Document Word List (DWL)
This is a newline-delimited json file (ndjson) that distills each input text into manageable data. Essentially, each line of the DWL is a json string that holds the filename, segment number, and raw number of times that each n-gram appears in the file.
By default, MEH will output an N-gram frequency table after performing the initial analysis. This file will contain a complete list of all N-grams and their corresponding frequencies in your data — this information is presented in a few different ways:
- Frequency: The number of times that each n-gram appears in your dataset.
- Docs_With_Token: The number of documents (that meet your Word Count requirements) that contain each n-gram.
- ObservationPct: The percent of total documents (that meet your Word Count requirements) that contain each n-gram.
- IDF: The inverse document frequency of each n-gram.
Note that, unlike previous versions of MEH, the frequency list is not sorted when it is saved to your hard drive. This is primarily out of concern for RAM allocation when analyzing extremely large datasets.
Accordingly, if you have an extremely long frequency list (e.g., > ~1 million n-grams), some spreadsheet software such as Excel will not be able to load your frequency list all at once (you’ll receive a warning if this is the case, such as “File not completely loaded”). This means that some common n-grams might not be displayed. An easy way around this is to use a program other than Excel to look at your frequency list if necessary. For the social scientists out there, you might consider using SPSS if your frequency list is extremely long. Alternatively, you could pre-sort your frequency list prior to loading it up in Excel using a script like this one in a language like R.
As a complete aside, if you want to compare the word frequencies / likelihoods from 2 different corpora, I have written a script that calculates all of the same same indices that are found on Paul Rayson’s extremely helpful page. Essentially, you can get 2 separate frequency lists from MEH (1 for each corpus), then apply this script.
The verbose output generated by MEH is similar to output that you might see created by standard content coding software, such as LIWC or RIOT Scan. Observations are numbered and accompanied by filenames, along with the segment numbers of each file (where applicable). Each column represents the number of times that an n-gram appears in each file, divided by word count.
The binary output is identical to the verbose output, however, scores for each N-gram are converted into simple presence/absence scores. Values of 1 and 0 signify the corresponding N-gram’s presence and absence, respectively, for a given observation. As per standard recommendations (e.g., Chung & Pennebaker, 2008), the binary output is often preferred over the verbose output for the meaning extraction method. This is often referred to as one hot encoding.
Raw Count (i.e., Document Term Matrix) Output
The document term matrix output is similar to the binary and the verbose outputs, however, it provides the raw counts for each n-gram per observation. This output file can easily be used for the purpose of something like Latent Dirichlet Allocation using the “topicmodels” package in R.
If you are new to LDA, or you simply need an R script that makes LDA easy to use with MEH’s DTM output, I have written one that you may freely use. It can be downloaded here.