Text Encoding of Input File(s)
This setting should match the encoding of your input files. If you’re not sure about the encoding of your text files, please refer to the ExamineTXT program on the toolbox page.
On this tab, you can choose whether you want to segment incoming texts into separate segments. You can do this either by splitting texts into N equally sized segments, or by specifying a desired segment size. In the latter case, MEH will attempt to split your texts into segments as close as possible (without going over) to your specified segment length.
Conversions take place prior to lemmatization when text is being processed; this feature is intended to augment and assist lemmatization. Additionally, this feature allows the user to customize text replacement. The “Conversions” field may be used to fix common misspellings (e.g., “hieght” to “height”; “teh” to “the”), convert “textisms” (e.g., “bf” to “boyfriend”), and so on. The conversions feature also allows for wilcards (*).
The proper format for conversion is:
This will replace all occurrences of the word “bf” with “boyfriend” before analyzing text. As a note, original and converted forms need not be a single word (e.g., “MEH is awesome” to “This software is adequate”). For more advanced uses and a deeper explanation of using the conversion engine, please refer to the “Advanced Conversions” page.
You may read more about using stop words here. In short, the stop list is used to specify which n-grams you want to omit from your output.
“Dictionary Words” are user-specified n-grams to include, even if they are low base-rate words.
Choose Output Types
Choose Pre-Existing Document Word List: This feature lets you pick up from a previous analysis using a DWL that has already been generated. This is particularly useful if you’re analyzing a very large dataset and want to change some downstream options without having to reprocess everything from the beginning.
Prune Frequency List; …after every X Documents: This option helps to makes sure that the dataset remains manageable when building your frequency list and beyond. In short, this option will remove any n-gram with a frequency of 1 at your specified intervals. The overwhelming majority of n-grams are extremely rare; removal of these items will typically not exert any meaningful impact on your results. The primary benefit of this option is to ensure that you don’t overextend your analysis beyond your system’s memory (by trying to hold too many n-grams / information in RAM at once).
For the other output file types on this tab, you can find some additional information on the understanding output page of this website.
This tab allows you to set the main features for what information to extract/retain as MEH processes through your texts. You can choose what N you want to use for your N-grams (i.e., unigrams, bi-grams, tri-grams, etc.), and you can choose to ignore texts that fall below a specified word count.
Note that whatever you select as your N in N-grams, MEH will extract [1 to N]-grams as well. In other words, if you choose 3-grams, MEH will also extract 2-grams and 1-grams.
Importantly, you can also set thresholds for which N-grams to retain for your final Document X Term matrix (DTM) output.
- Retain N-grams that appear in >= X% of Documents: This option will ensure that your final output only includes N-grams that appeared at least once in X% of all documents.
- Retain N-grams with a Frequency >= X: This option allows you to retain N-grams that occur at least X times in your dataset.
- Retain the X most Frequent N-grams (by raw frequency): This option will retain ~X N-grams (e.g., 500). The decision for which N-grams to keep is based on the raw frequency of each N-gram.
- … (by % of Documents): Same as the previous option, but instead of using the raw frequency, the threshold is set by the percent of documents that each n-gram occurs in.