General Use

RIOT Scan takes a list of .txt files that you provide and analyzes them, computing various indices that you select. Simply click on the “Select Folder” button then choose the location of the .txt files that you wish to analyze. Make sure that the files that you would like to analyze are selected in the checkbox list, then select the processing options and coding schemes that you would like to use and, finally, click the “Analyze” button. Citations for all of the options and coding schemes that you select will be generated in the same directory as your output. I provide here a brief description of how to use the different segmentation options in RIOT Scan.

General Guidelines – Variation and Word Indices

Remember, since the software does a recursive inspection (compares words in the text file to the rest of the file) for the variation indices, scanning time grows exponentially with individual file size. For example, The History of England in Three Volumes, Vol. II by Tobias Smollett is roughly 800,000 words long. On a relatively newer and faster computer, it might take around 15 minutes to completely analyze the language variability of this file. If you have a lot of very large files, the entire analysis process might take a while. You should be able to leave RIOT Scan running in the background without it interfering with other running programs. Medium-sized texts, such as A Discourse Concerning Ridicule and Irony in Writing by Anthony Collins (roughly 30,000 words in size) can be analyzed relatively rapidly, usually in less than 30 seconds. Files smaller than this (e.g., 10,000 words in size) can be analyzed for variability relatively instantaneously. These estimates do not include content coding schemes. If you decide to run content coding systems for your text, it will take longer to process each file. The amount of extra time necessary will depend upon the specific coding schemes (in addition to the number of coding schemes) that you decide to run.

Variation Indices Options

There are a couple of extra options that you may choose to use when calculating your indices. These options are Linebreak as new sentence and Ignore punctuation. These two options should be used depending on the format of your text files and your operationalization of what constitutes a “sentence”. I have found that these two options are best suited to artistic styles/forms of writing, such as song lyrics and poetry. Let’s take the following example from the poem My Grandmother’s Love Letters by the poet Hart Crane:

Yet I would lead my grandmother by the hand
Through much of what she would not understand;
And so I stumble. And the rain continues on the roof
With such a sound of gently pitying laughter.

Without either of the options selected, RIOT Scan will calculate that you have 2 sentences based upon the location of periods in this text. If you were to check the “Linebreak as new sentence” option, RIOT Scan would count that you have 5 sentences. This is because we have 4 lines of text, one of which will count as 2 sentences because of the period at the end of the phrase “And so I stumble.” If we wanted to consider this entire passage to be 4 sentences (based solely upon linebreaks), we would check the “Ignore punctuation” button. How you conceptualize your sentences can drastically impact some of the variation indices generated by RIOT Scan; you should carefully plan and be able to justify your use of these options. For most bodies of text, however, these options are entirely unnecessary.

Stemming / Lemmatisation

For the Porter Stemming Algorithm, you can do some more reading on that topic here:

You should only use stemming or lemmatisation algorithms if you have specific, a priori reasons for doing so. The most likely reason for using either of these options is with regard to the variation indices. Using such algorithms can drastically change the final output as they change the text being processed, albeit in an algorithmic, rule-driven fashion. This may potentially result in the text that makes it to processing existing in a form that will get passed over or picked up more by a lot of content coding schemes. To elaborate, I processed 150 writing samples both with and without application of Porter’s stemming algorithm. Across 285 content coding categories, the average correlation between data furnished with and without the stemming algorithm was r = .85. While this is indeed a strong correlation, the range of correlations (minimum r = .12; maximum r = 1.0) demonstrates that your choice of applying such algorithms may cause radical deviations in your dataset.

As a usage example, cases in which you would want to use the stemming or lemmatisation algorithm might be when you want to look at Hapax Legomena while correcting for variations in verb tense or plurality, or looking at bodies of text in a very rooted fashion. This would result in an index that may prove consequentially more meaningful from a psychological standpoint for specific conceptual reasons (e.g., words like “acceptable”, “accepting”, and “accepted” being treated as the same word) than if suffixes were not stripped. As with the other options, it is important that you carefully plan your use of the stemming or lemmatisation algorithm and be able to justify its use in your work.


You may want to treat a body of text as multiple, separate pieces of text — this can be done by clicking on the “Segmentation Options” button. For example, if you wanted to analyze Herman Melville’s classic Moby-Dick; or, The Whale (200,000+ words) and treat it as 100 separate data points, you could set the segmentation options in RIOT Scan to smartly split the file into 100 equally-sized samples of text (approximately 2000 words per sample), or as close to equal as is possible. This option might be useful if you want to observe trends across a body of text (e.g., examining trends of emotion words in Moby-Dick) or if you want to compare two texts at a finer-detailed level (e.g., you want to conduct a one-way ANOVA to see if Moby-Dick has significantly greater quantities of “Whirlall words” than does Jonathon Miles’ The Wreck of The Medusa).

Split Files into Equally Sized Segments:

This option allows you to uniformly split all incoming files into the same number of segments. For example, if you want to split all text files into 10 parts, this option will accomplish this. Each segment within each file will be approximately the same size.

Desired Segment Size:

This field allows you to specify what your desired maximum segment size will be when processing text. Use the default value of zero to refrain from segmenting text. Your text files will be smartly parsed so that segments are as equally close to your desired target size as is possible.

This option may also be thought of as a word count normalization tool, as well as a word count upper boundary limitation. This feature is of great use when the files that you would like to process are of varying word counts. The Meaning Extraction Method is optimal when word counts across observations is relatively homogenous. When using this option, no segments will exceed the limit that you place in this field. For example, a target segmentation size of 150 will parse files in such a manner:

An 80-word observation remains at 80 words.
A 300-word observation becomes 2 150-word segments.
A 500-word observation becomes 4 segments, each containing approximately 125 words.

Note: If any observation becomes segmented, its segments will never fall below 50% of the limits specified by this option. Observations that already fall below the target segmentation specified by the user will remain unsegmented at their original word count.

Think of engaging in the meaning extraction method as a bit like tuning an oscilloscope. You have two knobs that you are trying to tweak to find the “best” possible theme solution. You will want to tune your “wave amplitude” knob (i.e., the segment size / word count normalization) and then try to find the right “wave frequency” (i.e., various PCA solutions). Turn your “amplitude” knob to a good spot, then try adjusting the frequency. If you are getting a “noisy” signal, then try changing the amplitude, then adjust your frequency some more.

Segment Text with Regular Expression:

This allows you to enter a regular expression that will be used to determine where to segment texts. For example, if you want to split your text files by paragraph, you might use the regular expression rn for newline splits. This option is also useful for building semantic network data, be it at the paragraph level (rn), sentence level, etc. Every time a match is found in with your expression, a split will be placed in that location. Also useful for other topic modeling methods, such as LDA, if you are looking for a specific level of analysis.