Creating Custom Dictionaries

NOTE: Punctuation in dictionary files is not supported at this time. Adding punctuation to your dictionary words/phrases can lead to inaccurate results.

RIOT Scan possesses one of the most powerful and flexible custom content coding systems available for research. If you would like to use your own dictionary, it must be contained within a .txt file, and it must conform to the following format:

$Name: Test Dictionary
$Author: Ryan Boyd
%%%
Category Listing Here
%%%
Dictionary Contents here

You can download an example dictionary here. RIOT Scan is capable of reading dictionary files in UTF-8 encoding as well, if your language requires this encoding format. Now, for a quick explanation of the header:

$Name: – put the name of your dictionary here. It does not have to be the same as your filename

$Author: – put the author of your dictionary file here

The second section (after the %%%) is your category listing. The category number is on the left side, whereas the category name is on the right side. These values should be separated by a TAB, and the name of the categories in this section are what you will see in your output file. The custom dictionary system in RIOT Scan has been successfully tested with up to 16,000 distinct categories combined with 16,000 different words matching into various categories; this is likely far beyond adequate for most purposes.

Finally, the last section of your dictionary file should include the words/phrases that you want to code, as well as to which categories each word/phrase belongs. Again, values should be separated by a TAB. Capitalization is not important for any part of the dictionary file.

Note: As of version 1.4.7, you no longer need a $MAXCATS: tag in your dictionary header. Inclusion of this tag will not impact the functionality of your dictionary file, however, RIOT Scan now detects this feature of your dictionary automatically.

Wildcards and Phrases

As of version 1.6.0, RIOT Scan has a much improved custom dictionary system. Please note new features of this system. RIOT Scan can code for wildcards, phrases, and phrases with wildcards in addition to single words. You should use an asterisk (*) to denote a wildcard. Here are some examples:

eat – this will only detect the word “eat”, but not words such as “eating”, “eats”, and so on.

eat* – this will detect words such as “eat”, “eating”, “eats”, and so on.

Dashing through the snow – this is a plain phrase, and will be detected only if it occurs exactly as is

Dashing through the s* – this is a phrase with a wildcard. This example would detect phrases such as “Dashing through the snow” and “Dashing through the shower”, but would NOT detect “Dashing through the mall”

Waiting * Godot – this is a phrase with a wildcard counting words a whole word. This example would detect phrases such as “Waiting for Godot” and “Waiting with Godot”, but NOT “Waiting to see Godot”

Wait* * * Godot – this example would detect phrases such as “Waiting to see Godot” and “Wait for Mr. Godot”, but would NOT detect “Waiting for Godot”

*ing for Godot – this example would detect “Waiting for Godot” and “Pining for Godot”, but not “Waits for Godot”

IMPORTANT! If your custom dictionary contains multi-word phrases (e.g., bigrams, trigrams), ensure that the highest order phrases appear in your dictionary first. For example, “I’m happy” should appear in your dictionary before words like “I’m” and “happy”. RIOT Scan examines your custom dictionary in order. If words are detected that were already picked up as part of a phrase, they will be ignored so that the same words in your text are not being counted multiple times.

If you are building your own custom dictionary, it is important that you test it in order to ensure that it works as you would expect. If you have any questions about, or are having trouble with, the format of the dictionary file, please send me an e-mail.

“Not” Words

As of version 2.0.0, I have included the ability to ignore words and phrase pairs that would otherwise be detected as positive matches. This can be achieved by appending bracketed words/phrases to desired matches. This might sound a little confusing, so I will illustrate with some examples.

Let’s say that you want to detect the word “have”, but only when it is not followed by the word “some”. You would use the following lingo in your dictionary file:

have<some>

Note that there is no space between the word “have” and the bracket. This example will detect the word “have” in the following examples:

1. I have to go to the store.

2. I have an apple for you.

3. I have Spiderman sheets on my bed.

…but would ignore the word “have” in this example:

4. I have some time later today.

Additionally, you can chain together multiple “Not” scenarios on a single line, as follows:

have<some><a><an>

This example would detect the word “have” in examples 1 and 3, but not 2 and 4. When using this feature of custom dictionaries, note that the word “have” is still counted as a single word when it is found. This feature can also be extended to phrases:

don’t have<a cow>

This will detect the phrase “don’t have” in the following examples:

5. I don’t have a pet

6. I don’t have a clue

…but ignores the phrase “don’t have” in this example:

7. Don’t have a cow, man.