{"id":73,"date":"2013-10-06T15:09:02","date_gmt":"2013-10-06T20:09:02","guid":{"rendered":"https:\/\/www.ryanboyd.io\/software\/meh\/?page_id=73"},"modified":"2018-11-10T20:06:06","modified_gmt":"2018-11-10T20:06:06","slug":"understanding-output","status":"publish","type":"page","link":"https:\/\/www.ryanboyd.io\/software\/meh\/understanding-output\/","title":{"rendered":"Understanding Output"},"content":{"rendered":"<h2><span style=\"text-decoration: underline;\">Document Word List (DWL)<\/span><\/h2>\n<p>This is a newline-delimited json file (<a href=\"http:\/\/ndjson.org\/\" target=\"_blank\" rel=\"noopener\">ndjson<\/a>) that distills each input text into manageable data. Essentially, each line of the DWL is a json string that holds the filename, segment number, and raw number of times that each n-gram appears in the file.<\/p>\n<h2><span style=\"text-decoration: underline;\">Frequency List<\/span><\/h2>\n<p>By default, MEH will output an N-gram frequency table after performing the initial analysis. This file will contain a complete list of all N-grams and their corresponding frequencies in your data &#8212; this information is presented in a few different ways:<\/p>\n<ul>\n<li><span style=\"text-decoration: underline;\">Frequency<\/span>: The number of times that each n-gram appears in your dataset.<\/li>\n<li><span style=\"text-decoration: underline;\">Docs_With_Token:<\/span> The number of documents (that meet your Word Count requirements) that contain each n-gram.<\/li>\n<li><span style=\"text-decoration: underline;\">ObservationPct<\/span>: The percent of total documents (that meet your Word Count requirements) that contain each n-gram.<\/li>\n<li><span style=\"text-decoration: underline;\">IDF<\/span>: The <a href=\"http:\/\/www.tfidf.com\/\" target=\"_blank\" rel=\"noopener\">inverse document frequency<\/a>\u00a0of each n-gram.<\/li>\n<\/ul>\n<p>Note that, unlike previous versions of MEH, the frequency list is\u00a0<span style=\"text-decoration: underline;\">not<\/span> sorted when it is saved to your hard drive. This is primarily out of concern for RAM allocation when analyzing extremely large datasets.<\/p>\n<p>Accordingly, if you have an extremely long frequency list (e.g., &gt; ~1 million n-grams), some spreadsheet software such as Excel will not be able to load your frequency list all at once (you&#8217;ll receive a warning if this is the case, such as &#8220;File not completely loaded&#8221;). This means that some common n-grams might not be displayed.\u00a0An easy way around this is to use a program other than Excel to look at your frequency list if necessary. For the social scientists out there, you might consider using SPSS if your frequency list is extremely long. Alternatively, you could pre-sort your frequency list prior to loading it up in Excel using a script like <a href=\"https:\/\/www.ryanboyd.io\/software\/meh\/Supplemental\/Sort-Freq-Descending.R\">this one<\/a> in a language like <a href=\"https:\/\/cran.r-project.org\/\" target=\"_blank\" rel=\"noopener\">R<\/a>.<\/p>\n<p>As a complete aside, if you want to compare the word frequencies \/ likelihoods from 2 different corpora, I have written a script that calculates all of the same same indices that are found on <a href=\"http:\/\/ucrel.lancs.ac.uk\/llwizard.html\" target=\"_blank\" rel=\"noopener\">Paul Rayson&#8217;s extremely helpful page<\/a>. Essentially, you can get 2 separate frequency lists from MEH (1 for each corpus), then apply <a href=\"https:\/\/www.ryanboyd.io\/software\/meh\/Supplemental\/Corpus%20Comparison%20Script%20for%20MEH.R\">this script<\/a>.<\/p>\n<h2><span style=\"text-decoration: underline;\">Verbose Output<\/span><\/h2>\n<p>The verbose output generated by MEH is similar to output that you might see created by standard content coding software, such as LIWC or RIOT Scan. Observations are numbered and accompanied by filenames, along with the segment numbers of each file (where applicable). Each column represents the number of times that an n-gram appears in each file, divided by word count.<\/p>\n<h2><span style=\"text-decoration: underline;\">Binary Output<\/span><\/h2>\n<p>The binary output is identical to the verbose output, however, scores for each N-gram are converted into simple presence\/absence scores. Values of <strong>1<\/strong> and <strong>0\u00a0<\/strong>signify the corresponding N-gram&#8217;s presence and absence, respectively, for a given observation. As per standard recommendations (e.g., Chung &amp; Pennebaker, 2008), the binary output is often preferred over the verbose output for the meaning extraction method. This is often referred to as <a href=\"https:\/\/hackernoon.com\/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f\" target=\"_blank\" rel=\"noopener\">one hot encoding<\/a>.<\/p>\n<h2><span style=\"text-decoration: underline;\">Raw Count (i.e., Document Term Matrix) Output<\/span><\/h2>\n<p>The\u00a0document term matrix output is similar to the binary and the verbose outputs, however, it provides the raw counts for each n-gram per observation. This output file\u00a0can easily be used for the purpose of something like <a href=\"http:\/\/machinelearning.wustl.edu\/mlpapers\/paper_files\/BleiNJ03.pdf\" target=\"_blank\" rel=\"noopener\">Latent Dirichlet Allocation<\/a>\u00a0using the &#8220;<a href=\"http:\/\/cran.r-project.org\/web\/packages\/topicmodels\/topicmodels.pdf\" target=\"_blank\" rel=\"noopener\">topicmodels<\/a>&#8221; package in R.<\/p>\n<p>If you are new to LDA, or you simply need an R script that makes LDA easy to use with MEH&#8217;s DTM output, I have written one that you may freely use. It can be downloaded <a href=\"https:\/\/www.ryanboyd.io\/software\/meh\/Supplemental\/LDA%20with%20MEH%20DTM.R\">here<\/a>.<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Document Word List (DWL) This is a newline-delimited json file (ndjson) that distills each input text into manageable data. Essentially, each line of the DWL is a json string that holds the filename, segment number, and raw number of times that each n-gram appears in the file. Frequency List By default, MEH will output an [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":6,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-73","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/www.ryanboyd.io\/software\/meh\/wp-json\/wp\/v2\/pages\/73","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.ryanboyd.io\/software\/meh\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.ryanboyd.io\/software\/meh\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.ryanboyd.io\/software\/meh\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.ryanboyd.io\/software\/meh\/wp-json\/wp\/v2\/comments?post=73"}],"version-history":[{"count":8,"href":"https:\/\/www.ryanboyd.io\/software\/meh\/wp-json\/wp\/v2\/pages\/73\/revisions"}],"predecessor-version":[{"id":408,"href":"https:\/\/www.ryanboyd.io\/software\/meh\/wp-json\/wp\/v2\/pages\/73\/revisions\/408"}],"wp:attachment":[{"href":"https:\/\/www.ryanboyd.io\/software\/meh\/wp-json\/wp\/v2\/media?parent=73"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}