The Corpus
The US Army Text (Field Manual) English Corpus was constructed in June 2020 and analyzed to find the key words of each component document. The total corpus summed 1,966,986 words.
The source texts were downloaded from the Internet as .pdf files and then converted to .txt files. Copyright Note: all texts are in the public domain as they were written and released by the US Government. These files were then edited; the title pages and contents pages etc. were removed; as were the glossary and references pages at the end of each document.
Corpus Analysis
Then the edited texts were then analyzed using the Wordsmith Tools program to create word lists of each field manual. These word lists were compared with the word list from the Coca Extracts (Academic, Blog, Fiction, Magazine, News, Web) reference corpus of 8,271,003 words.
Then each individual key word list (for each individual text) was edited (most proper nouns were removed, plurals were deleted, and verb forms reduced to the base form e.g. exfiltrate), and the abbreviations and acronyms were removed for study separately.
The total US Army Text (Field Manual) English Corpus word list was also analyzed and this produced the key word list for the whole corpus which can be found here as a free download in .pdf format.
|