Corpus Linguistics

In 2004 I developed a one and a half million word corpus of texts related to the work of police officers and other criminal justice system professionals.

The Criminal Justice System Key Word List was developed from an analysis of this corpus to answer the question of which vocabulary to teach senior police officers and other criminal justice system professionals.

Vocabulary Size Research

Research into vocabulary size and coverage has established that the most common words in the language account for most text. The table below, cited in Nation (2001), shows that the most common two thousand words cover 81.3 % of text. Thereafter the learner is left with a huge task of learning thousands more words to achieve significant gains in coverage.

Different wordsPercent of word tokens in average text
86,741100 %
Word Coverage, Nation (2001)

Research suggests that guessing word meaning from context is only effective when the person already knows about 95% of the running words of a text. To achieve this the learner would have to learn another 10 500 words. Clearly this is an almost impossible task for most learners.

The alternative is to choose a more restricted set of words from a particular genre of English. This will give high coverage within that specific genre. One attempt to do this is the Academic Word List.

The Academic Word List

The AWL was developed by Averil Coxhead at the School of Linguistics and Applied Language Studies at Victoria University of Wellington and is a set of words which are found in a wide variety of academic texts. The key factors for their selection were their coverage (as % of text) and their range. The AWL consists of 750 word families. Combined with the first 2000 words, the AWL provides over 90% coverage of a wide variety of academic texts.

The CJS Key Word List

The CJS list was created by a similar process as the AWL: an analysis of the corpus and the elimination of words found in the first 2000 words, proper nouns and words with very limited coverage. The result of this process is a list of 850 word families (2716 words in total) which provide 10 – 15% coverage of texts of interest to criminal justice professionals.

Word Families

By word families I mean words which can be derived from a word by the addition of suffixes and prefixes etc. In the CJS list there is this example:


As can be seen from the table there many variant spellings also included as part of the word family.

CJS List Coverage

The table below shows the coverage of the first 1000 and 2000 words and the AWL and CJS lists of texts from different genres and the same CJS text.

LevelsConv.FictionNewsAcadCJS TextCJS Text
1st 100084.3%82.3%75.6%73.5%72.3%72.3%
2nd 10006%5.1%4.7%4.6%7.6%7.6%
Acad1.9%1.7%3.9%8.5%9.3%14.4% (CJS List)
Nation 2001 and Buckmaster 2004

The Paul Nation Program

Paul Nation of Victoria University has developed a text analysis program which analyses texts using the first and second 1000 words and the AWL. It is a very easy program to use and the AWL list can be replaced with the CJS Key Word List.

The program can be downloaded from here.

The CJS list can be downloaded (top left) as a .txt file ready to be used with the Paul Nation program. (Save as .txt file)

With this program you can analyse texts and see how much of the text is covered by the first and second thousand words and how much by the AWL and CJS list. The program also tells you which words do not appear on any list. This can help you with such things as estimating the level of difficulty of the text and its usefulness for your students.

Using the CJS Key Word List

The CJS list can be used with the Paul Nation program as mentioned above but can also be used to develop vocabulary learning exercises and tests which are focused on important words that your students really need to learn.

References and Links

AWL Site

Nation, P. (2001). Learning vocabulary in another language. New York: Cambridge University Press.