Analyzing a Tagged Corpus¶
The analyze_tagged_corpus.py
script will show the following statistics about a tagged corpus:
- total number of words
- number of unique words
- number of tags
- the number of times each tag occurs
Example output can be found in Analyzing Tagged Corpora and NLTK Part of Speech Taggers.
- To analyze the treebank corpus:
python analyze_tagged_corpus.py treebank
- To sort the output by tag count from highest to lowest:
python analyze_tagged_corpus.py treebank --sort count --reverse
- To see simplified tags, instead of standard tags:
python analyze_tagged_corpus.py treebank --simplify_tags
- To analyze a custom corpus, whose fileids end in “.pos”, using a TaggedCorpusReader:
python analyze_tagged_corpus.py /path/to/corpus --reader nltk.corpus.reader.tagged.TaggedCorpusReader --fileids '.+\.pos'
The corpus path can be absolute, or relative to a nltk_data directory. For example, both corpora/treebank/tagged
and /usr/share/nltk_data/corpora/treebank/tagged
will work.
- For a complete list of usage options:
python analyze_tagged_corpus.py --help