Analyzing Tagger CoverageΒΆ

The script will run a part-of-speech tagger over a corpus to determine how many times each tag is found. Example output can be found in Analyzing Tagged Corpora and NLTK Part of Speech Taggers.

Here’s an example using the NLTK default tagger on the treebank corpus:
python treebank

To get detailed metrics on each tag, you can use the --metrics option. This requires using a tagged corpus in order to compare actual tags against tags found by the tagger. See NLTK Default Tagger Treebank Tag Coverage and NLTK Default Tagger CoNLL2000 Tag Coverage for examples and statistics.

The default tagger used is NLTK’s default tagger. To analyze the coverage using a different tagger, use the --tagger option with a path to the pickled tagger, as in:
python treebank --tagger /path/to/tagger.pickle
You can also analyze tagger coverage over a custom corpus. For example, with a corpus whose fileids end in ”.pos”, you can use a TaggedCorpusReader:
python /path/to/corpus --reader nltk.corpus.reader.tagged.TaggedCorpusReader --fileids '.+\.pos'

The corpus path can be absolute, or relative to a nltk_data directory. For example, both corpora/treebank/tagged and /usr/share/nltk_data/corpora/treebank/tagged will work.

For a complete list of usage options:
python --help