Training IOB Chunkers¶
The train_chunker.py
script can use any corpus included with NLTK that implements a chunked_sents()
method.
- Train the default sequential backoff tagger based chunker on the treebank_chunk corpus::
python train_chunker.py treebank_chunk
- To train a NaiveBayes classifier based chunker:
python train_chunker.py treebank_chunk --classifier NaiveBayes
- To train on the conll2000 corpus:
python train_chunker.py conll2000
- To train on a custom corpus, whose fileids end in “.pos”, using a ChunkedCorpusReader:
python train_chunker.py /path/to/corpus --reader nltk.corpus.reader.chunked.ChunkedCorpusReader --fileids '.+\.pos'
The corpus path can be absolute, or relative to a nltk_data directory. For example, both corpora/treebank/tagged
and /usr/share/nltk_data/corpora/treebank/tagged
will work.
- You can also restrict the files used with the
--fileids
option: python train_chunker.py conll2000 --fileids train.txt
- For a complete list of usage options:
python train_chunker.py --help
There are also many usage examples shown in Chapter 5 of Python 3 Text Processing with NLTK 3 Cookbook.
Using a Trained Chunker¶
- You can use a trained chunker by loading the pickle file using nltk.data.load:
>>> import nltk.data >>> tagger = nltk.data.load("chunkers/NAME_OF_CHUNKER.pickle")
- Or if your chunker pickle file is not in a
nltk_data
subdirectory, you can load it with pickle.load: >>> import pickle >>> tagger = pickle.load(open("/path/to/NAME_OF_CHUNKER.pickle"))
- Either method will return an object that supports the ChunkerParserI interface. But before you can use this chunker, you must have a trained tagger <http://nltk-trainer.readthedocs.org/en/latest/train_tagger.html#using-a-trained-tagger>. You first use the tagger to tag a sentence, and then use a chunker to parse the tagged sentence with the
chunker.parse(sent)
method: >>> chunker.parse(tagged_words)
chunker.parse(tagged_words)
will return a Tree whose subtrees will be chunks, and whose leaves are the original tagged words.
All of the chunkers demonstrated at text-processing.com were trained with train_chunker.py
.