Training Classifiers

Example usage with the movie_reviews corpus can be found in Training Binary Text Classifiers with NLTK Trainer.

Train a binary NaiveBayes classifier on the movie_reviews corpus, using paragraphs as the training instances:
python train_classifier.py movie_reviews --instances paras --classifier NaiveBayes
Include bigrams as features:
python train_classifier.py movie_reviews --instances paras --classifier NaiveBayes --ngrams 1 --ngrams 2
Minimum score threshold:
python train_classifier.py movie_reviews --instances paras --classifier NaiveBayes --ngrams 1 --ngrams 2 --min_score 3
Maximum number of features:
python train_classifier.py movie_reviews --instances paras --classifier NaiveBayes --ngrams 1 --ngrams 2 --max_feats 1000
Use the default Maxent algorithm:
python train_classifier.py movie_reviews --instances paras --classifier Maxent
Use the MEGAM Maxent algorithm:
python train_classifier.py movie_reviews --instances paras --classifier MEGAM
Train on files instead of paragraphs:
python train_classifier.py movie_reviews --instances files --classifier MEGAM
Train on sentences:
python train_classifier.py movie_reviews --instances sents --classifier MEGAM
Evaluate the classifier by training on 3/4 of the paragraphs and testing against the remaing 1/4, without pickling:
python train_classifier.py movie_reviews --instances paras --classifier NaiveBayes --fraction 0.75 --no-pickle

The following classifiers are available:

If you also have scikit-learn then the following classifiers will also be available, with sklearn specific training options. If there is a sklearn classifier or training option you want that is not present, please submit an issue.

For example, here’s how to use the sklearn.LinearSVC classifier with the movie_reviews corpus:
python train_classifier.py movie_reviews --classifier sklearn.LinearSVC
For a complete list of usage options:
python train_classifier.py --help

There are also many usage examples shown in Chapter 7 of Python 3 Text Processing with NLTK 3 Cookbook.

Using a Trained Classifier

You can use a trained classifier by loading the pickle file using nltk.data.load:
>>> import nltk.data
>>> classifier = nltk.data.load("classifiers/NAME_OF_CLASSIFIER.pickle")
Or if your classifier pickle file is not in a nltk_data subdirectory, you can load it with pickle.load:
>>> import pickle
>>> classifier = pickle.load(open("/path/to/NAME_OF_CLASSIFIER.pickle"))

Either method will return an object that supports the ClassifierI interface.

Once you have a classifier object, you can use it to classify word features with the classifier.classify(feats) method, which returns a label:
>>> words = ['some', 'words', 'in', 'a', 'sentence']
>>> feats = dict([(word, True) for word in words])
>>> classifier.classify(feats)
If you used the --ngrams option with values greater than 1, you should include these ngrams in the dictionary using nltk.util.ngrams(words, n):
>>> from nltk.util import ngrams
>>> words = ['some', 'words', 'in', 'a', 'sentence']
>>> feats = dict([(word, True) for word in words + ngrams(words, n)])
>>> classifier.classify(feats)

The list of words you use for creating the feature dictionary should be created by tokenizing the appropriate text instances: sentences, paragraphs, or files depending on the --instances option.

Most of the sentiment classifiers used by text-processing.com were trained with train_classifier.py.