Training Classifiers¶

Example usage with the movie_reviews corpus can be found in Training Binary Text Classifiers with NLTK Trainer.

Train a binary NaiveBayes classifier on the movie_reviews corpus, using paragraphs as the training instances:: python train_classifier.py movie_reviews --instances paras --classifier NaiveBayes
Include bigrams as features:: python train_classifier.py movie_reviews --instances paras --classifier NaiveBayes --ngrams 1 --ngrams 2
Minimum score threshold:: python train_classifier.py movie_reviews --instances paras --classifier NaiveBayes --ngrams 1 --ngrams 2 --min_score 3
Maximum number of features:: python train_classifier.py movie_reviews --instances paras --classifier NaiveBayes --ngrams 1 --ngrams 2 --max_feats 1000
Use the default Maxent algorithm:: python train_classifier.py movie_reviews --instances paras --classifier Maxent
Use the MEGAM Maxent algorithm:: python train_classifier.py movie_reviews --instances paras --classifier MEGAM
Train on files instead of paragraphs:: python train_classifier.py movie_reviews --instances files --classifier MEGAM
Train on sentences:: python train_classifier.py movie_reviews --instances sents --classifier MEGAM
Evaluate the classifier by training on 3/4 of the paragraphs and testing against the remaing 1/4, without pickling:: python train_classifier.py movie_reviews --instances paras --classifier NaiveBayes --fraction 0.75 --no-pickle

The following classifiers are available:

NaiveBayes

DecisionTree

Maxent with various algorithms (many of these require numpy and scipy, and MEGAM requires megam)

Svm (requires svmlight and pysvmlight)

If you also have scikit-learn then the following classifiers will also be available, with sklearn specific training options. If there is a sklearn classifier or training option you want that is not present, please submit an issue.

sklearn.ExtraTreesClassifier

sklearn.GradientBoostingClassifier

sklearn.RandomForestClassifier

sklearn.LogisticRegression

sklearn.BernoulliNB

sklearn.GaussianNB

sklearn.MultinomialNB

sklearn.KNeighborsClassifier

sklearn.LinearSVC

sklearn.NuSVC

sklearn.SVC

sklearn.DecisionTreeClassifier

For example, here’s how to use the sklearn.LinearSVC classifier with the movie_reviews corpus:: python train_classifier.py movie_reviews --classifier sklearn.LinearSVC
For a complete list of usage options:: python train_classifier.py --help

There are also many usage examples shown in Chapter 7 of Python 3 Text Processing with NLTK 3 Cookbook.

Using a Trained Classifier¶

You can use a trained classifier by loading the pickle file using nltk.data.load:

>>> import nltk.data
>>> classifier = nltk.data.load("classifiers/NAME_OF_CLASSIFIER.pickle")

Or if your classifier pickle file is not in a nltk_data subdirectory, you can load it with pickle.load:

>>> import pickle
>>> classifier = pickle.load(open("/path/to/NAME_OF_CLASSIFIER.pickle"))

Either method will return an object that supports the ClassifierI interface.

Once you have a classifier object, you can use it to classify word features with the classifier.classify(feats) method, which returns a label:

>>> words = ['some', 'words', 'in', 'a', 'sentence']
>>> feats = dict([(word, True) for word in words])
>>> classifier.classify(feats)

If you used the --ngrams option with values greater than 1, you should include these ngrams in the dictionary using nltk.util.ngrams(words, n):

>>> from nltk.util import ngrams
>>> words = ['some', 'words', 'in', 'a', 'sentence']
>>> feats = dict([(word, True) for word in words + ngrams(words, n)])
>>> classifier.classify(feats)

The list of words you use for creating the feature dictionary should be created by tokenizing the appropriate text instances: sentences, paragraphs, or files depending on the --instances option.

Most of the sentiment classifiers used by text-processing.com were trained with train_classifier.py.