Training Classifiers¶
Example usage with the movie_reviews corpus can be found in Training Binary Text Classifiers with NLTK Trainer.
- Train a binary NaiveBayes classifier on the movie_reviews corpus, using paragraphs as the training instances:
python train_classifier.py movie_reviews --instances paras --classifier NaiveBayes
- Include bigrams as features:
python train_classifier.py movie_reviews --instances paras --classifier NaiveBayes --ngrams 1 --ngrams 2
- Minimum score threshold:
python train_classifier.py movie_reviews --instances paras --classifier NaiveBayes --ngrams 1 --ngrams 2 --min_score 3
- Maximum number of features:
python train_classifier.py movie_reviews --instances paras --classifier NaiveBayes --ngrams 1 --ngrams 2 --max_feats 1000
- Use the default Maxent algorithm:
python train_classifier.py movie_reviews --instances paras --classifier Maxent
- Use the MEGAM Maxent algorithm:
python train_classifier.py movie_reviews --instances paras --classifier MEGAM
- Train on files instead of paragraphs:
python train_classifier.py movie_reviews --instances files --classifier MEGAM
- Train on sentences:
python train_classifier.py movie_reviews --instances sents --classifier MEGAM
- Evaluate the classifier by training on 3/4 of the paragraphs and testing against the remaing 1/4, without pickling:
python train_classifier.py movie_reviews --instances paras --classifier NaiveBayes --fraction 0.75 --no-pickle
The following classifiers are available:
NaiveBayes
DecisionTree
Maxent
with various algorithms (many of these require numpy and scipy, andMEGAM
requires megam)Svm
(requires svmlight and pysvmlight)
If you also have scikit-learn then the following classifiers will also be available, with sklearn
specific training options. If there is a sklearn classifier or training option you want that is not present, please submit an issue.
- For example, here’s how to use the
sklearn.LinearSVC
classifier with themovie_reviews
corpus: python train_classifier.py movie_reviews --classifier sklearn.LinearSVC
- For a complete list of usage options:
python train_classifier.py --help
There are also many usage examples shown in Chapter 7 of Python 3 Text Processing with NLTK 3 Cookbook.
Using a Trained Classifier¶
- You can use a trained classifier by loading the pickle file using nltk.data.load:
>>> import nltk.data >>> classifier = nltk.data.load("classifiers/NAME_OF_CLASSIFIER.pickle")
- Or if your classifier pickle file is not in a
nltk_data
subdirectory, you can load it with pickle.load: >>> import pickle >>> classifier = pickle.load(open("/path/to/NAME_OF_CLASSIFIER.pickle"))
Either method will return an object that supports the ClassifierI interface.
- Once you have a
classifier
object, you can use it to classify word features with theclassifier.classify(feats)
method, which returns a label: >>> words = ['some', 'words', 'in', 'a', 'sentence'] >>> feats = dict([(word, True) for word in words]) >>> classifier.classify(feats)
- If you used the
--ngrams
option with values greater than 1, you should include these ngrams in the dictionary using nltk.util.ngrams(words, n): >>> from nltk.util import ngrams >>> words = ['some', 'words', 'in', 'a', 'sentence'] >>> feats = dict([(word, True) for word in words + ngrams(words, n)]) >>> classifier.classify(feats)
The list of words you use for creating the feature dictionary should be created by tokenizing the appropriate text instances: sentences, paragraphs, or files depending on the --instances
option.
Most of the sentiment classifiers used by text-processing.com were trained with train_classifier.py
.