Python 3 Text Processing with NLTK 3 Cookbook
上QQ阅读APP看书,第一时间看更新

Stemming words

Stemming is a technique to remove affixes from a word, ending up with the stem. For example, the stem of cooking is cook, and a good stemming algorithm knows that the ing suffix can be removed. Stemming is most commonly used by search engines for indexing words. Instead of storing all forms of a word, a search engine can store only the stems, greatly reducing the size of index while increasing retrieval accuracy.

One of the most common stemming algorithms is the Porter stemming algorithm by Martin Porter. It is designed to remove and replace well-known suffixes of English words, and its usage in NLTK will be covered in the next section.

Note

The resulting stem is not always a valid word. For example, the stem of cookery is cookeri. This is a feature, not a bug.

How to do it...

NLTK comes with an implementation of the Porter stemming algorithm, which is very easy to use. Simply instantiate the PorterStemmer class and call the stem() method with the word you want to stem:

>>> from nltk.stem import PorterStemmer
>>> stemmer = PorterStemmer()
>>> stemmer.stem('cooking')
'cook'
>>> stemmer.stem('cookery')
'cookeri'

How it works...

The PorterStemmer class knows a number of regular word forms and suffixes and uses this knowledge to transform your input word to a final stem through a series of steps. The resulting stem is often a shorter word, or at least a common form of the word, which has the same root meaning.

There's more...

There are other stemming algorithms out there besides the Porter stemming algorithm, such as the Lancaster stemming algorithm, developed at Lancaster University. NLTK includes it as the LancasterStemmer class. At the time of writing this book, there is no definitive research demonstrating the superiority of one algorithm over the other. However, Porter stemming algorithm is generally the default choice.

All the stemmers covered next inherit from the StemmerI interface, which defines the stem() method. The following is an inheritance diagram that explains this:

The LancasterStemmer class

The functions of the LancasterStemmer class are just like the functions of the PorterStemmer class, but can produce slightly different results. It is known to be slightly more aggressive than the PorterStemmer functions:

>>> from nltk.stem import LancasterStemmer
>>> stemmer = LancasterStemmer()
>>> stemmer.stem('cooking')
'cook'
>>> stemmer.stem('cookery')
'cookery'

The RegexpStemmer class

You can also construct your own stemmer using the RegexpStemmer class. It takes a single regular expression (either compiled or as a string) and removes any prefix or suffix that matches the expression:

>>> from nltk.stem import RegexpStemmer
>>> stemmer = RegexpStemmer('ing')
>>> stemmer.stem('cooking')
'cook'
>>> stemmer.stem('cookery')
'cookery'
>>> stemmer.stem('ingleside')
'leside'

A RegexpStemmer class should only be used in very specific cases that are not covered by the PorterStemmer or the LancasterStemmer class because it can only handle very specific patterns and is not a general-purpose algorithm.

The SnowballStemmer class

The SnowballStemmer class supports 13 non-English languages. It also provides two English stemmers: the original porter algorithm as well as the new English stemming algorithm. To use the SnowballStemmer class, create an instance with the name of the language you are using and then call the stem() method. Here is a list of all the supported languages and an example using the Spanish SnowballStemmer class:

>>> from nltk.stem import SnowballStemmer
>>> SnowballStemmer.languages('danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')
>>> spanish_stemmer = SnowballStemmer('spanish')
>>> spanish_stemmer.stem('hola')
u'hol'

See also

In the next recipe, we will cover Lemmatization, which is quite similar to stemming, but subtly different.