Stemming words
Stemming is a technique to remove affixes from a word, ending up with the stem. For example, the stem of cooking
is cook
, and a good stemming algorithm knows that the ing
suffix can be removed. Stemming is most commonly used by search engines for indexing words. Instead of storing all forms of a word, a search engine can store only the stems, greatly reducing the size of index while increasing retrieval accuracy.
One of the most common stemming algorithms is the Porter stemming algorithm by Martin Porter. It is designed to remove and replace well-known suffixes of English words, and its usage in NLTK will be covered in the next section.
Note
The resulting stem is not always a valid word. For example, the stem of cookery
is cookeri
. This is a feature, not a bug.
How to do it...
NLTK comes with an implementation of the Porter stemming algorithm, which is very easy to use. Simply instantiate the PorterStemmer
class and call the stem()
method with the word you want to stem:
>>> from nltk.stem import PorterStemmer >>> stemmer = PorterStemmer() >>> stemmer.stem('cooking') 'cook' >>> stemmer.stem('cookery') 'cookeri'
How it works...
The PorterStemmer
class knows a number of regular word forms and suffixes and uses this knowledge to transform your input word to a final stem through a series of steps. The resulting stem is often a shorter word, or at least a common form of the word, which has the same root meaning.
There's more...
There are other stemming algorithms out there besides the Porter stemming algorithm, such as the Lancaster stemming algorithm, developed at Lancaster University. NLTK includes it as the LancasterStemmer
class. At the time of writing this book, there is no definitive research demonstrating the superiority of one algorithm over the other. However, Porter stemming algorithm is generally the default choice.
All the stemmers covered next inherit from the StemmerI
interface, which defines the stem()
method. The following is an inheritance diagram that explains this:
The LancasterStemmer class
The functions of the LancasterStemmer
class are just like the functions of the PorterStemmer
class, but can produce slightly different results. It is known to be slightly more aggressive than the PorterStemmer
functions:
>>> from nltk.stem import LancasterStemmer >>> stemmer = LancasterStemmer() >>> stemmer.stem('cooking') 'cook' >>> stemmer.stem('cookery') 'cookery'
The RegexpStemmer class
You can also construct your own stemmer using the RegexpStemmer
class. It takes a single regular expression (either compiled or as a string) and removes any prefix or suffix that matches the expression:
>>> from nltk.stem import RegexpStemmer >>> stemmer = RegexpStemmer('ing') >>> stemmer.stem('cooking') 'cook' >>> stemmer.stem('cookery') 'cookery' >>> stemmer.stem('ingleside') 'leside'
A RegexpStemmer
class should only be used in very specific cases that are not covered by the PorterStemmer
or the LancasterStemmer
class because it can only handle very specific patterns and is not a general-purpose algorithm.
The SnowballStemmer class
The SnowballStemmer
class supports 13 non-English languages. It also provides two English stemmers: the original porter algorithm as well as the new English stemming algorithm. To use the SnowballStemmer
class, create an instance with the name of the language you are using and then call the stem()
method. Here is a list of all the supported languages and an example using the Spanish SnowballStemmer
class:
>>> from nltk.stem import SnowballStemmer >>> SnowballStemmer.languages('danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish') >>> spanish_stemmer = SnowballStemmer('spanish') >>> spanish_stemmer.stem('hola') u'hol'
See also
In the next recipe, we will cover Lemmatization, which is quite similar to stemming, but subtly different.