Natural Language Processing with Java Cookbook
上QQ阅读APP看书,第一时间看更新

There's more...

If you have specialized lemmatization needs, then you will need to create a training file. The training data file consists of a series of lines. Each line consists of three entries separated by spaces. The first entry contains a word. The second entry is the POS tag for the word. The third entry is the lemma for the word.

For example, in en-lemmatizer.dict, there are several lines for variations of the word bump, as shown in the following code:

bump    NN         bump
bump VB bump
bump VBP bump
bumped VBD bump
bumped VBN bump
bumper JJ bumper
bumper NN bumper

As you can see, a word may be used in different contexts and with different suffixes. Other datasets can be used for training. These include the Penn Treebank (https://web.archive.org/web/19970614160127/http://www.cis.upenn.edu/~treebank/) and the CoNLL 2009 datasets (https://www.ldc.upenn.edu/).

Training parameters other than the default parameters can be specified depending on the needs of the problem.

In the next recipe, Determining the lexical meaning of a word using OpenNLP, we will use the model to develop and determine the lexical meaning of a word.