Natural Language Processing with Java Cookbook
上QQ阅读APP看书,第一时间看更新

How it works...

The example started with the declaration of a sample sentence. The program will return a list of words found in the sentence with the stop words removed, as shown in the following code:

String sentence = 
"The blue goose and a quiet lamb stopped to smell the roses.";

An instance of LingPipe's IndoEuropeanTokenizerFactory is used to provide a means of tokenizing the sentence. It is used as the argument to the EnglishStopTokenizerFactory constructor, which provides a stop word tokenizer, as shown in the following code:

TokenizerFactory tokenizerFactory =
IndoEuropeanTokenizerFactory.INSTANCE;
tokenizerFactory = new EnglishStopTokenizerFactory(tokenizerFactory);

The tokenizer method is invoked against the sentence, where its second and third parameters specify which part of the sentence to tokenize. The Tokenizer class represents the tokens extracted from the sentence:

Tokenizer tokenizer = tokenizerFactory.tokenizer(
sentence.toCharArray(), 0, sentence.length());

The Tokenizer class implements the Iterable<String> interface that we utilized in the following for-each statement to display the tokens:

for (String token : tokenizer) {
System.out.println(token);
}

Note that in the following duplicated output, the first word of the sentence, The, was not removed, nor was there a terminating period. Otherwise, common stop words were removed, as shown in the following code:

The
blue
goose
quiet
lamb
stopped
smell
roses
.