data:image/s3,"s3://crabby-images/2e346/2e34660331d6a16ae073d2fec49f8485967a87bb" alt="Natural Language Processing with Java Cookbook"
How it works...
The first code sequence splits the sentences using a simple set of delimiters. These delimiters consisted of a period, question mark, and exclamation mark, which represent the more common sentence delimiters. The split method was applied against the sample sentences using the delimiters as its argument. The method returned an array of strings, which represent the sentences identified:
String sentenceDelimiters = "[.?!]";
String[] sentences = (text.split(sentenceDelimiters));
for (String sentence : sentences) {
System.out.println(sentence);
}
The output of this sequence did not handle numbers very well. The number 56.32 was split into two integer values between lines 4 and 5 of the output of step 5 in the previous section. In addition, each dot of an ellipsis was treated as a separate sentence. Using a period as a delimiter easily explains why we get this output. However, understanding the limitations of this technique allows us to choose those situations where it is applicable.
The next sequence used a simple regular expression to isolate sentences. This regular expression was used as the argument of the compile method, which returned an instance of the Pattern class. The matcher method of this class created a Matcher object. Its find method is used to iterate through the sample text. The Matcher object's group method returns the current sentence, which was displayed:
Pattern sentencePattern = Pattern.compile("\\s+[^.!?]*[.!?]");
Matcher matcher = sentencePattern.matcher(text);
while (matcher.find()) {
System.out.println(matcher.group());
}
The output from this code sequence also has problems. These are ultimately traceable to the overly simplistic regular expression used in the code. There are other regular expression variations that will perform better. A better performing regular expression for sentences is found at https://stackoverflow.com/questions/5553410/regular-expression-match-a-sentence.