Natural Language Processing with Java Cookbook

上QQ阅读APP看书，第一时间看更新

How to do it...

Let's go through the following steps:

Create a file called training-data.train. Add the following to the file:

The first sentence is terminated by a period<SPLIT>. We will want to be able to identify tokens that are separated by something other than whitespace<SPLIT>. This can include commas<SPLIT>, numbers such as 100.204<SPLIT>, and other punctuation characters including colons:<SPLIT>.

Next, add the following imports to the program:

import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import opennlp.tools.tokenize.TokenSample;
import opennlp.tools.tokenize.TokenSampleStream;
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.TokenizerFactory;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.InputStreamFactory;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.TrainingParameters;

Next, add the following try block to the project's main method that contains the code needed to obtain the training data:

InputStreamFactory inputStreamFactory = new InputStreamFactory() {
    public InputStream createInputStream() 
            throws FileNotFoundException {
        return new FileInputStream(
            "C:/NLP Cookbook/Code/chapter2a/training-data.train");
    }
};

Insert the following code segment into the try block that will train the model and save it:

try (
    ObjectStream<String> stringObjectStream = 
        new PlainTextByLineStream(inputStreamFactory, "UTF-8");
    ObjectStream<TokenSample> tokenSampleStream = 
        new TokenSampleStream(stringObjectStream);) {
  
    TokenizerModel tokenizerModel = TokenizerME.train(
        tokenSampleStream, new TokenizerFactory(
            "en", null, true, null), 
            TrainingParameters.defaultParams());
    BufferedOutputStream modelOutputStream = 
        new BufferedOutputStream(new FileOutputStream(
            new File(
                "C:/NLP Cookbook/Code/chapter2a/mymodel.bin")));
    tokenizerModel.serialize(modelOutputStream);
} catch (IOException ex) {
    // Handle exception
}

To test the new model, we will reuse the code found in the Tokenization using OpenNLP recipe. Add the following code after the preceding try block:

String sampleText = "In addition, the rook was moved too far to be effective.";
try (InputStream modelInputStream = new FileInputStream(
        new File("C:/Downloads/OpenNLP/Models", "mymodel.bin"));) {
    TokenizerModel tokenizerModel = 
        new TokenizerModel(modelInputStream);
    Tokenizer tokenizer = new TokenizerME(tokenizerModel);
    String tokenList[] = tokenizer.tokenize(sampleText);
    for (String token : tokenList) {
        System.out.println(token);
    }
} catch (FileNotFoundException e) {
    // Handle exception
} catch (IOException e) {
    // Handle exception
}

When executing the program, you will get an output similar to the following. Some of the training model output has been removed to save space:

Indexing events with TwoPass using cutoff of 5

Computing event counts... done. 36 events
 Indexing... done.
Sorting and merging events... done. Reduced 36 events to 12.
Done indexing in 0.21 s.
Incorporating indexed data for training... 
done.
 Number of Event Tokens: 12
 Number of Outcomes: 2
 Number of Predicates: 9
...done.
Computing model parameters ...
Performing 100 iterations.
 1: ... loglikelihood=-24.95329850015802 0.8611111111111112
 2: ... loglikelihood=-14.200654164477221 0.8611111111111112
 3: ... loglikelihood=-11.526745527757855 0.8611111111111112
 4: ... loglikelihood=-9.984657035211438 0.8888888888888888
...
 97: ... loglikelihood=-0.7805227945549726 1.0
 98: ... loglikelihood=-0.7730211829010772 1.0
 99: ... loglikelihood=-0.765664507836384 1.0
100: ... loglikelihood=-0.7584485899716518 1.0
In
addition
,
the
rook
was
moved
too
far
to
be
effective
.