上QQ阅读APP看书,第一时间看更新
Process of deriving tokens
Sentences are formed by stream of words and from a sentence we need to derive individual meaningful chunks which are called the tokens and process of deriving token is called tokenization:
- The process of deriving tokens from a stream of text has two stages. If you have a lot of paragraphs, then first you need to do sentence tokenization, then word tokenization, and generate the meaning of the tokens.
- Tokenization and lemmatization are processes that are helpful for lexical analysis. Using the nltk library, we can perform tokenization and lemmatization.
- Tokenization can be defined as identifying the boundary of sentences or words.
- Lemmatization can be defined as a process that identifies the correct intended POS and meaning of words that are present in sentences.
- Lemmatization also includes POS tagging to disambiguate the meaning of the tokens. In this process, the context window is either phrase level or sentence level.
You can find the code at the GitHub link: https://github.com/jalajthanaki/NLPython/tree/master/ch3
The code snippet is shown in Figure 3.8:
Figure 3.8: Code snippet for tokenization
The output of the code in Figure 3.8 is shown in Figure 3.9:
Figure 3.9: Output of tokenization and lemmatization