上QQ阅读APP看书,第一时间看更新
How to do it...
- Develop and import the following packages using Python:
import numpy as np from nltk.corpus import brown
- Describe a function that divides text into chunks:
# Split a text into chunks def splitter(content, num_of_words): words = content.split(' ') result = []
- Initialize the following programming lines to get the assigned variables:
current_count = 0 current_words = []
- Start the iteration using words:
for word in words: current_words.append(word) current_count += 1
- After getting the essential amount of words, reorganize the variables:
if current_count == num_of_words: result.append(' '.join(current_words)) current_words = [] current_count = 0
- Attach the chunks to the output variable:
result.append(' '.join(current_words)) return result
- Import the data of Brown corpus and consider the first 10000 words:
if __name__=='__main__': # Read the data from the Brown corpus content = ' '.join(brown.words()[:10000])
- Describe the word size in every chunk:
# Number of words in each chunk num_of_words = 1600
- Initiate a pair of significant variables:
chunks = [] counter = 0
- Print the result by calling the splitter function:
num_text_chunks = splitter(content, num_of_words) print "Number of text chunks =", len(num_text_chunks)
- The result obtained after chunking is shown in the following screenshot: