Replacing words matching regular expressions
Now, we are going to get into the process of replacing words. If stemming and lemmatization are a kind of linguistic compression, then word replacement can be thought of as error correction or text normalization.
In this recipe, we will replace words based on regular expressions, with a focus on expanding contractions. Remember when we were tokenizing words in Chapter 1, Tokenizing Text and WordNet Basics, and it was clear that most tokenizers had trouble with contractions? This recipe aims to fix this by replacing contractions with their expanded forms, for example, by replacing "can't" with "cannot" or "would've" with "would have".
Getting ready
Understanding how this recipe works will require a basic knowledge of regular expressions and the re
module. The key things to know are matching patterns and the re.sub()
function.
How to do it...
First, we need to define a number of replacement patterns. This will be a list of tuple pairs, where the first element is the pattern to match with and the second element is the replacement.
Next, we will create a RegexpReplacer
class that will compile the patterns and provide a replace()
method to substitute all the found patterns with their replacements.
The following code can be found in the replacers.py
module in the book's code bundle and is meant to be imported, not typed into the console:
import re replacement_patterns = [ (r'won\'t', 'will not'), (r'can\'t', 'cannot'), (r'i\'m', 'i am'), (r'ain\'t', 'is not'), (r'(\w+)\'ll', '\g<1> will'), (r'(\w+)n\'t', '\g<1> not'), (r'(\w+)\'ve', '\g<1> have'), (r'(\w+)\'s', '\g<1> is'), (r'(\w+)\'re', '\g<1> are'), (r'(\w+)\'d', '\g<1> would') ] class RegexpReplacer(object): def __init__(self, patterns=replacement_patterns): self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns] def replace(self, text): s = text for (pattern, repl) in self.patterns: s = re.sub(pattern, repl, s) return s
How it works...
Here is a simple usage example:
>>> from replacers import RegexpReplacer >>> replacer = RegexpReplacer() >>> replacer.replace("can't is a contraction") 'cannot is a contraction' >>> replacer.replace("I should've done that thing I didn't do") 'I should have done that thing I did not do'
The RegexpReplacer.replace()
function works by replacing every instance of a replacement pattern with its corresponding substitution pattern. In replacement patterns, we have defined tuples such as r'(\w+)\'ve'
and '\g<1> have'
. The first element matches a group of ASCII characters followed by 've
. By grouping the characters before 've
in parenthesis, a match group is found and can be used in the substitution pattern with the \g<1>
reference. So, we keep everything before 've
, then replace 've
with the word h
ave. This is how should've
can become should have
.
There's more...
This replacement technique can work with any kind of regular expression, not just contractions. So, you can replace any occurrence of &
with and
, or eliminate all occurrences of -
by replacing it with an empty string. The RegexpReplacer
class can take any list of replacement patterns for whatever purpose.
Replacement before tokenization
Let's try using the RegexpReplacer
class as a preliminary step before tokenization:
>>> from nltk.tokenize import word_tokenize >>> from replacers import RegexpReplacer >>> replacer = RegexpReplacer() >>> word_tokenize("can't is a contraction") ['ca', "n't", 'is', 'a', 'contraction'] >>> word_tokenize(replacer.replace("can't is a contraction")) ['can', 'not', 'is', 'a', 'contraction']
Much better! By eliminating the contractions in the first place, the tokenizer will produce cleaner results. Cleaning up the text before processing is a common pattern in natural language processing.
See also
For more information on tokenization, see the first three recipes in Chapter 1, Tokenizing Text and WordNet Basics. For more replacement techniques, continue reading the rest of this chapter.