Rewriting an immutable string
How can we rewrite an immutable string? We can't change inpidual characters inside a string:
>>> title = "Recipe 5: Rewriting, and the Immutable String"
>>> title[8] = ''
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' object does not support item assignment
Since this doesn't work, how do we make a change to a string?
Getting ready
Let's assume we have a string like this:
>>> title = "Recipe 5: Rewriting, and the Immutable String"
We'd like to do two transformations:
- Remove the part up to the :
- Replace the punctuation with _, and make all the characters lowercase
Since we can't replace characters in a string object, we have to work out some alternatives. There are several common things we can do, shown as follows:
- A combination of slicing and concatenating a string to create a new string.
- When shortening, we often use the partition() method.
- We can replace a character or a substring with the replace() method.
- We can expand the string into a list of characters, then join the string back into a single string again. This is the subject of a separate recipe, Building complex strings with a list of characters.
How to do it...
Since we can't update a string in place, we have to replace the string variable's object with each modified result. We'll use an assignment statement that looks something like this:
some_string = some_string.method()
Or we could even use an assignment like this:
some_string = some_string[:chop_here]
We'll look at a few specific variations of this general theme. We'll slice a piece of a string, we'll replace inpidual characters within a string, and we'll apply blanket transformations such as making the string lowercase. We'll also look at ways to remove extra _ that show up in our final string.
Slicing a piece of a string
Here's how we can shorten a string via slicing:
- Find the boundary:
>>> colon_position = title.index(':')
The index function locates a particular substring and returns the position where that substring can be found. If the substring doesn't exist, it raises an exception. The following expression will always be true: title[colon_position] == ':'.
- Pick the substring:
>>> discard, post_colon = title[:colon_position], title[colon_position+1:] >>> discard 'Recipe 5' >>> post_colon ' Rewriting, and the Immutable String'
We've used the slicing notation to show the start:end of the characters to pick. We also used multiple assignment to assign two variables, discard and post_colon, from the two expressions.
We can use partition(), as well as manual slicing. Find the boundary and partition:
>>> pre_colon_text, _, post_colon_text = title.partition(':')
>>> pre_colon_text
'Recipe 5'
>>> post_colon_text
' Rewriting, and the Immutable String'
The partition function returns three things: the part before the target, the target, and the part after the target. We used multiple assignment to assign each object to a different variable. We assigned the target to a variable named _ because we're going to ignore that part of the result. This is a common idiom for places where we must provide a variable, but we don't care about using the object.
Updating a string with a replacement
We can use a string's replace() method to create a new string with punctuation marks removed. When using replace to switch punctuation marks, save the results back into the original variable. In this case, post_colon_text:
>>> post_colon_text = post_colon_text.replace(' ', '_')
>>> post_colon_text = post_colon_text.replace(',', '_')
>>> post_colon_text
This has replaced the two kinds of punctuation with the desired _ characters. We can generalize this to work with all punctuation. This leverages the for statement, which we'll look at in Chapter 2, Statements and Syntax.
We can iterate through all punctuation characters:
>>> from string import whitespace, punctuation
>>> for character in whitespace + punctuation:
... post_colon_text = post_colon_text.replace(character, '_')
>>> post_colon_text
As each kind of punctuation character is replaced, we assign the latest and greatest version of the string to the post_colon_text variable.
We can also use a string's translate() method for this. This relies on creating a dictionary object to map each source character's position to a resulting character:
>>> from string import whitespace, punctuation
>>> title = "Recipe 5: Rewriting an Immutable String"
>>> title.translate({ord(c): '_' for c in whitespace+punctuation})
We've created a mapping with {ord(c): '_' for c in whitespace+punctuation} to translate any character, c, in the whitespace+punctuation sequence of characters to the '_' character. This may have better performance than a sequence of inpidual character replacements.
Removing extra punctuation marks
In many cases, there are some additional steps we might follow. We often want to remove leading and trailing _ characters. We can use strip() for this:
>>> post_colon_text = post_colon_text.strip('_')
In some cases, we'll have multiple _ characters because we had multiple punctuation marks. The final step would be something like this to clean up multiple _ characters:
>>> while '__' in post_colon_text:
... post_colon_text = post_colon_text.replace('__', '_')
This is yet another example of the same pattern we've been using to modify a string in place. This depends on the while statement, which we'll look at in Chapter 2, Statements and Syntax.
How it works...
We can't—technically—modify a string in place. The data structure for a string is immutable. However, we can assign a new string back to the original variable. This technique behaves the same as modifying a string in place.
When a variable's value is replaced, the previous value no longer has any references and is garbage collected. We can see this by using the id() function to track each inpidual string object:
>>> id(post_colon_text)
>>> post_colon_text = post_colon_text.replace('_','-')
>>> id(post_colon_text)
Your actual ID numbers may be different. What's important is that the original string object assigned to post_colon_text had one ID. The new string object assigned to post_colon_text has a different ID. It's a new string object.
When the old string has no more references, it is removed from memory automatically.
We made use of slice notation to decompose a string. A slice has two parts: [start:end]. A slice always includes the starting index. String indices always start with zero as the first item. A slice never includes the ending index.
The items in a slice have an index from start to end-1. This is sometimes called a half-open interval.
Think of a slice like this: all characters where the index i is in the range start ≤ i < end.
We noted briefly that we can omit the start or end indices. We can actually omit both. Here are the various options available:
- title[colon_position]: A single item, that is, the : we found using title.index(':').
- title[:colon_position]: A slice with the start omitted. It begins at the first position, index of zero.
- title[colon_position+1:]: A slice with the end omitted. It ends at the end of the string, as if we said len(title).
- title[:]: Since both start and end are omitted, this is the entire string. Actually, it's a copy of the entire string. This is the quick and easy way to duplicate a string.
There's more...
There are more features for indexing in Python collections like a string. The normal indices start with 0 on the left. We have an alternate set of indices that use negative numbers that work from the right end of a string:
- title[-1] is the last character in the title, 'g'
- title[-2] is the next-to-last character, 'n'
- title[-6:] is the last six characters, 'String'
We have a lot of ways to pick pieces and parts out of a string.
Python offers dozens of methods for modifying a string. The Text Sequence Type — str section of the Python Standard Library describes the different kinds of transformations that are available to us. There are three broad categories of string methods: we can ask about the string, we can parse the string, and we can transform the string to create a new one. Methods such as isnumeric() tell us if a string is all digits.
Here's an example:
>>> 'some word'.isnumeric()
>>> '1298'.isnumeric()
Before doing comparisons, it can help to change a string so that it has the same uniform case. It's frequently helpful to use the lower() method, thus assigning the result to the original variable:
>>> post_colon_text = post_colon_text.lower()
We've looked at parsing with the partition() method. We've also looked at transforming with the lower() method, as well as the replace() and translate() methods.
See also
- We'll look at the string as list technique for modifying a string in the Building complex strings from lists of characters recipe.
- Sometimes, we have data that's only a stream of bytes. In order to make sense of it, we need to convert it into characters. That's the subject of the Decoding bytes – how to get proper characters from some bytes recipe.