Python's built-in types
Python provides a great set of datatypes. This is true for both numeric types and also collections. Regarding the numeric types, there is nothing special about their syntax. There are, of course, some differences for defining literals of every type and some (maybe) not well-known details regarding operators, but there aren't a lot of choices left for developers. Things change when it comes to collections and strings. Despite the "there should be only one way to do something" mantra, the Python developer is really left with plenty of choices. Some of the code patterns that seem intuitive and simple to beginners are often considered non-Pythonic by experienced programmers because they are either inefficient or simply too verbose.
Such Pythonic patterns for solving common problems (by many programmers called idioms) may often seem like only aesthetics. This cannot be more wrong. Most of the idioms are driven by the fact how Python is implemented internally and on how built-in structures and modules work. Knowing more of such details is essential for a good understanding of the language. Also, the community itself is not free from myths and stereotypes about how things in Python work. Only by digging deeper yourself, will you be able to tell which of the popular statements about Python are really true.
Strings and bytes
The topic of strings may provide some confusion for programmers that are used to programming only in Python 2. In Python 3, there is only one datatype capable of storing textual information. It is str
or, simply, string. It is an immutable sequence that stores Unicode code points. This is the major difference from Python 2, where str
represents byte strings—something that is now handled by the bytes
objects (but not exactly in the same way).
Strings in Python are sequences. This single fact should be enough to include them in the section covering other container types, but they differ from other container types in one important detail. Strings have very specific limitations on what type of data they can store, and that is Unicode text.
bytes
and its mutable alternative (bytearray
) differs from str
by allowing only bytes as a sequence value—integers in the range 0 <= x < 256
. This may be confusing at the beginning, since when printed, they may look very similar to strings:
>>> print(bytes([102, 111, 111])) b'foo'
The true nature of bytes
and bytearray
is revealed when it is converted to another sequence type like list
or tuple
:
>>> list(b'foo bar') [102, 111, 111, 32, 98, 97, 114] >>> tuple(b'foo bar') (102, 111, 111, 32, 98, 97, 114)
A lot of Python 3 controversy was about breaking the backwards compatibility for string literals and how Unicode is dealt with. Starting from Python 3.0, every un-prefixed string literal is Unicode. So, literals enclosed by single quotes ('
), double quotes ("
), or groups of three quotes (single or double) without any prefix represent the str
datatype:
>>> type("some string") <class 'str'>
In Python 2, the Unicode literals required the u
prefix (like u"some string"
). This prefix is still allowed for backward compatibility (starting from Python 3.3), but does not hold any syntactic meaning in Python 3.
Bytes literals were already presented in some of the previous examples, but let's explicitly present its syntax for the sake of consistency. Bytes literals are also enclosed by single quotes, double quotes, or triple quotes, but must be preceded by a b
or B
prefix:
>>> type(b"some bytes") <class 'bytes'>
Note that there is no bytearray
literals in the Python syntax.
Last but not least, Unicode strings contain "abstract" text that is independent from the byte representation. This makes them unable to be saved on the disk or sent over the network without encoding to binary data. There are two ways to encode string objects into byte sequences:
- Using the
str.encode(encoding, errors)
method, which encodes the string using a registered codec for encoding. Codec is specified using theencoding
argument, and, by default, it is'utf-8'
. The second errors argument specifies the error handling scheme. It can be'strict'
(default),'ignore'
,'replace'
,'xmlcharrefreplace'
, or any other registered handler (refer to the built-incodecs
module documentation). - Using the
bytes(source, encoding, errors)
constructor, which creates a new bytes sequence. When the source is of thestr
type, then theencoding
argument is obligatory and it does not have a default value. The usage of theencoding
anderrors
arguments is the same as for thestr.encode()
method.
Binary data represented by bytes
can be converted to a string in the analogous ways:
- Using the
bytes.decode(encoding, errors)
method, which decodes the bytes using the codec registered for encoding. The arguments of this method have the same meaning and defaults as the arguments ofstr.encode()
. - Using the
str(source, encoding, error)
constructor, which creates a new string instance. Similar to thebytes()
constructor, theencoding
argument in thestr()
call has no default value and must be provided if the bytes sequence is used as a source.
Tip
Naming – bytes versus byte string
Due to changes made in Python 3, some people tend to refer to the bytes
instances as byte strings. This is mostly due to historic reasons—bytes
in Python 3 is the sequence type that is the closest one to the str
type from Python 2 (but not the same). Still, the bytes
instance is a sequence of bytes and also does not need to represent textual data. So, in order to avoid any confusion, it is advisable to always refer to them as either bytes
or a byte sequence despite their similarities to strings. The concept of strings is reserved for textual data in Python 3 and this is now always str
.
Python strings are immutable. This is also true to byte sequences. This is an important fact because it has both advantages and disadvantages. It also affects the way strings should be handled in Python efficiently. Thanks to immutability, strings can be used as dictionary keys or set
collection elements because once initialized, they will never change their value. On the other hand, whenever a modified string is required (even with only tiny modification), a completely new instance needs to be created. Fortunately, bytearray
as a mutable version of bytes
does not introduce such an issue. Byte arrays can be modified in-place (without the need of new object creation) through item assignments and can be dynamically resized exactly like lists—using appends, pops, inserts, and so on.
Knowing the fact that Python strings are immutable imposes some problems when multiple string instances need to be joined together. As stated before, concatenating any immutable sequences result in the creation of a new sequence object. Consider that a new string is built by the repeated concatenation of multiple strings, as follows:
s = "" for substring in substrings: s += substring
This will result in a quadratic runtime cost in the total string length. In other words, it is highly inefficient. For handling such situations, there is the str.join()
method available. It accepts iterable of strings as the argument and returns a joined string. Because it is the method, the actual idiom uses the empty string literal as a source of method:
s = "".join(substrings)
The string providing this method will be used as a separator between joined substrings; consider the following example:
>>> ','.join(['some', 'comma', 'separated', 'values']) 'some,comma,separated,values'
It is worth remembering that just because it is faster (especially for large lists), it does not mean that the join()
method should be used in every situation where two strings need to be concatenated. Despite being a widely recognized idiom, it does not improve code readability – and readability counts! There are also some situations where join()
may not perform as well as ordinary concatenation through addition. Here some examples of them:
- If the number of substrings is small and they are not contained already by some iterable—in some cases, an overhead of creating a new sequence just to perform concatenation can overshadow the gain of using
join()
. - When concatenating short literals, thanks to constant folding in CPython, some complex literals (not only strings) such as
'a' + 'b' + 'c'
to'abc'
can be translated to a shorter form at compile time. Of course, this is enabled only for constants (literals) that are relatively short.
Ultimately, the best readability of string concatenation if the number of strings is known beforehand is ensured by proper string formatting, by either using the str.format()
method or the %
operator. In code sections where the performance is not critical or gain from optimizing string concatenation is very little, string formatting is recommended as the best alternative.
Tip
Constant folding and peephole optimizer
CPython uses the peephole optimizer on compiled source code in order to improve performance. This optimizer implements a number of common optimizations directly on Python's byte code. As mentioned, constant folding is one such feature. The resulting constants are limited in length by a hardcoded value. In Python 3.5, it is still invariably equal to 20. Anyway, this particular detail is rather a curiosity than a thing that can be relied on in day-to-day programming. Information of other interesting optimizations performed by peephole optimizer can be found in the Python/peephole.c
file of Python's source code.
Collections
Python provides a good selection of built-in data collections that allows you to efficiently solve many problems if you choose wisely. Types that you probably already know are those that have dedicated literals:
- Lists
- Tuples
- Dictionaries
- Sets
Python is of course not limited to these four and it extends the list of possible choices through its standard library. In many cases, the solution to a problem may be as simple as making a good choice for data structure. This part of the book aims to ease such a decision by providing deeper insight into the possible options.
The two most basic collection types in Python are lists and tuples, and they both represent sequences of objects. The basic difference between them should be obvious for anyone who has spent more than a few hours with Python—lists are dynamic so can change their size, while tuples are immutable (they cannot be modified after they are created).
Tuples, despite having many various optimizations that makes allocation/deallocation of small objects fast, are the recommended datatype for structures where the position of the element is information by itself. For example, tuple may be a good choice for storing a pair of (x, y) coordinates. Anyway, details regarding tuples are rather uninteresting. The only important thing about them in the scope of this chapter is that tuple
is immutable and thus hashable. What this means will be covered later in a Dictionaries section. More interesting than tuple is its dynamic counterpart, list
, how it really works, and how to deal with it efficiently.
Many programmers easily confuse Python's list
type with the concept of linked lists found often in standard libraries of other languages such as C, C++, or Java. In fact, CPython lists are not lists at all. In CPython, lists are implemented as variable length arrays. This should also be true for other implementations such as Jython and IronPython, although such implementation details are often not documented in these projects. The reasons for such confusion are clear. This datatype is named list and also has an interface that could be expected from any linked list implementation.
Why is it important and what does it mean? Lists are one of the most popular data structures and the way they are used greatly affects every application's performance. Also, CPython is the most popular and used implementation, so knowing its internal implementation details is crucial.
In detail, lists in Python is a contiguous array of references to other objects. The pointer to this array and the length is stored in a lists head structure. This means that every time an item is added or removed, the array of references needs to be resized (reallocated). Fortunately, in Python, these arrays are created with exponential over-allocation, so not every operation requires a resize. This is how the amortized cost of appending and popping elements can be low in terms of complexity. Unfortunately, some other operations that are considered "cheap" in ordinary linked lists have relatively high computational complexity in Python:
- Inserting an item at arbitrary place using the
list.insert
method—complexity O(n) - Deleting an item using
list.delete
or usingdel
—complexity O(n)
Here, n is the length of a list. At least retrieving or setting an element using index is an operation that cost is independent of the list's size. Here is a full table of average time complexities for most of the list operations:
For situations where a real linked list is needed (or simply, a data structure that has appends
and pop
at each side at O(1) complexity), Python provides deque
in collections
built-in module. This is a generalization of stacks and queues and should work fine anywhere where a doubly linked list is required.
As you probably know, writing a piece of code such as this is painful:
>>> evens = [] >>> for i in range(10): ... if i % 2 == 0: ... evens.append(i) ... >>> evens [0, 2, 4, 6, 8]
This may work for C, but it actually makes things slower for Python because:
- It makes the interpreter work on each loop to determine what part of the sequence has to be changed
- It makes you keep a counter to track what element has to be treated
- It requires an additional function lookup to be performed at every iteration because
append()
is a list's method
A list comprehension is the correct answer to this pattern. It uses wired features that automate parts of the previous syntax:
>>> [i for i in range(10) if i % 2 == 0] [0, 2, 4, 6, 8]
Besides the fact that this writing is more efficient, it is way shorter and involves fewer elements. In a bigger program, this means fewer bugs and code that is easier to read and understand.
Tip
List comprehensions and internal array resize
There is a myth among some Python programmers that the list comprehensions can be a workaround for the fact that the internal array representing the list object must be resized with every few additions. Some say that the array will be allocated once in just the right size. Unfortunately, this isn't true.
The interpreter during evaluation of the comprehension can't know how big the resulting container will be and it can't preallocate the final size of the array for it. Due to this, the internal array is reallocated in the same pattern as it would be in the for
loop. Still, in many cases, list creation using comprehensions is both cleaner and faster than using ordinary loops.
Another typical example of a Python idiom is the usage of enumerate
. This built-in function provides a convenient way to get an index when a sequence is used in a loop. Consider the following piece of code as an example:
>>> i = 0 >>> for element in ['one', 'two', 'three']: ... print(i, element) ... i += 1 ... 0 one 1 two 2 three
This can be replaced by the following code, which is shorter:
>>> for i, element in enumerate(['one', 'two', 'three']): ... print(i, element) ... 0 one 1 two 2 three
When the elements of multiple lists (or any iterables) need to be aggregated in a one-by-one fashion, then the built-in zip()
function may be used. This is a very common pattern for uniform iteration over two same-sized iterables:
>>> for item in zip([1, 2, 3], [4, 5, 6]): ... print(item) ... (1, 4) (2, 5) (3, 6)
Note that the results of zip()
can be reversed by another zip()
call:
>>> for item in zip(*zip([1, 2, 3], [4, 5, 6])): ... print(item) ... (1, 2, 3) (4, 5, 6)
Another popular syntax element is sequence unpacking. It is not limited to lists and tuples and will work with any sequence type (even strings and byte sequences). It allows you to unpack a sequence of elements into another set of variables as long as there are as many variables on the left-hand side of the assignment operator as the number of elements in the sequence:
>>> first, second, third = "foo", "bar", 100 >>> first 'foo' >>> second 'bar' >>> third 100
Unpacking also allows you to capture multiple elements in a single variable using starred expressions as long as it can be interpreted unambiguously. Unpacking can also be performed on nested sequences. This can come in handy especially when iterating on some complex data structures built of sequences. Here are some examples of more complex unpacking:
>>> # starred expression to capture rest of the sequence >>> first, second, *rest = 0, 1, 2, 3 >>> first 0 >>> second 1 >>> rest [2, 3] >>> # starred expression to capture middle of the sequence >>> first, *inner, last = 0, 1, 2, 3 >>> first 0 >>> inner [1, 2] >>> last 3 >>> # nested unpacking >>> (a, b), (c, d) = (1, 2), (3, 4) >>> a, b, c, d (1, 2, 3, 4)
Dictionaries are one of the most versatile data structures in Python. dict
allows to map a set of unique keys to values as follows:
{ 1: ' one', 2: ' two', 3: ' three', }
Dictionary literals are a very basic thing and you should already know them. Anyway, Python allows programmers to also create a new dictionary using comprehensions similar to the list comprehensions mentioned earlier. Here is a very simple example:
squares = {number: number**2 for number in range(100)}
What is important is that the same benefits of using list comprehensions apply to dictionary comprehensions. So, in many cases, they are more efficient, shorter, and cleaner. For more complex code, when many if
statements or function calls are required to create a dictionary, the simple for
loop may be a better choice, especially if it improves the readability.
For Python programmers new to Python 3, there is one important note about iterating over dictionary elements. The dictionary methods: keys()
, values()
, and items()
no longer have lists as their return value types. Also, their counterparts iterkeys()
, itervalues()
, and iteritems()
that returned iterators instead are missing in Python 3. Instead, what keys()
, values()
, and items()
return now are view objects:
keys()
: This returns thedict_keys
object that provides a view on all the keys of a dictionaryvalues()
: This returns thedict_values
object that provides views on all the values of a dictionaryitems()
: This returns thedict_items
object providing views on all(key, value)
two tuples of a dictionary
View objects provide a view on the dictionary content in a dynamic way, so every time the dictionary changes, the views will reflect these changes, as shown in this example:
>>> words = {'foo': 'bar', 'fizz': 'bazz'} >>> items = words.items() >>> words['spam'] = 'eggs' >>> items dict_items([('spam', 'eggs'), ('fizz', 'bazz'), ('foo', 'bar')])
View objects join the behavior of lists returned by implementation of old methods with iterators returned by their "iter" counterparts. Views do not need to redundantly store all values in memory (like lists do), but still allow getting their length (using len
) and testing membership (using the in
clause). Views are, of course, iterable.
The last important thing is that both views returned by the keys()
and values()
methods ensure the same order of keys and values. In Python 2, you could not modify the dictionary content between these two calls if you wanted to ensure the same order of retrieved keys and values. dict_keys
and dict_values
are now dynamic so even if the content of a dictionary will change between keys()
and values()
calls, the order of iteration is consistent between these two views.
CPython uses hash tables with pseudo-random probing as an underlying data structure for dictionaries. It seems like a very deep implementation detail, but it is very unlikely to change in the near future, so it is also a very interesting fact for the programmer.
Due to this implementation detail, only objects that are hashable can be used as a dictionary key. An object is hashable if it has a hash value that never changes during its lifetime and can be compared to different objects. Every Python's built-in type that is immutable is also hashable. Mutable types such as list, dictionaries, and sets are not hashable and so they cannot be used as dictionary keys. Protocol that defines if a type is hashable consists of two methods:
__hash__
: This provides the hash value (as an integer) that is needed by the internaldict
implementation. For objects that are instances of user-defined classes, it is derived from theirid()
.__eq__
: This compares if two objects that have the same value. All objects that are instances of user-defined classes compare unequal, by default, except for themselves.
Two objects that are compared equal must have the same hash value. The reverse does not need to be true. This means collisions of hashes are possible—two objects with the same hash may not be equal. It is allowed, and every Python implementation must be able to resolve hash collisions. CPython uses open addressing to resolve such collisions (https://en.wikipedia.org/wiki/Open_addressing). Still, the probability of collisions greatly affects performance, and if it is high, the dictionary will not benefit from its internal optimizations.
While three basic operations: adding, getting, and deleting an item have an average time complexity equal to O(1), their amortized worst case complexities are a lot higher—O(n), where n is the current dictionary size. Additionally, if user-defined class objects are used as dictionary keys and they are hashed improperly (with a high risk of collisions), then this will have a huge negative impact on the dictionary performance. The full table of CPyhton's time complexities for dictionaries is as follows:
It is also important to know that the n number in worst-case complexities for copying and iterating the dictionary is the maximum size that the dictionary ever achieved, rather than the current item count. In other words, iterating over the dictionary that once was huge but has greatly shrunk in time may take a surprisingly long time. So, in some cases, it may be better to create a new dictionary object if it has to be iterated often instead of just removing elements from the previous one.
One of the common pitfalls of using dictionaries is that they do not preserve the order of elements in which new keys were added. In some scenarios, when dictionary keys use consecutive keys whose hashes are also consecutive values (for example, using integers), the resulting order might be the same due to the internal implementation of dictionaries:
>>> {number: None for number in range(5)}.keys() dict_keys([0, 1, 2, 3, 4])
Still, using other datatypes which hash differently shows that the order is not preserved. Here is an example in CPython:
>>> {str(number): None for number in range(5)}.keys() dict_keys(['1', '2', '4', '0', '3']) >>> {str(number): None for number in reversed(range(5))}.keys() dict_keys(['2', '3', '1', '4', '0'])
As shown in the preceding code, the resulting order is both dependent on the hashing of the object and also on the order in which the elements were added. This is not what can be relied on because it can vary with different Python implementations.
Still, in some cases, the developer might need dictionaries that preserve the order of additions. Fortunately, the Python standard library provides an ordered dictionary called OrderedDict
in the collections
module. It optionally accepts an iterable as the initialization argument:
>>> from collections import OrderedDict >>> OrderedDict((str(number), None) for number in range(5)).keys() odict_keys(['0', '1', '2', '3', '4'])
It also has some additional features such as popping items from both ends using the popitem()
method or moving the specified element to one of the ends using the move_to_end()
method. A full reference on that collection is available in the Python documentation (refer to https://docs.python.org/3/library/collections.html).
The other important note is that in very old code bases, dict
may be used as a primitive set implementation that ensures the uniqueness of elements. While this will give proper results, this should be omitted unless Python versions lower than 2.3 are targeted. Using dictionaries this way is wasteful in terms of resources. Python has a built-in set
type that serves this purpose. In fact, it has a very similar internal implementation to dictionaries in CPython, but offers some additional features as well as specific set-related optimizations.
Sets are a very robust data structure that are useful mostly in situations where the order of elements is not as important as their uniqueness and efficiency of testing if an element is contained by a collection. They are very similar to analogous mathematic concepts. Sets are provided as built-in types in two flavors:
set()
: This is a mutable, non-ordered, finite collection of unique, immutable (hashable) objectsfrozenset()
: This is an immutable, hashable, non-ordered collection of unique, immutable (hashable) objects
The immutability of frozenset()
makes it possible to be used as dictionary keys and also other set()
and frozenset()
elements. A plain mutable set()
cannot be used within another set or frozenset content as this will raise TypeError
:
>>> set([set([1,2,3]), set([2,3,4])]) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unhashable type: 'set'
The following set initializations are completely correct:
>>> set([frozenset([1,2,3]), frozenset([2,3,4])]) {frozenset({1, 2, 3}), frozenset({2, 3, 4})} >>> frozenset([frozenset([1,2,3]), frozenset([2,3,4])]) frozenset({frozenset({1, 2, 3}), frozenset({2, 3, 4})})
Mutable sets can be created in three ways:
- Using a
set()
call that accepts optional iterable as the initialization argument, such asset([0, 1, 2])
- Using a set comprehension such as
{element for element in range(3)}
- Using set literals such as
{1, 2, 3}
Note that using literals and comprehensions for sets requires extra caution because they are very similar in form to dictionary literals and comprehensions. Also, there is no literal for empty set objects—empty curly brackets {}
are reserved for empty dictionary literals.
Sets in CPython are very similar to dictionaries. As a matter of fact, they are implemented like dictionaries with dummy values, where only keys are actual collection elements. Also, sets exploit this lack of values in mapping for additional optimizations.
Thanks to this, sets allow very fast additions, deletions, and checking for element existence with the average time complexity equal to O(1). Still, since the implementation of sets in CPython relies on a similar hash table structure, the worst-case complexity for these operations is O(n), where n is the current size of a set.
Other implementation details also apply. The item to be included in a set must be hashable, and if instances of user-defined classes in a set are hashed poorly, this will have a negative impact on the performance.
Every data structure has its shortcomings. There is no single collection that can suit every problem and four basic types of them (tuple, list, set, and dictionary) is still not a wide range of choices. These are the most basic and important collections that have a dedicated literal syntax. Fortunately, Python provides a lot more options in its standard library through the collections
built-in module. One of them was already mentioned (deque
). Here are the most important collections provided by this module:
namedtuple()
: This is a factory function for creating tuple subclasses whose indexes can be accessed as named attributesdeque
: This is a double-ended queue, list-like generalization of stacks and queues with fast appends and pops on both endsChainMap
: This is a dictionary-like class to create a single view of multiple mappingsCounter
: This is a dictionary subclass for counting hashable objectsOrderedDict
: This is a dictionary subclass that preserves the order the entries were added indefaultdict
: This is a dictionary subclass that can supply missing values with a provided default
Note
More details on selected collections from the collections module and some advice on where it is worth using them are provided in Chapter 12, Optimization – Some Powerful Techniques.