Using indexes in an intelligent way
So far, applying an index feels like the Holy Grail, which always improves performance magically. However, this is not the case. Indexes can also be pretty pointless in some cases.
Before digging into things more deeply, here is the data structure we are using for this example. Remember that there are only two distinct names and unique IDs:
test=# \d t_test
Table "public.t_test"
Column | Type | Modifiers
--------+---------+------------------------------------
id | integer | not null default nextval('t_test_id_seq'::regclass)
name | text |
Indexes:
"idx_id" btree (id)
At this point, one index has been defined, which covers the id column. In the next step, the name column will be queried. Before doing that, an index on name will be created:
test=# CREATE INDEX idx_name ON t_test (name);
CREATE INDEX
Now it is time to see if the index is used correctly:
test=# EXPLAIN SELECT * FROM t_test WHERE name = 'hans2';
QUERY PLAN
-------------------------------------------------------------
Index Scan using idx_name on t_test (cost=0.43..4.45 rows=1 width=9)
Index Cond: (name = 'hans2'::text)
(2 rows)
As expected, PostgreSQL will decide on using the index. Most users would expect that. But note that my query says hans2. Remember: hans2 does not exist in the table and the query plan perfectly reflects this. rows=1 indicates that the planner only expects a very small subset of data being returned by the query.
Let us see what happens if we look for more data:
test=# EXPLAIN SELECT * FROM t_test WHERE name = 'hans' OR name = 'paul';
QUERY PLAN
--------------------------------------------------------------
Seq Scan on t_test (cost=0.00..81622.00 rows=3000011 width=9)
Filter: ((name = 'hans'::text) OR (name = 'paul'::text))
(2 rows)
In this case, PostgreSQL will go for a straight sequential scan. Why is that? Why is the system ignoring all indexes? The reason is simple: hans and paul make up the entire dataset because there are no other values. Therefore, PostgreSQL figures that the entire table has to be read anyway. There is no reason to read all of the index and the full table if reading just the table is sufficient.
In other words, PostgreSQL will not use an index just because there is one. PostgreSQL will use indexes when they make sense. If the number of rows is smaller, PostgreSQL will again consider bitmap scans and normal index scans:
test=# EXPLAIN SELECT * FROM t_test WHERE name = 'hans2' OR name = 'paul2';
QUERY PLAN
----------------------------------------------------------
Bitmap Heap Scan on t_test (cost=8.88..12.89 rows=1 width=9)
Recheck Cond: ((name = 'hans2'::text) OR (name = 'paul2'::text))
-> BitmapOr (cost=8.88..8.88 rows=1 width=0)
-> Bitmap Index Scan on idx_name (cost=0.00..4.44 rows=1 width=0)
Index Cond: (name = 'hans2'::text)
-> Bitmap Index Scan on idx_name (cost=0.00..4.44 rows=1 width=0)
Index Cond: (name = 'paul2'::text)
The most important point to learn here is that execution plans depend on input values. They are not static and not independent of the data inside the table. This is a very important observation, which has to be kept in mind all the time. In real-world examples, the fact that plans change can often be the reason for unpredictable runtimes.