Indexes in relational databases are a very imporatant feature, that reduce the cost of our lookup queries. In the last post on the basics of indexes in PostgreSQL, we covered the fundamentals and saw how we can create an index on a table and measure it’s impact on our queries. In this post, we will take a dive into the inner workings and some implmentation details of the most used index type in PostgreSQL  the BTree index.
What is BTree?
From Wikipedia’s page on BTree:
In computer science, a Btree is a selfbalancing tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic time.
Awesome, right? Basically, it’s a data structure that sorts itself. That’s why it’s selfbalancing  it chooses it’s shape on it’s own.
The Btree is a generalization of a binary search tree in that a node can have more than two children. Unlike selfbalancing binary search trees, the B tree is optimized for systems that read and write large blocks of data.
Unlike regular binary trees, the BTree can have multiple leaves, which it balances on it’s own. Also, it’s implementations are I/O optimized, which makes them suitable for database indexes.
Btrees are a good example of a data structure for external memory. It is commonly used in databases and filesystems.
Now, there’s plenty to know about BTrees. Knowing how they work is pretty interesting, so let’s check it out.
Functionality
As a disclaimer before we start with the BTree as a selfbalancing data structure, I would like to inform you that there is a lot of CS theory about the BTree, which we will not cover in this section. For example, you might want to look first at binary trees, 23trees and 234trees before you dive into BTrees. Nevertheless, what we will cover here will be sufficient for you to understand how the BTree index works.
Having that out of the way, think of a binary tree. In binary trees, each node can have a maximum of two children, hence the name  binary.
A Binary Tree
Well, a BTree is a tree where each node can have multiple children, or better said, a BTree can have N children. While in binary search trees each node can have one value, BTrees have the concept of keys. Keys are like a list of values, that each of the nodes will hold.
BTrees also have the concept of order, where for BTrees an order of 3 means that each nonleaf node of the tree can have a maximum of 3 children. Having that in mind, this means that each node can have two (31) keys.
Confusing? Well, think about this: on a nonleaf node with keys 5, 10
, you can
add three nodes:
 one node, with values smaller than 5
 one node, with values between 5 and 10
 and one node, with values larger than 10
Let’s draw it out:
The most important thing about BTrees is their balaning aspect. The concept revolves on the fact that each node has keys, like in the example above. The way BTrees balance themselves is really interesting, and the keys are the most important aspect of this this functionality.
Basically, whenever a new item (or, in our case, a number) is added, the BTree finds the appropriate place (or, node) for the item to go in. For example, if we want to add the number 6, the BTree will “ask the root node”, in what node should it push the number in. “Asking” is nothing more than comparing the new number with the keys of the node. Since the number 6 is larger then 5, but smaller then the number 10 (which are the root node keys), it will create a new node just below the root node:
With this mechanism, the BTree is always ordered and looking up a value in it is rather cheap. There are multiple implementations of BTrees. For this post, it’s nice to know that PostgreSQL uses the BTree implementation of the “Lehman and Yao’s highconcurrency Btree management algorithm”. You can read the actual paper here{:target="_blank"}.
But, how is this relevant to the BTree indexes in PostgreSQL?
BTrees and PostgreSQL
Indexes in PostgreSQL, simply put, are a replica of the data on the column(s) that is/are indexed. The only difference is in the data order  the replica of the data is sorted, which allows PostgreSQL to quickly find and retrieve the data. For example, when you search for a record in a table, where the column by which you are searching is indexed, the index decreases the cost of the query because PostgreSQL looks up in the index and can easily find the location of the data on disk.
The BTree data structure falls really nice into place, when you recall that the index is ordered. Under the hood, indexes are BTrees, but really big ones. Due to the nature of the BTree data structure, whenever a new record is added on the indexed table it knows how to rebalance and keep the order of the records in it.
Limitations
Almost in all use cases, the power of indexes is noticable on large amounts of data. This means that the indexes will have to be as big as the actual data tables. Or does it?
Imagine if we are dealing with billions of records. This means that the index
tables will have the billions of records stored in an ordered fashion. Okay,
PostgreSQL could handle that. But, can you imagine how long would an INSERT
command take? Adding the record in the data table will take really long,
because the index will have to add the new record in the correct place, to keep
the order of the indexes. Due to this limitation, the implementation of the
BTree index keeps page files, which simply put, are nodes on a big BTree
data structure.
Although each index is a whole, this paging mechanism adds a separation of the data in the index, while still keeping the order. In that case, instead of dumping the whole index into memory just to add a single record, PostgreSQL finds the page where the new record should be added and writes the indexed value into it.
My explanation is quite abstract, but the aim of this article was to introduce the BTree data structure and how it falls into place with PostgreSQL. If you would like to read more details on the implementation, check Discovering the Computer Science Behind Postgres Indexes from Pat Shaughnessy. Actually, if I knew of the existence of his article before I started writing this article, I might have not even written it.
Using a BTree index
Having the workings of BTree index aside, let’s see how to use it. The command to add an index to a column is:


Or, since the btree
index is the default one, we can omit the USING
part of
the command:


This will create a BTree index on the name
column in the table
table.
Outro
In this article we took an overview on one of the most popular indexes in PostgreSQL. We saw what is the difference between Binary Search Trees and BTrees, and how their behaviour translates into PostgreSQL. Take note, there is a ton of detail that I had to omit due to the length of the article and the target audience.
Links
If you would like to dig in deeper into BTree, whether the index or the data structure, here are some useful links:
 Efficient Locking for Concurrent Operations on BTrees
 BTrees
 Anatomy of an SQL index
 BTree on Wikipedia