Postgres - index types and usage scenarios

2024 Oct 13  |  7 min read  |  tags: postgres (2) dbms (2)

Types

Index creation template:

CREATE INDEX <index_name> ON <table_name> USING <index_type>(column_name);

Index types:

  • none - defaults to b-tree
  • hash
  • gist
  • spgist
  • gin
  • brin

B-Tree

In-depth explanation in the postgres source code - /src/backend/access/nbtree/README

The postgres b-tree index uses the balanced tree data structure. So, naturally the fetching time will be uniform for all values, as all the elements will be almost equidistant from the root of the tree.

Properties

  • Large index size.
  • Expensive to compute - tree has to be created, balanced

When to use

  • Default choice of index. Good for general purpose indexing. Works well for a wide range of data types.

When not to use

  • When the table is extremely large. B-tree are expensive. Are cpu and storage intensive to create and maintain.
  • When data is high velocity. The constant b-tree rebuild due to dml statements (update, delete) will use up a lot of cpu, and slow down the database.
  • Do not use when the data is highly skewed. Since b-tree uses a balanced tree, skewed data will trigger a lot of tree re-balancing. This is computationally expensive, and will also cause a lot of i/o as storage pages will be split and moved around.
CREATE INDEX btree_index_name ON your_table(column_name);

Supported operators

  • Search, insert, delete - single values or range of values
<   <=   =   >=   >

Hash

In-depth explanation in the postgres source code - /src/backend/access/hash/README

  • Generates a 32 bit hash code from the value of the indexed column.
  • Uses hash function to map key and values to locations. Perfect fit for fast exact-match/equality lookups. Time complexity to find a row is O(1).
  • Not good for range queries, sorting

When to use

  • You are performing exact-match lookups (=)
  • Column has high cardinality.

When not to use

  • The column has low cardinality
  • Your are performing operations other than exact-match lookups (other than =)

Notes on performance

Hash index doesn't like too many hash values in a single bucket. This is exactly like a normal hash map, with a linked list for each bucket. Hash index performs best when the data is uniformly distributed. In other words - the ratio of cardinality to number_of_rows is low.

When the number of rows is high, but column cardinality is low, the hash index performance will suffer massively.

Reason - This is because the bucket in the hashmap will end up with too many values. Postgres stores the buckets in pages (postgres stores everything in pages). With too many values in a bucket, the bucket will be spread over many pages. This will quickly ruin performance at scale.

CREATE INDEX hash_index_name ON your_table USING hash (column_name);

Supported operators

=

GIST - generalized search tree

In-depth explanation in the postgres source code - /src/backend/access/gist/README

Properties

  • Uses tree structure, but unlike B-Tree, can be customized for specific data types
  • Use for indexing multi-dimensional data.
  • Flexible index types. Useful for geometric and text data types

When to use

  • Use for spatial data types, text search, custom data types
  • Use when performing full text search on the column's data
CREATE INDEX gist_index_name ON your_table USING gist (column_name);

Supported operators

<<   &<   &>   >>   <<|   &<|   |&>   |>>   @>   <@   ~=   &&

SP-GIST - Space Partitioned Generalized Search Tree

In-depth explanation in the postgres source code - /src/backend/access/spgist/README

Properties

  • Extension of GIST. Divides index space into non-overlapping regions for efficient storage and retrieval

When to use

  • Column has space partitioned data like geographical regions, ranges
  • When gist index isn't fast enough, sp-gist can radically speed up queries
CREATE INDEX spgist_index_name ON your_table USING spgist (column_name);

Supported operators

<<   >>   ~=   <@   <<|   |>>

GIN - Generalized Inverted Index

In-depth explanation in the postgres source code - /src/backend/access/gin/README

More details by the maintainers - Gin for PostgreSQL

  • Stores inverted lists of keys and positions, enabling efficient searches for elements with these data structures

Properties

  • Computationally expensive to generate.
  • Can generate large index sizes. (as compared to b-tree indexes).

When to use

  • When the column is of JSONB type. Gin is the the preferred index for JSONB datatype.

When not to use

  • Table has a lot of updates. Re-calculating gin index is extremely expensive.
CREATE INDEX gin_index_name ON your_table USING gin (column_name);

Supported operators

<@   @>   =   &&

BRIN - Block Range Index

In-depth explanation in the postgres source code - /src/backend/access/brin/README

Brin index groups storage pages into blocks, and stores summary information about each block. The default page block size is 128.

When postgres is searching for a value using brin index, it can skip over entire blocks of pages if the index's min/max stats say that the value isn't in that particular page block.

Properties

  • Very space efficient. Generates extremely small indexes.
  • Very CPU efficient. Don't use up too much CPU to build/rebuild.

When to use

  • Use to optimize storage and query performance on very large tables where the data is ordered.
  • When data is sorted. Eg - time series, sequential data.
  • When the table is huge, and b-tree index is too expensive to use (especially if a lot of inserts and updates are occurring. re-computing b-tree is extremely expensive).
  • When the data has very high velocity (index re-calculation is cheap, so can keep up with data insert speeds)
CREATE INDEX brin_index_name ON your_table USING brin (column_name);

-- create brin index with custom page block size
CREATE index brin_index_name ON your_table USING brin(column_name) WITH (pages_per_range = 32);

Supported operators

<   <=   =   >=   >

Notes on indexing performance

Choosing single or multi-column index

A query may not necessarily use multiple indexes in a query.

This is why a single multicolumn index vs 2 single column indexes can give vastly different performance.

Use ANALYZE to actually see what the query is using.

Some queries with OR, AND may use multiple indexes for the query. These combined indexes are usually single column indexes - postgres will do a bitwise & and/or * operation over these indexes, and use the result to choose rows.

https://www.postgresql.org/docs/9.1/textsearch-indexes.html