UUID - versions, details, use cases

2024 Sep 13  |  15 min read  |  tags: distributed-systems (1)

work in progress
work in progress
work in progress
work in progress

this note has no flow, and is a dump of information at the moment

Unique identifiers are at the core of distributed systems. The uuid family of identifier generators are the most wisely used ones. They have been implemented by standard libraries of almost all languages.

But, each uuid works differently and serves a specific purpose. The choice of uuid greatly affects the performance of the systems you architect and build. Some uuids are fit for one scenario, and completely unfit for others.

To actually understand when to use which uuid, you have to understand the underlying uuid itself. The authoritative sources to understand uuids are the rfcs that specify them. This note is the result of a study of those rfcs.

All uuids are 128bit in size (32 digits). First 2 or 4 bits indicate the uuid type. The rest of the bits are the uuid's actual value. UUIDs aren't guaranteed to be unique. The assumption when using UUID is that the randomness and the space is large enough that collisions won't be possible.

Uuid collisions are mathematically possible

So, UUIDs are virtually globally unique. But without guarantees. We still rely on them.

If you truly need a system which mathematically won't have unique id collisions, you cannot use uuids generated by individual nodes in your distributed system. Instead, you'll have to use a centralized system that generates and distributes unique identifiers. A widely used solution is Snowflake Id (introduced by twitter).

Uuid versions

There are 7 current version of uuid. They can be calssified as follows:

  • time based
    • v1
    • v6
  • randomness based
    • v4
  • time and randomness based
    • v7
  • input based

    • v3
    • v5
  • v1 - monotonically increasing - 48bit MAC address + 60bit timestamp (100ns resolution)

  • v2 - ??? unknown
  • v3 - 128 bit md5 hash calculated from a string you provide
  • v4 - 6bits fixed (to indicate version and variant)+ 122bits random sequence. Built to be universally unique across the universe. Can have collisions, but the probability is vanishingly small.
    • most commonly used and implemented uuid
    • there are more variants within uuid4 itself
  • v5 - deterministic, calculated as a sha1 hash of 2 fields that you provide
    • use this in favour of v3, because sha1 hashing is superior (md5 collisions are possible). sha1 also gets calculated faster, as it has more parallelism. (sha1 is more compute intensive than md5, but sha1 is also parallelizable. So, despite needing more compute power, sha1 is calculated much faster when you measure the actual time taken to generate the hash)
  • v6 - monotonically increasing counter, mac+timestamp based
  • v7 - timestamp+random data
  • v8 - ??? idk

The default choice is v4. But it is not suitable for all scenarios.

Time based uuid

v1

Composition - 48bit MAC address, 60 bit timestamp (100ns resolution)

Properties - Monotonically increasing

When will there be a collision - When two uuids are generated within the same 100ns window.

v6

  • Improvement over uuid v1. Bitwise, it is compatible with v1.
  • Time is encoded from high to low bytes. So, uuids when sorted, will be sorted by time.

Bit Composition (in order) - 1. 32 bits - most significant 32 bits of the 60bit timestamp 2. 16 bits - middle 16 bits of the 60bit timestamp 3. 4 bits - uuid v6 version (0110) 4. 12 bits - least significant 12 bits of the 60bit timestamp 5. 2 bits - uuid variant (10) 6. 6 bits - high portion of clock sequence 7. 8 bits - clock sequence low 8. 48 bits - spatially unique identifier (MAC address)

Input based uuid

v3 and v5

Namespace based generator. These uuids can be deterministic if you want them to be, as these UUIDs are dependant on input. If you give the same input, the uuid function will generate the same output. There is no inherent randomness or time dependence in the uuid generation algorithm.

These namespace based uuids take two arguments: - namespace - this itself is a uuid object. Use one of the 4 standard namespaces, or use a uuid that you generated. - name - a string that you provide.

When will there be a collision - When two uuids are generated using the exact same arguments - namespace and name.

There are 4 standard namespacess defined in the uuid rfc:

  1. NameSpace_DNS - has the value 6ba7b810-9dad-11d1-80b4-00c04fd430c8
  2. NameSpace_URL - has the vlaue 6ba7b811-9dad-11d1-80b4-00c04fd430c8
  3. NameSpace_OID - has the value 6ba7b812-9dad-11d1-80b4-00c04fd430c8
  4. NameSpace_X500 - has the value 6ba7b814-9dad-11d1-80b4-00c04fd430c8

As you can see, the namespaces aren't random. They have a fixed value, and the same for everyone. These values have been specified in rfc4122

https://datatracker.ietf.org/doc/html/rfc4122.html

Appendix C

This appendix lists the name space IDs for some potentially
interesting name spaces, as initialized C structures and in the
string representation defined above.

/* Name string is a fully-qualified domain name */
uuid_t NameSpace_DNS = { /* 6ba7b810-9dad-11d1-80b4-00c04fd430c8 */
   0x6ba7b810,
   0x9dad,
   0x11d1,
   0x80, 0xb4, 0x00, 0xc0, 0x4f, 0xd4, 0x30, 0xc8
};

/* Name string is a URL */
uuid_t NameSpace_URL = { /* 6ba7b811-9dad-11d1-80b4-00c04fd430c8 */
   0x6ba7b811,
   0x9dad,
   0x11d1,
   0x80, 0xb4, 0x00, 0xc0, 0x4f, 0xd4, 0x30, 0xc8
};

/* Name string is an ISO OID */
uuid_t NameSpace_OID = { /* 6ba7b812-9dad-11d1-80b4-00c04fd430c8 */
   0x6ba7b812,
   0x9dad,
   0x11d1,
   0x80, 0xb4, 0x00, 0xc0, 0x4f, 0xd4, 0x30, 0xc8
};

/* Name string is an X.500 DN (in DER or a text output format) */
uuid_t NameSpace_X500 = { /* 6ba7b814-9dad-11d1-80b4-00c04fd430c8 */
   0x6ba7b814,
   0x9dad,
   0x11d1,
   0x80, 0xb4, 0x00, 0xc0, 0x4f, 0xd4, 0x30, 0xc8
};
v3 versus v5

The only difference is that v3 uses md5 hashing, and v5 uses sha1 hashing.

If you know sha1, then you know that it generates 160bits of data. But uuid is only 128bit long. The rfc specifies how to condense these 160bits of sha1, and fit them into 128bits of uuid.

Following is the implementation of uuid v3 and v5 in python's source code:

# /python/uuid.py

def uuid3(namespace, name):
    """Generate a UUID from the MD5 hash of a namespace UUID and a name."""
    from hashlib import md5
    digest = md5(
        namespace.bytes + bytes(name, "utf-8"),
        usedforsecurity=False
    ).digest()
    return UUID(bytes=digest[:16], version=3)

def uuid5(namespace, name):
    """Generate a UUID from the SHA-1 hash of a namespace UUID and a name."""
    from hashlib import sha1
    hash = sha1(namespace.bytes + bytes(name, "utf-8")).digest()
    return UUID(bytes=hash[:16], version=5)

# The following standard UUIDs are for use with uuid3() or uuid5().

NAMESPACE_DNS = UUID('6ba7b810-9dad-11d1-80b4-00c04fd430c8')
NAMESPACE_URL = UUID('6ba7b811-9dad-11d1-80b4-00c04fd430c8')
NAMESPACE_OID = UUID('6ba7b812-9dad-11d1-80b4-00c04fd430c8')
NAMESPACE_X500 = UUID('6ba7b814-9dad-11d1-80b4-00c04fd430c8')

Randomness based uuid

v4 - most commonly used

Bit Composition - uuid v4 has 6 fixed bits to indicate version and variant, and 122 bits of random value.

v4 is built to be universally unique across the universe. Note that collisions aren't mathematically impossible. There can be collisions in uuid v4. But the probability of it is so small, that we assume that the uuid will be universally unique.

When will there be a collision - When two machines generating the uuids generate the exact same random numbers.

If your application demands absolutely unique identifiers, and even uuid v4 is not a good fit, then switch over to using the snowflake id system. In it, the centralized system will provide unique identifiers to your system. Those identifiers are guaranteed to be unique within your system. Snowflake IDs are 64bits in length, so they also fit natively into your databases, into the bigint datatype.

  • v4 - 6bits fixed (to indicate version and variant)+ 122bits random sequence. Built to be universally unique across the universe. Can have collisions, but the probability is vanishingly small.
    • most commonly used and implemented uuid
    • there are more variants within uuid4 itself

Time and randomness based uuid

v7

This version is preferred over v6 and v1

Time ordered value field is derived from the unix epoch timestamp (milliseconds since midnight 1 Jan 1970 UTC, without leap seconds).

Bit composition (in order) - 1. 48 bits - unsigned unix epoch timestamp in big-endian format (big-endian = most significant value first) 2. 4 bits - UUIDv7 version 3. 12 bits - pseudo random data for uniqueness 4. 2 bit - variant 5. 62 bits - pseudo random data for uniqueness

Nil and Max uuid

Treat these two as reserved uuid. They aren't used for anything. Just that the specifications mention them.

  • Nil uuid has all 128 bits set to zero - 00000000-0000-0000-0000-000000000000
  • Max uuid has all 128 bits set to 1 - FFFFFFFF-FFFF-FFFF-FFFF-FFFFFFFFFFFF

Choosing uuid based on requirements

Requirement - Anonymity

Need complete anonymity? The generated uuid shouldn't be able to be traced back to the generating node.

uuid provides anonymity? reason
v1 no uuid includes mac address
v2 - -
v3 depends 2 fields provided for generation should not be bound to the generating node
v4 yes completely randomly generated
v5 depends 2 fields provided for generation should not be bound to the generating node
v6 no uuiud includes mac address
v7 yes Based on timestamp+random data. As long as all the nodes have their clocks synced up, the uuid cannot be traced back to the generating node.
v8 - -

Unique keys for database tables

Depends on the use case. If you're using something like dynamodb, where you want to spread out your reads and writes (across shards), then randomness based uuid4 is a good fit. But these category of databases (dynamodb, cassandra etc) already hash the primary key (consistent hashing) to spread out the load, so the randomness of your primary key doesn't matter.

If you're using something like postgres, where you need to focus your reads and writes to a limited set of database pages, then uuid4 is a horrible idea.

If you're using a dbms, then using one of the monotonically increasing timestamp-based uuid is a great idea (prefer to use v7).

The UUIDs in order of preference are:

  1. uuid v7
  2. uuid v6

Just avoid v1. It sorts horribly because of the way its bits are ordered.

uuid is preferred? reason
v1 yes timestamp based, so linear & sortable
v2 - -
v3 no md5 hash, so will be randomly spread out
v4 no random, so will be spread out
v5 no sha1hash, so it is random. So it is spread out
v6 yes (strong yes) monotonically increasing, so linear & sortable
v7 yes (strong yes) unix timestamp based, so linear & sortable
v8 - -

Need uuid for using as a key for database (event id, transaction id, user id etc). So, the IDs generated have to be sortable, so that they a aren't randomly spread out throughout the database pages. This reduces the spread of pages that have to be read from the database for a query, and greatly improves performance. (b-tree etc).

uuid v4 is horrible for database performance. The keys will end up being spread out through the database. This will increase the number of pages that have to be accessed/modified. If the uuid4 column is indexed, it will also make index updates much more inefficient because changes will have to be made all over the btree/b+tree. There will be too many cache misses, and performance will be comparatively horrible.

Solutions - use timestamp orderable uuids, that are sortable. This will ensure a degree of sequence in the stored data, and will make queries and indexes much faster. Also, because the uuids are linear, you can now also run range queries (eg - find all events that happened between two uuids).

When a new row in inserted against the key, it will cause updates in random pages. This random nature will reduce throughput. Sortable uuids also end up making joins faster with some db join algorithms.

Known issues/limitations

  • UUIDs are very large in size and consume a lot of space at scale.
  • If storage becomes a major concern, switching over to 64bit snowflake id is an option. But it drastically changes your system from an entirely distributed system, to becoming dependent on the dedicated snowflake id servers for unique identifiers. Make sure this is a cost worth taking up.
    • Snowflake id play well with the databases - they are 64 bits, so can be stored as bigint.
  • Postgres and most databases only generate UUIDv4. So it is terrible for performance for many scenarios. It performance is going to be a bottleneck, your only option is to make the application layer generate the uuid to store it in postgres as a binary datatype.

There are no solutions, only tradeoffs

There is no true solution. There is no distributed unique identifier generator without collisions. There are only tradeoffs.

The true solution of unique identifiers is something like the snowflake system - a centralized system that generates and provides unique identifiers to all downstream systems. This unique identifier system is the source of truth. And if it goes down, your system will come to a stop.

And if you keep the system de-centralized, then you use one of the uuid versions, and open up your system to the possibility of collisions.

Like much of problem solving - there are no true solutions. Only tradeoffs.