Niraj Zade | Website is a work in progress.

UUID - versions, details, use cases

2024 Sep 13  |  16 min read  |  tags: distributed-systems 1 software-engineering 3


work in progress
work in progress
work in progress
work in progress

this note has no flow, and is a dump of information at the moment

Unique identifiers are at the core of distributed systems. The uuid family of identifier generators are the most wisely used ones. They have been implemented by standard libraries of almost all languages.

But, each uuid works differently and serves a specific purpose. The choice of uuid greatly affects the performance of the systems you architect and build. Some uuids are fit for one scenario, and completely unfit for others.

To actually understand when to use which uuid, you have to understand the underlying uuid itself. The authoritative sources to understand uuids are the rfcs that specify them. This note is the result of a study of those rfcs.

All uuids are 128bit in size (32 digits). UUIDs aren't guaranteed to be collision resistant. The assumption when using UUID is that there are so many possible UUIDs, that even if we randomly choose one, the possibility of collision will practically be zero.

For an absolutely collision resistant system, setup and use a Snowflake Id cluster. But it has operational challenges of its own, and becomes a centralized point of failure for your distributed system.

Uuid collisions are mathematically possible

So, UUIDs are virtually globally unique. But without guarantees. We still rely on them.

If you truly need a system which mathematically won't have unique id collisions, you cannot use uuids generated by individual nodes in your distributed system. Instead, you'll have to use a centralized system that generates and distributes unique identifiers. A widely used solution is Snowflake Id (introduced by twitter).

Uuid versions

There are 7 current version of uuid. They can be calssified as follows:

The default choice is v4. But it is not suitable for all scenarios.

Time based

v1

Composition - 48bit MAC address, 60 bit timestamp (100ns resolution)

Properties - Monotonically increasing

When will there be a collision - When two uuids are generated within the same 100ns window.

v6

Bit Composition (in order) - 1. 32 bits - most significant 32 bits of the 60bit timestamp 2. 16 bits - middle 16 bits of the 60bit timestamp 3. 4 bits - uuid v6 version (0110) 4. 12 bits - least significant 12 bits of the 60bit timestamp 5. 2 bits - uuid variant (10) 6. 6 bits - high portion of clock sequence 7. 8 bits - clock sequence low 8. 48 bits - spatially unique identifier (MAC address)

Input based

v3 and v5

Namespace based generator. These uuids can be deterministic if you want them to be, as these UUIDs are dependant on input. If you give the same input, the uuid function will generate the same output. There is no inherent randomness or time dependence in the uuid generation algorithm.

These namespace based uuids take two arguments: - namespace - this itself is a uuid object. Use one of the 4 standard namespaces, or use a uuid that you generated. - name - a string that you provide.

When will there be a collision - When two uuids are generated using the exact same arguments - namespace and name.

There are 4 standard namespacess defined in the uuid rfc:

  1. NameSpace_DNS - has the value 6ba7b810-9dad-11d1-80b4-00c04fd430c8
  2. NameSpace_URL - has the vlaue 6ba7b811-9dad-11d1-80b4-00c04fd430c8
  3. NameSpace_OID - has the value 6ba7b812-9dad-11d1-80b4-00c04fd430c8
  4. NameSpace_X500 - has the value 6ba7b814-9dad-11d1-80b4-00c04fd430c8

As you can see, the namespaces aren't random. They have a fixed value, and the same for everyone. These values have been specified in rfc4122

https://datatracker.ietf.org/doc/html/rfc4122.html

Appendix C

This appendix lists the name space IDs for some potentially
interesting name spaces, as initialized C structures and in the
string representation defined above.

/* Name string is a fully-qualified domain name */
uuid_t NameSpace_DNS = { /* 6ba7b810-9dad-11d1-80b4-00c04fd430c8 */
   0x6ba7b810,
   0x9dad,
   0x11d1,
   0x80, 0xb4, 0x00, 0xc0, 0x4f, 0xd4, 0x30, 0xc8
};

/* Name string is a URL */
uuid_t NameSpace_URL = { /* 6ba7b811-9dad-11d1-80b4-00c04fd430c8 */
   0x6ba7b811,
   0x9dad,
   0x11d1,
   0x80, 0xb4, 0x00, 0xc0, 0x4f, 0xd4, 0x30, 0xc8
};

/* Name string is an ISO OID */
uuid_t NameSpace_OID = { /* 6ba7b812-9dad-11d1-80b4-00c04fd430c8 */
   0x6ba7b812,
   0x9dad,
   0x11d1,
   0x80, 0xb4, 0x00, 0xc0, 0x4f, 0xd4, 0x30, 0xc8
};

/* Name string is an X.500 DN (in DER or a text output format) */
uuid_t NameSpace_X500 = { /* 6ba7b814-9dad-11d1-80b4-00c04fd430c8 */
   0x6ba7b814,
   0x9dad,
   0x11d1,
   0x80, 0xb4, 0x00, 0xc0, 0x4f, 0xd4, 0x30, 0xc8
};
v3 versus v5

The only difference is that v3 uses md5 hashing, and v5 uses sha1 hashing.

If you know sha1, then you know that it generates 160bits of data. But uuid is only 128bit long. The rfc specifies how to condense these 160bits of sha1, and fit them into 128bits of uuid.

Following is the implementation of uuid v3 and v5 in python's source code:

# /python/uuid.py

def uuid3(namespace, name):
    """Generate a UUID from the MD5 hash of a namespace UUID and a name."""
    from hashlib import md5
    digest = md5(
        namespace.bytes + bytes(name, "utf-8"),
        usedforsecurity=False
    ).digest()
    return UUID(bytes=digest[:16], version=3)

def uuid5(namespace, name):
    """Generate a UUID from the SHA-1 hash of a namespace UUID and a name."""
    from hashlib import sha1
    hash = sha1(namespace.bytes + bytes(name, "utf-8")).digest()
    return UUID(bytes=hash[:16], version=5)

# The following standard UUIDs are for use with uuid3() or uuid5().

NAMESPACE_DNS = UUID('6ba7b810-9dad-11d1-80b4-00c04fd430c8')
NAMESPACE_URL = UUID('6ba7b811-9dad-11d1-80b4-00c04fd430c8')
NAMESPACE_OID = UUID('6ba7b812-9dad-11d1-80b4-00c04fd430c8')
NAMESPACE_X500 = UUID('6ba7b814-9dad-11d1-80b4-00c04fd430c8')

Randomness based

v4

This is also the most commonly used uuid. When people generally think of uuid, what they think is UUID v4.

Bit Composition - uuid v4 has 6 fixed bits to indicate version and variant, and 122 bits of random value.

v4 is built to be universally unique across the universe. Note that collisions aren't mathematically impossible. There can be collisions in uuid v4. But the probability of it is so small, that we assume that the uuid will be universally unique.

When will there be a collision - When two machines generating the uuids generate the exact same random numbers.

If your application demands absolutely unique identifiers, and even uuid v4 is not a good fit, then switch over to using the snowflake id system. In it, the centralized system will provide unique identifiers to your system. Those identifiers are guaranteed to be unique within your system. Snowflake IDs are 64bits in length, so they also fit natively into your databases, into the bigint datatype.

Time and randomness based

v7

This version is preferred over v6 and v1

Time ordered value field is derived from the unix epoch timestamp (milliseconds since midnight 1 Jan 1970 UTC, without leap seconds).

Bit composition (in order) - 1. 48 bits - unsigned unix epoch timestamp in big-endian format (big-endian = most significant value first) 2. 4 bits - UUIDv7 version 3. 12 bits - pseudo random data for uniqueness 4. 2 bit - variant 5. 62 bits - pseudo random data for uniqueness

Nil and Max uuid

Treat these two as reserved uuid. They aren't used for anything. Just that the specifications mention them.

Choosing uuid based on requirements

Requirement - Anonymity

In this case, the generated uuid should have no relation to the generator node. So : - any UUID version with MAC address in it can't be used. - uuid v3 and v5 can't be used if a generator-node related string is being given as an argument to the uuid function

uuid provides anonymity? reason
v1 no uuid includes mac address
v2 - -
v3 depends 2 fields provided for generation should not be bound to the generating node
v4 yes completely randomly generated
v5 depends 2 fields provided for generation should not be bound to the generating node
v6 no uuiud includes mac address
v7 yes Based on timestamp+random data. As long as all the nodes have their clocks synced up, the uuid cannot be traced back to the generating node.
v8 - -

Unique keys for database tables

uuid is preferred? reason
v1 no timestamp based, but not sortable
v2 - -
v3 no md5 hash, so will be randomly spread out
v4 no random, so will be spread out
v5 no sha1hash, so it is random. So it is spread out
v6 yes (strong yes) monotonically increasing, so linear & sortable
v7 yes (strong yes) unix timestamp based, so linear & sortable
v8 - -

Depends on the use case. If you're using something like dynamodb, where you want to spread out your reads and writes (across shards), then randomness based uuid4 is a good fit. But these category of databases (dynamodb, cassandra etc) already hash the primary key (consistent hashing) to spread out the load, so the randomness of your primary key doesn't matter.

If you're using something like postgres, where you need to focus your reads and writes to a limited set of database pages, then spread-out unsortable UUID versions (like v, v4etc) is a horrible idea.

If you're using a dbms, then using one of the monotonically increasing timestamp-based uuid is a great idea (prefer to use v7).

The UUIDs in order of preference are:

  1. uuid v7
  2. uuid v6

However, the specifications of v6 and v7 are in draft phase (as of 2024, in RFC9562). However, implementations of v6 are available for most languages. If needed, you can write one for yourself. It is doable.

Avoid using UUIDv1. Even though it is timestamp based, It does not sort well because of the way its bits are ordered. The problem is that the starting bits are random, and the ending bits are stable. We want it to be the other way around, to avoid fragmentation in out database storage. UUIDv6 solved this issue by re-arranging the bits.

Example of UUIDv1:

848039a9-7979-11ef-8052-f48c526f1a9e  
875f7903-7979-11ef-bda6-f48c526f1a9e  
880cdb52-7979-11ef-9123-f48c526f1a9e

Need uuid for using as a key for database (event id, transaction id, user id etc). So, the IDs generated have to be sortable, so that they a aren't randomly spread out throughout the database pages. This reduces the spread of pages that have to be read from the database for a query, and greatly improves performance. (b-tree etc).

uuid v4 is horrible for database performance. The keys will end up being spread out through the database. This will increase the number of pages that have to be accessed/modified. If the uuid4 column is indexed, it will also make index updates much more inefficient because changes will have to be made all over the btree/b+tree. There will be too many cache misses, and performance will be comparatively horrible.

Solutions - use timestamp orderable uuids, that are sortable. This will ensure a degree of sequence in the stored data, and will make queries and indexes much faster. Also, because the uuids are linear, you can now also run range queries (eg - find all events that happened between two uuids).

When a new row in inserted against the key, it will cause updates in random pages. This random nature will reduce throughput. Sortable uuids also end up making joins faster with some db join algorithms.

Known issues/limitations

There are no solutions, only tradeoffs

There is no true solution. There is no distributed unique identifier generator without collisions. There are only tradeoffs.

The true solution of unique identifiers is something like the snowflake system - a centralized system that generates and provides unique identifiers to all downstream systems. This unique identifier system is the source of truth. And if it goes down, your system will come to a stop.

And if you keep the system de-centralized, then you use one of the uuid versions, and open up your system to the possibility of collisions.

Like much of problem solving - there are no true solutions. Only tradeoffs.

Table Of Contents:

All Articles

Blog

Resources