work in progress
work in progress
work in progress
work in progress
this note has no flow, and is a dump of information at the moment
Unique identifiers are at the core of distributed systems. The uuid family of identifier generators are the most wisely used ones. They have been implemented by standard libraries of almost all languages.
But, each uuid works differently and serves a specific purpose. The choice of uuid greatly affects the performance of the systems you architect and build. Some uuids are fit for one scenario, and completely unfit for others.
To actually understand when to use which uuid, you have to understand the underlying uuid itself. The authoritative sources to understand uuids are the rfcs that specify them. This note is the result of a study of those rfcs.
All uuids are 128bit in size (32 digits). UUIDs aren't guaranteed to be collision resistant. The assumption when using UUID is that there are so many possible UUIDs, that even if we randomly choose one, the possibility of collision will practically be zero.
For an absolutely collision resistant system, setup and use a Snowflake Id cluster. But it has operational challenges of its own, and becomes a centralized point of failure for your distributed system.
Uuid collisions are mathematically possible
So, UUIDs are virtually globally unique. But without guarantees. We still rely on them.
If you truly need a system which mathematically won't have unique id collisions, you cannot use uuids generated by individual nodes in your distributed system. Instead, you'll have to use a centralized system that generates and distributes unique identifiers. A widely used solution is Snowflake Id (introduced by twitter).
Uuid versions
There are 7 current version of uuid. They can be calssified as follows:
- time based
- v1
- v6
- randomness based
- v4
- time and randomness based
- v7
input based
- v3
- v5
v1 - monotonically increasing - 48bit MAC address + 60bit timestamp (100ns resolution)
- v2 - ??? unknown
- v3 - 128 bit md5 hash calculated from a string you provide
- v4 - 6bits fixed (to indicate version and variant)+ 122bits random sequence. Built to be universally unique across the universe. Can have collisions, but the probability is vanishingly small.
- most commonly used and implemented uuid
- there are more variants within uuid4 itself
- v5 - deterministic, calculated as a sha1 hash of 2 fields that you provide
- use this in favour of v3, because sha1 hashing is superior (md5 collisions are possible). sha1 also gets calculated faster, as it has more parallelism. (sha1 is more compute intensive than md5, but sha1 is also parallelizable. So, despite needing more compute power, sha1 is calculated much faster when you measure the actual time taken to generate the hash)
- v6 - monotonically increasing counter, mac+timestamp based
- v7 - timestamp+random data
- v8 - ??? idk
The default choice is v4. But it is not suitable for all scenarios.
Time based
v1
Composition - 48bit MAC address, 60 bit timestamp (100ns resolution)
Properties - Monotonically increasing
When will there be a collision - When two uuids are generated within the same 100ns window.
v6
- Improvement over uuid v1. Bitwise, it is compatible with v1.
- Time is encoded from high to low bytes. So, uuids when sorted, will be sorted by time.
Bit Composition (in order) - 1. 32 bits - most significant 32 bits of the 60bit timestamp 2. 16 bits - middle 16 bits of the 60bit timestamp 3. 4 bits - uuid v6 version (0110) 4. 12 bits - least significant 12 bits of the 60bit timestamp 5. 2 bits - uuid variant (10) 6. 6 bits - high portion of clock sequence 7. 8 bits - clock sequence low 8. 48 bits - spatially unique identifier (MAC address)
Input based
v3 and v5
Namespace based generator. These uuids can be deterministic if you want them to be, as these UUIDs are dependant on input. If you give the same input, the uuid function will generate the same output. There is no inherent randomness or time dependence in the uuid generation algorithm.
These namespace based uuids take two arguments:
- namespace
- this itself is a uuid object. Use one of the 4 standard namespaces, or use a uuid that you generated.
- name
- a string that you provide.
When will there be a collision - When two uuids are generated using the exact same arguments - namespace and name.
There are 4 standard namespacess defined in the uuid rfc:
- NameSpace_DNS - has the value
6ba7b810-9dad-11d1-80b4-00c04fd430c8
- NameSpace_URL - has the vlaue
6ba7b811-9dad-11d1-80b4-00c04fd430c8
- NameSpace_OID - has the value
6ba7b812-9dad-11d1-80b4-00c04fd430c8
- NameSpace_X500 - has the value
6ba7b814-9dad-11d1-80b4-00c04fd430c8
As you can see, the namespaces aren't random. They have a fixed value, and the same for everyone. These values have been specified in rfc4122
https://datatracker.ietf.org/doc/html/rfc4122.html
Appendix C
This appendix lists the name space IDs for some potentially
interesting name spaces, as initialized C structures and in the
string representation defined above.
/* Name string is a fully-qualified domain name */
uuid_t NameSpace_DNS = { /* 6ba7b810-9dad-11d1-80b4-00c04fd430c8 */
0x6ba7b810,
0x9dad,
0x11d1,
0x80, 0xb4, 0x00, 0xc0, 0x4f, 0xd4, 0x30, 0xc8
};
/* Name string is a URL */
uuid_t NameSpace_URL = { /* 6ba7b811-9dad-11d1-80b4-00c04fd430c8 */
0x6ba7b811,
0x9dad,
0x11d1,
0x80, 0xb4, 0x00, 0xc0, 0x4f, 0xd4, 0x30, 0xc8
};
/* Name string is an ISO OID */
uuid_t NameSpace_OID = { /* 6ba7b812-9dad-11d1-80b4-00c04fd430c8 */
0x6ba7b812,
0x9dad,
0x11d1,
0x80, 0xb4, 0x00, 0xc0, 0x4f, 0xd4, 0x30, 0xc8
};
/* Name string is an X.500 DN (in DER or a text output format) */
uuid_t NameSpace_X500 = { /* 6ba7b814-9dad-11d1-80b4-00c04fd430c8 */
0x6ba7b814,
0x9dad,
0x11d1,
0x80, 0xb4, 0x00, 0xc0, 0x4f, 0xd4, 0x30, 0xc8
};
v3 versus v5
The only difference is that v3 uses md5 hashing, and v5 uses sha1 hashing.
If you know sha1, then you know that it generates 160bits of data. But uuid is only 128bit long. The rfc specifies how to condense these 160bits of sha1, and fit them into 128bits of uuid.
Following is the implementation of uuid v3 and v5 in python's source code:
# /python/uuid.py
def uuid3(namespace, name):
"""Generate a UUID from the MD5 hash of a namespace UUID and a name."""
from hashlib import md5
digest = md5(
namespace.bytes + bytes(name, "utf-8"),
usedforsecurity=False
).digest()
return UUID(bytes=digest[:16], version=3)
def uuid5(namespace, name):
"""Generate a UUID from the SHA-1 hash of a namespace UUID and a name."""
from hashlib import sha1
hash = sha1(namespace.bytes + bytes(name, "utf-8")).digest()
return UUID(bytes=hash[:16], version=5)
# The following standard UUIDs are for use with uuid3() or uuid5().
NAMESPACE_DNS = UUID('6ba7b810-9dad-11d1-80b4-00c04fd430c8')
NAMESPACE_URL = UUID('6ba7b811-9dad-11d1-80b4-00c04fd430c8')
NAMESPACE_OID = UUID('6ba7b812-9dad-11d1-80b4-00c04fd430c8')
NAMESPACE_X500 = UUID('6ba7b814-9dad-11d1-80b4-00c04fd430c8')
Randomness based
v4
This is also the most commonly used uuid. When people generally think of uuid, what they think is UUID v4.
Bit Composition - uuid v4 has 6 fixed bits to indicate version and variant, and 122 bits of random value.
v4 is built to be universally unique across the universe. Note that collisions aren't mathematically impossible. There can be collisions in uuid v4. But the probability of it is so small, that we assume that the uuid will be universally unique.
When will there be a collision - When two machines generating the uuids generate the exact same random numbers.
If your application demands absolutely unique identifiers, and even uuid v4 is not a good fit, then switch over to using the snowflake id system. In it, the centralized system will provide unique identifiers to your system. Those identifiers are guaranteed to be unique within your system. Snowflake IDs are 64bits in length, so they also fit natively into your databases, into the bigint datatype.
- v4 - 6bits fixed (to indicate version and variant)+ 122bits random sequence. Built to be universally unique across the universe. Can have collisions, but the probability is vanishingly small.
- most commonly used and implemented uuid
- there are more variants within uuid4 itself
Time and randomness based
v7
This version is preferred over v6 and v1
Time ordered value field is derived from the unix epoch timestamp (milliseconds since midnight 1 Jan 1970 UTC, without leap seconds).
Bit composition (in order) - 1. 48 bits - unsigned unix epoch timestamp in big-endian format (big-endian = most significant value first) 2. 4 bits - UUIDv7 version 3. 12 bits - pseudo random data for uniqueness 4. 2 bit - variant 5. 62 bits - pseudo random data for uniqueness
Nil and Max uuid
Treat these two as reserved uuid. They aren't used for anything. Just that the specifications mention them.
- Nil uuid has all 128 bits set to zero -
00000000-0000-0000-0000-000000000000
- Max uuid has all 128 bits set to 1 -
FFFFFFFF-FFFF-FFFF-FFFF-FFFFFFFFFFFF
Choosing uuid based on requirements
Requirement - Anonymity
In this case, the generated uuid should have no relation to the generator node. So : - any UUID version with MAC address in it can't be used. - uuid v3 and v5 can't be used if a generator-node related string is being given as an argument to the uuid function
uuid | provides anonymity? | reason |
---|---|---|
v1 | no | uuid includes mac address |
v2 | - | - |
v3 | depends | 2 fields provided for generation should not be bound to the generating node |
v4 | yes | completely randomly generated |
v5 | depends | 2 fields provided for generation should not be bound to the generating node |
v6 | no | uuiud includes mac address |
v7 | yes | Based on timestamp+random data. As long as all the nodes have their clocks synced up, the uuid cannot be traced back to the generating node. |
v8 | - | - |
Unique keys for database tables
uuid | is preferred? | reason |
---|---|---|
v1 | no | timestamp based, but not sortable |
v2 | - | - |
v3 | no | md5 hash, so will be randomly spread out |
v4 | no | random, so will be spread out |
v5 | no | sha1hash, so it is random. So it is spread out |
v6 | yes (strong yes) | monotonically increasing, so linear & sortable |
v7 | yes (strong yes) | unix timestamp based, so linear & sortable |
v8 | - | - |
Depends on the use case. If you're using something like dynamodb, where you want to spread out your reads and writes (across shards), then randomness based uuid4 is a good fit. But these category of databases (dynamodb, cassandra etc) already hash the primary key (consistent hashing) to spread out the load, so the randomness of your primary key doesn't matter.
If you're using something like postgres, where you need to focus your reads and writes to a limited set of database pages, then spread-out unsortable UUID versions (like v, v4etc) is a horrible idea.
If you're using a dbms, then using one of the monotonically increasing timestamp-based uuid is a great idea (prefer to use v7).
The UUIDs in order of preference are:
- uuid v7
- uuid v6
However, the specifications of v6 and v7 are in draft phase (as of 2024, in RFC9562). However, implementations of v6 are available for most languages. If needed, you can write one for yourself. It is doable.
Avoid using UUIDv1. Even though it is timestamp based, It does not sort well because of the way its bits are ordered. The problem is that the starting bits are random, and the ending bits are stable. We want it to be the other way around, to avoid fragmentation in out database storage. UUIDv6 solved this issue by re-arranging the bits.
Example of UUIDv1:
848039a9-7979-11ef-8052-f48c526f1a9e
875f7903-7979-11ef-bda6-f48c526f1a9e
880cdb52-7979-11ef-9123-f48c526f1a9e
Need uuid for using as a key for database (event id, transaction id, user id etc). So, the IDs generated have to be sortable, so that they a aren't randomly spread out throughout the database pages. This reduces the spread of pages that have to be read from the database for a query, and greatly improves performance. (b-tree etc).
uuid v4 is horrible for database performance. The keys will end up being spread out through the database. This will increase the number of pages that have to be accessed/modified. If the uuid4 column is indexed, it will also make index updates much more inefficient because changes will have to be made all over the btree/b+tree. There will be too many cache misses, and performance will be comparatively horrible.
Solutions - use timestamp orderable uuids, that are sortable. This will ensure a degree of sequence in the stored data, and will make queries and indexes much faster. Also, because the uuids are linear, you can now also run range queries (eg - find all events that happened between two uuids).
When a new row in inserted against the key, it will cause updates in random pages. This random nature will reduce throughput. Sortable uuids also end up making joins faster with some db join algorithms.
Known issues/limitations
- UUIDs are very large in size and consume a lot of space at scale.
- If storage becomes a major concern, switching over to 64bit snowflake id is an option. But it drastically changes your system from an entirely distributed system, to becoming dependent on the dedicated snowflake id servers for unique identifiers. Make sure this is a cost worth taking up.
- Snowflake id play well with the databases - they are 64 bits, so can be stored as bigint.
- Postgres and most databases only generate UUIDv4. So it is terrible for performance for many scenarios. It performance is going to be a bottleneck, your only option is to make the application layer generate the uuid to store it in postgres as a binary datatype.
There are no solutions, only tradeoffs
There is no true solution. There is no distributed unique identifier generator without collisions. There are only tradeoffs.
The true solution of unique identifiers is something like the snowflake system - a centralized system that generates and provides unique identifiers to all downstream systems. This unique identifier system is the source of truth. And if it goes down, your system will come to a stop.
And if you keep the system de-centralized, then you use one of the uuid versions, and open up your system to the possibility of collisions.
Like much of problem solving - there are no true solutions. Only tradeoffs.