Niraj Zade

UUID - versions, details, use cases

2024 Sep 13  |  17 min read  |  tags: distributed-systems 1 software-engineering 3

Table Of Contents:

  • Quick note on UUID and database types
  • work in progress
    work in progress
    work in progress
    work in progress
    
    this note has no flow, and is a dump of information at the moment
    

    Unique identifiers are at the core of distributed systems. The uuid family of identifier generators are the most wisely used ones. They have been implemented by standard libraries of almost all languages.

    But, each uuid works differently and serves a specific purpose. The choice of uuid greatly affects the performance of the systems you architect and build. Some uuids are fit for one scenario, and completely unfit for others.

    To actually understand when to use which uuid, you have to understand the underlying uuid itself. The authoritative sources to understand uuids are the rfcs that specify them. This note is the result of a study of those rfcs.

    All uuids are 128bit in size (32 digits). UUIDs aren't guaranteed to be collision resistant. The assumption when using UUID is that there are so many possible UUIDs, that even if we randomly choose one, the possibility of collision will practically be zero.

    For an absolutely collision resistant system, setup and use a Snowflake Id cluster. But it has operational challenges of its own, and becomes a centralized point of failure for your distributed system.

    Uuid collisions are mathematically possible

    So, UUIDs are virtually globally unique. But without guarantees. We still rely on them.

    If you truly need a system which mathematically won't have unique id collisions, you cannot use uuids generated by individual nodes in your distributed system. Instead, you'll have to use a centralized system that generates and distributes unique identifiers. A widely used solution is Snowflake Id (introduced by twitter).

    Uuid versions

    There are 7 current version of uuid. They can be calssified as follows:

    The default choice is v4. But it is not suitable for all scenarios.

    Time based

    v1

    Composition - 48bit MAC address, 60 bit timestamp (100ns resolution)

    Properties - Monotonically increasing

    When will there be a collision - When two uuids are generated within the same 100ns window.

    v6

    Bit Composition (in order) - 1. 32 bits - most significant 32 bits of the 60bit timestamp 2. 16 bits - middle 16 bits of the 60bit timestamp 3. 4 bits - uuid v6 version (0110) 4. 12 bits - least significant 12 bits of the 60bit timestamp 5. 2 bits - uuid variant (10) 6. 6 bits - high portion of clock sequence 7. 8 bits - clock sequence low 8. 48 bits - spatially unique identifier (MAC address)

    Input based

    v3 and v5

    Namespace based generator. These uuids can be deterministic if you want them to be, as these UUIDs are dependant on input. If you give the same input, the uuid function will generate the same output. There is no inherent randomness or time dependence in the uuid generation algorithm.

    These namespace based uuids take two arguments: - namespace - this itself is a uuid object. Use one of the 4 standard namespaces, or use a uuid that you generated. - name - a string that you provide.

    When will there be a collision - When two uuids are generated using the exact same arguments - namespace and name.

    There are 4 standard namespacess defined in the uuid rfc:

    1. NameSpace_DNS - has the value 6ba7b810-9dad-11d1-80b4-00c04fd430c8
    2. NameSpace_URL - has the vlaue 6ba7b811-9dad-11d1-80b4-00c04fd430c8
    3. NameSpace_OID - has the value 6ba7b812-9dad-11d1-80b4-00c04fd430c8
    4. NameSpace_X500 - has the value 6ba7b814-9dad-11d1-80b4-00c04fd430c8

    As you can see, the namespaces aren't random. They have a fixed value, and the same for everyone. These values have been specified in rfc4122

    https://datatracker.ietf.org/doc/html/rfc4122.html
    
    Appendix C
    
    This appendix lists the name space IDs for some potentially
    interesting name spaces, as initialized C structures and in the
    string representation defined above.
    
    /* Name string is a fully-qualified domain name */
    uuid_t NameSpace_DNS = { /* 6ba7b810-9dad-11d1-80b4-00c04fd430c8 */
       0x6ba7b810,
       0x9dad,
       0x11d1,
       0x80, 0xb4, 0x00, 0xc0, 0x4f, 0xd4, 0x30, 0xc8
    };
    
    /* Name string is a URL */
    uuid_t NameSpace_URL = { /* 6ba7b811-9dad-11d1-80b4-00c04fd430c8 */
       0x6ba7b811,
       0x9dad,
       0x11d1,
       0x80, 0xb4, 0x00, 0xc0, 0x4f, 0xd4, 0x30, 0xc8
    };
    
    /* Name string is an ISO OID */
    uuid_t NameSpace_OID = { /* 6ba7b812-9dad-11d1-80b4-00c04fd430c8 */
       0x6ba7b812,
       0x9dad,
       0x11d1,
       0x80, 0xb4, 0x00, 0xc0, 0x4f, 0xd4, 0x30, 0xc8
    };
    
    /* Name string is an X.500 DN (in DER or a text output format) */
    uuid_t NameSpace_X500 = { /* 6ba7b814-9dad-11d1-80b4-00c04fd430c8 */
       0x6ba7b814,
       0x9dad,
       0x11d1,
       0x80, 0xb4, 0x00, 0xc0, 0x4f, 0xd4, 0x30, 0xc8
    };
    
    v3 versus v5

    The only difference is that v3 uses md5 hashing, and v5 uses sha1 hashing.

    If you know sha1, then you know that it generates 160bits of data. But uuid is only 128bit long. The rfc specifies how to condense these 160bits of sha1, and fit them into 128bits of uuid.

    Following is the implementation of uuid v3 and v5 in python's source code:

    # /python/uuid.py
    
    def uuid3(namespace, name):
        """Generate a UUID from the MD5 hash of a namespace UUID and a name."""
        from hashlib import md5
        digest = md5(
            namespace.bytes + bytes(name, "utf-8"),
            usedforsecurity=False
        ).digest()
        return UUID(bytes=digest[:16], version=3)
    
    def uuid5(namespace, name):
        """Generate a UUID from the SHA-1 hash of a namespace UUID and a name."""
        from hashlib import sha1
        hash = sha1(namespace.bytes + bytes(name, "utf-8")).digest()
        return UUID(bytes=hash[:16], version=5)
    
    # The following standard UUIDs are for use with uuid3() or uuid5().
    
    NAMESPACE_DNS = UUID('6ba7b810-9dad-11d1-80b4-00c04fd430c8')
    NAMESPACE_URL = UUID('6ba7b811-9dad-11d1-80b4-00c04fd430c8')
    NAMESPACE_OID = UUID('6ba7b812-9dad-11d1-80b4-00c04fd430c8')
    NAMESPACE_X500 = UUID('6ba7b814-9dad-11d1-80b4-00c04fd430c8')
    

    Randomness based

    v4

    This is also the most commonly used uuid. When people generally think of uuid, what they think is UUID v4.

    Bit Composition - uuid v4 has 6 fixed bits to indicate version and variant, and 122 bits of random value.

    v4 is built to be universally unique across the universe. Note that collisions aren't mathematically impossible. There can be collisions in uuid v4. But the probability of it is so small, that we assume that the uuid will be universally unique.

    When will there be a collision - When two machines generating the uuids generate the exact same random numbers.

    If your application demands absolutely unique identifiers, and even uuid v4 is not a good fit, then switch over to using the snowflake id system. In it, the centralized system will provide unique identifiers to your system. Those identifiers are guaranteed to be unique within your system. Snowflake IDs are 64bits in length, so they also fit natively into your databases, into the bigint datatype.

    Time and randomness based

    v7

    This version is preferred over v6 and v1

    Time ordered value field is derived from the unix epoch timestamp (milliseconds since midnight 1 Jan 1970 UTC, without leap seconds).

    Bit composition (in order) - 1. 48 bits - unsigned unix epoch timestamp in big-endian format (big-endian = most significant value first) 2. 4 bits - UUIDv7 version 3. 12 bits - pseudo random data for uniqueness 4. 2 bit - variant 5. 62 bits - pseudo random data for uniqueness

    Nil and Max uuid

    Treat these two as reserved uuid. They aren't used for anything. Just that the specifications mention them.

    Choosing uuid based on requirements

    Requirement - Anonymity

    In this case, the generated uuid should have no relation to the generator node. So : - any UUID version with MAC address in it can't be used. - uuid v3 and v5 can't be used if a generator-node related string is being given as an argument to the uuid function

    uuid provides anonymity? reason
    v1 no uuid includes mac address
    v2 - -
    v3 depends 2 fields provided for generation should not be bound to the generating node
    v4 yes completely randomly generated
    v5 depends 2 fields provided for generation should not be bound to the generating node
    v6 no uuiud includes mac address
    v7 yes Based on timestamp+random data. As long as all the nodes have their clocks synced up, the uuid cannot be traced back to the generating node.
    v8 - -

    Unique keys for database tables

    uuid is preferred? reason
    v1 no timestamp based, but not sortable
    v2 - -
    v3 no md5 hash, so will be randomly spread out
    v4 no random, so will be spread out
    v5 no sha1hash, so it is random. So it is spread out
    v6 yes (strong yes) monotonically increasing, so linear & sortable
    v7 yes (strong yes) unix timestamp based, so linear & sortable
    v8 - -

    Depends on the use case. If you're using something like dynamodb, where you want to spread out your reads and writes (across shards), then randomness based uuid4 is a good fit. But these category of databases (dynamodb, cassandra etc) already hash the primary key (consistent hashing) to spread out the load, so the randomness of your primary key doesn't matter.

    If you're using something like postgres, where you need to focus your reads and writes to a limited set of database pages, then spread-out unsortable UUID versions (like v, v4etc) is a horrible idea.

    If you're using a dbms, then using one of the monotonically increasing timestamp-based uuid is a great idea (prefer to use v7).

    The UUIDs in order of preference are:

    1. uuid v7
    2. uuid v6

    However, the specifications of v6 and v7 are in draft phase (as of 2024, in RFC9562). However, implementations of v6 are available for most languages. If needed, you can write one for yourself. It is doable.

    Avoid using UUIDv1. Even though it is timestamp based, It does not sort well because of the way its bits are ordered. The problem is that the starting bits are random, and the ending bits are stable. We want it to be the other way around, to avoid fragmentation in out database storage. UUIDv6 solved this issue by re-arranging the bits.

    Example of UUIDv1:

    848039a9-7979-11ef-8052-f48c526f1a9e  
    875f7903-7979-11ef-bda6-f48c526f1a9e  
    880cdb52-7979-11ef-9123-f48c526f1a9e
    

    Need uuid for using as a key for database (event id, transaction id, user id etc). So, the IDs generated have to be sortable, so that they a aren't randomly spread out throughout the database pages. This reduces the spread of pages that have to be read from the database for a query, and greatly improves performance. (b-tree etc).

    uuid v4 is horrible for database performance. The keys will end up being spread out through the database. This will increase the number of pages that have to be accessed/modified. If the uuid4 column is indexed, it will also make index updates much more inefficient because changes will have to be made all over the btree/b+tree. There will be too many cache misses, and performance will be comparatively horrible.

    Solutions - use timestamp orderable uuids, that are sortable. This will ensure a degree of sequence in the stored data, and will make queries and indexes much faster. Also, because the uuids are linear, you can now also run range queries (eg - find all events that happened between two uuids).

    When a new row in inserted against the key, it will cause updates in random pages. This random nature will reduce throughput. Sortable uuids also end up making joins faster with some db join algorithms.

    Known issues/limitations

    There are no solutions, only tradeoffs

    There is no true solution. There is no distributed unique identifier generator without collisions. There are only tradeoffs.

    The true solution of unique identifiers is something like the snowflake system - a centralized system that generates and provides unique identifiers to all downstream systems. This unique identifier system is the source of truth. And if it goes down, your system will come to a stop.

    And if you keep the system de-centralized, then you use one of the uuid versions, and open up your system to the possibility of collisions.

    Like much of problem solving - there are no true solutions. Only tradeoffs.

    Quick note on UUID and database types

    If you're using UUID as a primary key for databases, then the type of UUID you use depends on the kind of database you're using.

    Random UUIDs will spread writes & reads across the database. This behaviour will affect your DB's performance

    Resources:

    All Articles

    Blog

    Resources