If you’ve worked in software for a little while, you’ve probably had to come up with a unique way to identify some entity. Usually this is as a primary key for a record sitting in a database but it can really be anything. For simple systems, a monotonically increasing integer works (
id = 1, 2, 3 and so on). When dealing with distributed systems however, we generally end up using UUIDs.
UUIDs are a fairly old concept that allows you to generate a unique identifier without some central authority. The reason this is important is because you can have multiple nodes generating multiple records without them having to coordinate with some central server that’s providing a global lock on an id (which you’d have to do if you wanted integers like 1, 2, 3 and so on). But what makes UUIDs safe and unique? Could we generate our own UUIDs?
Luckily, there’s a specification for UUIDs, RFC 4122. In it, we can see that there are multiple ways to generate a UUID. I’m going to focus on versions number 1 and 4 since those are the most popular (especially version 4) but the other versions are also sound.
In order to create a Version 1 UUID, we need to rely on the system time (which according to the spec should be accurate to within 100 nanoseconds). Most of the UUID is derived from this accurate timestamp with various parts being used in various places. The final bits of the UUID are derived from the MAC address.
I won’t go into detail covering which parts get used where (the spec does that), but you can see at a glance why this is safe. Suppose I have a lot of different machines spitting out these UUIDs. We have two cases:
- The MAC address is unique among all nodes - I will always get a unique UUID because the MAC address uniquely identifies the node
- The MAC address is shared among some or all nodes
The MAC address being shared presents some difficulty. What this means is that I can never generate a UUID within the same 100 nanoseconds as another node in order to guarantee uniqueness. Moreover on a single node I can never generate two or more UUIDs within the same 100 nanoseconds. The specification supports this deduction. It’s also potentially problematic if I have clock drift among nodes in my cluster and they share the same MAC address.
So Version 1 is safe as long as everything in my system has a different MAC address or my clocks are synchronized and I can limit the generation of UUIDs based on my clock resolution. Lets take a look at Version 4.
Version 4 is arguably the safest way to generate a UUID. In this version, UUIDs are just a set of 128-bit pseudorandom bits. No matter which node one of these UUIDs is generated on, the chances of collision with any existing UUID is
1/2^128 which is an astoundingly small number. This SO post is illustrative. While Version 1 guarantees uniqueness if the constraints are met, Version 4 virtually guarantees uniqueness under almost all conditions unless you are incapable of generating psuedorandom numbers.
Making our own UUID
Both Version 1 and Version 4 offer good illustrations of how we might derive our own UUID generator. Version 4 is arguably the simplest, just choose a sufficiently long psuedorandom string of bits and call it a day. Are the chances of collision too high for you with 128 bits? Choose 256 or even 1024. The amount of storage space and the time to generate such UUIDs does increase but they will almost never collide.
On the other hand, you can choose a monotonically increasing value on the system such as time and couple it with a semi or guaranteed unique identifier for the node such as
hostname. Notably this value should be something you either have control over or something that is roughly guaranteed to be unique to that node. You should not choose something like the exact version of the operating system for instance. You could even choose a psuedorandom bit string to identify the host (version 4 style) and then pair that with your time value if you want to mix and match.