Summary
UUIDs are 128-bit numbers that look like this:
c5cba9cb-7109-40f8-96d2-346dc83f3a1f
They can be generated by most programming languages and database systems. E.G in Python:
>>> import uuid
>>> str(uuid.uuid4())
'56b20fff-b22e-4117-9288-b9da686d299b'
They have very strong uniqueness guarantees. You could generate billions of them for years without finding a duplicate. Being unique is a property of the way they are created; it requires no coordination, so you can generate them from different sources all over the world.
They are appealing to be used as unique identifiers, such as a database primary key. While they are bigger and slower, they do save a lot of headaches, and are a very nice tool to have at your disposal.
Apparently, it's important
With UUID7 being added to modern Python and Postgres, there has been a debate about exposing internal IDs to the outside world and the consequences for security.
Some people advise using a public ID that is different from the internal unique ID of your DB to reduce the attack surface. Some points out UUID7 having a timestamp part is metadata that can be used to infer some information (E.G: imagine you get a medical test for a disease, with the ID you know when the procedure happened).
When I read such a debate, I always remember that most systems out there are still using some incremental primary key, or use a natural key. And that a lot of teams don't even think about what type of ID they are using.
Let's explain what this stuff is, so you can make those decisions yourself.
The need for uniqueness
If I talk about the movie Dracula, which one am I talking about? There are so many!
You may start adding information about it to differentiate them, like the director. The one by George Melford or the one by John Badham? But then you hit the problem that Coppola's movie, probably the most famous one, is not named Dracula, but in fact "Bram Stoker's Dracula". Except on the French market, where it is.
You may use the year then, but in 1931, two Dracula movies came out.
So what about using the exact release date?
But in which country?
If you want to manipulate information about some kind of entity, you need a unique way to identify it, so that you can retrieve this one, exactly, not anything else. In relational databases, we use table primary keys for that, Python objects have internal IDs, StackOverflow URLs have unique numbers (https://stackoverflow.com/questions/1155008/how-unique-is-uuid and https://stackoverflow.com/questions/1155008 points to the same article)...
Using some properties of the object we want to talk about to define it uniquely is very hard and error-prone. This is what we tried to do with the Dracula movie. This is why the Internet Movie Database has unique IDs for each movie that are completely arbitrary. E.G: 0112896
is the unique ID of Dracular, dead and loving it.
You can be tempted to use someone's name, but names have duplicates. Someone's phone number, but phones can be relocated. And so on, and so on.
So it's good practice to attribute a meaningless internal identifier to entities you are manipulating. Making that identifier unique across your system, though, is a challenge in itself.
Some systems use a number that is based on the object creation date, with a very precise (to the nanosecond) resolution. You'll get duplicates if you create too many objects at the same time, though.
Some use an automatically incremented number with a lock, so that your new object is always +1 than any previous recorded number. But then you need a centralized system to hold the lock, you can't create unique values from outside of it.
Another solution to this problem is to use a universally unique identifier, or UUID, a number that is generated in a way that has strong guarantees of being unique by mathematical magic.
Let's see what they are, and the pros and cons of using them.
An interesting idea
A typical UUID looks like this:
c5cba9cb-7109-40f8-96d2-346dc83f3a1f
This is just a notation, mind you. The dashes and hexadecimal base are for readability. You could represent the very same UUID with the number in base 10 and no delimiter:
216225951109208685816139325685595281727
UUID are, indeed, just numbers generated with standard algorithms that have the following characteristics:
They are always the same length: 32 digits in hexadecimal (128 bits).
They have a version that tells you how it's been generated.
They can be divided into 5 standardized sections, some of them containing information, such as the version: "c5cba9cb-7109-40f8-96d2-346dc83f3a1f" is version 4 because 40f8 starts with 4.
So far we have 8 versions, each of them generating the number in a different manner:
Version 1: Uses timestamp + MAC address, making them roughly time-sortable and unique across machines.
Version 2 Like v1, but embeds POSIX UID/GID instead of some timestamp bits.
Version 3: Deterministically generated from a namespace UUID and a name, using MD5 hashing.
Version 4: Fully random or pseudo-random 122-bit value, maximizing unpredictability. The most used version in the world.
Version 5: Like v3 but uses SHA-1 for stronger hashing and reduced collision risk.
Version 6: Like v1 with timestamp-first in the layout, for better database index locality.
Version 7: Time-ordered UUID using Unix time (milliseconds) + random bits for easy chronological sorting. This is the new hotness.
Version 8: Reserved format allowing custom bit layouts while still being UUID-compliant. Basically, make your own UUID.
They are everywhere. Most rich systems can generate a UUID.
Python:
>>> import uuid
>>> str(uuid.uuid4())
'56b20fff-b22e-4117-9288-b9da686d299b'
Powershell:
c:\ [guid]::NewGuid()
Guid
----
be3a122f-9490-4fed-8122-9fe793ac7ddf
Bash:
$ uuidgen
9dbb3e02-7743-4d99-8840-d37cfa741c44
And any good RDBMS has support for both generating them and dedicated field types to store them efficiently nowadays.
The most important part is of course, that UUID algorithms are very good at avoiding duplicates.
E.G: UUID4, which is the one with the most randomness, statistically would require, on average, to generate 1 billion UUIDs per second for about 86 years to get a 50% chance of collision on your next attempt.
And finally, all UUID versions have the same compatible format, meaning that if all you need is an opaque unique handle (you don't need the metadata), all versions can be used interchangeably.
Which one should I use, and for what?
UUID4 is the one that is used the most for a reason: it's the one that requires the least information to be generated, and exposes, therefore, the least metadata to the rest of the world.
It's also very standard and very well supported.
It's useful if you want to generate a unique name for a temp directory, a primary key in a database, the name of a profile in a config file (Firefox does that)... Pretty much everything where you need something to be unique with no meaning.
By its nature, it allows you to generate UUID4 from all over the planet with no additional information and no coordination, yet they will be unlikely to collide. No need for a central lock. No need to share data between sources of IDs.
In a database with auto-incremented primary keys, a user, a product, and a permission row will end up with the same ID, and the namespace makes them distinct. But with UUIDs as primary keys, if you see an ID, only one object has it, no matter its table.
In tests or exports, there is less risk of conflicts as well, since multiple runs will not end up with overlapping IDs.
Of course, UUIDs have drawbacks. They eat up more space. They take longer to compare. They really suck when you have to dictate one over the phone.
But the worst problem of using UUID version 4 in the particular case of DB primary keys is that it is random, and therefore, can't be sorted or indexed. This makes looking up one particular ID slower than alternatives.
Indeed, with an incremented ID, the DB knows that if you look for ID 97898, you can skip all rows up to 97897 and find it. With something random, it could be anywhere.
For this reason, some systems have been using less standard alternatives, such as ULID, to mitigate this problem.
This is why the upcoming UUID 7 is exciting: it contains both a lot of randomness and a timestamp, but nothing else (unlike UUID 1 and derivatives). That makes it easy to sort. And since all UUID versions have the same format, UUID 7 will be a drop-in replacement for UUID4.
Of course, it does leak the date and time of creation of the database entry to the outside world if you use it publicly (which most systems do), and now you understand why there is such debate taking place.
Personally, I've never been in a situation where using a UUID 4 as a unique identifier had more cons than pros. I eagerly wait for UUID 7 to remove one more annoyance, but it's a nice-to-have, not a requirement.
Yet I will ask myself if the creation date is information I wish to hide or not. I haven't worked in 20 years, on a project where it would have been the case. But you never know.
And you can see all V4 ones (and even search) there: https://everyuuid.com/