I just read a post on ways to scale your database that hit home with me — the author suggests the use of UUIDs (similar to GUIDs) as the primary key (PK) of database tables.
Reasons UUIDs are Good
There are several reasons using a UUID as a PK would be great compared to auto-incrementing integers:
- At scale, when you have multiple databases containing a segment (shard) of your data, for example a set of customers, using a UUID means that one ID is unique across all databases, not just the one you’re in now. This makes moving data across databases safe. Or in my case where all of our database shards are merged onto our Hadoop cluster as one, no key conflicts.
- You can know your PK before insertion, which avoids a round trip DB hit, and simplifies transactional logic in which you need to know the PK before inserting child records using that key as it’s foreign key (FK)
- UUIDs do not reveal information about your data, so would be safer to use in a URL, for example. If I am customer 12345678, it’s easy to guess that there are customers 12345677 and 1234569, and this makes for an attack vector. (But see below for a better alternative).
Reasons UUIDs May Not be Good
Don’t be naive
A naive use of a UUID, which might look like
70E2E8DE-500E-4630-B3CB-166131D35C21, would be to treat as a string, e.g.
varchar(36) — don’t do that!!
“Oh, pshaw”, you say, “no one would ever do such a thing.”
Think twice — in two cases of very large databases I have inherited at relatively large companies, this was exactly the implementation. Aside from the 9x cost in size, strings don’t sort as fast as numbers because they rely on collation rules.
Things got really bad in one company where they had decided to use Latin-1 character set. When we converted to UTF-8 several of the compound-key indexes were not big enough to contain the larger strings. Doh!
UUIDs are a pain
Don’t underestimate how annoying it is to have to deal with values that are too big to remember or verbalize.
Planning for real scaling
If our goal is to scale, and I mean really scale let’s first acknowledge that an
int is not big enough in many cases, maxing out at around 2 billion, which needs 4 bytes. We have way more than 2 billion transactions in each of several databases.
bigint is needed in some cases and that uses 8 bytes. Meanwhile, using one of several strategies, databases like PostgreSQL and SQL Server have a native type that is stored in 16 bytes.
So who cares if it’s twice as large as
bigint or four times bigger than
int? It’s just a few bytes, right?
Primary keys get around in normalized databases
If you have a well normalized database, as we do at my current company, each use of the key as an FK starts adding up.
Not just on disk but during joins and sorts these keys need to live in memory. Memory is getting cheaper, but whether disk or RAM, it’s limited. And neither is free.
Our database has plenty of intermediate tables that are mainly containers for the foreign keys of others, especially in 1-to-many relations. Accounts have multiple card numbers, addresses, phone numbers, usernames, and all that. For each of these columns in a set of table with billions of accounts, the extra size of foreign keys adds up fast.
It’s really hard to sort random numbers
Another problem is fragmentation — because UUIDs are random, they have no natural ordering so cannot be used for clustering. This is why SQL Server has implemented a
newsequentialid() function that is suitable for use in clustered indexes, and is probably the right implementation for all UUID PKs. It is probable that there are similar solutions for other databases, certainly PostgreSQL, MySQL and likely the rest.
Primary keys should never be exposed, even UUIDs
A primary key is, by definition unique within its scope. It is, therefore, an obvious thing to use as a customer number, or in a URL to identify a unique page or row.
I would argue that using a PK in any public context is a bad idea.
The original issue with simple auto-incrementing values is that they are easily guessable as I noted above. Botnets will just keep guessing until they find one.
Arguably it would be a fool’s errand to try to guess a UUID, however Microsoft warns against using
newsequentialid() because by mitigating the clustering issue, it makes the key more guessable.
My keys will never change (until they do)
But there’s a far more compelling reason not to use any kind of PK in a public context: if you ever need to change keys, all your external references are broken.
When would you need to change keys? As it happens, we’re doing a data migration this week, because who knew in 2003 when the company started that we would now have 13 massive SQL Server databases and growing fast?
Never say never. I have been there and done that, and it happens. It’s easy to manage up front. It’s way harder to fix when you’re counting things in the trillions.
Indeed, our context is a perfect example of why UUIDs are needed, and why they are costly, and why exposing primary keys is an issue.
My internal system is external
I manage the Hadoop infrastructure that receives data nightly from all of our databases. The Hadoop system is linked, which is fine — we’re in the same company.
So, to disambiguate non-unique sequence keys from our multiple databases, we generate a pseudo-primary-key by concatenating two values, the id (PK) of the customer which is unique across databases, plus the sequence id of the table rows themselves.
In so doing we have created a tight, and effectively permanent binding between years of historical customer data. If those primary keys in the RDBMS change, ours will need to also, or we’ll have some horrifying before-and-after scenario.
Best of Both: Integers Internal, UUIDs External
A solution used in several different contexts that has worked for me is, in short, to use both.
Internally, let the database manage data relationships with small, efficient, numeric sequential keys, whether
Then add a column populated with a UUID (perhaps as a trigger on insert). Within the scope of the database itself, relationships can be managed using the real PKs and FKs.
But when a reference to the data needs to be exposed to the outside world, even when “outside” means another internal system, they must rely only on the UUID.
This way, if you ever do have to change your internal primary keys, you can be sure it’s scoped only to one database.
We used this strategy at a different company for customer data, just to avoid the “guessable” problem.
In another case, we would generate a “slug” of text (e.g. in blog posts like this one) that would make the URL a little more human friendly. If we had a duplicate, we would just append a counter.
Even as a “secondary primary key”, using a naive use of UUIDs in string form is wrong: use the built-in database mechanisms.
Use integers because they are efficient. Use the database implementation of UUIDs in addition for any external reference.
References and many thanks
Thanks to Ruby Weekly (which I still read, wistfully although Scala is growing on me), Starr Horne’s great blog from Honeybadger.io on this topic, the always funny and smart post on Coding Horror by Jeff Atwood, co-founder of Stack Overflow, and naturally a fine question on one of Stackoverflow’s sites at dba.stackexchange.com. Also a nice post from MySqlserverTeam, another from theBuild.com and of course MSDN which I linked earlier.
Meta: Why I Blog
I learned a lot writing about this.
I started out reading email on a Sunday afternoon.
Then came across an interesting post by Starr, which got me thinking his advice might have unintended outcomes. So I googled and learned way more about UUIDs than I knew before, and changed my fundamental understanding and disposition about how and when to use them.
Halfway through writing this, I sent email to the team leads at my company wondering if we had considered one of the topics I discussed. Hopefully we’re ok, but I think we may have avoided at least one unexpected surprise in code scheduled for release this week.
Note that all of these are entirely selfish reasons 🙂
Hope you like it, too!