Generating sequential GUIDs which sort correctly in SQL Server in .net

Using a non sequential GUID in either a clustered or non clustered index is not ideal from a performance point of view as non sequential GUIDs are not ever-increasing (like an identity int) so get inserted into the middle of the index rather than the end resulting in increased logical fragmentation and decreased performance.

If you really can’t switch from a GUID to an identity int (to save space and fragmentation) and you must use a GUID in an index, to mitigate against the performance problems ensure it is sequential and ever-increasing. SQL Server allows you to do this by setting a default constraint on uniqueidentifier columns which calls the NEWSEQUENTIALID() method. This method creates a GUID that is greater than any GUID previously generated by it on a specified computer since Windows was started. The problem with this method however is it is created on the DB side and won’t be useful if you need to pass in a GUID from the client side.

There are a number of options which can be used to generate sequential GUIDs in C# which are compatible with SQL Servers GUID/UUID sorting mechanism, I’ll cover three of them in this post.

Options for generating SQL Server compatible sequential GUIDs

  1. SequentialGuidValueGenerator which is part of Entity Framework core.
    Note at the time of writing the documentation for this is incorrect. It does not generate sequential Guids using the same algorithim as NEWSEQUENTIALID(), however it does indeed generate sequential GUIDs which are sortable with respect to SQL Servers GUID sorting approach. Looking at the source of the method we can see this method calls the regular Guid.NewGuid() and performs byte shuffling to make it optimised for SQL Server. Usage is simple and of course you don’t need to be actually using EF to reference the assembly and just use the function. The generated GUIDs correctly sort in SQL Server.Generating sequential GUIDs in .net
  2.  UuidCreateSequential with byte shuffling applied.
    NEWSEQUENTIALID() is a wrapper over the Windows UuidCreateSequential function with some byte shuffling applied. We can therefore call this function by importing the relevant .dll and rearrange the bytes in the same way as SQL Server does to get sequential GUIDs in C#. The link above from StackOverflow has all the implementation details. Looking at 20 GUIDs created with this approach we can see they are much more uniform than the Sequential GUID algorithm from EF covered above.
    Depending on your scenario a number of problems with this approach may limit its feasiblity:
    1 – Requires DLLImport which may not be allowed/desired and might have permissions issues.
    2- Not cross platform, this is a windows dll, so if your deploying your .net core app to Linux this isn’t a runner.
    3 – If your windows server restarts your GUIDs may start from a lower range thus causing index fragmentation.
    4 – Can’t be used in cluster environment where mutiple machines write to same DB as GUIDs generated will all be out of synch with each other thus causing index fragmentation.
  3.  COMB GUID.
    The COMB GUID approach was first described by  and involves replacing the portion of a normal GUID that is sorted first with a date/time value. This guarantees (within the precision of the system clock) that values will be sequential, even when the code runs on different machines. There are lots of examples of COMB GUIDs in C# online but I like RT.Comb which is a library available on Nuget. Below you can see how its used in its basic configuration.And we can see the GUIDs it creates are sorted correctly in SQL Server…

but wait… actually they are not sorted correctly.  This is because the timestamp is generated using DateTime.UtcNow which has precision of 1/300th of a second meaning that if you quickly generate a lot of COMB Guids like above there is a chance you’ll create two with the same timestamp value. This won’t result in collisions as the non timestamp bits will take care of that but it means COMBs with the same timestamp are not guaranteed to be sorted correctly for insertion into the DB.

Therefore if you’re inserting records faster than the precision offered by DateTime.UtcNow (very possible) these inserts will cause index fragmentation. Avoiding index fragmentation is the whole purpose of using sequential GUIDs in the first place so we need a solution to this if COMBs are to be viable compared to other approaches.

Thankfully a solution is provided by RT.Comb itself and is part of the reason I like to use this library rather than just some of the code for creating COMBs available on StackOverflow or other places online. RT.Comb has a timestamp provider called UtcNoRepeatTimestampProvider which ensures that the current timestamp is at least Xms (4ms is the default, but this can be changed) greater than the previous one and increments the current one by Xms if it is not. The library documentation gives the following table of what timestamp UtcNoRepeatTimestampProvider (RHS) will provide compared to the original DateTime.UtcNow (LHS) timestamp.

02:08:50.613    02:08:50.613
02:08:50.613    02:08:50.617
02:08:50.613    02:08:50.621
02:08:50.617    02:08:50.625
02:08:50.617    02:08:50.629
02:08:50.617    02:08:50.632

Using UtcNoRepeatTimestampProvider is simple and results in GUIDs which are correctly sorted in SQL Server as shown below.

Creating sequential GUIDs in .net

Which Sequential GUID approach to use?

Of course there are many other solutions on the web which you can choose. All three discussed above are super fast. On my machine I did 100K iterations in at most 101 ms so this wouldn’t influence me either way. Given the potential problems outlined with the dll import approach above I’d avoid this unless there were not viable alternatives, which there are.

There’s not much to choose between the other two. If your using EF Core to update your DB then you’ll have SequentialGuidValueGenerator available anyhow so I’d just go with what you have out of the box rather than integrating RT.Comb for example. COMBs are ideal if you want to be able to extract the timestamp out of the GUID for things like debugging, but it does perhaps have limited real world benefit. There are many COMB libraries available online but if you happen to use both SQL Server and PostgreSQL RT.Comb is ideal as it supports both.

Note Sequential GUIDs by their nature are guessable so don’t use these in a security sensitive context. 

 

Introduction to database indexes

If your database queries are taking an age and a half to execute you could do worse than investigate the addition of some indexes to the columns which are involved.

book-indexYou’ll no doubt be familiar with indexes at the back of books which have the name of some concept with a list of page numbers beside it. Indexes for databases are used in much the same way as indexes in books are. They both essentially provide a quick way to look up the location of information you want to find. Each index is a form of categorisation of the overall data. A book usually has only one index as that’s all it needs and the categorisation of the index is usually concepts related to the book sorted alphabetically.

Advantages of indexes are increased search performance

The graphic above is a cropped screenshot of the index at the back of Succeeding with Agile: Software Development Using Scrum. Imagine for example your interested in finding mentions (or concepts related to him) of Kent Beck in that book. By looking in the index you can quickly see that he is mentioned on page 58, 288 and 289. If there was no index or the information you were looking for wasn’t in the index you would have to go through every page to look for it. This book has over 450 pages so it’s not going to be a quick task. What’s worse is that the information may not even be in the book at all.

Same idea applies in databases. Imagine your Eason’s, Barnes and Noble or Amazon and have an authors database table with 100,000 records in it. This table has a load of columns but one which we care about is called lastName. If you queried the authors table with something like select * from authors where lastName = ‘beck and there was index on the lastName column the database engine will use a logical categorisation (each record categorised by lastName) of all the data in the authors table to quickly find where that record is physically located on the disk drive.

Specifics is for another day but it does this most likely by using special structures known as B-trees where the possible areas on the disk that the record(s) in question can be physically located gets smaller and smaller as the engine goes further into the tree. By continuously going in the general direction of the physical records via a B-tree the DB engine can find the records very quickly.

If you had the same search scenario without an index on lastName the database engine would have to look at every record to see if it matched ..where lastName = ‘beck’. That’s 100,000 records to look at. That’s a lot of IO and CPU time.

In database terms looking through the whole table when no relevant index is present is called a table scan whereas using an index is referred to as an index seek. There is also such a thing called an index scan which generally speaking is where an index does exist but the database engine deems it quicker to just look at the records one by one anyway. It might do this because the cost of traversing the index structure (e.g the B-tree which is data stored on disk and requires IO and CPU time just like table records) will be more expensive than simply going through the data pages (in which the records are stored) one by one. This might happen in cases where there is an index but the table size is small and/or the requested records make up a large percentage of total records in the database table.

Generally speaking seeking is good and scanning is bad. Adding indexes will cause an index seek to be used if that is less expensive than an index scan. In most instances it is so indexes can really improve the performance of your select queries. The improvement can sometimes be dramatic, cutting perhaps 99% execution time off a particular query in a large table. Of course “there’s no such thing as a free lunch” and indexes do have their disadvantages.

Disadvantages of indexes are slower updates and more disk space required

book-index

Yes adding indexes to your database tables do come at a cost. The cost is very likely to be something your willing to pay though. Let’s refer back to the image of the book index I first used above to explain the cost of having indexes.

We see information about Kent Beck is located on page 58, 288 and 289. Great.

What if the author of this book, as part of the update for the next edition inserted new text (about a whole page) into page 28 which caused almost all information to be moved to the next page? The author has done an ‘insert’ and thus the index is now out of date. The information about Kent Beck is not in fact still on pages 58, 288 and 289 but rather is it now on pages 59, 289 and 290. Of course all references in the index that refer to page 28 and above are out of date, not just the Kent Beck related ones. The index therefore needs to be updated to reflect the change. This may not be a problem for a book as new editions don’t come out that often. For a database however which might have millions of daily inserts, deletes or updates directly affecting indexed columns the IO and CPU cost to constantly reshuffle the indexes on disk so they are correct can be enormous. Thankfully in the vast majority of cases systems read from much more than write to databases.

Another disadvantage of adding indexes is the disk space requirements. The image above shows only a tiny excerpt of an index that is actually 11 pages long. All them extra characters and pages no doubt cost the publisher extra to print the book. Database indexes require extra space too. The specific amount will depend on the size of the table and the number of columns in the index.

Conclusion

Indexes are perhaps the single best way to increase the performance of your database queries. I used the analogy of an index in a book to aid in explaining what they are and how unfortunately the don’t come without a price (a fair price I think) but if you’ve any questions let me know in the comments. I find the topic of indexes fascinating (so expect more posts about them) and a big part of me wishes I could transpose myself back into my college data algorithms class to actually understand what the lecturer was going on about when he was talking about things like indexing algorithms and B-Trees. I really did think I would never use ‘that stuff’ again :-).