Call us Today! 201-728-8809 | | Contact | About |
   


GUIDs GUIDs everywhere, but how is my data unique?

know your data my friend....

After having seen many systems that use GUIDs as the primary key for every table, it is time to say enough is enough!

August 2013. Volume 2 - Issue 1
By Tom Romeo
Revised November 2013.

GUIDs GUIDs everywhere, but how is my data unique?

Featured on SQL Server Central

After having seen many systems that use GUIDs as the primary key for every table, it is time to say enough is enough!

First and foremost I must state most emphatically that there is nothing wrong with using GUIDs. Like everything else in life, moderation and discretion are important.

Here are some advantages for using GUIDs as primary keys:
  • In theory it is unique across every table, every database, every server
  • Allows easy merging of records from different databases. Of course every database merge is followed by a database scrub as the potential for duplicate data is introduced. (Actually this is not an advantage after all).
    • Example, a customer or patient record exists in two different databases each having a different GUID as the unique ID, when merged; there will be two rows with the same data.
    • A scalar function with a where clause on anything but the GUID could fail if it returns more than one row
  • You can generate GUIDs from any source rather than requiring a round trip to the database, though it is probably best to not generate GUIDs using jQuery at the browser level of your application. The GUID should be generated at the data tier or in the database itself.
  • Many replication scenarios require GUID columns
So why would one ever not use a GUID as the primary key?
  • If the data contains a natural unique identifier then it makes no sense to add an artificial unique identifier.
  • If the primary key is clustered then the GUID will force the internal trees to be out of balance with respect to the natural order of the data.
  • If the primary key is clustered and there are frequent inserts then frequent page splits could hinder performance.1
  • Similarly if the primary key is clustered and the fill factor of the primary key is adjusted to compensate for frequent inserts, then the primary key may require excessive disk space and may become fragmented more frequently.1
  • If the unique identifier is to be rendered as a barcode, then a GUID may be inefficient or force the application to use a two dimensional barcode symbology. This also applies when using pre-printed barcode labels or RFID tags such that the barcode or RFID is unique. There is no need to assign an artificial unique identifier for something that is already unique.
  • The human readable aspect of the unique identifier is significantly more difficult. For example on an asset tag.
  • If the GUID requires more space than the actual data then there may be a better way to make the data unique. For example as used in a look-up table for departments or status codes.
  • It is significantly harder to visually see errors in the data when GUIDs are present.
Essentially, GUIDs have their place. Unfortunately they are ubiquitous mostly due to laziness. It's ok to make mistakes out of ignorance as long as you fix them when you learn the error of your ways, but making mistakes due to laziness is inexcusable. Since many programmers and developers maintain the credo of "I'm lazy therefore I am", it is often too late to fix problems on this level when they manifest.

The concept of using record identifiers or ID(s) dates back to the time when storage was very expensive and the id consisted of a one, two, three, or four byte numeric code that may also have been stored as packed decimal. Many old flat file or ISAM file systems used this method. The whole Y2K effort was all about converting systems from a two digit year to a four digit year.

In the database world, be it hierarchical, networked, relational, or other, using an ID to represent record relationships was and still is very efficient. Some systems actually benefited from forcing the ID column to be a binary multiple as two, four, eight, 16, or 32 bytes.

Given a table that is used to store the status codes {'New', 'Used', 'Retired', 'Unavailable', 'Out of Service'}, One could construct a simple table like the following:

CREATE TABLE [dbo].[tblStatusCodes](
[StatusCodeID] [smallint] NOT NULL,
[StatusCodeName] [nvarchar](48) NOT NULL,
CONSTRAINT [PK_tblStatusCodes] PRIMARY KEY CLUSTERED
(
[StatusCodeID] ASC
)


Such that the data would look like this:

StatusCodeID StatusCodeName
1 New
2 Used
3 Retired
4 Unavailable
5 Out of Service


Advantages:
  1. The status code name can change in one place and be referenced
  2. The status code ID requires less space than the actual status code name
  3. Referential integrity is simple using foreign keys
Using a GUID rather than a small integer in this case seems ridiculous as the GUID requires twice as much space as the longest status code name.

Given the sample table:

CREATE TABLE [dbo].[tblTraverseServices](
[Server] [nvarchar](256) NOT NULL,
[Service] [nvarchar](64) NOT NULL,
[Description] [nvarchar](128) NOT NULL,
[ConnectionData] [nvarchar](256) NOT NULL,
[Port] [bigint] NOT NULL,
[Enabled] [bit] NOT NULL,
[LastPingTime] [datetime] NOT NULL,
[Mod_User] [nvarchar](32) NOT NULL,
[Mod_Time] [datetime] NOT NULL,
CONSTRAINT[PK_tblTraverseServices] PRIMARY KEY CLUSTERED
(
[Server] ASC,
[Service] ASC
)


The primary key is the combination of the server name and service name. There is no GUID to uniquely identify the server because server names must be unique. Allowing 256 bytes gives ample room to store the Fully Qualified Domain Name or (FQDN). As of this writing the max length for a FQDN is 255 bytes.

While the Server and ConnectionData fields may seem redundant, the ConnectionData field may contain an IPV4 or IPV6 address whereas the Server field must contain the server name as the IP address could change, the server name is less likely to change. The data might look like this (omitting the Mod_User and Mod_Time):

Server Service Description ConnectionData Port Enabled LastPingTime
PCSVSERVER060 TraverseArchiveService PCS Traverse Archive Service PCSVSERVER060 65101 1 2013-08-02 10:09:25.493
PCSVSERVER060 TraverseDBMaintenanceService PCS Traverse Database Service 192.168.1.35 65102 0 1990-01-01
PCSVSERVER060 TraverseInterfaceService PCS Traverse Interface Service PCSVSERVER060 65103 1 2013-08-02 10:10:21.330
PCSVSERVER060 TraverseReportService PCS Traverse Report Service PCSVSERVER060 65104 1 2013-08-02 10:07:16.767


Most importantly, by not having a GUID as the primary key, it is not possible to duplicate data. If a GUID were to be used as the primary key, the potential to add multiple rows with Server = "PCSVSERVER060" and Service = "TraverseArchiveService" exits, as an example.

Other places where introducing a GUID might be a bad idea are:
  • Importing data from other systems like SAP. SAP has likely already uniquely identified whatever is being exported from SAP. There is no need to tag it with a new GUID.
  • Importing data from scheduling providers like Microsoft Exchange or IBM Domino. If there is a need to store meetings in a database, you can be assured that all such systems uniquely identify their appointments.
  • If the belief that using a single field to uniquely identify it improves join performance. (we can dedicate an entire article to debunk this myth)
So where would we use a GUID? There are many good ways to use GUIDs such as a transaction id?

In this table the transaction id is a GUID, and while it is not the primary key, it is part of the primary key as a way to group all of the items that were affected within a single transaction

CREATE TABLE [dbo].[tblAudit](
[TransactionID] [varchar](36) NOT NULL,
[TransactionDate] [datetime] NOT NULL,
[AuditType] [nvarchar](16) NOT NULL,
[AuditSubType] [nvarchar](32) NULL,
[TableName] [varchar](50) NOT NULL,
[FieldName] [varchar](50) NOT NULL,
[OldValue] [nvarchar](250) NULL,
[NewValue] [nvarchar](250) NULL,
[TransStatus] [nchar](8) NULL,
[Mod_User] [nvarchar](32) NOT NULL,
[Mod_Time] [datetime] NOT NULL,
CONSTRAINT [PK_tblAudit] PRIMARY KEY CLUSTERED
(
[TransactionID] ASC,
[AuditType] ASC,
[TableName] ASC,
[FieldName] ASC
)


Our Traverse product uses such a table to track all inserts, updates, and deletes so users can easily run reports on all changes. We didn't need to use a GUID for this field, but it was a natural fit because the system could be distributed having multiple points of origin. Using a GUID was an easy guarantee.

Many systems use audit tables in many ways. Many times such tables do not require primary keys at all as the data may never be retrieved by that key. While a GUID may lend itself to that purpose, it may not be necessary. However, to defend the GUID imagine moving several million rows from one database to another when a network outage interrupts the process.

If each audit row is uniquely identified, then one could resume the process. Otherwise one would need to start over. Of course a GUID primary key it would present itself as a way to guarantee that the same row has not been written more than once as long as the GUID is not generated by an insert trigger..

SQL 2012 introduced a sequence object that may offer advantages over GUIDs in some cases.

1 Insert intensive applications can be rendered inoperable when using GUIDs as the clustered primary key or clustered index. Years ago I witnessed such a system, and at the time the only way to sustain the insertion of several million rows per hour was to drop the clustered primary key. It wasn't until later that the nature of what was happening became clear and why GUIDs and clustered indexes or clustered primary keys are diametrically opposed when great volume is concerned. SQL Server's default fill factor of zero was an unrecognized enemy, as the random nature of the GUID caused frequent page splits. Page splits are resource intensive as they move data to make room for new records. Hence the name clustered index as the records are physically stored in order. Lowering the fill factor could possibly decrease the frequency of page splits to a desirable level, but the cost of larger indexes may also cause problems, and, there is a higher risk of index fragmentation which would call for more frequent re-index operations. Even 15,000 RPM disks could not keep up with the activity until we removed the clustered primary key. While storage has become more affordable, high capacity and high performance storage still comes at a premium. Sequence numbers are far more efficient than GUIDS for this purpose, so we welcome the sequence object in SQL 2012. To spell out the irony of the whole exercise, the GUID actually did nothing to guarantee the uniqueness of each row but rather that each row was guaranteed a spot in the table regardless of whether the process was sending duplicate data.

I'll leave with this parting thought, before creating your next database, know your data my friend....

Click here to see other articles in the series.

Corporate Headquarters

Practical Compliance Solutions
One Orient Way - Suite F156 Rutherford, NJ 07070
Phone: 201-728-8809


40.827731 N 74.101029 W
Connect With Us


Share




© 2002 - 2017 - Copyright Practical Compliance Solutions - All rights reserved