An Introduction to Data Compression in SQL Server 2008

This is an excerpt from my free eBook, Brad’s Sure Guide to SQL Server 2008.

There is one thing every DBA knows with certainty, and that is that databases grow with time. MDFs grow, backups grow, and it never stops. The more data we have, the more work SQL Server has to perform in order to deal with it all; whether it’s executing a query on a table with 10 million rows, or backing up a 5 TB database. Whether we like it or not, we are fighting a losing battle, and DBA’s can’t reverse the information explosion. Or can we?

While we can’t stop growth, SQL Server 2008 (Enterprise Edition only), gives us some new tools to help us better deal with all this data, and that is the promise of compression. Given the right circumstances, DBAs can use data compression to reduce the size of our MDFS, and backup compression can help us reduce the amount of space our backups take. Not only does compression reduce physical file sizes, it reduces disk I/O, which can greatly enhance the performance of many database applications, along with database backups.

When we discuss SQL Server compression, we need to think of it two different ways. First, there is data compression, which includes row-level and page-level compression that occurs within the MDF files of our databases. Second, there is backup compression, which occurs only when data is backed up. While both of these are forms of compression, they are architected differently. Because of this, it is important to treat them separately.

Data Compression comes in two different forms:

Row-level Data Compression: Row-level data compression is essentially turning fixed length data types into variable length data types, freeing up empty space. It also has the ability to ignore zero and null values, saving additional space. In turn, more rows can fit into a single data page.
Page-level Data Compression: Page-level data compression starts with row-level data compression, then adds two additional compression features: prefix and dictionary compression. We will take a look at what this means a little later in this chapter. As you can imagine, page-level compression offers increased data compression over row-level compression alone.

Backup Compression comes in a single form:

Backup Compression: Backup compression does not use row-level or page-level data compression. Instead, backup compression occurs only at the time of a backup, and it uses its own proprietary compression technique. Backup compression can be used when using, or not using, data compression, although using backup compression on a database that is already compressed using data compression may not offer additional benefits.

In the next section, we will take a high-level overview of data compression, and then we will drill down into the detail of the different types of compression available with SQL Server 2008.

Data Compression Overview

Data compression has been around for years. For example, who hasn’t zipped a file at some point in their career? While compression isn’t a new technology, it’s new to SQL Server. Unlike zip compression, SQL Server’s data compression does not automatically compress an entire database; instead, data compression can only be used for these database objects:

A table stored as a heap
A table stored as a clustered index
A non-clustered index
An indexed view
Partitioned tables and indexes

In other words, as the DBA, you must evaluate each of the above objects in your database, decide if you want to compress it, then decide whether you want to compress it using either row-level or page level compression. Once you have completed this evaluation, then you must turn on compression for that object. There is no single switch you can flip that will turn compression on or off for all the objects listed above, although you could write a Transact-SQL script to accomplish this task.

Fortunately, other than turning compression on or off for the above objects, you don’t have to do anything else to enable compression. You don’t have to re-architect your database or your application, as data compression is entirely handled under the covers by the SQL Server Storage Engine. When data is passed to the Storage Engine, it is compressed and stored in the designated compressed format (on disk and in the Buffer Cache). When the Storage Engine passes the information to another component of SQL Server, then the Storage Engine has to uncompress it. In
other words, every time data has to be passed to or from the Storage Engine, it has to be compressed or uncompressed. While this does take extra CPU overhead to accomplish, in many cases, the amount of disk I/O saved by compression more than makes up for the CPU costs, boosting the overall performance of SQL Server.

Here’s a simplified example. Let’s say that we want to update a row in a table, and that the row we want to update is currently stored on disk in a table that is using row-level data compression. When we execute the UPDATE statement, the Relational Engine (Query Processor) parses, compiles, and optimizes the UPDATE statement, ready to execute it. Before the statement can be executed, the Relational Engine needs the row of data that is currently stored on disk in the compressed format, so the Relational Engine requests the data by asking the Storage Engine to go get it. The Storage Engine (with the help of the SQLOS) goes and gets the compressed data from disk and brings it into the Buffer Cache, where the data continues to remain in its compressed format.

Once the data is in the Buffer Cache, the row is handed off to the Relational Engine from the Storage Engine. During this pass off, the compressed row is uncompressed and given to the Relational Engine to UPDATE. Once the row has been updated, it is then passed back to the Storage Engine, where is it again compressed and stored in the Buffer Cache. At some point, the row will be flushed to disk, where it is stored on disk in its compressed format.

Data compression offers many benefits. Besides the obvious one of reducing the amount of physical disk space required to store data—and the disk I/O needed to write and read it—it also reduces the amount of Buffer Cache memory needed to store data in the Buffer Cache. This in turn allows more data to be stored in the Buffer Cache, reducing the need for SQL Server to access the disk to get data, as the data is now more likely to be in memory than disk, further reducing disk I/O.

Just as data compression offers benefits, so it has some disadvantages. Using compression uses up additional CPU cycles. If your server has plenty to spare, then you have no problem. But if your server is already experiencing a CPU bottleneck, then perhaps compression is better left turned off.

Row‐Level Data Compression

The simplest method of data compression, row-level compression, works by:

Reducing the amount of metadata used to store a row.
Storing fixed length numeric data types as if they were variable-length data types. For example, if you store the value 1 in a bigint data type, storage will only take 1 byte, not 8 bytes, which the bigint data types normally takes.
Storing CHAR data types as variable-length data types. For example, if you have a CHAR (100) data type, and only store 10 characters in it, blank characters are not stored, thus reducing the space needed to the store data.
Not storing NULL or 0 values

Row-level data compression offers less compression than page-level data compression, but it also incurs less overhead, reducing the amount of CPU resources required to implement it.

Page Level Data Compression

Page-level data compression offers greater compression, but at the expense of greater CPU utilization. It works using these techniques:

It starts out by using row-level data compression to get as many rows as it can on a single page.
Next, prefix compression is run. Essentially, repeating patterns of data at the beginning of the values of a given column are removed and substituted with an abbreviated reference that is stored in the compression information (CI) structure that immediately follows the page header of a data page.
And last, dictionary compression is used. Dictionary compression searches for repeated values anywhere on a page and stores them in the CI. One of the major differences between prefix and dictionary compression is that prefix compression is restricted to one column, while dictionary compression works anywhere on a data page.

The amount of compression provided by page-level data compression is highly dependent on the data stored in a table or index. If a lot of the data repeats itself, then compression is more efficient. If the data is more random, then little benefits can be gained using page-level compression.

Data Compression Demo

Data compression can be performed using either SQL Server Management Studio (SSMS) or by using Transact-SQL. For this demo, we will take a look at how you can compress a table that uses a clustered index, using SSMS.

Let’s say that we want to compress the Sales.Customer table (which has a clustered index) in the AdventureWorks database. The first step is to right-click on the table in SSMS, select “Storage,” and then select “Manage Compression.”

Figure 1: SSMS can be used to manage compression.

This brings up the Data Compression Wizard, displayed below.

Figure 2: The Data Compression Wizard, or Transact-SQL commands, can be used to manage data compression.

After clicking “Next,” the wizard displays the following screen, which allows you not only to select the compression type, but it also allows you to calculate how much space you will save once compression has been turned on.

Figure 3: Use this screen to select the compression type, and to calculate how much space will be saved.

First, let’s see how much space we will save if we use row-level compression on this table. To find out, click on the drop-down box below “Compression Type,” select “Row,” and then click “Calculate.”

Figure 4: For this table, row-level compression doesn’t offer much compression.

After clicking “Calculate,” the wizard runs and calculates how much space is currently being used, and how much space would be used after row-level compression. As we can see, very little space will be saved, about 1.6%.

Now, let’s see how much compression savings page-level compression offers us for this particular table. Again, I go to the drop-down menu under “Compression Type,” select “Page,” then press “Calculate.”

Figure 5: Page-level compression is higher than row-level compression.

After pressing “Calculate,” we see that compression has improved greatly, now saving about 20% space. At this point, if you should decide to turn on page-level compression for this table, click on the “Next” button.

Figure 6: The wizard allows you several options in which to turn on compression.

At the above screen, you can choose to perform the compression now (not a good idea during busy production times because the initial compression process can be very CPU and disk I/O intensive), schedule it to run later, or just to script the Transact-SQL code so you can run it at your convenience.

Once you have compressed this table (a clustered index), keep in mind that any non-clustered indexes that this table may have are not automatically compressed for you. Remember, compression is based on a per object basis. If you want to compress the non-clustered indexes for this table, you will have to compress each one, one at a time.

While this wizard helps you to see how much compression either method offers, it does not suggest which compression method should be used, nor does it recommend whether compression should be used at all for this object. As the DBA, it will be your job to evaluate each compressible object to determine if it should have compression enabled or not. In other words, you must decide if the benefits of compression outweigh the negatives.

Backup Compression

For years, there have been third-party programs that allow you to compress and speed up SQL Server backups. In most regards, the backup compression included with the Enterprise Edition of SQL Server is very plain vanilla. In fact, if you already are using a third-party backup program, I would suggest you continue using it, because SQL Server 2008 backup compression offers fewer features. In fact, the only option SQL Server 2008 backup compression offers you is to turn it off or on. That’s it.

SQL Server 2008 backup compression, like the third-party add-ons, compresses backups, which not only saves backup space, but it can substantially reduce backup times. Unlike data compression, there is very little downside to using backup compression, other than the additional CPU resources required to perform the compression (or decompression during a restore). Assuming that you perform backups during slower times of the day, the additional CPU resources used will not be noticeable.

The time and space savings offered by backup compression depends on the data in your database. If you are heavily using data compression in your databases, or are using Transparent Data Encryption, then using backup compression probably won’t offer you many benefits, as already compressed data, or encrypted data, is not very compressible.

Let’s take a brief look at how you turn on SQL Server 2008 backup compression. While our example will use SSMS, you can use Transact-SQL to perform the same task. To backup AdventureWorks, right-click on the database in SSMS, select “Tasks,” and then select “Back Up,” and the backup dialog box appears.

Figure 7: As with any backup, you can use the backup dialog box to make your selections.

Once you have selected your backup options, next click on “Options,” and the following screen appears.

Figure 8: Backup compression options are limited.

At the top of figure 8 are the standard backup options, while at the bottom of the screen you see the options for backup compression. Notice that you only have three choices.

The first option, “Use the default server settings” tells the backup to use the server’s default backup compression setting. In SQL Server 2008, there is a new sp_configure option called “backup compression default.” By default, it is set to have backup compression off. If you want, you can set this option so that backup compression is turned on by default. So if you choose the “Use the default server settings” option above, then whatever option is set for the “backup compression default” will be used for the backup.

The “Compress Backup” option turns backup compression on, and the “Do not compress backup” option turns it off. Both of these options override the “backup compress default” server setting, whatever it happens to be.

Once you have chosen your backup compression method, you proceed with the backup just like any other SQL Server backup.

If you need to restore a compressed backup, you don’t have to do anything special, it will uncompress itself automatically. Although you can only compress backups using the Enterprise Edition of SQL Server 2008, you can restore a compressed backup to any edition of SQL Server 2008. On the other hand, you cannot restore a compressed SQL Server 2008 backup to any previous version of SQL Server.

Summary

In this article, we have learned about the two forms of data compression, and about backup compression. While data compression might seem like a seductive new feature of SQL Server, I highly recommend that it is only used by experienced DBAs. While it offers lots of promise for increased performance, it can just as easily cause performance problems if misused. Backup compression, on the other hand, can be used by DBAs of all skill levels.

Brad M McGehee

SQL Server DBA