Home             About Us             Sign Guest Book             Contact Us             Subscribe              RSS  
Articles
Tools
Code Snippets
Links
Tips & Tricks
FAQ
Resources
Articles
Code Snippets
Links
FAQ
Resources

Saravana Kumar
Vinod Kumar

Click on the image to subscribe to our Monthly Newsletter.


Click here for Archives

No Chats Available

Understanding - DBCC SHOW_STATISTICS

Most SQL Server internal is undocumented, and even if they are documented there is hardly any information anyone can get out of these. So here is one such attempt where I take the DBCC SHOW_STATISTICS command and try to understand what it is trying to give us as useful information. I have at many occasion used these output to say if our indexing is effective or not. Let us dive deep next...

Typical Execution:

USE pubs 
DBCC SHOW_STATISTICS (authors, aunmind) 
GO

 This example when run gives us a typical output as:

Next let us look at what each of these columns mean.

* Updated

This column represents when the statistics were last updated by the system. These can be automatically or auto-magically updated if we have AUTO UPDATE STATISTICS enabled for the given database. I would surely recommend you to have a look at this setting and mark this option. Secondly, you might also want to update the statistics manually using the UPDATE STATISTICS command that is available with SQL Server. Use this option is you feel that the table's statistics are out-of-date by a big margin. I will talk on them in a little while later next. This data about when the statistics were last updated is available using the STATS_DATE option. A typical query for our above index would be:

SELECT 'Index Name' = i.name, 
          'Statistics Date' = STATS_DATE(i.id, i.indid)
FROM sysobjects o
INNER JOIN sysindexes i
ON o.id = i.id 
WHERE o.name = 'authors' AND i.name = 'aunmind'

* Rows

This represents the number of rows in the table. This will be the updated value as and when an INSERT / DELETE happen on the table. This value is same to the rowcnt column in the sysindexes table in SQL Server. You might also want to compare the same using a COUNT(1) over the table to get the same value too.

* Rows Sampled

This value will be same as the number of rows in the table if the statistics are upto-date. Else there is a pseudo algorithm that runs in the background when AUTO UPDATE STATISTICS fires and updates the statistics. Hence these delta rows can also be found as part of the rowmodctr available in the sysindexes table. If the statistics are upto-date then the rowmodctr will be 0.

* Steps

This is equal to the number of Histogram distribution. This would be equal to the number of rows retrieved in the 3rd record set. Here the example the 3rd record set is the RANGE_HI_KEY column would have 22 rows. This is SQL Server's way to distribute the data so that when a query is fired it can look at specific ranges and figure out if that section of the data distribution is selective enough to use the index for the specific query.

* Density

The statistics information also includes details about the uniqueness, or density, of the data values. Density measures how selective an index is. The more selective an index is, the more useful it is, and because higher selectivity means that a query can eliminate more rows from consideration. A unique index is the most selective. By definition, each index entry in a unique index can point to only one row. In essence what we say is for a unique index for any given value SQL engine will be able to easily take a decision given a value because it might get one row or none based on the criteria. SQL Server uses combinations of Density and ALL_Density columns to decide on the selectivity.

The statblob (in sysindexes table) field contains two types of density information: density values and all_density values. The optimizer uses the density and all_density values to determine whether it is better to conduct a table scan or to use an index to access the data. SQL Server computes these values based on the number of rows in the table, the cardinality of the indexed columns (i.e., the number of distinct values), the number of nonfrequent (NF) values (i.e., values that appear no more than once in the histogram distribution steps), and the cardinality of NF values. NF means the value occurs no more than once in the distribution steps. It may occur many times in the data, but will appear in either 0 or 1 distribution steps. Density values range from 0 to 1. Highly selective indexes have density values of 0.1 or less (use it as a rule of thumb, lesser the better). A unique index has a density value of 1/number of rows in the table. Thus, unique index is useful because it is selective enough to be more efficient than a table scan.

* Average Key Length

This is the value of the average length of the Index. A general recommendation in the industry includes keeping the Index narrow. Now this is the place where we see the same. It is the Average length of the Index. Here in the example we have three values in the “Average Length” column. It is the maximum of them put at the “Average Key Length” value. In our example this value can be found using a query like:

Select AVG(CAST(LEN(au_lname) as DECIMAL(15,3))) from authors

* ALL Density

This is same as we discussed for the Density column. It gives us a split for the given combination of the indexes how the density has been. Like in the above example the index has been created as a composite index key of two columns, hence the stats or the density has been outlined for the combination of au_lname, au_lame + au_fname and au_lame + au_fname + au_id. Now even though the index (aunmind) was created with only au_lname + au_fname you can see that we have another statistics with au_id column. This is because by definition a non-clustered index that is created will have a pointer back to the data (here clustered index of au_id) in the leaf node. Hence we are getting three statistical records.

The availability of density information for all columns helps SQL Server decide how useful the index is for joins. For example, suppose that an index consists of three key fields. The density on the first column might be 0.5, which isn't too useful. But as you look at more columns in the key, the number of rows that are pointed to is fewer than (or in the worst case, the same as) the number of rows that are pointed to by just the first column, so the density value goes down. If you're looking at both the first and second columns, the density might be 0.25, which is somewhat better. And if you examine all three columns, the density might be 0.03, which is highly selective.

* Columns

Columns used for the Index creation from the table. Includes all the combinations in case of composite Index and would also contain a combination with the Clustered Index as the data resides there (not applicable for clustered Index)

* RANGE_HI_KEY

It is the Index Histograms leaf or the end value. Basically it would be values of the column itself. It gives us an insight into the data distribution clearly. The histogram is stored in the ASCII value order. BOL says that it is the upper bound value of the histogram.

* RANGE_ROWS

This value denotes the number of rows that are present in the Histogram break-up. In other words gives the total rows for the given range of histogram range.

* EQ_ROWS

This denotes the number of rows that are equal within the given histogram equal to the RANGE_HI_KEY value. Hence the number of rows that are equal within the histogram data partition.

* DISTINCT_RANGE_ROWS

This is the number of distinct values within a histogram step, excluding the upper bound. This value can be used to later find the average range rows.

* AVG_RANGE_ROWS

This value is the average number of duplicates within a histogram step. The value of the same can be calculated as:

= RANGE_ROWS / DISTINCT_RANGE_ROWS

In other words, this can also be seen as the selectivity that can be asked for by the optimizer when the RANGE_HI_KEY value is queried.

Last minute notes to look for:

  • The EQ_Rows have to be closer to one. Cannot be asked for every time – it is a ideal situation (and we had in our example :)). All the PK would have 1. Indicating a high-selectivity on that index.
  • Key Length has to be as smaller as possible. There is no hard and fast rule that states that it has to be below a certain value. More narrow a Index is more number of rows can fit into a single page and hence with lesser reads we can get maximum records for the select.
  • Check if the statistics are up-to-date. This is vital and see you use UPDATE STATISTICS if need-be.
  • If the number in "Rows" column are same as the number of rows in the table. This very much ties with the previous point.
  • (Density * Rows) should towards the 1 mark. More the value less effective is the index. This will be one for all UNIQUE indexes. But this is not a recommendation written on stone.
  • Data-distribution must be seen to have a Indexing strategy in place.

I would use these tips for cases when we really don't have the database with the actual values (say a client / customer site) then such information will surely give us an idea to how the data distribution are for the indexed columns.

Conclusion

Sometimes these DBCC commands give us really very good information that will be useful for us. But this has been the case for SQL Server 2000 database, in SQL 2005 all these information have been exposed as fabulous DMV's and we can sometimes get more information than what such DBCC commands can give. Do feel free to drop me your comments on the article anytime.

 

 

Comment about this article
Free Hit Counters
Free Hit Counters