Understanding - DBCC SHOW_STATISTICS
Most
SQL Server internal is
undocumented, and even if they are documented there is hardly any
information anyone can get out of these. So here is one such attempt where
I take the DBCC SHOW_STATISTICS command and try to understand what it is trying
to give us as useful information. I have at many occasion used these output to
say if our indexing is effective or not. Let us dive deep next...
Typical
Execution:
USE pubs
DBCC SHOW_STATISTICS (authors, aunmind)
GO
This
example when run gives us a typical output as:

Next
let us look at what each of these columns mean.
*
Updated
This
column represents when the statistics were last updated by the system. These
can be automatically or auto-magically updated if we have AUTO UPDATE
STATISTICS enabled for the given database. I would surely recommend you to have
a look at this setting and mark this option. Secondly, you might also want to
update the statistics manually using the UPDATE STATISTICS command that is
available with SQL Server. Use this option is you feel that the table's
statistics are out-of-date by a big margin. I will talk on them in a little
while later next. This data about when the statistics were last updated is
available using the STATS_DATE option. A typical query for our above index
would be:
SELECT 'Index Name' = i.name,
'Statistics Date' = STATS_DATE(i.id, i.indid)
FROM sysobjects o
INNER JOIN sysindexes i
ON o.id = i.id
WHERE o.name = 'authors' AND i.name = 'aunmind'
*
Rows
This
represents the number of rows in the table. This will be the updated value as
and when an INSERT / DELETE happen on the table. This value is same to the rowcnt
column in the sysindexes table in SQL Server. You might also want to compare
the same using a COUNT(1) over the table to get the same value too.
*
Rows Sampled
This
value will be same as the number of rows in the table if the statistics are
upto-date. Else there is a pseudo algorithm that runs in the background when
AUTO UPDATE STATISTICS fires and updates the statistics. Hence these delta rows
can also be found as part of the rowmodctr available in the sysindexes
table. If the statistics are upto-date then the rowmodctr will be 0.
*
Steps
This
is equal to the number of Histogram distribution. This would be equal to the
number of rows retrieved in the 3rd record set. Here the example the 3rd record
set is the RANGE_HI_KEY column would have 22 rows. This is SQL Server's way to
distribute the data so that when a query is fired it can look at specific
ranges and figure out if that section of the data distribution is selective
enough to use the index for the specific query.
*
Density
The
statistics information also includes details about the uniqueness, or density,
of the data values. Density measures how selective an index is. The more
selective an index is, the more useful it is, and because higher
selectivity means that a query can eliminate more rows from consideration. A
unique index is the most selective. By definition, each index entry in a unique
index can point to only one row. In essence what we say is for a unique index
for any given value SQL engine will be able to easily take a decision
given a value because it might get one row or none based on the criteria. SQL
Server uses combinations of Density and ALL_Density columns to decide on the
selectivity.
The
statblob (in sysindexes table) field contains two types of density
information: density values and all_density values. The optimizer uses the
density and all_density values to determine whether it is better to conduct a
table scan or to use an index to access the data. SQL Server computes these
values based on the number of rows in the table, the cardinality of the indexed
columns (i.e., the number of distinct values), the number of nonfrequent (NF)
values (i.e., values that appear no more than once in the histogram
distribution steps), and the cardinality of NF values. NF means the value
occurs no more than once in the distribution steps. It may occur many times in
the data, but will appear in either 0 or 1 distribution steps. Density
values range from 0 to 1. Highly selective indexes have density values of 0.1
or less (use it as a rule of thumb, lesser the better). A unique index has a
density value of 1/number of rows in the table. Thus, unique index is useful
because it is selective enough to be more efficient than a table scan.
*
Average Key Length
This
is the value of the average length of the Index. A general recommendation in
the industry includes keeping the Index narrow. Now this is the place where we
see the same. It is the Average length of the Index. Here in the example we
have three values in the “Average Length” column. It is the maximum of them put
at the “Average Key Length” value. In our example this value can be found using
a query like:
Select AVG(CAST(LEN(au_lname) as DECIMAL(15,3))) from authors
*
ALL Density
This
is same as we discussed for the Density column. It gives us a split for the
given combination of the indexes how the density has been. Like in the above
example the index has been created as a composite index key of two columns,
hence the stats or the density has been outlined for the combination of
au_lname, au_lame + au_fname and au_lame + au_fname + au_id. Now even though
the index (aunmind) was created with only au_lname + au_fname you can see that
we have another statistics with au_id column. This is because by definition a
non-clustered index that is created will have a pointer back to the data (here
clustered index of au_id) in the leaf node. Hence we are getting three
statistical records.
The
availability of density information for all columns helps SQL Server decide how
useful the index is for joins. For example, suppose that an index consists of
three key fields. The density on the first column might be 0.5, which isn't too
useful. But as you look at more columns in the key, the number of rows that are
pointed to is fewer than (or in the worst case, the same as) the number of rows
that are pointed to by just the first column, so the density value goes down.
If you're looking at both the first and second columns, the density might be
0.25, which is somewhat better. And if you examine all three columns, the
density might be 0.03, which is highly selective.
*
Columns
Columns
used for the Index creation from the table. Includes all the combinations in
case of composite Index and would also contain a combination with the Clustered
Index as the data resides there (not applicable for clustered Index)
*
RANGE_HI_KEY
It
is the Index Histograms leaf or the end value. Basically it would be values of
the column itself. It gives us an insight into the data distribution clearly.
The histogram is stored in the ASCII value order. BOL says that it is the upper
bound value of the histogram.
*
RANGE_ROWS
This
value denotes the number of rows that are present in the Histogram break-up. In
other words gives the total rows for the given range of histogram range.
*
EQ_ROWS
This
denotes the number of rows that are equal within the given histogram equal to
the RANGE_HI_KEY value. Hence the number of rows that are equal within the
histogram data partition.
*
DISTINCT_RANGE_ROWS
This
is the number of distinct values within a histogram step, excluding the upper
bound. This value can be used to later find the average range rows.
*
AVG_RANGE_ROWS
This
value is the average number of duplicates within a histogram step. The value of
the same can be calculated as:
=
RANGE_ROWS / DISTINCT_RANGE_ROWS
In
other words, this can also be seen as the selectivity that can be asked for by
the optimizer when the RANGE_HI_KEY value is queried.
Last
minute notes to look for:
-
The
EQ_Rows have to be closer to one. Cannot be asked for every time – it is a
ideal situation (and we had in our example :)). All the PK would have 1.
Indicating a high-selectivity on that index.
-
Key
Length has to be as smaller as possible. There is no hard and fast rule
that states that it has to be below a certain value. More narrow a Index is
more number of rows can fit into a single page and hence with lesser reads we
can get maximum records for the select.
-
Check
if the statistics are up-to-date. This is vital and see you use UPDATE
STATISTICS if need-be.
-
If
the number in "Rows" column are same as the number of rows in the
table. This very much ties with the previous point.
-
(Density
* Rows) should towards the 1 mark. More the value less effective is the index.
This will be one for all UNIQUE indexes. But this is not a recommendation
written on stone.
-
Data-distribution
must be seen to have a Indexing strategy in place.
I
would use these tips for cases when we really don't have the database with
the actual values (say a client / customer site) then such information will
surely give us an idea to how the data distribution are for the indexed
columns.
Conclusion
Sometimes
these DBCC commands give us really very good information that will be useful
for us. But this has been the case for SQL Server 2000 database, in SQL 2005
all these information have been exposed as fabulous DMV's and we can sometimes
get more information than what such DBCC commands can give. Do feel free to
drop me your comments on the article anytime.
|