Friday, March 7, 2014

Index Selectivity in SQL Server


Index Selectivity in SQL Server

Index selectivity describes distribution of distinct values in a given data set.


To more generalize the term count the number of row and number of distinct values in a given column across all rows then divide count of distinct values by number of rows, the result in a ratio describe the selectivity of the index.


Better the selectivity more useful the index is and used by query optimizer.

There’s a common piece of advice given about columns in an index key that says that the most selective column should go first. It's correct, but the problem is that it’s often given without any explanation as to why the most selective column should go first, nor are the other considerations for index key order mentioned.
This can lead to misunderstandings like, in the extreme case, where one person after hearing that advice went and added the primary key column as the leading column of every single nonclustered index (because it’s highly selective), and then wondered why his database performance decreased dramatically.
The comment about selectivity is because of the way SQL keeps statistics on indexes. SQL only keeps the histogram for the first column of the index. That means that it only knows the actual distribution of values of the first column. If the first column is not selective, the index may not be used. However, that’s not the whole story.
SQL also, in addition to the histogram, keeps density values for all of the left-based subsets of the index keys. So, for a 4 column index key, SQL knows the density of the first column, of the first and second columns, first and second third columns and of all four columns. The density is, in a nutshell, a value that shows how unique the set of columns is. It’s 1/(distinct values). The value can be seen for any index using DBCC Show_Statistics with the DENSITY_VECTOR option.
This means, while SQL only knows the actual data distribution of the first column, it does know, on average, how many rows will be returned by an equality match on any left-based subset of the index keys.


Therefore, a good rule for the order of columns in an index key is to put the most selective columns first, when all other considerations are equal.



Refer below query on employee table having four columns Id,Name,Addr,Dept

SELECT * FROM Employee
-- CREATE INDEX idx_Employee ON Employee(id,Name,Addr)
-- CREATE INDEX idx_Employee1 ON Employee(Addr,Name,id)
DBCC SHOW_STATISTICS(Employee,'idx_Employee')
-- DBCC SHOW_STATISTICS(Employee,'idx_Employee1')