Wednesday, January 11, 2012

SQL Server Bottleneck Analysis Guidelines

                           SQL Server Bottleneck Analysis Guidelines

Memory Bottleneck Analysis 2

Disk Bottleneck Analysis 4

Processor Bottleneck Analysis 6

Overall SQL Server Bottleneck Analysis 8

Transaction Management 9

Key performance counters 10



Memory Bottleneck Analysis



Object: - Memory

Counter: - Available Mbytes

Preferred Value: - > 20MB

Reference: - KB 889654



Object: - Memory

Counter: - Free System Page Table Entries

Preferred Value: - > 7000

Description: - Free System Page Table Entries is the number of page table entries not currently in use by the system. If < 7000, consider removing /3GB.

Reference: - KB 311901



Object: - Memory

Counter: - Pages/Sec

Preferred Value: - < 50

Description: - Pages/sec is the rate at which pages are read from or written to disk to resolve hard page faults. This counter is a primary indicator of the kinds of faults that cause system-wide delays.

Reference: - Monitoring and Tuning Your Server



Object: - Memory

Counter: - Pages Input/Sec

Preferred Value: - < 10

Description: - Pages Input/sec is the rate at which pages are read from disk to resolve hard page faults.

Reference: - KB 889654



Object: - Paging File

Counter: - %Usage

Preferred Value: - < 70%

Description: - The amount of the Page File instance in use in percent.

Reference: - KB 889654



Object: - SQLServer:Buffer Manager

Counter: - Page Life Expectancy

Preferred Value: - > 300

Description: - This performance monitor counter tells you, on average, how long data pages are staying in the buffer. If this value gets below 300 seconds, this is a potential indication that your SQL Server could use more memory in order to boost performance.



Object: - SQLServer:Buffer Manager

Counter: - Lazy Writes/Sec

Preferred Value: - < 20

Description: - This counter tracks how many times a second that the Lazy Writer process is moving dirty pages from the buffer to disk in order to free up buffer space. Generally speaking, this should not be a high value, say more than 20 per second or so. Ideally, it should be close to zero. If it is zero, this indicates that your SQL Server's buffer cache is plenty big and SQL Server doesn't have to free up dirty pages, instead waiting for this to occur during regular checkpoints. If this value is high, then a need for more memory is indicated.



Object: - SQLServer:Buffer Manager

Counter: - Page reads/sec

Preferred Value: - < 90

Description: - Number of physical database page reads issued. 80 – 90 per second is normal, anything that is above indicates indexing or memory constraint.



Object: - SQLServer:Buffer Manager

Counter: - Checkpoint Pages/Sec

Preferred Value: - This value is relative, it varies from server to server, we need to compare the average to a base line capture to tell if the value is high or low.

Description: - When a checkpoint occurs, all dirty pages are written to disk. This is a normal procedure and will cause this counter to rise during the checkpoint process. What you don't want to see is a high value for this counter over time. This can indicate that the checkpoint process is running more often than it should, which can use up valuable server resources. If this has a high figure (and this will vary from server to server), consider adding more RAM to reduce how often the checkpoint occurs, or consider increasing the "recovery interval" SQL Server configuration setting.



Object: - SQLServer:Buffer Manager

Counter: - Page writes/sec

Preferred Value: - < 90

Description: - Number of physical database page writes issued. 80 – 90 per second is normal, anything more we need to check the lazy writer/sec and checkpoint counters, if these counters are also relatively high then, it’s memory constraint.



Object: - SQLServer:Buffer Manager

Counter: - Free pages

Preferred Value: - > 640

Description: - Total number of pages on all free lists.



Object: - SQLServer:Buffer Manager

Counter: - Stolen pages

Preferred Value: - Varies. Compare with baseline

Description: - Number of pages used for miscellaneous server purposes (including procedure cache).



Object: - SQLServer:Buffer Manager

Counter: - Buffer Cache hit ratio

Preferred Value: - > 90%

Description: - Percentage of pages that were found in the buffer pool without having to incur a read from disk.



Object: - SQLServer:Buffer Manager

Counter: - Target Server Memory(KB)

Preferred Value: -

Description: - Total amount of dynamic memory the server can consume.



Object: - SQLServer:Buffer Manager

Counter: - Total Server Memory(KB)

Preferred Value: -

Description: - Total amount of dynamic memory (in kilobytes) that the server is using currently



Disk Bottleneck Analysis



Object: - PhysicalDisk

Counter: - Avg. Disk Sec/Read

Preferred Value: - < 8ms

Description: - Measure of disk latgency. Avg. Disk sec/Read is the average time, in seconds, of a read of data from the disk.

More Info:

Reads or non cached Writes

Excellent < 08 Msec ( .008 seconds )

Good < 12 Msec ( .012 seconds )

Fair < 20 Msec ( .020 seconds )

Poor > 20 Msec ( .020 seconds )

Cached Writes Only

Excellent < 01 Msec ( .001 seconds )

Good < 02 Msec ( .002 seconds )

Fair < 04 Msec ( .004 seconds )

Poor > 04 Msec ( .004 seconds



Object: - PhysicalDisk

Counter: - Avg. Disk sec/Write

Preferred Value: - < 8ms (non cached) < 1ms (cached)

Description: - Measure of disk latency. Avg. Disk sec/Write is the average time, in seconds, of a write of data to the disk.



Object: - PhysicalDisk

Counter: - Avg. Disk Read Queue Length

Preferred Value: - < 2 * spindles

Description: - Avg. Disk Read Queue Length is the average number of read requests that were queued for the selected disk during the sample interval.

More Info:

< (2+ no of spindles) Excellent

< (2*no of spindles) Good

< (3* no of spindles) Fair

Reference - Whitepaper “Performance Monitoring in Windows 2003: Best Practices” by Ben W. Christenbury



Note: If the disk has say 20 disk and it is RAID 10 then no. of spindles = 20/2 = 10. If it is RAID 5 then the no. of spindles = no of disks = 20.



Object: - PhysicalDisk

Counter: - Avg. Disk Write Queue Length

Preferred Value: - < 2 * spindles

Description: - Avg. Disk Write Queue Length is the average number of write requests that were queued for the selected disk during the sample interval.





Object: - SQLServer:Buffer Manager

Counter: - Page reads/sec

Preferred Value: - < 90

Description: - Number of physical database page reads issued. 80 – 90 per second is normal, anything that is above indicates indexing or memory constraint.



Object: - SQLServer:Buffer Manager

Counter: - Page writes/sec

Preferred Value: - < 90

Description: - Number of physical database page writes issued. 80 – 90 per second is normal, anything more we need to check the lazy writer/sec and checkpoint counters, if these counters are also relatively high then, it’s memory constraint.



Object: - SQLServer:Buffer Manager

Counter: - Free pages

Preferred Value: - > 640

Description: - Total number of pages on all free lists.



Object: - SQLServer:Buffer Manager

Counter: - Stolen pages

Preferred Value: - Varies. Compare with baseline

Description: - Number of pages used for miscellaneous server purposes (including procedure cache).



Object: - SQLServer:Buffer Manager

Counter: - Buffer Cache hit ratio

Preferred Value: - > 90%

Description: - Percentage of pages that were found in the buffer pool without having to incur a read from disk.



Processor Bottleneck Analysis



Object: - Processor

Counter: - %Processor Time

Preferred Value: - < 80%

Description: - % Processor Time is the percentage of elapsed time that the processor spends to execute a non-Idle thread.



Object: - Processor

Counter: - %Privileged Time

Preferred Value: - < 30% of Total %Processor Time

Description: - % Privileged Time is the percentage of elapsed time that the process threads spent executing code in privileged mode.



Object: - Process (sqlservr)

Counter: - %Processor Time

Preferred Value: - < 80%



Object: - Process (sqlservr)

Counter: - %Privileged Time

Preferred Value: - < 30% of %Processor Time (sqlservr)

Description: - Note: Divide the value by number of processors



Object: - System

Counter: - Context Switches/sec

Preferred Value: - < 3000

Description: -

1500 – 3000 per processor Excellent – Fair

> 6000 per processor Poor

Upper limit is about 40,000 at 90 % CPU per CPU

NOTE: Remember to divide by number of processors



Object: - System

Counter: - Processor Queue Length

Preferred Value: - < 4 per CPU

Description: - For standard servers with long Quantums

<= 4 per CPU Excellent

< 8 per CPU Good

< 12 per CPU Fair



Object: - SQLServer:Access Methods

Counter: - Full Scans / sec

Preferred Value: - < 1

Description: - If we see high CPU then we need to investigate this counter, otherwise if the full scan are on small tables we can ignore this counter. Values greater than 1 or 2 indicate that we are having table / Index page scans. We need to analyze how this can be avoided.



Object: - SQLServer:Access Methods

Counter: - Worktables Created/Sec

Preferred Value: - < 20

Description: - Number of worktables created in tempdb per second. Worktables are used for queries that use various spools (table spool, index spool, etc).





Object: - SQLServer:Access Methods

Counter: - Workfiles Created/Sec

Preferred Value: - < 20

Description: - Number of work files created per second. Tempdb workfiles are used in processing hash operations when the amount of data being processed is too big to fit into the available memory. They may be able to reduce this number by making the queries more efficient by adding/changing indexes, adding additional memory, etc.



Object: - SQLServer:Access Methods

Counter: - Page Splits/sec

Preferred Value: - < 20

Description: - Interesting counter that can lead us to our table / index design. This value needs to be low as possible. If you find out that the number of page splits is high, consider increasing the fillfactor of your indexes. An increased fillfactor helps to reduce page splits because there is more room in data pages before it fills up and a page split has to occur.





Overall SQL Server Bottleneck Analysis



Object: - SQLServer:General Statistics

Counter: - User Connections

Preferred Value: -

Description: - The number of users currently connected to the SQL Server.



Object: - SQLServer:General Statistics

Counter: - Logins/sec

Preferred Value: - < 2

Description: - > 2 per second indicates that the application is not correctly using connection pooling.



Object: - SQLServer:General Statistics

Counter: - Logouts/sec

Preferred Value: - < 2

Description: - > 2 per second indicates that the application is not correctly using connection pooling.



Object: - SQLServer:SQL Statistics

Counter: - Batch Requests/Sec

Preferred Value: - < 1000

Description: - Over 1000 batch requests per second indicate a very busy SQL Server.



Object: - SQLServer:SQL Statistics

Counter: - SQL Compilations/sec

Preferred Value: - < 10% of the number of Batch Requests / sec

Description: - The number of times per second that SQL Server compilations have occurred. This value needs to be as low as possible. If you see a high value such as over 100, then it’s an indication that there are lots or adhoc queries that are running, might cause CPU



Object: - SQLServer:SQL Statistics

Counter: - SQL Re-Compilations/sec

Preferred Value: - < 10% of the number of SQL Compilations/sec

Description: - This needs to be nil in our system as much as possible. A recompile can cause deadlocks and compile locks that are not compatible with any locking type.



Object: - SQLServer:Latches

Counter: - Average Latch Wait Time (ms)

Preferred Value: - < 300

Description: - Average latch wait time (milliseconds) for latch requests that had to wait.



Transaction Management



Object: - SQLServer:Locks

Counter: - Number of Deadlocks/sec

Preferred Value: - < 1

Description: - The number of lock requests that resulted in a deadlock.



Object: - SQLServer:Locks

Counter: - Lock Requests/sec

Preferred Value: - < 1000

Description: - Number of requests for a type of lock per second. Lock requests/sec > 1000 indicates that the queries are accessing large number of rows; the next step is to review high read queries. If you also see high Avg. Wait time, then it’s an indication of blocking, then review the blocking script output.



Object: - SQLServer:Locks

Counter: - Average Wait Time (ms)

Preferred Value: - < 500

Description: - This is the average wait time in milliseconds to acquire a lock. Lower the value the better it is. If the value goes higher then 500, there may be blocking going on; we need to run blocker script to identify blocking.



Key performance counters



Buffer cache hit ratio



The buffer cache hit ratio is the percentage of pages found in the buffer cache without having to read from disk. A value of 90 percent or higher is desirable. If the value is 60 percent, you should add more memory until the value is consistently greater than 90 percent.



Page life expectancy



Have you ever checked out the page life expectancy counter in Performance Monitor's Buffer Manager object? SQL Server Books Online (BOL) says the page life expectancy value is the "number of seconds a page will stay in the buffer pool without references." So, a buffer that has a 300-second page life expectancy will keep any given page in memory in the buffer pool for 5 minutes before the buffer pool flushes the page to disk—unless a process references the page.

This counter can be helpful in determining whether you have a memory problem, giving you a reasonably accurate view of whether your server has memory pressure. According to Microsoft, 300 seconds is the minimum target for page life expectancy. If the buffer pool flushes your pages in less than 300 seconds, you probably have a memory problem. Looking at this value is particularly handy when your page life expectancy is significantly higher or lower than 300 seconds. For example, a customer recently asked me, "Do we need more memory?" I monitored the page life expectancy value for the customer's system, and the value didn't fall below 3000 seconds. That's quantifiable proof that more memory wouldn't help performance. Other customers have an average page life expectancy between 10 and 50 seconds, and they wonder why their servers are slow.