For that past couple of months I have been chasing performance issues with one of the most high profile servers I support. Yes, I said months and high profile. Typically the pressure to find a quick fix would be immense, but luckily we have a strong team working the issue so we are able to work the process rather than throw band-aids at the issue until the problem is resolved. I should add that we are lucky enough to have amazing monitoring tools that allow us to spot a problem before it noticeably impact users and take corrective action.
At somewhat regular intervals my SQL 2005 SP3 server with 128GB of RAM and enough SAN bandwidth for 3 servers would begin to get sluggish. The number of active user sessions would climb without end, while none of the sessions would be waiting on any resource. This usually happened in the afternoon on certain days of the week but could happen any time during the business day as long as the server was experiencing sufficient load. The problem was quickly mitigated by freeing the procedure cache on the server or even a single DB. Even running sp_updatestats by way of a slowly moving, single database DBCC FREEPROCCACHE, would fix the problem although performance was even more sluggish while it was running. Immediately after freeing enough cache the CPU would shoot way up and any backlog would quickly clear.
Being a DBA and seeing the world from a DBA perspective I first assumed the issue was related to code that was no longer optimal for the task. The symptoms sure pointed to a bad plan or cardinality issues. I spent a lot of time searching for opportunities to tune code and missing indexes, cutting the CPU usage of the server in half and cut average reads per second from 200k to 50k. Oddly enough this had no impact on the frequency of the problem.
Realizing I was dealing with something larger, I began to form a team to look into the issue. I pulled in subject matter experts from different areas of the organization to help look into this. Working through the troubleshooting process it was discovered that while new versions of monitoring tools, backup software and antivirus software had been added to the server the maximum memory setting had not been evaluated since the server was built 3 years ago. This server happens to be boot from SAN with synchronous replication for disaster recovery purposes. The primary page file is on non-replicated drives but a small page file is kept on the C drive to capture crash dumps. Watching the server run it was discovered that Windows seems to go to the page file on the C drive before going to any other. It quickly became clear that anything that forced the OS to go to the paging file would have a much higher performance cost than we had ever thought. As a result of this realization, I dropped the memory setting down and saw a drop in CPU usage and IO waits. Oddly enough this had no impact on the frequency of the problem.
Working through the issue with the rest of the team the focus kept coming back to the question: “What else does freeing the procedure cache do?” Working through the research process on this I flushed out a lot of configuration changes made in the heat of the moment to confront this problem in the past. One of the things I kept noticing is how this resembled some issues I had seen when the server was new that were resolved by running DBCC FREESYSTEMCACHE(‘TokenAndPermUserStore’), but I remember that had been fixed in SQL 2005 SP2. The more research I did the more I kept coming back to the possibility of security cache bloat so I decided to find out more about it. I was surprised to discover that while some issues had been fixed there are still other issues that can cause the security cache to bloat and that there are now trace flags to manage the size of the cache. Armed with this new information I waited for the problem to occur again, this time running DBCC FREESYSTEMCACHE(‘TokenAndPermUserStore’). It immediately fixed the performance problem.
Working with smart people is both a blessing and a curse. The blessings are obvious and well covered elsewhere, but the curse is that smart people do not get smart without having a thirst for knowledge. What that means here is that I had to be able to explain what was bloating the security cache, just having the fix was not enough.
I truly believe that solving problems is much easier once you know what is wrong. This paradox is proven again here. Once I set out to see what was bloating the security cache, I discovered this article: http://blogs.msdn.com/psssql/archive/2008/06/16/query-performance-issues-associated-with-a-large-sized-security-cache.aspx that had everything I needed. The most important part was the first query that showed what was using the cache by type. I noticed that the majority of the entries had a class of 65535, leading me to this KB article: http://support.microsoft.com/kb/933564. The KB article explains what may cause the security cache to bloat. Reading through all of the conditions I noticed that CREATE, ALTER and DROP operations on a table object in any database (including TempDB) on the list. EUREKA!
There is a batch job that I had seen running during these times, but since it ran a lot and there was nothing wrong with the query plans it was using I had no way to prove it was involved in the issue. The code the batch job was calling did have some opportunities for improvement that I am working through with the developers that own it. Unfortunately, the code is very high profile so reworking it is a very slow deliberate process. One of the issues with the code is that it uses temporary tables to implement procedural logic in certain places.
It appears that calling the code through a batch job leads to massive amounts of temporary table creation and destruction, causing the security cache to bloat, leading to higher CPU usage as the cache grows in size. Once the cache achieves critical mass, CPU usage drops and requests begin to queue. The requests will continue to run more slowly, causing further queuing until the DBCC FREESYSTEMCACHE(‘TokenAndPermUserStore’) command is issued.
I have been reminded of several things from this exercise.
- Working through the troubleshooting process works
- Knee-jerk reactions in the heat of the moment lead to more work later
- What I knew then is not necessarily still true
- Adding / Changing software on a server needs to be monitored
- I need to make sure I am always keeping an open mind when chasing issues
Now it is time to figure out what trace flag I want to use to address the security cache bloat. After that I get to start prioritizing all of the new items that this exercise has added to my to-do list.