Managing the Size of the Security Cache (TokenAndPermUserStore) in SQL 2005

Introduction

In an earlier post I walked through the diagnosis of a problem with TokenAndPermUserStore bloat in SQL 2005.In this post I will go through what I did to work through the issues and the results of those efforts.

Triage

Once I understood that the issue was TokenAndPermUserStore bloat, the first step was to stop the bleeding. The quick fix I chose was to create a job that ran every 30 minutes to issue a DBCC FREESYSTEMCACHE(‘TokenAndPermUserStore’). Once the job was in and running the random server slowdowns immediately stopped. At this point I knew I was on the right track even if all I had was a band-aid.

Research

There is not a lot of good documentation out there on TokenAndPermUserStore bloat. I am guessing not too many people are running into it. This post is the best resource I have found. I had already installed all of the suggested service packs and cumulative updates so that left me with trace flags and fixing the underlying code.

While fixing the code seemed like the most obvious fix, some of the code that is causing the problem is optimal for the purpose. Of all the alternatives that have been tried, the current code is the best way to get the job done.  Going back and asking for resources to change proven code to deal with what I see as a SQL Server problem just did not seem right.

Ruling out a code fix left me with two options, let the cache clearing job keep running every 30 minutes and hope nobody ever turns it off or add a trace flag to control the behavior of the server. I ruled out leaving the job running and set out to learn everything I could about –T4618. What I quickly learned is that there really is not that much information out there on the subject. What I was able to find through various sources is that –T4618 limits the number of entries in the TokenAndPermUserStore to 1024. I know from watching the counts in the cache that every time I free the cache it quickly shoots up to about 5000 entries then grows slowly from there so –T4618 would cause a huge amount of churn and CPU use. I then found that to get around the 1024 entries limitation I could also apply –T4610, limiting the cache to 8124 entries. 8124 sounded pretty good given the behavior I was seeing.

I did also run across this article on using –T4621 to customize the size of the Security Cache, but decided it was not for me due to the complexities of picking a good number, converting it to binary and editing the registry on the cluster. In SQL 2008 the value can be changed with an sp_configure call (no trace flag), making it a much more viable option. For my SQL 2005 server I saw –T4621 as something to try if –T4618 and –T4610 did not work the way I expected.

Turning on the Trace Flags

I found a number of articles that said you can turn –T4618 and –T4610 on while the server is running. Not true. You have to add them as startup parameters using SQL Configuration Manager then restart your server, raising the stakes somewhat if I guessed wrong on the setting.

I decided I was comfortable enough with my estimates that I pushed all of the paperwork and during a slack time on the system, added –T4618;-T4610 to the list of parameters on both cluster nodes. I then took the cluster resource offline and brought it back online. It all happened so fast that it was almost hard to tell I did anything. The only real evidence besides the log rolling over at an odd time is an entry in the SQL log  that says “–T4618”, another that says “–T4610” and finally one that says “TokenAndPermUserStore Memory Quota set by -T4618 as 8192.”

The Results

The primary application that uses this SQL Server is a web site. It does pretty decent volume so over the years the business has invested in some pretty cool monitoring tools. The tools were instrumental in helping identify and monitor the problem, telling me when I needed to spring into action to prevent lost sales. I have sort of a love / hate relationship with these tools because every time I said the problem was fixed they disagreed.

The relationship I have with those tools has now changed somewhat. I now love them. The monitors show that the web site is now running 20% faster than it was even when I thought it was performing well. To say I am happy with the results is an understatement.

I am blown away by how much fixing the TokenAndPermUserStore bloat helped overall application performance.

My Adventures with the SQL 2005 Security Cache

Introduction

For that past couple of months I have been chasing performance issues with one of the most high profile servers I support. Yes, I said months and high profile. Typically the pressure to find a quick fix would be immense, but luckily we have a strong team working the issue so we are able to work the process rather than throw band-aids at the issue until the problem is resolved. I should add that we are lucky enough to have amazing monitoring tools that allow us to spot a problem before it noticeably impact users and take corrective action.

The Problem

At somewhat regular intervals my SQL 2005 SP3 server with 128GB of RAM and enough SAN bandwidth for 3 servers would begin to get sluggish. The number of active user sessions would climb without end, while none of the sessions would be waiting on any resource. This usually happened in the afternoon on certain days of the week but could happen any time during the business day as long as the server was experiencing sufficient load. The problem was quickly mitigated by freeing the procedure cache on the server or even a single DB. Even running sp_updatestats by way of a slowly moving, single database DBCC FREEPROCCACHE, would fix the problem although performance was even more sluggish while it was running. Immediately after freeing enough cache the CPU would shoot way up and any backlog would quickly clear.

Troubleshooting

Being a DBA and seeing the world from a DBA perspective I first assumed the issue was related to code that was no longer optimal for the task. The symptoms sure pointed to a bad plan or cardinality issues. I spent a lot of time searching for opportunities to tune code and missing indexes, cutting the CPU usage of the server in half and cut average reads per second from 200k to 50k. Oddly enough this had no impact on the frequency of the problem.

Realizing I was dealing with something larger, I began to form a team to look into the issue. I pulled in subject matter experts from different areas of the organization to help look into this. Working through the troubleshooting process it was discovered that while new versions of monitoring tools, backup software and antivirus software had been added to the server the maximum memory setting had not been evaluated since the server was built 3 years ago. This server happens to be boot from SAN with synchronous replication for disaster recovery purposes. The primary page file is on non-replicated drives but a small page file is kept on the C drive to capture crash dumps. Watching the server run it was discovered that Windows seems to go to the page file on the C drive before going to any other. It quickly became clear that anything that forced the OS to go to the paging file would have a much higher performance cost than we had ever thought.  As a result of this realization, I dropped the memory setting down and saw a drop in CPU usage and IO waits. Oddly enough this had no impact on the frequency of the problem.

Working through the issue with the rest of the team the focus kept coming back to the question: “What else does freeing the procedure cache do?” Working through the research process on this I flushed out a lot of configuration changes made in the heat of the moment to confront this problem in the past. One of the things I kept noticing is how this resembled some issues I had seen when the server was new that were resolved by running DBCC FREESYSTEMCACHE(‘TokenAndPermUserStore’), but I remember that had been fixed in SQL 2005 SP2. The more research I did the more I kept coming back to the possibility of security cache bloat so I decided to find out more about it. I was surprised to discover that while some issues had been fixed there are still other issues that can cause the security cache to bloat and that there are now trace flags to manage the size of the cache. Armed with this new information I waited for the problem to occur again, this time running DBCC FREESYSTEMCACHE(‘TokenAndPermUserStore’). It immediately fixed the performance problem.

Why?

Working with smart people is both a blessing and a curse. The blessings are obvious and well covered elsewhere, but the curse is that smart people do not get smart without having a thirst for knowledge. What that means here is that I had to be able to explain what was bloating the security cache, just having the fix was not enough.

I truly believe that solving problems is much easier once you know what is wrong. This paradox is proven again here. Once I set out to see what was bloating the security cache, I discovered this article: http://blogs.msdn.com/psssql/archive/2008/06/16/query-performance-issues-associated-with-a-large-sized-security-cache.aspx that had everything I needed. The most important part was the first query that showed what was using the cache by type. I noticed that the majority of the entries had a class of 65535, leading me to this KB article: http://support.microsoft.com/kb/933564. The KB article explains what may cause the security cache to bloat. Reading through all of the conditions I noticed that CREATE, ALTER and DROP operations on a table object in any database (including TempDB) on the list. EUREKA!

There is a batch job that I had seen running during these times, but since it ran a lot and there was nothing wrong with the query plans it was using I had no way to prove it was involved in the issue. The code the batch job was calling did have some opportunities for improvement that I am working through with the developers that own it.  Unfortunately, the code is very high profile so reworking it is a very slow deliberate process. One of the issues with the code is that it uses temporary tables to implement procedural logic in certain places.

It appears that calling the code through a batch job leads to massive amounts of temporary table creation and destruction, causing the security cache to bloat, leading to higher CPU usage as the cache grows in size. Once the cache achieves critical mass, CPU usage drops and requests begin to queue. The requests will continue to run more slowly, causing further queuing until the DBCC FREESYSTEMCACHE(‘TokenAndPermUserStore’) command is issued.

Conclusion

I have been reminded of several things from this exercise.

  • Working through the troubleshooting process works
  • Knee-jerk reactions in the heat of the moment lead to more work later
  • What I knew then is not necessarily still true
  • Adding / Changing software on a server needs to be monitored
  • I need to make sure I am always keeping an open mind when chasing issues

 Now it is time to figure out what trace flag I want to use to address the security cache bloat. After that I get to start prioritizing all of the new items that this exercise has added to my to-do list.