How Do I Identify Invalid or Unneeded Logins?

Introduction

I was looking at one of my servers and was caught off-guard by how many individual users had been granted rights. To make it worse I recognized a few user names as former coworkers. Clearly time for some house cleaning, now time to find a script.

The Process

I went looking in my script library for anything that might help me do this analysis faster and found a good start, a script to validate user logins. Well more like a note really, based on a conversation between Thomas LaRock (Blog|Twitter) and Tim Ford (Blog|Twitter) on Twitter about a stored procedure called sp_validatelogins that makes sure a windows login on your SQL instance matches a valid login on your domain. Thomas LaRock does a great job of documenting it here.

Next I needed to figure out what users should have been in groups. My first answer is all of them, but being realistic I decided I would identify all users that have access to the server by both individual login and group then see if the group already has the same rights the user needs or if it could / should. It turns out this was easy, a cursor of windows users with a call to xp_logininfo to check for group membership will tell me exactly what to look at, including the groups and the command to get them because I often end up pasting that into an IM.

As I started working through test data I began to notice users in databases that no longer exist on the server. Again, this one was pretty easy. A query against sys.server_principals with a right outer join to sys.database_principals quickly shows all users that exist in a database that do not have logins on the server. Adding a select to a temp table and wrapping it all in sp_MSforeachdb put it all together into a nice recordset that I could then work off of.

The Script

Here is the script I ended up with, run it in text mode rather than grid for better results. I would advise against trying to automate this any further or even taking any actions without carefully researching the impact of each change before making it. Messing with security is a quick way to get a lot of attention.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
SET NOCOUNT ON

CREATE TABLE #invalid_logins
(
[sid] varbinary(85),
[name] sysname
)

CREATE TABLE #Logins
(
account_name sysname NULL,
type char(8) NULL,
privilege char(9) NULL,
mapped_login_name sysname NULL,
permission_path sysname NULL
)

INSERT #invalid_logins
EXEC sp_validatelogins

DECLARE @name sysname

DECLARE csr CURSOR FAST_FORWARD READ_ONLY FOR
SELECT sl.name
FROM sys.syslogins sl
LEFT OUTER JOIN #invalid_logins il
ON sl.name = il.name
WHERE sl.isntname = 1
AND sl.isntgroup = 0
AND il.name IS NULL

OPEN csr

FETCH NEXT FROM csr INTO @name

WHILE @@FETCH_STATUS <> -1
BEGIN
INSERT #Logins
EXEC xp_logininfo @acctname=@name, @option='all'
FETCH NEXT FROM csr INTO @name
END

CLOSE csr
DEALLOCATE csr

PRINT 'The following logins are no longer valid and should be removed from this server:'
SELECT name
FROM #invalid_logins

PRINT 'The following logins have access by login and by group and might not need access by login:'
SELECT COUNT(*) AS num_of_groups,
account_name,
REPLACE(RTRIM(( SELECT permission_path + ' '
FROM #Logins
WHERE account_name = l.account_name
FOR XML PATH('')
)), ' ', ', ') AS group_names,
'EXEC xp_logininfo @acctname=' + QUOTENAME(account_name) + ', @option=''all''' AS command_to_see_groups
FROM #Logins l
GROUP BY account_name,
[type]
HAVING COUNT(*) > 1
ORDER BY num_of_groups DESC

DROP TABLE #invalid_logins
DROP TABLE #Logins

PRINT 'The following result set shows users that were deleted from the server but not the individual databases'
CREATE TABLE #databases_with_orphan_users
(
db_with_orphan_login sysname,
orphan_login sysname,
permissions_count int
)

EXEC sp_MSforeachdb 'USE ?
INSERT #databases_with_orphan_users
SELECT '
'?'' AS db_with_orphan_login,
dp.name,
(SELECT COUNT(*) FROM sys.database_permissions WHERE grantee_principal_id = sp.principal_id) AS permissions_count
FROM sys.server_principals sp
RIGHT OUTER JOIN sys.database_principals dp
ON sp.sid = dp.sid
WHERE sp.sid IS NULL
AND dp.type NOT IN ('
'R'')
AND dp.name NOT IN ('
'guest'', ''sys'', ''INFORMATION_SCHEMA'')
AND dp.type_desc NOT IN ('
'APPLICATION_ROLE'')'

SELECT db_with_orphan_login,
orphan_login,
permissions_count
FROM #databases_with_orphan_users

DROP TABLE #databases_with_orphan_users

Conclusion

I hope you find this script helpful. I have a feeling this script will be one of my favorites because it takes a lot of tedious analysis and boils it down to a quick hit list. Please let me know if you run into any issues with it. As always scripts from the internet are like Halloween candy, inspect before consumption. I offer no warranty beyond a sympathetic ear if you should run into any issues.

Is Anybody Using That Stored Procedure?

Introduction

From time to time I get asked to check if a stored procedure is still used in preparation for dropping it. Last week I tweeted my wish for a DMV that would give me the usage statistics for stored procedures to which Aaron Bertrand (Blog|Twitter) quickly replied that SQL 2008 has sys.dm_exec_procedure_stats that does exactly what I need. Unfortunately the server I needed to do the analysis on was running SQL 2005.

Where It Went So Wrong

Rather than digging into my script library to see if I had anything that might help do the analysis I fell into classic tunnel vision, cursing myself for not pushing harder to get this application on SQL 2008. I decided that the only way to figure out if these stored procedures were still used was to fire up Profiler and trace for the stored procedure names. I was not looking forward to doing this because it is easy to get a trace wrong, not capturing any data or bogging down the server, but I pushed on. I built my server-side trace script, trying it out on a less used server then fired it up on my production server. The production server immediately went to 100% CPU, completely killing all applications that depended on it. The only way to get the get the applications back online was to restart the SQL Service. This was absolutely the worst I have ever been burned by Profiler.

A Better Way

Last night I decided I was done replaying the events of the day over and over in my head and threw out what had happened to #sqlhelp on Twitter to get advice on what I could have done better. The basic feedback was try not to run traces on multiple strings and if you do have to then try multiple traces. There really was not magic tipping point that anybody could point to as to why doing something in one place would have such disastrous results while doing it somewhere else would not. This really points to what I have known for a long time: Profiler is not your friend. Profiler is more like that former coworker that you could not trust but had to work with on some of your most important projects.

Talking to Andrew Kelly (Blog|Twitter) led to one of my more pronounced face-palm moments, ever. He pointed out that “2005 has similar dmv’s but if its a seldom run query the plan may not be in cache most of the time.” In iteration 1 of the same project I ran into less pronounced issues tracing stored procedures to make sure they were no longer used. At that time I wrote a script to check whether or not an object had a plan in the cache or not. That very script was sitting out in my common scripts directory the whole time. Ultimate Face-Palm.

The Script

Here is the script I ultimately used to figure out if the stored procedures were used. Pulling back the plan is optional but I find it helpful because it will tell you what parameters an object was compiled with. This can often point to where it was called from or even who the user is that called it. Careful with this query if you choose to run it on a production server you may see a performance impact.

Note that I explicitly specify the database name rather than rely on the USE statement. When the absence of results is what you are looking for it is too easy to confuse a mistake for success.

Updated to correct bug caused by improper usage of OBJECT_ID, pointed out by Simon Sabin (blog). I opted to keep the CROSS APPLY rather than going with an OUTER APPLY because I only want to return results for plans that are cached. The OBJECT_ID syntax is taken directly from here: http://msdn.microsoft.com/en-us/library/ms190328.aspx. I could not make it work by passing DB_ID as a parameter.

1
2
3
4
5
6
7
SELECT  q.text,
        p.query_plan, *
FROM    sys.dm_exec_query_stats qs
        CROSS APPLY sys.dm_exec_sql_text(qs.plan_handle) q
        CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) p
WHERE   q.dbid = DB_ID('[ database_name ]')
            AND q.objectid = OBJECT_ID('[ database_name . [ schema_name ] . | schema_name . ] object_name ]')

Conclusion

I learned a lot of tough lessons on this one. I suppose even tougher because they were lessons I already should have learned. Hopefully this blog and the included script will prevent someone else from falling into the same trap. Please let me know if you run into any issues with the script or if it bails you out of any jams. As always scripts from the internet are like Halloween candy, inspect before consumption. I offer no warranty beyond a sympathetic ear if you should run into any issues.

Get Users By Connection

Introduction

I was recently working a production issue that required me to log in via the Dedicated Administrator Connection (DAC). When I tried to log in I was greeted with the following message:

Could not connect because the maximum number of ‘1’ dedicated administrator connections already exists. Before a new connection can be made, the existing dedicated administrator connection must be dropped, either by logging off or ending the process.

After searching through my scripts folder and not finding anything that would tell me who was connected via the DAC I turned to Google. I found a lot of posts detailing how to use the DAC but none telling how to tell who is using it.

Once the emergency was taken care of I made a point of going back and writing a query to tell who is using the DAC. To help the community or at least that part that relies on Google like myself I am also blogging it and adding appropriate search tags.

The Query

I know, I know, enough with the backstory already lets see some code. The code I have posted here will show all users by connection, although I have included a commented WHERE clause to filter to just the DAC connection. Here is the SQL:

1
2
3
4
5
6
7
8
9
SELECT  ep.name,
        ep.protocol_desc,
        es.login_name,
        es.nt_user_name,
        es.host_name
FROM    master.sys.dm_exec_sessions es
        INNER JOIN master.sys.tcp_endpoints ep
            ON es.endpoint_id = ep.endpoint_id
--WHERE ep.endpoint_id = 1 --Uncomment WHERE to show who is on DAC Connection

Conclusion

Hopefully this script is helpful. Unfortunately this script will not tell you who is connected via the DAC when nobody is able to connect to the SQL instance in question, netstat from a command line on the server may help there. Please let me know if you run into any issues, have any ideas that would make it better or just want to share how you are using it. As always scripts from the internet are like Halloween candy, inspect before consumption. I offer no warranty beyond a sympathetic ear if you should run into any issues.

Managing the Size of the Security Cache (TokenAndPermUserStore) in SQL 2005

Introduction

In an earlier post I walked through the diagnosis of a problem with TokenAndPermUserStore bloat in SQL 2005.In this post I will go through what I did to work through the issues and the results of those efforts.

Triage

Once I understood that the issue was TokenAndPermUserStore bloat, the first step was to stop the bleeding. The quick fix I chose was to create a job that ran every 30 minutes to issue a DBCC FREESYSTEMCACHE(‘TokenAndPermUserStore’). Once the job was in and running the random server slowdowns immediately stopped. At this point I knew I was on the right track even if all I had was a band-aid.

Research

There is not a lot of good documentation out there on TokenAndPermUserStore bloat. I am guessing not too many people are running into it. This post is the best resource I have found. I had already installed all of the suggested service packs and cumulative updates so that left me with trace flags and fixing the underlying code.

While fixing the code seemed like the most obvious fix, some of the code that is causing the problem is optimal for the purpose. Of all the alternatives that have been tried, the current code is the best way to get the job done.  Going back and asking for resources to change proven code to deal with what I see as a SQL Server problem just did not seem right.

Ruling out a code fix left me with two options, let the cache clearing job keep running every 30 minutes and hope nobody ever turns it off or add a trace flag to control the behavior of the server. I ruled out leaving the job running and set out to learn everything I could about –T4618. What I quickly learned is that there really is not that much information out there on the subject. What I was able to find through various sources is that –T4618 limits the number of entries in the TokenAndPermUserStore to 1024. I know from watching the counts in the cache that every time I free the cache it quickly shoots up to about 5000 entries then grows slowly from there so –T4618 would cause a huge amount of churn and CPU use. I then found that to get around the 1024 entries limitation I could also apply –T4610, limiting the cache to 8124 entries. 8124 sounded pretty good given the behavior I was seeing.

I did also run across this article on using –T4621 to customize the size of the Security Cache, but decided it was not for me due to the complexities of picking a good number, converting it to binary and editing the registry on the cluster. In SQL 2008 the value can be changed with an sp_configure call (no trace flag), making it a much more viable option. For my SQL 2005 server I saw –T4621 as something to try if –T4618 and –T4610 did not work the way I expected.

Turning on the Trace Flags

I found a number of articles that said you can turn –T4618 and –T4610 on while the server is running. Not true. You have to add them as startup parameters using SQL Configuration Manager then restart your server, raising the stakes somewhat if I guessed wrong on the setting.

I decided I was comfortable enough with my estimates that I pushed all of the paperwork and during a slack time on the system, added –T4618;-T4610 to the list of parameters on both cluster nodes. I then took the cluster resource offline and brought it back online. It all happened so fast that it was almost hard to tell I did anything. The only real evidence besides the log rolling over at an odd time is an entry in the SQL log  that says “–T4618”, another that says “–T4610” and finally one that says “TokenAndPermUserStore Memory Quota set by -T4618 as 8192.”

The Results

The primary application that uses this SQL Server is a web site. It does pretty decent volume so over the years the business has invested in some pretty cool monitoring tools. The tools were instrumental in helping identify and monitor the problem, telling me when I needed to spring into action to prevent lost sales. I have sort of a love / hate relationship with these tools because every time I said the problem was fixed they disagreed.

The relationship I have with those tools has now changed somewhat. I now love them. The monitors show that the web site is now running 20% faster than it was even when I thought it was performing well. To say I am happy with the results is an understatement.

I am blown away by how much fixing the TokenAndPermUserStore bloat helped overall application performance.

My Adventures with the SQL 2005 Security Cache

Introduction

For that past couple of months I have been chasing performance issues with one of the most high profile servers I support. Yes, I said months and high profile. Typically the pressure to find a quick fix would be immense, but luckily we have a strong team working the issue so we are able to work the process rather than throw band-aids at the issue until the problem is resolved. I should add that we are lucky enough to have amazing monitoring tools that allow us to spot a problem before it noticeably impact users and take corrective action.

The Problem

At somewhat regular intervals my SQL 2005 SP3 server with 128GB of RAM and enough SAN bandwidth for 3 servers would begin to get sluggish. The number of active user sessions would climb without end, while none of the sessions would be waiting on any resource. This usually happened in the afternoon on certain days of the week but could happen any time during the business day as long as the server was experiencing sufficient load. The problem was quickly mitigated by freeing the procedure cache on the server or even a single DB. Even running sp_updatestats by way of a slowly moving, single database DBCC FREEPROCCACHE, would fix the problem although performance was even more sluggish while it was running. Immediately after freeing enough cache the CPU would shoot way up and any backlog would quickly clear.

Troubleshooting

Being a DBA and seeing the world from a DBA perspective I first assumed the issue was related to code that was no longer optimal for the task. The symptoms sure pointed to a bad plan or cardinality issues. I spent a lot of time searching for opportunities to tune code and missing indexes, cutting the CPU usage of the server in half and cut average reads per second from 200k to 50k. Oddly enough this had no impact on the frequency of the problem.

Realizing I was dealing with something larger, I began to form a team to look into the issue. I pulled in subject matter experts from different areas of the organization to help look into this. Working through the troubleshooting process it was discovered that while new versions of monitoring tools, backup software and antivirus software had been added to the server the maximum memory setting had not been evaluated since the server was built 3 years ago. This server happens to be boot from SAN with synchronous replication for disaster recovery purposes. The primary page file is on non-replicated drives but a small page file is kept on the C drive to capture crash dumps. Watching the server run it was discovered that Windows seems to go to the page file on the C drive before going to any other. It quickly became clear that anything that forced the OS to go to the paging file would have a much higher performance cost than we had ever thought.  As a result of this realization, I dropped the memory setting down and saw a drop in CPU usage and IO waits. Oddly enough this had no impact on the frequency of the problem.

Working through the issue with the rest of the team the focus kept coming back to the question: “What else does freeing the procedure cache do?” Working through the research process on this I flushed out a lot of configuration changes made in the heat of the moment to confront this problem in the past. One of the things I kept noticing is how this resembled some issues I had seen when the server was new that were resolved by running DBCC FREESYSTEMCACHE(‘TokenAndPermUserStore’), but I remember that had been fixed in SQL 2005 SP2. The more research I did the more I kept coming back to the possibility of security cache bloat so I decided to find out more about it. I was surprised to discover that while some issues had been fixed there are still other issues that can cause the security cache to bloat and that there are now trace flags to manage the size of the cache. Armed with this new information I waited for the problem to occur again, this time running DBCC FREESYSTEMCACHE(‘TokenAndPermUserStore’). It immediately fixed the performance problem.

Why?

Working with smart people is both a blessing and a curse. The blessings are obvious and well covered elsewhere, but the curse is that smart people do not get smart without having a thirst for knowledge. What that means here is that I had to be able to explain what was bloating the security cache, just having the fix was not enough.

I truly believe that solving problems is much easier once you know what is wrong. This paradox is proven again here. Once I set out to see what was bloating the security cache, I discovered this article: http://blogs.msdn.com/psssql/archive/2008/06/16/query-performance-issues-associated-with-a-large-sized-security-cache.aspx that had everything I needed. The most important part was the first query that showed what was using the cache by type. I noticed that the majority of the entries had a class of 65535, leading me to this KB article: http://support.microsoft.com/kb/933564. The KB article explains what may cause the security cache to bloat. Reading through all of the conditions I noticed that CREATE, ALTER and DROP operations on a table object in any database (including TempDB) on the list. EUREKA!

There is a batch job that I had seen running during these times, but since it ran a lot and there was nothing wrong with the query plans it was using I had no way to prove it was involved in the issue. The code the batch job was calling did have some opportunities for improvement that I am working through with the developers that own it.  Unfortunately, the code is very high profile so reworking it is a very slow deliberate process. One of the issues with the code is that it uses temporary tables to implement procedural logic in certain places.

It appears that calling the code through a batch job leads to massive amounts of temporary table creation and destruction, causing the security cache to bloat, leading to higher CPU usage as the cache grows in size. Once the cache achieves critical mass, CPU usage drops and requests begin to queue. The requests will continue to run more slowly, causing further queuing until the DBCC FREESYSTEMCACHE(‘TokenAndPermUserStore’) command is issued.

Conclusion

I have been reminded of several things from this exercise.

  • Working through the troubleshooting process works
  • Knee-jerk reactions in the heat of the moment lead to more work later
  • What I knew then is not necessarily still true
  • Adding / Changing software on a server needs to be monitored
  • I need to make sure I am always keeping an open mind when chasing issues

 Now it is time to figure out what trace flag I want to use to address the security cache bloat. After that I get to start prioritizing all of the new items that this exercise has added to my to-do list.

Community Based Training

Introduction

Recently I have been thinking through the best ways to get the people I support trained to the same level in SQL. The constraints I am running into are that I do not have the time or even maybe the skill to develop a syllabus and course materials to launch my own SQL Server course. I also know that getting training dollars appropriated for large scale training is difficult even in good economic times.

The Idea

I see the need for an option that splits the difference.

There is a real un-served niche in the market for DBA delivered community training materials. The idea would be to work together as a community to create common course materials and delivery scripts that would be presented by individual DBAs to the teams they support. Everyone would get roughly the same experience while having a trainer that can speak directly to their specific questions or even look at examples using their specific data.

The First Course

The first course I see in the series would be an introduction to TSQL. I am currently reading Microsoft SQL Server 2008 T-SQL Fundamentals (PRO-Developer) by Itzik Ben-Gan to pick up pointers and it strikes me as to how teachable the text is. I feel like if I could just get people to read the book they would be trained. I think that if a few people came together and developed presentations, reading assignments and homework assignments the result would be a powerful teaching tool that any DBA could use to further their organization. The key is that this would be extremely cheap training owned by the community.

I need to pause here and add that I have never developed courseware so I am not sure of the copyright implications or any other details that go with creating courseware from someone else’s text. I picture an author being happy that books will be bought for each participant, but I am sure there is much more to it than I am thinking of.

I Need Feedback

Does this sound like a good idea?

Would you be interested in working on it?

Do you know of something like this that already exists?

Please let me know. I would have to consider no feedback at all to be strong feedback as well.

Why Do My SQL 2008 Compressed Backups Shrink?

Introduction

Recently a coworker was doing some research with SQL 2008 Backup Compression on a very large database. During the testing they noticed that the backup file shrunk rather than grew.

The Facts

The database that the test was being run on was 360GB in size. The backup initially created a file that was 120GB in size. The file then shrank to 43GB by the time the backup was complete. The file only shrinks when the drive has sufficient space for a file that is 1/3 the size of the database. If there is not sufficient space the backup file will grow from 0.

The Concern

The concern that immediately came to mind is that the backup may be writing a file to the file system then performing some compression on it. Compression of the file after it is written would limit the ability to backup to anywhere but a local drive. It would also mean that the local drive would have to be optimized for read-write instead of just write.

The Reality

I recently had a chance to work with someone from Microsoft that was able to get with the right people to figure out what was going on. It turns out that this behavior is a feature meant to enhance performance.

Here is what they had to say:

The issue is that we attempt to pre-allocate the backup files so that they don’t keep getting extended during the backup process, which slows things down locally, and REALLY makes them slow (>1 order of magnitude) if your backup file is on an SMB share. 

For uncompressed backups, we have a pretty good idea of how much data we’ll be backing up.

 For compressed backups, the eventual size depends on how compressible the data is, which we can’t tell beforehand.

 So, we chose a value which is an average result in customer databases, which is 1/3 of the volume of data to be backed up.

 If the backup ends up being larger than this estimate, we’ll grow the file as needed.  If it ends up being smaller, we’ll trim the file back to the size actually used.

Conclusion

My fears of performance issues were unfounded. I was also surprised to find out that the file could grow as well as shrink, although it makes perfect sense when I think it through. The big takeaway I see is making sure that the locations I back up to have free space equal to at least 1/3 the size of the largest database being backed up there to get best possible performance. This sounds like a good excuse to write a PowerShell script.

More generally, the SQL Server community is amazing. The mere act of participating in the community means that at some point you will come into contact with someone that can answer just about any question you could have. I am in awe.

Why Do I Need All These Servers?

Introduction

I am frequently getting requests for a new server to do this or that. Running through the usual questions to determine scope and scale, the price can get high pretty quickly. Inevitably,  talk turns to ideas on how to cut the cost while still delivering a robust platform. Typically this is when the questions about skipping the Test, QA or Stage environments come out. The logic is usually that a production server is a must have, a Development server is good to have and anything else is gravy.

Applications running third party (off the shelf) tools can get by just fine without even a Development environment while heavily customized applications need many more environments.

My goal here is to lay out the different environments as well as what I see as a the use for each.

Local

The Local environment refers to an individual developer’s desktop. On this desktop there may be a database server, web server and application server. This is the primary Development environment where things are often built and tried out without fear of angering the rest of the project team.

Chances are that if your organization has a developer you have a Local environment.

Development

The Development environment is the place where everyone’s work starts to come together. It is the first environment that would contain any true servers. Depending on application architecture, this environment may be entirely composed of servers or in the case of client applications still include developer desktops.

Servers in the Development environment are typically given the drive layout as production to allow for restores and Testing of scripts but may only have a fraction of the number of processors, memory and spindles that a production server would have.

I like to compare the Development environment to a busy lake on a hot summer day. There are a lot of people doing their own thing, headed in their respective directions at high speed. The size of your boat (project) usually determines who has the right of way and everyone must adhere to the local customs while keeping their eyes open to prevent collisions.

Testing

The Testing environment is much more of a nice to have environment. For all intents and purposes it is a second Development environment with one important difference, change control. The Testing environment is where code is tested before turning it over to the QA department. It is also the environment where builds are rehearsed to minimize downtime during releases to the QA environment. Developers typically do not have access to change any objects in the Testing environment.

It is very important that the drive layouts between all environments match for manageability purposes. It is even more so for the Testing environment due to how often restores are done to quickly get back to a clean state before Testing the next build.

QA

The QA environment is one of the more important environments. This is the environment where the QA team does their magic. The usage pattern here tends to be low volume and slow paced, so hardware for the QA environment can be the same as that used in the Development environment.

The big thing to remember about the QA environment is that it must be as stable as possible. This is not the best environment to rehearse releases because any problems are likely to put the QA team behind schedule but, absent a test environment, it is the only place many organizations have.

Staging

The Staging environment is tied with production for being the most expensive environment. The Staging environment must be run on exactly the same equipment as production. The typical usage of Staging is automated load testing, user acceptance Testing and even training on important new features.

Without a Staging environment careful scheduling would have to be done to prevent collisions in the QA environment. Otherwise QA testers may be entering defects for behaviors caused by user acceptance or load testing activities. User acceptance testers may also report back to their peers that the changes to the application make it slower when in fact it is just the environment they are testing in.

Many organizations house their Staging and Production servers in separate locations for disaster recovery purposes. Attempting to do the same with only a QA environment would make the release process much more complicated in a disaster recovery scenario as changes would have to go directly from Development or Testing to Production.

Production

This is where the magic happens. I think everyone understands the need for a Production server so I am not going to spend a lot of time here. The point I would like to make is that the more money you spend on your production environment, the more environments you should have to back it up. There is no point to spending a small fortune on the best HA cluster you can find without having the capability to do real release management and disaster recovery or worst case: release management in disaster recovery.

Conclusion

I hope I have provided enough information to begin to ask the right questions when working with users to price out servers. Being able to properly educate users either helps them come to a realistic evaluation of their needs or a business case for 6 or 7 brand new servers that you can brag to your DBA friends about.

Stored Procedure to Sequentially Run SQL Agent Jobs

Introduction

Here is another one of my utility scripts. Lately it seems like all I am blogging about is utility scripts, but I have a ton of them and need to get them documented. This stored procedure is used to run a series of SQL Agent jobs sequentially. It is one of my favorite stored procedures because it clears up a lot of headaches for me without having to spend a bunch of money or turn my maintenance schedules over to another team.

The Problem

I was always finding myself rearranging maintenance job schedules to keep them as close as possible to each other without overlapping. Inevitably I would guess wrong and have jobs overlapping in one part of the schedule, hammering a server unnecessarily while having nothing running at other times, wasting my maintenance window. This problem is magnified on multi-instance clusters because jobs can overlap on different instances on the same node, leading to harder to discover performance issues.

I know some people are thinking this sounds like a perfect use for Maintenance Plans, but I have a ton of servers. To have to install and maintain SSIS on each of them would be painful enough, but having to open up every maintenance plan and change it via the GUI on every server for even the smallest change is just too much. I need something I can script and I need to be able to use different parameters on different databases.

The Solution

I wrote a stored procedure, sp_dba_run_unscheduled_jobs, to run all jobs of a given prefix in order, beginning with jobs that have not run most recently and then alphabetically. The main idea of the job is to make maintenance single threaded to lessen the impact of the maintenance on any off-hours user or batch job activity. The stored procedure includes a maximum minutes of run time parameter to stop kicking off new jobs beyond a certain time.

An added benefit to this stored procedured is that the jobs actually get done faster when they are not in contention with each other. When using this stored procedure to run maintenance jobs I use a single job with step 1 running all of the index jobs and step 2 running all of the CheckDB jobs. I have not found a situation where I have had to invoke the time limit.

The stored procedure is written to only run the jobs that do not have a schedule or at least do not have an enabled schedule so I can still add schedules to jobs that I want to run at a certain time. This works out well when I have a reindex and CheckDB for a large database that takes up my entire maintenance window because I can add job scheduled to the long running jobs then just let the maintenance for the other databases run single threaded at the same time.

Recently, I began using this stored procedure to fire off restore jobs. I set up a restore job for each database so I can run the job to restore a single database, including storing permissions, flushing replication etc then putting back everything except replication after the restore. The way the stored procedure works I am able to disable the jobs for the databases I do not want restored then fire off the restore jobs for the remaining jobs sequentially.

The Code

The code for the stored procedure is fairly straightforward. Load a cursor with a list of jobs that begin with the given prefix that are enabled but do not have a schedule or have a disabled schedule. The disabled schedule is important here because it allows the schedule to be retained in case it turns out that job has to run at the scheduled time. Once the cursor is loaded, loop through the list, checking to see if the next job is running. If the job is not running and the current time is not past the end time then start it, pausing before checking if it is running because SQL Agent can take a second to update the status. When the current job stops running grab the next one, check the time and if not past the end time start the job. Repeat until there are no jobs to be run.

This stored procedure depends on the stored procedure: sp_dba_GetSqlJobExecutionStatus.

Here is the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
CREATE PROCEDURE [dbo].[sp_dba_run_unscheduled_jobs] @job_name_prefix sysname, @max_minutes_runtime int = null, @days_in_job_cycle int = 1, @mail_job_name varchar(256) = '', @mail_profile_name sysname = null, @mail_recipients varchar(max) = null
AS

 BEGIN     

    DECLARE @job_name               sysname,
            @execution_status       int,
            @row_count              int,
            @last_run_date          int,
            @start_time             datetime,
            @stop_time              datetime,
            @min_date               datetime,
            @min_int_date           int,
            @int_date               int

    SELECT  @min_date = DATEADD(dd, 0, DATEDIFF(dd, 0, GETDATE())),
            @days_in_job_cycle = CASE WHEN @days_in_job_cycle < 1 THEN 1 ELSE @days_in_job_cycle END --Prevent Infinite Loop

    SELECT  @min_int_date = YEAR(@min_date) * 10000 + MONTH(@min_date) * 100 + DAY(@min_date) - @days_in_job_cycle,
            @int_date = YEAR(GETDATE()) * 10000 + MONTH(GETDATE()) * 100 + DAY(GETDATE())


    SELECT  @row_count  = 1,
            @job_name   = '',
            @start_time = GETDATE(),
            @stop_time  = DATEADD(mi, @max_minutes_runtime, GETDATE())

    WHILE @row_count > 0 AND (@stop_time IS NULL OR @stop_time > GETDATE())
      BEGIN

        SELECT      TOP 1
                    @job_name = sj.name,
                    @last_run_date = sjh.run_date
        FROM        msdb.dbo.sysjobs sj
                    LEFT OUTER JOIN msdb.dbo.sysjobschedules sjs
                        ON sj.job_id = sjs.job_id
                    LEFT OUTER JOIN msdb.dbo.sysschedules ss
                        ON sjs.schedule_id = ss.schedule_id
                            AND ss.[enabled] = 1
                    LEFT OUTER JOIN (
                                        SELECT      job_id,
                                                    MAX(run_date) AS run_date
                                        FROM        msdb.dbo.sysjobhistory
                                        WHERE       step_id = 0
                                                        AND run_date > @min_int_date
                                        GROUP BY    job_id 
                                    ) sjh
                        ON sj.job_id = sjh.job_id
        WHERE       (sjs.job_id IS NULL OR ss.schedule_id IS NULL)
                        AND sj.name LIKE @job_name_prefix
                            AND (sjh.run_date IS NULL OR sj.name > @job_name OR sjh.run_date < @int_date)
                                AND sj.[enabled] = 1
        ORDER BY    ISNULL(sjh.run_date, 0),
                    sj.name
       
        SELECT  @row_count = @@ROWCOUNT

        IF @row_count > 0
         BEGIN

            EXEC dbo.sp_dba_GetSqlJobExecutionStatus @job_name = @job_name, @execution_status = @execution_status OUTPUT
                       
            IF @execution_status = 4 --Make sure job is not running
                EXEC msdb.dbo.sp_start_job @job_name = @job_name
               
            WAITFOR DELAY '00:00:02.000' -- Pause here to make sure the job gets started before checking the status

            EXEC dbo.sp_dba_GetSqlJobExecutionStatus @job_name = @job_name, @execution_status = @execution_status OUTPUT

            WHILE @execution_status != 4
             BEGIN
                WAITFOR DELAY '00:00:01.000'
                EXEC dbo.sp_dba_GetSqlJobExecutionStatus @job_name = @job_name, @execution_status = @execution_status OUTPUT
             END
             
         END
         
        IF @stop_time IS NOT NULL AND @stop_time > GETDATE()
         BEGIN
            DECLARE @subject nvarchar(255),
                    @body    nvarchar(max)

            SELECT  @subject = @mail_job_name + ' on ' + @@SERVERNAME + ' - Shutting down...Time limit reached.',
                    @body = @mail_job_name + ' on ' + @@SERVERNAME + ' shut down after '
                            + CAST(ABS(DATEDIFF(mi, GETDATE(), @start_time)) AS nvarchar) + ' minutes.'

            EXEC msdb.dbo.sp_send_dbmail    @profile_name = @mail_profile_name,
                                            @recipients = @mail_recipients,
                                            @subject = @subject,
                                            @body = @body
         END
       
     END
       
 END

GO

EXEC sp_MS_marksystemobject 'sp_dba_run_unscheduled_jobs'
GO

Conclusion

This is my new favorite utility stored procedure because I keep finding new uses for it. I hope you find this stored procedure as useful as I do. Please let me know if you run into any issues, have any ideas that would make it better or just want to share how you are using it. As always scripts from the internet are like Halloween candy, inspect before consumption. I offer no warranty beyond a sympathetic ear if you should run into any issues.

Stored Procedure to Get SQL Job Execution Status

Introduction

Ever need a programmatic way to figure out if a SQL Agent job is running? I did. After spending a while searching on Google I did not turn up anything I liked so I turned to Profiler and started looking at what calls SQL Management Studio uses. After a bit of poking around I was able to come up with a stored procedure that I could call.

The Meat

The stored procedure is pretty straightforward. Get the job and owner from the system tables, declare a temporary table, call xp_sqlagent_enum_jobs with the results going into the temporary table, finally placing a return value in an output variable and optionally producing a recordset.
Comments at the top of the stored procedure outline the various statuses as of the time of writing. My general rule is everything except a status of 4 means running.

Here is the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
USE [master]
GO

CREATE PROCEDURE [dbo].[sp_dba_GetSqlJobExecutionStatus](@job_name sysname, @select_data int =0, @execution_status int = NULL OUTPUT)

AS

SET NOCOUNT ON

/*
Is the execution status for the jobs.
Value Description
0 Returns only those jobs that are not idle or suspended.
1 Executing.
2 Waiting for thread.
3 Between retries.
4 Idle.
5 Suspended.
7 Performing completion actions

*/


DECLARE @job_id uniqueidentifier,
@is_sysadmin int,
@job_owner sysname

SELECT @job_id = jv.job_id,
@job_owner = sp.name
FROM msdb.dbo.sysjobs_view jv
INNER JOIN sys.server_principals sp
ON jv.owner_sid = sp.sid
WHERE jv.name = @job_name

SELECT @is_sysadmin = ISNULL(IS_SRVROLEMEMBER(N'sysadmin'), 0)

CREATE TABLE #xp_results
(
job_id uniqueidentifier NOT NULL,
last_run_date int NOT NULL,
last_run_time int NOT NULL,
next_run_date int NOT NULL,
next_run_time int NOT NULL,
next_run_schedule_id int NOT NULL,
requested_to_run int NOT NULL, -- BOOL
request_source int NOT NULL,
request_source_id sysname COLLATE database_default NULL,
running int NOT NULL, -- BOOL
current_step int NOT NULL,
current_retry_attempt int NOT NULL,
job_state int NOT NULL
)

INSERT INTO #xp_results
EXECUTE master.dbo.xp_sqlagent_enum_jobs @is_sysadmin, @job_owner, @job_id

SELECT @execution_status = job_state
FROM #xp_results

DROP TABLE #xp_results

IF @select_data =1
SELECT @job_name AS job_name,
@execution_status AS execution_status

SET NOCOUNT OFF
GO

EXEC sp_MS_marksystemobject 'sp_dba_GetSqlJobExecutionStatus'

Conclusion

I hope you find this stored procedure as useful as I do. Please let me know if you run into any issues or have any ideas that would make it better. As always scripts from the internet are like Halloween candy, inspect before consumption. I offer no warranty beyond a sympathetic ear if you should run into any issues.