Tuesday 17 April 2012

Saving Your Database Server From CPU Meltdown

This week I helped investigate a problem with our database server. Our customers were complaining that they could not log in to our website, so we took a look at our database server and found it to be using 100% CPU! Uh oh...

Something was clearly hammering our SQL Server database but how do you find the problem? Where do you start?

Tracing

I quickly Googled around to find out how I can track down performance bottlenecks and happened upon this excellent blog post: Finding the Causes of Poor Performance in SQL Server. I would highly recommend you bookmark this link, it might just save your bacon one day.

I won't repeat ad-nauseum what this post said, but effectively it involved setting up a trace running on your database - just like SQL Profiler but without the GUI - and then reading the trace file that is produced into a temporary table to be able to query it.

So we set up a trace to record all stored procedure calls for 30 minutes and analysed the results. We found that, within this period, one stored procedure was being called over 8000 times and was taking up a vast amount of time/resources. Great, lets start looking at improving this procedure.

Optimising

I can't go into great details about the problematic procedure - the system I work on is for a commercial business - but let me give an overview of what it was doing.

We have a table called Client which represents a hierarchical data structure, so every row in this table has a reference key to a parent Client. This procedure was being given an ID to a Client record and it then had to recurse up through the hierarchy until a certain condition was met, which would then return the top-most Client ID.

There were several things wrong with this approach:
  1. The procedure was using a database function in the WHERE clause to do this hierarchical searching, which is not very good for performance. Couple that with having to search an entire table and it must have slowed down the procedure a lot.
  2. Searching through the hierarchy of Client records was done using a WHILE loop; check the current row and if it did not meet a certain condition check the parent row and keep repeating. In my experience, SQL Server is never that great at working in this imperative fashion.
The fact that this procedure was being invoked so often meant I had to rewrite it to perform better. The solution I came up with is to use another table we have called ClientLookup. This solves the problem of a hierarchy being stored in the database by flattening all the parent IDs for every primary key in the Client table; filtering on a particular Client ID will therefore give you the entire list of parent Clients. Also the ClientLookup table is well indexed.

After rewriting this procedure to join on our ClientLookup table instead of a database function, CPU usage shot down dramatically. Below is a screenshot of the CPU graph in Resource Monitor. I've added an arrow to show the point where I made my procedure changes live.


Conclusion

This one simple change has now meant our entire system is working at light speed; in fact I'm amazed that doing this one thing has made such a profound impact. CPU usage on the server is at the lowest it has been for years whilst still working effectively, webpage request times have reduced greatly and our response times to the website all around the world has been halved.

What can we take away from this? Check your database performance regularly. Using the technique I used above, finding the biggest bottleneck and fixing it could give you big wins.

No comments:

Post a Comment