In my previous post I talked about how to create an efficient system to calculate working n working days from a given day. However, you could quite reasonably ask how I can talk about efficient calculation given that I didn’t even mention indexes. That wasn’t an oversight; I didn’t want to complicate the overall story.
I have to admit that I am personally not an expert on the subject of indexing, but I am frequently amazed at the incredible performance improvements that good indexing can offer to a system. It’s one of those things like staring at an aeroplane and being amazed that they can fly – you can see it every day, but it’s still amazing!
In SQL 2008, which I am using, the Display Execution Plan window sometimes offers advice on what indexes should be added and the estimated performance improvements from doing so. As far as I understand, there are two reasons why you might not follow that advice to the letter:
- Sometimes the index that it prescribes will be overly specific, and tuned just for that query.
- Each index that you add means more updates to index lists when you add new rows to the table, so maybe you will not want to add an index at all.
In our case, tblDay, tblHoliday, and tblCalendar will not be frequently updated – I would imagine once per year or thereabouts. Given that, the additional cost when inserting new rows is pretty much irrelevant, but in this case unfortunately SSMS 2008 didn’t offer any advice, so it was time to think things through for myself.
In my experience, setting up indexes is almost more of an art than a science. I had this view confirmed a few years ago when I asked the DBA in a place I was working to help me interpret a query plan. I asked him what I should be looking for as hints about where to put indexes. He said “anything that looks a bit odd or doesn’t look right.” He was trying to be helpful, but was unable to articulate his knowledge.
Indexes: A real primer
An index will sort things by one field then a second field then a third field, etc. Think of a telephone directory that lists all people by Last Name, First Name. Now, imagine that you are looking for Homer Simpson’s number. It’s pretty easy. Find the Simpsons then search through the H’s in the first name section, past Hannah, Helen, Hillary, and if you get to Hugh you’ve gone too far.
Now imagine that we were looking for a chap called Homer whose last name we don’t know, but we do know his phone number is 0573 525 342. Suddenly the Last Name, First Name index isn’t quite so useful! It is still better than nothing though. I can search through each Surname then use the alphabetical first name listing to search for all the Homers with that surname.
What about if I wanted to find out who has a particular phone number, say 0573 525 342. Now suddenly the sorting of the phone book by Last Name, First Name is no use at all. I’m going to have to search through the entire directory to find that number.
As a solution to all this inefficient searching we could, of course, have a couple of completely separate phone books which have the same data in them but sorted by First Name, Last Name in one, and Phone Number in the other. That, however, is obviously not very practical, as it would triple the amount of paper we would need, and it would make updating information a nightmare! What about, instead of that, having two indexes to the phone book in the back of it, one that lists each First Name, Last Name pair and the page and column number of the main book where you will find those details, and another one that lists all phone numbers with a page-column number reference beside them.
Now when searching for Homer when I don’t know his last name but do know his number I can get a list of all Homers in the phone book and with a bit of quick flipping back and forth I can find him soon enough. When trying to find out who has the phone number 0573 525 342 I again use the index in the back, and it will point me to a page and column, say pg 873, col 2, then once again I can find the person quite easily.
Notice the difference between the Last Name, First Name index and the second two. In the first one the data is all there, naturally in the right order. The second two indexes simply tell me where to look in the main table, but two sequential phone numbers might be in totally different places in the main phone book. This is the difference between a clustered index and a non-clustered index. You can have any number of non-clustered indices on a table, but you can only have one clustered index – the phone book is either ordered last-name, first-name or it is ordered first-name, last-name. You can’t have it both ways. It’s also worth noting that in both these cases I’m still going to have to do a bit of scanning (for example, an entire column), but nothing like as much as scanning the entire book.
So let’s think things through for our problem with calculating a working day.
The main function we’ll use now that we’ve populated tblDay is GetWorkingDay. I’ve already described in my previous post exactly how this works.
So, if we imagine starting with an arbitrary date of 13 Nov 2011, and wanting to calculate 36 working days forward using CalendarID 2. The first thing we’re going to have to do is find the row of tblDay for 13 Nov 2011 with CalendarID 2. With no indexing at all, we’re already going to be having to search through the entire table, much like searching for a phone number. There are two criteria there, CalendarID and Date, so we could index our table by CalendarID then Date, or Date then CalendarID. This is a bit like Last Name, First Name vs First Name, Last Name. My theory here, and I don’t know if it is correct, is to do the “biggest cut” you can first. At present we only have two calendars. The maximum we could get is about 200 (roughly one for each country). On the other hand, the number of dates we will have for any given calendar could easily be several thousand, so it makes sense to me to order by date then by calendarID. If someone can show me how that is wrong, I’d be delighted to know!
This can quickly give us the table row that we are after as a starting point, but if we are smart we should be able to provide enough information that we don’t actually need to look up the main table at all. SQL Server lets us INCLUDE additional data from the table in the index.
CREATE NONCLUSTERED INDEX [IDXtblDay_DateForwardBack] ON [dbo].[tblDay] ( [Date] ASC, [CalendarID] ASC, [IsWorkingDay] ASC ) INCLUDE ( [ForwardCount], [BackCount]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] GO
Notice that I’ve INCLUDEd ForwardCount and BackCount, not as part of the index, but as data that is accessible. Maybe think of it as the index is not sorted by ForwardCount or BackCount, but there is another column which shows that data.
This means that SQL Server will not actually need to look up the table at all for the first part of the process. All we are wanting is the ForwardCount for 13 November 2011 using CalendarID= 2. For me, this is 20469. Because we have INCLUDEd this as part of the indexing, this number can be found without actually looking up the table at all.
The Proc will then need to add 36, which shouldn’t be too much of a challenge. We’ll then need to search for the row where ForwardCount = 20505 and IsWorkingDay = True. (Note, we shouldn’t need to check the CalendarID because the ForwardCount and BackCount numbers of the different calendars should be sufficiently separated as to be unique).
Notice that this is a different search from the previous one. Earlier we started with a date and returned a ForwardCount. Now we start with a ForwardCount and want a date. It’s a bit like starting with a name, getting a phone number, adding 36 to the phone number, and then wanting to return the corresponding name. So we need a new index.
I’ve created two that are almost identical, but one for searching forward one for searching backwards.
CREATE NONCLUSTERED INDEX [IDXtblDay_BackDate] ON [dbo].[tblDay] ( [BackCount] ASC ) INCLUDE( [IsWorkingDay], [Date] ) CREATE NONCLUSTERED INDEX [IDXtblDay_ForwardDate] ON [dbo].[tblDay] ( [ForwardCount] ASC ) INCLUDE ([IsWorkingDay], [Date] )
The SSMS Execution Plan shows that the two indices are being used as I would expect them to be used (i.e. when calculating forwards, the ForwardDate one is used, and when calculating backwards, the BackDate one is used), which is quite pleasing.
Here is the final project up to this point.