A customer of mine was experiencing performance drops on their HR-server. They asked me to get to the bottom of it. As always when working with Windows servers and performance I headed over to Perfmon.
I knew how to use the monitor but I had never set up a Data Collector Set previously. So here’s how to set it up.
To start things off, let’s take a look at Performance Monitor (AKA Perfmon) on a Windows Server 2008 R2 machine.
OK. So what. It looks the same as it did with maybe some different wording and layout. Let’s dig in a bit more and look at Data Collector Sets, which is what we use to capture historic data.
Here you can see that I do not have any User Defined Data Collector Sets. I’m going to create a set using the built in System Performance template and accept the default all the way through the wizard. If you have questions about how to create this set, check this resource out:
Create a Data Collector Set Manually
http://technet.microsoft.com/en-us/library/cc766404.aspx
Now I have my data collector called “This is easy”. At this point, I have not set any schedules or tweaked any advanced settings so this data collector will only collect data when I start it manually. Let’s checkout the properties of my new collector, specifically the Schedule tab since I
want this collector to run every day from 12AM to 11:59PM.
OK, easy enough, looks like I have my schedule configured correctly to start every day at 12AM. Note I did not set an Expiration Date as I don’t want the collector to quit anytime soon. Next, we’ll look at the Stop Condition tab.
Here’s where it gets a bit tricky. Important Note – Reports are compiled oncethe collector set stops. This means I can’t let this collector just simply run forever because I will never get a report. I need it to stop at 11:59PM so it can create a cool report that looks like this:
Have you figured out what’s missing yet? Let me give you a hint by showing you what the Schedule tab looks like on a Counter Log in Windows Server 2003:
Yea, there’s no “Stop At” setting so I can’t set a stop time! So after some pondering, I decide to roll with 1439 minutes for the Overall duration setting.
How did I come up with that? The customer wanted a report from 12AM to 11:59PM which is 23 hours and 59 minutes total run time. (23*60) + 59 = 1439 minutes (sorry for the grade school math flashback there).
So I sent this over to the customer and he seems happy. The next day he tells me it didn’t work. Impossible! Seeing is believing, so I looked at the reports on his system and sure enough, the job ended at something like 11AM, missed the 12AM schedule, and was not going to start again until 12AM that night so he will miss 13 hours of reports. No good.
How could this be? Pondering some more, I figured out that he had manually started the collector right after creating it around 11AM the day prior. Since it will run for 1439 minutes from start time, it ran until 11AM and the job at 12AM didn’t kick off since it was already running. Meh, there must be a better way! I went back to the drawing board.
So back to where this post started, 1AM on Tuesday, I had a very simple idea (perhaps via inception):
What underlying mechanism controls the schedule for these data collectors to start and stop?
Task Scheduler!
Opening Task Scheduler, we’ll navigate to Task Scheduler Library > Microsoft > Windows > PLA. This is where you will find your data collector set schedules.
Getting the properties on my collector set, on the Triggers tab, if I edit the Trigger you can see even this screen does not have a way to end at a certain time. At first I thought I could select the Repeat Task option but every hour is the maximum configurable time.
Exploring a bit more I went into the Settings tab, one setting caught my eye:
If the task is already running, then the following rule applies: Stop the existing instance
Ah ha! What if I didn’t configure a Stop Condition and used this setting to stop the collector as the next collector instance starts? Guess what, it worked. Now my collector starts at 12AM and finishes at 12AM when the next instance starts up. The extra value on this is that they got 1 additional minute of reporting, 11:59PM to 12AM.
Victory? Not quite. Once I tested this I noticed that the report generation broke. Great, yet another area of Perfmon that has always just worked so how am I going to fix this one? Switching back to Perfmon, there are some additional data management settings that are not so obvious. Right-clicking on the data collector revealed an option I had not touched before called Data Manager.
Here you can control the disk space consumption of the folder that houses data from the data collector in question.
If you browse out to where Perfmon puts data collector files, you will see various files as seen below. The ETL is the Kernel ETW tracing that is enabled in my data collector. The .BLG file is the performance counter data, and the report files are, well, the report J. The rules.log is fun to look at, but I’ve never messed with it. It corresponds with the Rules tab back on the Data Manager window.
Not very interesting on the surface yet when I went out to view the data for my data collector that was stopped by the new instance starting, I noticed none of this data was there, just a .CAB file. Why? Let’s go back and look at the Actions tab on the Data Manager.
If you look at the Actions tab here, you’ll notice that there is a Folder action at the top that basically says after 1 Day, doesn’t matter what size it is, create a CAB and delete the data. So essentially when my data collector stopped, it was at one day so the data was being deleted and placed in a CAB before the tracerpt.exe process could parse the ETL and BLG files and generate the report. To fix this I edited the Folder action and changed the Age to 2 Days and OK’d all the way out.
Did it work after that? Yes. Was I happy with the solution? No, it shouldn’t be this hard for something so simple. Is this the only way to do this? It’s the only way I could think of. Why did you all take out the Stop condition as a definitive Time? I don’t know, but it sure made this a lot more interesting to figure out!
Cred to Jake.