Use OMS (Log Analytic) to monitor and send alert for BlueScreen of Death

At times there is a driver or two that’s misbehaving and causing bluescreens. As the server automatically reboots after dumping memory to the memory.dmp file you might not get a report from your users that there has been a problem. And depending on your monitoring tool you might not get an alter there either. Operations Manager can easily alert you for things like that, but far from all customers use OpsMgr due to it’s complexity. Luckily, it’s just a 1 minute job to get alert in OMS if you have got a bluescreen! And as OMS can be run in Free mode, you may be able to monitor your servers for free (all depending on the amount of data you collect) and else, it’s really cheap so no big deal if you need to use a standard subscription. Anyway, lets get to the technical stuff!

First of all, enable OMS to collect Eventlog System and all Error messages.

omserrordata

Then create an Alert like this,

oms_bsod

The Alert text to be used is:

That will only alert for Crashes. You can also enable an alert for Event ID 6008 which will alert you for an unexpected shutdown. The difference is that my alert will only send an alert if there was a BSOD while an unexpected alert could also alert if someone pulled the power. Or even combine both into one alert with an OR statement. In my case, I just want to get alerted about the BSOD’s so thats the only thing I look for right now.

Tell how often is should check. There is usually no need to check more than once or twice an hour. And finally define if it should send an email alert or use one of the other alert methods.

Easy as that! Next time you get a bluescreen on a server, you will get an alert by mail so you can debug the dump and find out what’s causing it.

It will look like this,

bsodmail

 

Enable driver verifier for all none-microsoft drivers with powershell

I’ve been doing some debugging for a customer, who has multiple industrial Client PC’s who are rebooting regularly. And to get more information in the memory dumps I had a need to configure the system to do a complete memory dump but also to enable extra verification of all drivers in the system to find the cause of the bluescreens.

Window has a built in tool called “Verifier” where you can enable extra checks on calls done by specific drivers. You generally don’t want to enable it on all drivers as that will slow down the system notable. And truthfully, the number of times it’s a Microsoft device driver who’s causing the issue is so small, because they check and stress test their drivers so much better than all the other vendors. Thus, it’s always better to enable the extra checks for all drivers, except the ones from Microsoft to start with.

As I didn’t want to run around to all the Client PC’s and configure verifier, I’ve made a small powershell script that reads the name of all none-microsoft drivers from the system and enabled verification for just those drivers. Which can then be execute in any number of ways.

It’s using both the Get-VMIObject and Get-WindowsDrivers to get a complete list of thirdparty drivers in the system. And it will also configure the system for a Complete Memory Dump.

Just to be safe, I’ve added /bootmode resetonbootfail so it will reset the verifier settings in case the system is bluescreening during boot due to verifier notificing a bad driver in the boot process.

Reboot the PC, get a big cold Coke and wait for the bluescreen to happen.

Bugcheck: DRIVER_POWER_STATE_FAILURE (9f)

I experienced a Bluescreen of Death (BSOD) on my Windows 8 Laptop (HP EliteBook 8560w) this morning when it resumed from Hibernate.
I quickly launched WinDBG and opened the crashdump.

WinDBG managed to find the driver that caused this problem by itself this time. But IF WinDBG had not been able to show me the faulty driver, the next step would have been to use the Bugcheck info (0x0000009f) to dig further into this;

The last argument is the interesting one, and which we should look into further with the !irp command.

It will show something similar to this. And it’s the e1c63x64.sys driver that were active at the time of the bluescreen. Same info as !analyze -v managed to figure out by itself.

Hmm, so what driver is that?

intel_driver1Too bad that it were unable to provide more detailed information. But some oldschool properties of the \SystemRoot\system32\DRIVERS\e1c63x64.sys file gave this;

And a quick search on Intel’s Support sites showed that there was a newer version available for my NIC;
Intel(R) 82579LM Gigabit Network Connection here.

Driver updated, and hopefully no more bluescreens due to this driver bug.