The Case of the perfect Storm

Hello everyone,

I would like to start this article by sharing that after 5 Years as a Deployment MVP (3 years as a Setup and Deployment MVP and then 2 years as a Software Packaging, Deployment and Servicing MVP) I am now switching to the Windows Expert – IT Pro MVP Expertise. I am following my fellow MVP Olav Tvedt who is also making that switch. Therefore you will probably see less Deployment related articles on this blog and more Windows Server related articles.

Then, for those of you that are familiar with Mark Russinovich you probably noticed that the title of this article sounds familiar and of course I was inspired by his articles and conference session series: “The case of the unexplained” where Mark analyzes issues in depth using sysinternals tools mostly. If you’ve never heard about Mark Russinovich or seen/read those series, I highly recommend them:

 

Context

I am currently working on a Windows 7 deployment project (my last?) and we finalized recently an SCCM 2007 to 2012 migration project. We wanted to implement a GPO with a startup script to install the SCCM Agent because we were not able to reach 100% of the clients using other approaches (client push, remote scripts, etc). We designed a very simple GPO, with only one setting: a startup script launching ccmsetup.exe. This GPO was filtered using a security group and a WMI filter to avoid applying it on servers or computers that already have the client.

We tested this GPO in a dev environment, then applied it at the root of the domain but kept only a small amount of machines in the security group to avoid side effects. We decided to link that GPO at the root of the domain because we wanted to be able to address the computer objects in the “Computers” container.

After 3 months of testing and slowly adding more and more machines to the security group, we were highly confident that this GPO didn’t have any side effect and we decided to change the security filtering to apply the GPO to all desktops. We then replaced the security group used for the filtering by “Authenticated Users” that contains, if you didn’t already know, all computers.

The Problem

Soon after we did this, users started to report issues when launching Citrix applications, and then it was confirmed that the issue was widespread and almost all users trying to launch citrix applications received errors and applications were not launching. An emergency team was gathered and since the GPO change was the only change made that day, it was decided to roll the change back and disable the GPO. To my great surprise, the issue was fixed.

The perfect Storm (Root Cause Analysis)

Since the issue was fixed, we had now to understand what happened and why it would be linked to that GPO. At that point, I was convinced it was a coincidence and that it had nothing to do with the GPO because we confirmed that the SCCM Agent was not installed on that SCCM Setup never even started.

Our first hint was that it affected newly launched citrix applications but already running applications were ok. We started reviewing the Citrix servers event logs and found that the Applications log was spammed with MSI Installer messages. The messages started right after the GPO was implemented, weird …

When looking up those messages, we found a KB article that looked related: KB974524.

This KB applies to

  • Windows Vista and Server 2008 (yep that’s correct, the citrix servers were running 2008)
  • GPO with a WMI filter on Win32_Product (yep that’s also our case)

So this is what happened:

  • When we changed our security filtering to “Authenticated Users” we also targeted users and servers
  • Our WMI Filter preventing the GPO from being applied on servers and to users started to be evaluated on servers when users launched Citrix apps (and technically logging on the servers)
  • On the Citrix servers running 2008, the bug described in the KB generated all those MSI events and slowed down the application launch, that’s why some users where having error messages when launching Citrix Apps.

Why the perfect storm? Because

  • We targeted users and servers
  • We excluded them using a WMI filter
  • The servers were running 2008

It was very disappointing because we ran tests for months before going full speed, and now, we have to go back to the drawing board and start the process all over again. Anyway, that was an interesting troubleshooting phase and I learned a lot.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s