****UPDATES: See update notes at bottom of the page.
I recently encountered an issue where a client running a couple of SharePoint (SP) 2013 Farms was starting to see service failures across the board which adversely affected the availability of their sites. Prior to these failures, this client had recently upgraded their SCOM 2012 R2 environments to SCOM 2016, but there’s no definitive indication of causality as it relates to the SCOM upgrade. This article will provide an in-depth review of the issue, and a tested workaround. I’ll also update it with any new information that the Microsoft product group brings to light.
Firstly we determined that the affected sites were based on legacy IIS application pools and that the application servers had the SCOM 2016 agents deployed to them. Upon further review and root cause analysis of the problem, we determined that the Application Performance Monitoring (APM) feature in SCOM 2016 appeared to be the cause of the issues, even though APM wasn’t even enabled in their SCOM management groups (neither server-side nor client side monitoring). This was rather unusual, and so it begged the question, why would SCOM APM adversely affect the SharePoint sites even though APM wasn’t enabled in SCOM? A detailed review of the problem ensued.
The SP Service failure encountered, and relating to a System.IO.FileLOadException is shown below:
Root Cause Analysis
A review of the Application logs on the application web servers showed the following events corresponding to the application pool failures reviewed above, and application and .NET runtime errors with event IDs 1000, 1026, and 1325.
Furthermore, the following error and informational events indicating a connection to the System Center Management APM service, and relating the same faulting instance referenced in the application logs, are logged in the SCOM event log. Look for event IDs 4009, 4151, 4003 and 1121.
And a quick review of the state of the SCOM Services confirms that the System Center Management APM service is deployed but stopped.
I hear your asking out loud, what is this service? Why does it exist and why is it stopped? To answer these questions, perhaps a bit of a background on SCOM APM is in order.
The System Center Management APM service, or better known as the Microsoft Monitoring Agent APM service, monitors the health of .NET applications on servers to which it is deployed. Note that during the deployment of a SCOM agent to a computer, the service is installed by default, but remains in a stopped state with a ‘disabled’ startup type until it is called into action by the Client-Side Monitoring feature in SCOM, at which time, it changes to a ‘manual’ startup type, and goes into a running state.
What is Application Performance Monitoring (APM) and Client-Side Monitoring (CSM) and how is this affecting IIS Application Pools?
APM is a feature that enables you to monitor the performance of your applications to ensure that they have no exceptions, are reliable and are meeting the service level agreements (SLAs) defined for them. Client-Side Monitoring is a component of APM that enables you to ensure that your customers (end-users of applications) are having quality web experiences. For the purposes of this article, let’s review how SCOM implements Client-Side APM for .NET applications.
I’ve underscored the above statement to point out a crucial component of the APM functionality that appears to be causing the SharePoint service failures referenced above. It would appear that there is an anomaly with the APM service component of the SCOM 2016 that’s affecting adversely affecting legacy IIS Application pools even though APM isn’t enabled, it appears to be running against the IIS application pools.
I’m calling this a workaround, because I imagine that a formal fix will be provided by the Microsoft product team which is well aware of this issue. I’ve found 2 workarounds to be effective here:
- Repair or reinstall the SCOM 2016 Agent with the NOAPM switch set to true (NOAPM=1)
- Replace the SCOM 2016 Agent with a SCOM 2012 R2 Agent. Compatibility here doesn’t appear to be an issue.
To install the agent with the NOAPM switch, use msiexec to execute the MOMAgent.msi installer:
****UPDATE 03/22/2017: I just noticed that the product team have released some information about this issue. Looks like a definitive fix will ship with UR 3 for SCOM.
****UPDATE 05/25/2017: Microsoft just released the highly anticipated update rollup 3 for System Center including OpsMgr. One of the many issues that I and I imagine the rest of the community was hoping would be fixed in this rollup is the issue about which this article is written, but it doesn’t appear that this update properly fixes the issue. See official update notes here.
MSIEXEC /i MOMAgent.msi NOAPM=1
This will reinstall the healthservice without installing the System Center Management APM Service, after which affected sites based on your IIS Application Pools should be happy.
NOTE: After implementing the workaround, it’s ok to apply any update rollups that apply to the newly-installed agents.
I hope that this helps any who’ve encountered this issue, and I’ll update when new information from the Microsoft product group comes to light. Cheers!