Monitor Cloud Connector using Operations Management Suite (OMS)

Skype for Business Server 2015
 

Topic Last Modified: 2017-11-16

Read this topic to learn how to monitor your Cloud Connector version 2.1 and later deployment by using Microsoft Operations Management Suite (OMS).

You can now monitor your Cloud Connector version 2.1 and later deployment by using Operations Management Suite (OMS), a Microsoft cloud IT management solution. OMS Log Analytics enables you to monitor and analyze the availability and performance of resources including physical and virtual machines. For more information about OMS and Log Analytics, see What is Operations Management Suite (OMS)?.

This topic contains the following sections:

  • Prerequisites

  • Configure Cloud Connector to use OMS

  • Configure OMS

  • Analyze the alerts in your Log Analytics repository

  • Recommended monitoring set

Before you can use OMS to monitor your Cloud Connector deployment, you will need the following:

You'll need to configure your Cloud Connector on-premises environment to use OMS. To do this, you will need your OMS workspace ID and key, which you can find by using the OMS portal as follows: Settings -->Connected Sources --> Windows Servers:

Screen shot for Cloud Connector OMS

How you configure Cloud Connector to use OMS depends on your scenario:

  • If you are installing a new Cloud Connector appliance or you want to re-deploy an appliance, follow these steps before you run Install-CcAppliance:

    1. In the CloudConnector.ini file [Common] section, set the OMSEnabled parameter to True.

      Each time Cloud Connector is deployed or upgraded, it will try to install the OMS agent automatically onto the VMs. Enable this feature so the OMS agent can survive the Cloud Connector automatic update.

    2. To configure the OMS ID and key, run Set-CcCredential -AccountType OMSWorkspace.

  • If you are installing an OMS agent onto an existing Cloud Connector appliance, follow these steps:

    1. In the CloudConnector.ini file [Common] section, set OMSEnabled=true.

    2. Run Import-CcConfiguration.

    3. Run Install-CcOMSAgent.

      noteNote:
      If the OMSWorkspace credential has never been set, you are prompted for the credential when you run install-CcOMSAgent.
  • If you want to update the OMS workspace ID or key in a Cloud Connector appliance that has already installed an OMS agent:

    1. To configure the OMS ID and key, run Set-CcCredential -AccountType OMSWorkspace.

    2. To apply the updates, run Install-CcOMSAgent.

  • For all scenarios, verify that the agents are connected as follows:

    In the OMS portal, go to Settings -> Connected Sources -> Windows Servers. You will see a list of connected machines.

Next, you will need to specify your OMS configuration by using the OMS portal. Specifically, you will need to:

  • Specify information about event logs and performance counters.

  • Create alerts.

In the OMS portal, you must specify information about the event logs and performance counters as follows:

  1. Go to Settings->Data->Windows Event logs, and add event logs for:

    • Lync Server

    • Application

    noteNote:
    You must manually enter Lync Server in the text box. It does not appear as an option in the drop-down list.

    For more information, see Windows event log data sources in Log Analytics

  2. Go to Settings->Data-> Windows Performance Counters, and add performance counters for:

    • OS level counters. You can add OS level counters, such as processor usage, memory usage, network usage, or you can use existing solutions such as Capacity and Performance, Network Performance Monitor without adding counters explicitly. No matter how you decide to monitor them, Microsoft recommends that you monitor these OS counters.

    • Skype for Business counters. There are numerous counters provided by Skype for Business. You can find these counters by logging on to any Mediation Server and opening the Performance Monitor. These counters start with "LS:". Microsoft recommends that you start with the following capacity counters at a minimum, and add others that are of interest:

      Total active calls:

      • LS:MediationServer - Inbound Calls(_Total)\- Current

      • LS:MediationServer - Outbound Calls(_Total)\- Current

      Total active media bypass calls:

      • LS:MediationServer - Inbound Calls(_Total)\- Active media bypass calls

      • LS:MediationServer – Outbound Calls(_Total)\- Active media bypass calls

    noteNote:
    You must manually enter the performance counters in the text box. They do not appear as options in the drop-down list.

    For more information, see Windows and Linux performance data sources in Log Analytics

There are two types of alerts in OMS: Number of results alerts and Metric measurement alerts. For more information about creating alerts, see Working with alert rules in Log Analytics.

You should consider the following when creating alerts:

  • Make sure the alert is a Number-of-results alert, which is the default selection.

  • The demo queries require that “Number of results” is set to “Greater than 0”.

  • It is recommended that you set both Time window and Alert frequency to 5 minutes.

  • It is recommended that you do not enable “Suppress alerts” for demo alerts.

  • For typical alert scenarios, Microsoft recommends creating a pair of alerts: one error alert and one reset alert. For the error alert, select severity level Critical; for the reset alert, select severity level Informational .

The following sections describe how to create sample alerts.

Create an alert pair: "RTCMEDSRV is NOT running in Mediation Servers" and "RTCMEDSRV is back in running in Mediation Servers"

To create this alert pair:

  • The query for the error alert is:

    Event | where Computer contains "MediationServer" | where EventLog == "Lync Server" and (EventID == 25002 or EventID == 25003)
     | summarize arg_max(TimeGenerated, EventID) by Computer | where EventID == 25003
    

    The query uses the computer filter where Computer contains "MediationServer". The filter selects only the computer whose name contains the string “MediationServer”.

    You would replace the filter with your own computer filter or simply remove it. You can create complex string filters without regular expressions. For more information, see String operators. You can also choose to use regular expressions. Moreover, you can create a computer group by saving a search query and using that group as your computer filter in your alert query. For more information, see Computer groups in Log Analytics log searches.

    For each computer, the error query will get the last event log for both the RTCMEDSRV service start and service stop. It will return one log if the last event is the service stop event; it will return nothing if the last event is the service start event. In short, the query would return a list of servers whose RTCMEDSRV is stopped in the time window.

  • The query for the reset alert is:

    Event | where Computer contains "MediationServer" | where EventLog == "Lync Server" and (EventID == 25002 or EventID == 25003)
     | summarize arg_max(TimeGenerated, EventID) by Computer  | where EventID == 2500
    

    The reset query does exactly the opposite thing of the error query. For each computer, it will return one if the last event is the service start event; it will return nothing if the last event is the service stop event.

Create an alert pair: " Too many concurrent calls in Mediation Servers" and “Concurrent calls fall back to normal load”

To create this alert:

  • The query for the error alert is:

    Perf | where Computer contains "MediationServer" | where (ObjectName == "LS:MediationServer - Outbound Calls" or ObjectName 
    == "LS:MediationServer - Inbound Calls") | summarize arg_max(TimeGenerated, CounterValue) by ObjectName, Computer | summarize
     TotalCalls = sum(CounterValue) by Computer| where TotalCalls >= 500
    

    For each computer, the query will get the last counters for inbound call and outbound call and sum those two values. It will return one log if the sum value exceeds 500; it will return nothing if it doesn't. In short, the query would return a list of servers whose concurrent calls are too many in the time window.

  • The query for the reset alert is:

    Perf  | where Computer contains "MediationServer" | where (ObjectName == "LS:MediationServer - Outbound Calls" or ObjectName ==
     "LS:MediationServer - Inbound Calls") | summarize arg_max(TimeGenerated, CounterValue) by ObjectName, Computer | summarize
     TotalCalls = sum(CounterValue) by Computer| where TotalCalls < 500
    
    

    The reset query does exactly the opposite thing of the error query. For each computer, the query will get the last counters for inbound call and outbound call and sum those two values. It will return one log if the sum value is less than 500; it will return nothing otherwise.

Create an alert: "CPU usage > 90 or RTCMEDIARELAY stopped in Servers" alert

To create this alert, the query is:

search *| where Computer contains "MediationServer" | where (Type == "Perf" or Type == "Event") | where ((ObjectName ==
 "Processor" and CounterName == "% Processor Time") or EventLog == "Lync Server") | where (CounterValue > 90 or EventID == 22003)

The query will get all processor usage counter and service stop event from all computers and return one log if either processor usage exceeds 90% or service is ever stopped.

To analyze the alerts in your repository, use the Alert Management solution. For more information, see Alert Management solution in Operations Management Suite (OMS)

To identify issues with event logs and performance counters:

  • Event logs. For any issue, there should be an events pair, with one set of events to indicate something is wrong, while the other indicates that everything is well. For any given time period, it is the last event recorded that will indicate whether something is amiss for that time period.

  • Performance Counters. There should be a threshold for the monitored counters.

The following table lists the services that Microsoft recommends monitoring by listing the stop and start event IDs:

 

Service Name

Target Server Role

Stop Event ID

Start Event ID

RTCMEDSRV

Mediation Server

25003

25002

RTCSRV

Edge Server

12289

12288

RTCMRAUTH

Edge Server

19003

19002

RTCMEDIARELAY

Edge Server

22003

22002

The following table lists the network issues that Microsoft recommends monitoring:

 

Monitor Name

Target Server Role

Success Event ID expression

Error Event ID expression

Failure example

Mediation Server to gateway connectivity failure

Mediation Server

25062 ||

25002

25061

MS PING (option) Gateway failed

Mediation Server to gateway call completion failure

Mediation Server

25064 ||

25002

25063

MS trying to make call to Gateway failed

Critical network problems

Edge Server

14353 ||

12288

14624

Transport TLS has failed to start on local IP address 192.168.231.14 at port 5061

The following lists the call capacity counters that should be monitored. These numbers should be less that 500 for Cloud Connector standard edition; less than 50 for Cloud Connector minimum edition.

  • LS:MediationServer - Inbound Calls(_Total)\- Current

  • LS:MediationServer - Outbound Calls(_Total)\- Current

  • LS:MediationServer - Inbound Calls(_Total)\- Active media bypass calls

  • LS:MediationServer - Outbound Calls(_Total)\- Active media bypass calls

 
Show: