Lync CMS Replication to Edge Server Stops Working…

Posted by in Lync

For the most part, CMS replication takes care of itself. It’s pretty reliable. However, sometimes things can stop working for no real apparent reason.

This is what happened to me the other day. One of our Edge servers (2012 Standard) was rebooted for OS patching at 0400 and after that point it never received any replication from the CMS.

I tried a lot of things to resolve the issue. Restarting the replication service didn’t throw any errors, it showed various pieces of information about getting certs from the CMS. It also went into a Startup status without any errors. I just never got that all important DataReceived status when it receives data from the CMS.

4443 from the FE pool was tested successfully using telnet, and no changes had been made to anything in the environment. I doubted the cert but this pool had already been in production for a year previously and the cert was the same cert as on the other member of the Edge pool which was working okay.

I contemplated uninstalling the patches installed the previous evening but this would require a restart and it’s hard to find a time when the Edge servers aren’t being used by a particular region (the regional Edge servers aren’t in yet).

A good ‘ole Google search brought up many people referring to a particular registry entry which solves issues with CMS replication. Most places also mentioned it needed a reboot which also I was dubious about doing.

After a couple of hours of SIP tracing showing no immediate issues I decided to try the registry fix.

The keys can be found here: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\SecurityProviders\SCHANNEL

I added the following DWORD (32-bit) values:

  • ClientAuthTrustMode with a value of 2.¬†This key will prevent schannell.dll from truncating the Root CA list from the Edge server, and allow validation tests to pass.
  • SendTrustedIssuerList with a value of 0. This key will fix it so that the server does not send a list of trusted certification authorities to the client. This is needed on servers where a large number of certificates exist in the Trusted Root Certification Authorities container. I had 55.

Luckily this didn’t even need a reboot. I left it about 5 mins and then ran an Invoke-CSManagementStoreReplication from the FE pool. Immediately I saw a DataReceived event in Event Viewer and the Replication status flipped to TRUE. Nice.

I have no idea what caused this issue. At the moment, my current thinking is that KB2975331 was applied and this somehow caused the TLS certificate issue. I’m thinking this because the following day we built two new Edge servers in a different region and OS patched them up-to-date. We couldn’t get any replication to happen until we put in the reg keys mentioned above. Go figure.

Hope this helps someone out.