Friday, February 01, 2019

"A connection timeout has occurred on a previously established connection to availability replica"

I previously spent some time troubleshooting this issue at one client, and then having encountered it twice more this year, I figured I'd include it in a blog post. Yep, the fix, which is delivered in a CU for SQL 2012, 2014, or 2016 does fix the issue.
Message 35201: A connection timeout has occurred while attempting to establish a connection to availability replica 'replicaname' with id [availability_group_id]. Either a networking or firewall issue exists, or the endpoint address provided for the replica is not the database mirroring endpoint of the host server instance. 
Message 35206: A connection timeout has occurred on a previously established connection to availability replica 'replicaname' with id [availability_group_id]. Either a networking or a firewall issue exists or the availability replica has transitioned to the resolving role.
If you are troubleshooting the above errors, make sure you are on one of these versions (or later)*
  • SQL Server 2016 RTM CU5 or SP1 CU1
  • SQL Server 2014 SP2 CU4
  • SQL Server 2012 SP3 CU7 
*This patch was out before SQL Server 2017 was released, SQL 2017 is not susceptible.

This issue was very problematic because as databases stopped synchronizing, the log files on the primary replica continued to grow. This would eventually create an outage once the volume filled to capacity. Despite all our best efforts, like the KB article says, there's no fix other than rebooting the secondary or removing/recreating the replica. Obviously rebooting/removing the secondary replica doesn't necessarily impact production, but it does impact high availability.

I have encountered this error in multiple environments, once in an Availability Group with 50+ databases, and another also with just 3 databases, one of which had constant high-transactional volume. According to the KB article, "This problem might occur only on very powerful computers and when SQL Server is very busy. For example, in one scenario, this problem occurred on a very busy system with 24 cores."