SQL Tact: availabilitygroups

Showing posts with label availabilitygroups. Show all posts

Thursday, October 31, 2019

Availability Groups: What to do without MultiSubNetFailover = True

I received an email from a client who is having issues with third-party applications connecting to their three-subnet SQL Server Availability Group. After an exchange with Microsoft Support, they discovered that the applications weren't specifying MultiSubNetFailover = True in their connection strings. As a result, because RegisterAllProvidersIP = 1 in the cluster, connections were randomly experiencing high latency upon connecting, as client-side DNS queries over time had a 66% chance of returning the wrong IP from the listener.

They set RegisterAllProvidersIP = 0, but before you take that as advice keep reading. That fixed the connectivity latency problem for now. I'm not sure whether the application connection strings can't or won't be changed to include MultiSubNetFailover = True. This decision was made either because of vendor limitations and/or because of vendor reliance on old data providers. Here was my guidance regarding Microsoft's recommendation to specify RegisterAllProvidersIP = 0 and lower the HostrecordTTL to 120 (seconds):

I guess this is good quick fix guidance but the problem really is on the application side. Assuming those connection strings or providers can’t change, sure. But you lose the ability to failover fast and you will have increased load on DNS servers.

With RegisterAllProvidersIP disabled on the client access point (the cluster network), only one IP address is made available for the listener: the IP of the Listener in the primary replica's subnet. When RegisterAllProvidersIP is enabled, all site IPs for the listener are simultaneously listed. Connection strings using MultiSubnetFailover = True will try all IPs simultaneously and use the one that responds, providing for the fastest possible transition after an availability group failover. That's desirable.

With RegisterAllProvidersIP = 1, caveat remains that any connection strings that aren't using or can't use MultiSubnetFailover will have connection problems. So RegisterAllProvidersIP should be enabled only when MultiSubnetFailover=True can be used in all application connection strings.

With RegisterAllProvidersIP=0, failovers may be problematic. HostrecordTTL = 120 is 2 minutes. Availability Group failovers will result in up to a 2 minute outage, after which time the client OS will query DNS for new information. With RegisterAllProvidersIP =1, failovers can be more or less instantaneous, or at least not hampered by DNS.

Also by lowering the HostrecordTTL, traffic to and constant load on your DNS server(s) will increase tenfold, as the default is 20 minutes. What MS has recommended is a short-term solution but it’s not scalable obviously. It’s not getting to the real source of problem – the connection strings.

It would be much better if we got the third-party applications to change their connection strings to use MultiSubnetFailover = True, and then you could enable RegisterAllProvidersIP =1. This has been around since 2012, it should be available if they are using any modern data provider. If the third party app is still using OLEDB, there is a new data provider for MSOLEDBSQL released in 2018 that works for all previous MS OLE DB implementations. The new ODBC Driver 17 supports Multisubnetfailover, if they’re using ODBC. If the vendor apps are still using something old like SQLNCLI10 (from 2008), or something non-Microsoft, check for newer versions. More info here: https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/always-on-client-connectivity-sql-server?view=sql-server-ver15

Wednesday, June 05, 2019

Actual Emails: Will MSOLEDB work for connecting to an Availability Group?

Scenario:

We need to configure an existing legacy application from an external vendor to talk to our new SQL Server 2017 Availability Group, which spans multiple subnets. In the end, that last bit is the key. The old data provider MSOLEDB will work for connecting to single-subnet Availability Group listeners, but won't work consistently when connecting to a multisubnet Availability Group's listener. The key is the ability to specify MultiSubNetFailover=True in the connection parameters.

Client's software vendor:

The connection used is the Microsoft OLE DB Provider for SQL Server that is supplied by Microsoft to create the Data Link Properties. When configuring our Data Link, we use the "Microsoft OLE DB Provider for SQL Server". The connection string is formatted:

Provider=SQLOLEDB.1;Password="whatever";Persist Security Info=True;User ID=username;Initial Catalog=Test;Data Source=ServerName

Answer:

Good info, but we do need to make a change here. SQLOLEDB is the provider from back in the SQL 2000 era. Do not recommend its use for new development. It has been replaced by the Native Client (SNAC), which has since been replaced by MSOLEDBSQL (I linked below). It should be easy and transparent to upgrade the provider from SQLOLEDB with no negative impact.

Here’s why we need to upgrade the data provider to talk to our SQL Availability Group. The SQL Server Listener for a multi-subnet Availability Group actually has two IP’s. When you perform at a command line:

Nslookup SQLListenerNameWhatever

You get back an IP in each subnet (in our case, two), for each replica SQL instance in the Availability Group.

When a connection string uses MultiSubNetFailover=True and connecting to the Availability Group Listener name (not the IP or either SQL Server instance name), BOTH IP’s are tried simultaneously and immediately, and the driver talks only to the IP that replies: the primary replica. After a failover, the other IP begins to reply immediately, and so there is no delay in reconnectivity when a failover occurs.

Without specifying MultiSubNetFailover=True, your application will (essentially randomly) pick ONE of the two IP’s for the Listener, and try it. There is no way to “rig” one IP to be first consistently over time. If it picks the primary replica, everything works! If it picks the IP for the current secondary replica… your application’s connection timeout will have to expire and then try the next IP. This is why I’m bringing this up – the application will timeout upon SQL login without MultiSubNetFailover=True.

This hasn’t been an issue with your other clients if they aren’t using a multisubnet availability group. If they have an Availability Group all inside only one subnet, then the Listener only has one IP in DNS, and MultiSubNetFailover=True isn’t required.

You should be fine to install the MSOLEDBSQL provider released in 2018 and use that in your data link. Obviously it should be tested for due diligence, but it should work. At the very least, you could try instead the SQL Native Client 11 (SQLNCLI11), which was released for SQL Server 2012, and it also should work just fine for both OLEDB or ODBC.

Let me know if you have any more questions.

Connection string information for SQL Server Availability Groups:

The SQL Native Client (SNAC) Sqlncli10 will NOT work for connecting to a multisubnet Availability Group because it cannot specify MultiSubNetFailover=True. Must use the newer SQLNCLI11, however, the SQL Native Client (SNAC) is no longer maintained by Microsoft, so you should instead use the new discrete OLE DB or ODBC drivers.

Recommend convert your connection strings from sqlncli1n or sqloledb, to MSOLEDBSQL. OLEDB has been un-deprecated via the new MSOLEDBSQL provider released in 2018. Download the new driver here if needed: https://www.microsoft.com/en-us/download/details.aspx?id=56730 (Note that of the two identically-named files, the 64-bit one is the larger file size of the two.)

There is also the ODBC Driver 17 for SQL Server provider, and the newer ODBC Driver 18 for SQL Server.

In SSIS, the new MSOLEDBSQL OLE DB driver appears as "Native OLE DB\Microsoft OLE DB Driver for SQL Server". The old OLE DB driver is "Native OLE DB\Microsoft OLE DB Provider for SQL Server".

Note: you also need to make sure your Windows Cluster has RegisterAllProviderIP's set to ON for a multisubnet Availability Group!

Edit March 2023: Added latest OLE DB/ODBC information.

Friday, February 01, 2019

"A connection timeout has occurred on a previously established connection to availability replica"

I previously spent some time troubleshooting this issue at one client, and then having encountered it twice more this year, I figured I'd include it in a blog post. Yep, the fix, which is delivered in a CU for SQL 2012, 2014, or 2016 does fix the issue.

Message 35201: A connection timeout has occurred while attempting to establish a connection to availability replica 'replicaname' with id [availability_group_id]. Either a networking or firewall issue exists, or the endpoint address provided for the replica is not the database mirroring endpoint of the host server instance.

Message 35206: A connection timeout has occurred on a previously established connection to availability replica 'replicaname' with id [availability_group_id]. Either a networking or a firewall issue exists or the availability replica has transitioned to the resolving role.

If you are troubleshooting the above errors, make sure you are on one of these versions (or later)*

SQL Server 2016 RTM CU5 or SP1 CU1
SQL Server 2014 SP2 CU4
SQL Server 2012 SP3 CU7

*This patch was out before SQL Server 2017 was released, SQL 2017 is not susceptible.

This issue was very problematic because as databases stopped synchronizing, the log files on the primary replica continued to grow. This would eventually create an outage once the volume filled to capacity. Despite all our best efforts, like the KB article says, there's no fix other than rebooting the secondary or removing/recreating the replica. Obviously rebooting/removing the secondary replica doesn't necessarily impact production, but it does impact high availability.

I have encountered this error in multiple environments, once in an Availability Group with 50+ databases, and another also with just 3 databases, one of which had constant high-transactional volume. According to the KB article, "This problem might occur only on very powerful computers and when SQL Server is very busy. For example, in one scenario, this problem occurred on a very busy system with 24 cores."

Thursday, July 12, 2018

Baton Rouge User Groups Networking Night

What a great evening of interesting presentations, professional networking, and Jambalaya!

We had 5 speakers last night on topics covering: Availability Groups (me), software documentation, leveraging the SQL Server community, front-end design principals, and augmented reality!

Folks in attendance included people brand new to the User Groups, students, job seekers, hiring managers, junior-level to senior-level DBA's and Developers, bosses and their employees! Plus we had jambalaya from Pot & Paddle, our same vendor for SQLSaturday Baton Rouge.

I have uploaded my opening presentation, Availability Groups in 10 Minutes, to my Github.

Folks stayed over an hour afterwards networking!

Tuesday, December 19, 2017

Start/Stop Group of Azure RM VM's

For testlab purposes, I often have a group of Azure VMs to manually spin up/spin down at once. I've had this script in use for a while and decided I'd comment it up and share.

The SQL Server Availability Groups template in the Azure VM Gallery, for example, creates five VMs in a resource group. Who wants to use the Azure portal in browser to click, wait, x5? And if you try to start/stop too many VMs at a time, they will error out.

PowerShell provides! Since the Start-AzureRMVM and Stop-AzureRMVM cmdlets wait for each VM to finish changing state, it'll take 2-3 minutes per VM, but it's launched in one manual step.

Note, there's also built-in Azure functionality to shut down VMs on a schedule, no coding required. In this case, the use case for this script is to start up my five VMs for a test Availability Group manually, and bring them down when I'm done with my lab. I may also set up the auto-shutdown as insurance that I won't accidentally leave the VMs running.

The below script is quite verbose and well-commented, including the Setup block to get the required module and log you into Azure. You could certainly shorten and simplify the script, but my goal here is readability, and repeatability. (If you have a suggestion on how this can be improved, I'm all for it.)

Specify a resource group name in $ResourceGroupName, and a list of VM names in $VMs. Important note here: when you have a cluster of VMs, you may want to have them start up and shut down in a specific order. The below example shuts down the domain controller VMs last, and starts them up first.

#Launch VS Code as Administrator

<#
Invoke-Command -script {Install-WindowsFeature -Name "Failover-Clustering" } `
-ComputerName SQLDEV11, SQLDEV12, SQLDEV14, SQLDEV15
Invoke-Command -script {Install-WindowsFeature -Name "RSAT-Clustering-Mgmt" } `
-ComputerName SQLDEV11, SQLDEV12, SQLDEV14, SQLDEV15
Invoke-Command -script {Install-WindowsFeature -Name "RSAT-Clustering-PowerShell" } `
-ComputerName SQLDEV11, SQLDEV12, SQLDEV14, SQLDEV15
#>
#Install-Module SQLSERVER -Force -AllowCLobber 
#Import-Module SQLSERVER

#Must run on the primary node
#TODO: configure initial variable values.

Write-Output "Begin $(Get-Date)"
#Setup: TODO Configure these
$PrimaryReplicaName = "SQLSERVER-0"
$PrimaryReplicaInstanceName = "SQL2K17" #Named instance or DEFAULT for the default instance
$SecondaryReplicaName1 = "SQLSERVER-1"
$SecondaryReplicaInstanceName1 = "SQL2K17" #Named instance or DEFAULT for the default instance
$AvailabilityGroupName = "WWI2017-AG"

#Inventory and test
Get-ChildItem "SQLSERVER:\Sql\$($PrimaryReplicaName)\$($PrimaryReplicaInstanceName)\AvailabilityGroups\$($AvailabilityGroupName)\AvailabilityReplicas\" | Test-SqlAvailabilityReplica | Format-Table

   $AGPrimaryObjPath = "SQLSERVER:\Sql\$($PrimaryReplicaName)\$($PrimaryReplicaInstanceName)\AvailabilityGroups\$($AvailabilityGroupName)\AvailabilityReplicas\$($PrimaryReplicaName+$(IF($PrimaryReplicaInstanceName -ne "DEFAULT"){$("%5C")+$PrimaryReplicaInstanceName} ))"
   $AGPrimaryObj = Get-Item $AGPrimaryObjPath 
   $AGSecondaryObjPath = "SQLSERVER:\Sql\$($PrimaryReplicaName)\$($PrimaryReplicaInstanceName)\AvailabilityGroups\$($AvailabilityGroupName)\AvailabilityReplicas\$($SecondaryReplicaName1+$(IF($SecondaryReplicaInstanceName1 -ne "DEFAULT"){$("%5C")+$SecondaryReplicaInstanceName1} ))"
   $AGSecondaryObj = Get-Item $AGSecondaryObjPath


   
#Set replicas to synchronous before planned failover
        
    Set-SqlAvailabilityReplica `
    -Path $AGPrimaryObjPath `
    -AvailabilityMode SynchronousCommit `
    -FailoverMode "Manual" `
    -ErrorAction Stop
    Set-SqlAvailabilityReplica `
    -Path $AGSecondaryObjPath `
    -AvailabilityMode SynchronousCommit `
    -FailoverMode "Manual" `
    -ErrorAction Stop

#Check for when replicas are synchronized.
Do {
$AGSecondaryObj.Refresh()
$CurrentSync = ($AGSecondaryObj | Select RollupSynchronizationState | Format-Wide | Out-String).Trim()
IF ($CurrentSync -ne "Synchronized") { 
        Write-Output "Waiting for Synchronized state before failover, still $($CurrentSync)"
        Start-Sleep -s 2 
        }
} Until ($CurrentSync -eq 'Synchronized')

#Perform failover
Write-Output "Beginning Failover $(Get-Date)"
Switch-SqlAvailabilityGroup `
    -Path "SQLSERVER:\Sql\$($SecondaryReplicaName1)\$($SecondaryReplicaInstanceName1)\AvailabilityGroups\$($AvailabilityGroupName)\" `
     -ErrorAction Stop `
    #Only include the next line if it is a forced failover
    #-AllowDataLoss -Force
Write-Output "Failover Complete $(Get-Date)"
Start-Sleep -s 10 #Allow failover to resolve

#Return secondary replica to Asynchronous sync
#Note that the values here of Primary and Secondary1 are flipped, because the variables predate the failover.
Invoke-Command -script { `
param($SecondaryReplicaName1, $SecondaryReplicaInstanceName1, $AvailabilityGroupName, $PrimaryReplicaName, $PrimaryReplicaInstanceName)

 Set-SqlAvailabilityReplica `
-Path "SQLSERVER:\Sql\$(($SecondaryReplicaName1))\$(($SecondaryReplicaInstanceName1))\AvailabilityGroups\$(($AvailabilityGroupName))\AvailabilityReplicas\$(($SecondaryReplicaName1)+$(IF(($SecondaryReplicaInstanceName1) -ne "DEFAULT"){$("%5C")+(($SecondaryReplicaInstanceName1))} ))"  `
-AvailabilityMode asynchronousCommit `
-ErrorAction Stop
    Set-SqlAvailabilityReplica `
-Path "SQLSERVER:\Sql\$(($SecondaryReplicaName1))\$(($SecondaryReplicaInstanceName1))\AvailabilityGroups\$(($AvailabilityGroupName))\AvailabilityReplicas\$(($PrimaryReplicaName)+$(IF(($PrimaryReplicaInstanceName) -ne "DEFAULT"){$("%5C")+(($PrimaryReplicaInstanceName))} ))"  `
-AvailabilityMode asynchronousCommit `
-ErrorAction Stop

Get-ChildItem "SQLSERVER:\Sql\$($SecondaryReplicaName1)\$($SecondaryReplicaInstanceName1)\AvailabilityGroups\$($AvailabilityGroupName)\AvailabilityReplicas\" | Test-SqlAvailabilityReplica | Format-Table
} -ComputerName $SecondaryReplicaName1  -Args $SecondaryReplicaName1, $SecondaryReplicaInstanceName1, $AvailabilityGroupName, $PrimaryReplicaName, $PrimaryReplicaInstanceName

Write-Output "End $(Get-Date)"

This is version of the script updated for Resource Manager VMs. For classic VMs, I have a similar script in a blog post here.