Unable to move failed-over-to-DR databases back to production Site

I recently came across a scenario, where an Exchange environment that had been configured in a Best Practice state had failed over to the DR site due to an extended network outage at the primary production site, and was unable to re-seed back and fail back over.

The environment was configured very similar as described in the Deploy HA documentation by Microsoft, and had it’s DAG configured across two sites:

Stock example showing DR site relationship

Instead of the “Replication” network that is shown in the above graphic, the primary site had a secondary network (subnet 192.168.100.x) where DPM backup services ran on, the DR site did not include a secondary network.

Although the Exchange databases were mounted and running on the DR server infrastructure, the replication state was in a failed state at the primary site. Running a Get-MailboxDatabaseCopyStatus command showed all databases in a status of DisconnectedAndResynchronizing

DisconnectedAndResynchronizing state

All steps attempted to try to re-establish synchronization of the databases failed with various different error messages, even deleting the existing database files and trying to re-seed the databases failed, with most messages pointing to network connectivity issues.

Update-MailboxDatabaseCopy vqmbd06\pcfexch006 -DeleteExistingFiles

Confirm
Are you sure you want to perform this action?
Seeding database copy "VQMBD06\PCFEXCH006".
[Y] Yes  [A] Yes to All  [N] No  [L] No to All  [?] Help (default is "Y"):
The seeding operation failed. Error: An error occurred while performing the seed operation. Error: An error occurred
while communicating with server 'DCExchange'. Error: A socket operation was attempted to an unreachable network
10.50.3.15:64327 [Database: VQMBD06, Server: pcfexch006.xxxxx.com]
    + CategoryInfo          : InvalidOperation: (:) [Update-MailboxDatabaseCopy], SeedInProgressException
    + FullyQualifiedErrorId : [Server=PCFEXCH006,RequestId=e0740b4a-7b94-42f5-b3ad-7ee42632f9c4]
 [FailureCategory=Cmdlet-SeedInProgressException] 2D10AE04,Microsoft.Exchange.Management.SystemConfigurationTasks.UpdateDatabaseCopy
    + PSComputerName        : pcfexch006.xxxxx.com

Looking carefully at the error message, the error says: A socket operation was attempted to an unreachable network 10.50.3.15:64327

Very strange, as when a network test was run, no errors occurred with connecting to that IP and TCP port.

Test-NetConnection -ComputerName DCExchange -Port 64327


ComputerName     : DCExchange
RemoteAddress    : 10.50.3.15
RemotePort       : 64327
InterfaceAlias   : Ethernet
SourceAddress    : 10.50.2.42
TcpTestSucceeded : True

When the test command Test-ReplicationHealth was run, the ClusterNetwork state was in a failed state:

PCFEXCH006      ClusterNetwork             *FAILED*   On server 'PCFEXCH006' there is more than one network interface
                                                      configured for registration in DNS. Only the interface used for
                                                      the MAPI network should be configured for DNS registration.
                                                      Network 'MapiDagNetwork' has more than one network interface for
                                                      server 'pcfexch006'. Correct the physical network configuration
                                                      so that each Mailbox server has exactly one network interface
                                                      for each subnet you intend to use. Then use the
                                                      Set-DatabaseAvailabilityGroup cmdlet with the -DiscoverNetworks
                                                      parameters to reconfigure the database availability group
                                                      networks.
                                                      Subnet '10.50.3.0/24' on network 'MapiDagNetwork' is not Up.
                                                      Current state is 'Misconfigured'.
                                                      Subnet '10.50.3.0/24' on network 'MapiDagNetwork' is not Up.
                                                      Current state is 'Misconfigured'.
                                                      Subnet '10.50.3.0/24' on network 'MapiDagNetwork' is not Up.
                                                      Current state is 'Misconfigured'.
                                                      Subnet '10.50.3.0/24' on network 'MapiDagNetwork' is not Up.
                                                      Current state is 'Misconfigured'.
                                                      Subnet '10.50.3.0/24' on network 'MapiDagNetwork' is not Up.
                                                      Current state is 'Misconfigured'.
                                                      Subnet '192.168.100.0/24' on network 'MapiDagNetwork' is not Up.
                                                       Current state is 'Misconfigured'.
                                                      Subnet '192.168.100.0/24' on network 'MapiDagNetwork' is not Up.
                                                       Current state is 'Misconfigured'.
                                                      Subnet '192.168.100.0/24' on network 'MapiDagNetwork' is not Up.
                                                       Current state is 'Misconfigured'.
                                                      Subnet '192.168.100.0/24' on network 'MapiDagNetwork' is not Up.
                                                       Current state is 'Misconfigured'.
                                                      Subnet '192.168.100.0/24' on network 'MapiDagNetwork' is not Up.
                                                       Current state is 'Misconfigured'.
                                                      Subnet '10.50.2.0/24' on network 'MapiDagNetwork' is not Up.
                                                      Current state is 'Misconfigured'.
                                                      Subnet '10.50.2.0/24' on network 'MapiDagNetwork' is not Up.
                                                      Current state is 'Misconfigured'.
                                                      Subnet '10.50.2.0/24' on network 'MapiDagNetwork' is not Up.
                                                      Current state is 'Misconfigured'.
                                                      Subnet '10.50.2.0/24' on network 'MapiDagNetwork' is not Up.
                                                      Current state is 'Misconfigured'.
                                                      Subnet '10.50.2.0/24' on network 'MapiDagNetwork' is not Up.
                                                      Current state is 'Misconfigured'.

The Failover Cluster Manager was checked, but no errors were found, and the networks in question were “Up”, and in green status.

Looking further at the output of the Test-ReplicationHealth shows that the current state is “Misconfigured”, so let’s see how that replication traffic is configured. The following shows the output of Get-DatabaseAvailabilityGroupNetwork

RunspaceId         : 57e140b2-15ad-4822-9f94-3e1b0d34f491
Name               : MapiDagNetwork
Description        :
Subnets            : {{10.50.3.0/24,Up}, {10.50.2.0/24,Up}}
Interfaces         : {{DCExchange,Up,10.50.3.15}, {pcfexch005,Up,10.50.2.36}, {pcfexch006,Up,10.50.2.42}}
MapiAccessEnabled  : True
ReplicationEnabled : True
IgnoreNetwork      : False
Identity           : VarDAG2016\MapiDagNetwork
IsValid            : True
ObjectState        : New

RunspaceId         : 57e140b2-15ad-4822-9f94-3e1b0d34f491
Name               : ReplicationDagNetwork01
Description        :
Subnets            : {{192.168.100.0/24,Up}}
Interfaces         : {{pcfexch005,Up,192.168.100.218}, {pcfexch006,Up,192.168.100.217}}
MapiAccessEnabled  : False
ReplicationEnabled : True
IgnoreNetwork      : False
Identity           : VarDAG2016\ReplicationDagNetwork01
IsValid            : True
ObjectState        : New

An attempt was done to reset the network state by disabling the automatic configuration and re-enabling it with the following commands:

Set-DatabaseAvailabilityGroup VarDAG2016 -ManualDagNetworkConfiguration $true
Set-DatabaseAvailabilityGroup VarDAG2016 -ManualDagNetworkConfiguration $false

No change, and the seed attempt failed again.

An attempt to remove the Backup network (Here named “ReplicationDagNetwork01“) from the replication traffic was done with the following commands:

Set-DatabaseAvailabilityGroup VarDAG2016 -ManualDagNetworkConfiguration $true

Set-DatabaseAvailabilityGroupNetwork -Identity VarDAG2016\ReplicationDagNetwork01 -ReplicationEnabled:$false

No change was seen, and the seed attempt failed.

Looking further at the what options the command had, the “IgnoreNetwork” option was used:

Set-DatabaseAvailabilityGroup VarDAG2016 -ManualDagNetworkConfiguration $true

Set-DatabaseAvailabilityGroupNetwork -Identity VarDAG2016\ReplicationDagNetwork01 -ReplicationEnabled:$false -IgnoreNetwork:$true

Still no change, so I set back the autoconfiguration with the command:

Set-DatabaseAvailabilityGroup VarDAG2016 -ManualDagNetworkConfiguration $false

Running Get-DatabaseAvailabilityGroupNetwork | fl showed no visible change, but the Site-to-Site tunnel showed a massive uptick in usage, so I ran the Get-MailboxDatabaseCopyStatus command, and it showed all databases that were in a status of DisconnectedAndResynchronizing synchronizing! I retried the reseed process, and it worked!

I’m not sure why the Set-DatabaseAvailabilityGroupNetwork command showed no visible changes, but it’s obvious the changes did occur, that the replication was disabled over the BackupNet (192.168.100.x) and forced over the correct network.

Dropped Calls at 10 min with Skype for Business SIP trunk. (& what is SIP ALG?)

The case of the dropped call at the 10 min mark:

We have a SIP trunk provider that due to their implementation of RFC requirements, send down a Re-Invite packet at the 10 minute mark. If you’re just running a single server, there’s no issue, as the FE that gets the packet sends back an ACK packet saying “got it”, and communication continues. When you have multiple servers, it can be an issue, as in this provider’s case, they only send that Re-Invite packet to the first registered IP, and if the outgoing call originates from one of the other servers, they never get that Re-Invite packet, can’t send back an ACK, and the call gets dropped due to timeout.

To mitigate the above mentioned case, an SBC was put in the middle so that the trunk would only have one IP to communicate to, all was fine and dandy, but due to the nature of things changing in an IT environment, a new random call drop issue came up. (Ugh!)

To see what was going on, client and server Skype for Business logs were looked at for the time the drop occurred, they showed a standard “BYE” happening, not an error, so onto looking at the next step, infrastructure.

In order to give some concept of the environment, here is a basic diagram of what it looks like:

Internal Diag

Note that there is NAT’ing going on between the networks, the SBC was configured with the IP 192.168.70.44 (in the syslog figures, it shows up as “Device”) the internal FE servers had NAT addresses of 192.168.70.20, 21, and 22. We’ll get back to that in a bit.

Syslog was enabled on the SBC, perused through those, and this is what I found:

In the process to narrow down what was going on, we found that in the majority of complaints, the calls were getting dropped at the 10 min mark, and since the Re-Invite gets sent at that time, the SIP provider was in the crosshairs, and that was our main focus. After much going back and forth with them, going back and forth with the SBC vender, tearing up the servers, time to start fresh, and look deeper.

In looking at the logs, there are some calls that the re-invites are handled correctly: Here are two long ones, one lasting 13 min, the other almost 30, both with Re-Invite packets, (the second one with three).

Work1-1

Eventually I found some calls that failed, here’s an example of one that wasn’t handled correctly:

Notice the call is only 2 seconds long, but still dropped!

Non0-1

Here’s a screenshot of the Re-Invite hangup problem:

Non3

Here we see the Re-Invite packet coming down at exactly 10 min after the call started (15:08:58), when the internal server answers that with its SDP packet, the address doesn’t get NAT’d, the SBC then tries to send it to the Internal IP, that communication doesn’t occur, and after a period of no communication (14 sec in this instance), the server assumes the call has ended, has the BYE packet sent to drop the call.

So, from looking at all the calls with the syslog viewer, I saw sporadically there were calls failing, and others succeeding, there didn’t seem to be any rhyme or reason to it, except for the failure was always right after an SDP packet coming from the inside server.

This now had me looking at ALG, which in this case, is controlled by the firewall. What is ALG? To put it simply, it is NAT for SIP SDP packets. What is SDP? While SIP deals with creating, modifying, and closing down sessions, SDP deals with the media within those sessions. and what we’re specifically interested in this scenario, is the IP addresses the two endpoints should talk over. Since one of the endpoints is behind a NAT’d device, ALG is the mechanism that changes the data in these packets to have IPs the two devices can talk to!

Thankfully, this firewall makes it easy to capture inbound packets and outbound packets separately in the same session, producing two separate pcap files which can be compared to side by side! Let’s look at packet #24 of a call that does work:

Here is the inbound packet, notice the O and C attributes have the internal IP, as it’s originating from inside the network:

cap1

Once it goes through the firewall, here is the outbound version of that exact same capture, you can see the O and C attributes have been changed to the NAT IP by ALG:

cap2

Here’s an example of a complete failure capture. Note, this was from an “updated” firmware version of the firewall that completely blew up NAT, so the capture was easy to catch, but the concept is the same which is occurring with the sporadic failures above:

Here is the inbound packet #4, notice the O and C attributes have the internal IP, as it’s originating from inside the network:

cap3

Once it goes through the firewall, here is the outbound version of that packet #4, you can see the O and C attributes have NOT been changed to the NAT IP by ALG:

cap4

Let’s take a look at that syslog trace again:

We can see the original invite comes from the Skype FE server (10.50.2.46, which is NAT’d to 192.168.70.22), communication goes to the SBC, and then gets sent up to the SIP provider’s CPE SIP router. Everything works great until one of the SDP packets does not get translated through ALG correctly, when that happens, the SBC does not know where to send the packet, it gets sent to the “Internal” IP, subsequently goes nowhere, and 14 seconds later, the stream is disconnected with a “BYE” packet due to inactivity.

Non1

Why did this problem arise out of the blue? I’m guessing there have been some good firewall firmware versions that have not caused any issues, so there has been some periods of tranquility, with other firmware versions that have “sometimes” caused this issue. The current firewall version installed is 7.1.5. I’ve tried 7.1.8, 8.0, and 8.0.2, they are progressively worse (with 8.0 and higher not working at all), so I went back down to 7.1.5, and am awaiting resolution from the manufacturer.

Another solution is to change the architecture of the network boundaries affected to be Routed instead of NAT’d, that way ALG does not come into the picture. That might not be a feasible solution due to your security or infrastructure constraints, but has solved the issue (temporarily) in this case. It is not a permanent fix, as any SIP conversations going over NAT will continue to be an issue until the vendor resolves it in their firmware.