On Premises vs Cloud – An insight into services uptime and support availability differences

What are you getting by moving to “Cloud services” vs “on-Premise”? Make sure expectations are set with Executive Management as to what they’re gaining, but also losing.

Over the past 30 years I’ve seen a push from Cloud to On-Prem to Cloud and back untold amounts of times. Yes, those terms were not specifically used, technologies evolve, but the pendulum swings back and forth for many reasons. Right now there’s a massive push for “Cloud being the holy grail”, business owners are embarrassed if they’re not there, strongly feel they’re missing out, and doing it wrong if not.

Over the years, the biggest reason I’ve seen it swing back to “On-Prem”, staying insourced, or any other naming convention that’s used is due to support, speed, service uptime, and reliability!

We all know that “Cloud” is supposed to be so much cheaper when you factor in support costs vs paying for full inhouse salaries, however, setting expectations is quite important. The saying “you get what you pay for” absolutely applies here.

Let’s take one system as an example: Microsoft Exchange is a complex system, dependent on a very wide range of infrastructure. Yes, to support that service in one’s company an administrator must be well versed in a large variety of systems and technologies, and as a result, that person will be expensive to have on staff.

If you have access to such a resource (on staff, on a retainer, etc.), system availability is high, with rapid fault resolution when events occur.

Amongst many other things, I personally concurrently manage the Msft Exchange environments for 6 different companies, 3 of them over 10 years now. How much of my time does that take up? An average of 60 min a week for all of them combined! (wait, wha….?? I thought environments like that are a beast to manage? – Well, not when they’re configured correctly, and maintained) – These are highly available, fully redundant systems mind you. In those 10+ years, not once has any company been out of email service for over an hour due to systems under my support. (Once an ISP was down for several hours on the US East coast, and that caused a long lasting service outage for one of the companies) – Have there been issues? Absolutely, but the resolution has typically been under 30 min once contacted, with full system availability nearly constant during business hours.

Let’s look at Microsoft 365 Cloud email service in comparison:

I was recently hired by an very large company to migrate their on-premise Exchange service to 365, and in just the first 6 months of doing so, email outages for them have already been:

  1. Over 4 hours
  2. Over an entire day
  3. Half a day
  4. Several 1 hour outages

If this were systems I was in charge of managing, with very good reasoning, I would be out of a job! Everyone knows that “Cloud” is the best though, so we just work around it, and chalk it up to “eh, it’s what management wants….”

Let’s talk about 365 support for a bit:

When you call do in for support, mean time for incident resolution spans between several hours, to several days! Unless you spend a very good amount of money on fast support, the only available options are submitting a support request on the portal and wait for someone to call you back (typically in a couple of hours). Hopefully, you’re available to work on the request, but the vast majority of time, you’re not, so realistically, that support ticket can span several days! – My experience, close to 90% of the time I get a call back when I’m out of any ability to work on the issue, it’s madding! – Yes, those support requests are not for an entire system being down (those, you have zero visibility into “why, when will it come back up, etc….” best of luck…), I’m talking about any wide ranging amount of reasons you have to call in support due to the fact you don’t control or have access to the full infrastructure.

There are loads of reasons to move your infrastructure to the “cloud”, but if you do, make sure expectations are set with Executive Management as to what they’re gaining, but also losing by doing so. In my experience, service availability, and performance is worse, with possible feature set lost for the (uh, cloud is usually higher) cost of licensing and supporting on-premise solutions.

Here are some links during for very large Office 365 outages during September/October, there have been other large ones earlier that a simple web search can bring up:

https://www.bloomberg.com/news/articles/2020-09-28/microsoft-says-office-365-teams-other-online-services-are-down

https://www.forbes.com/sites/daveywinder/2020/09/29/what-caused-the-massive-microsoft-teams-office-365-outage-yesterday-heres-what-we-know

Unable to move failed-over-to-DR databases back to production Site

I recently came across a scenario, where an Exchange environment that had been configured in a Best Practice state had failed over to the DR site due to an extended network outage at the primary production site, and was unable to re-seed back and fail back over.

The environment was configured very similar as described in the Deploy HA documentation by Microsoft, and had it’s DAG configured across two sites:

Stock example showing DR site relationship

Instead of the “Replication” network that is shown in the above graphic, the primary site had a secondary network (subnet 192.168.100.x) where DPM backup services ran on, the DR site did not include a secondary network.

Although the Exchange databases were mounted and running on the DR server infrastructure, the replication state was in a failed state at the primary site. Running a Get-MailboxDatabaseCopyStatus command showed all databases in a status of DisconnectedAndResynchronizing

DisconnectedAndResynchronizing state

All steps attempted to try to re-establish synchronization of the databases failed with various different error messages, even deleting the existing database files and trying to re-seed the databases failed, with most messages pointing to network connectivity issues.

Update-MailboxDatabaseCopy vqmbd06\pcfexch006 -DeleteExistingFiles

Confirm
Are you sure you want to perform this action?
Seeding database copy "VQMBD06\PCFEXCH006".
[Y] Yes  [A] Yes to All  [N] No  [L] No to All  [?] Help (default is "Y"):
The seeding operation failed. Error: An error occurred while performing the seed operation. Error: An error occurred
while communicating with server 'DCExchange'. Error: A socket operation was attempted to an unreachable network
10.50.3.15:64327 [Database: VQMBD06, Server: pcfexch006.xxxxx.com]
    + CategoryInfo          : InvalidOperation: (:) [Update-MailboxDatabaseCopy], SeedInProgressException
    + FullyQualifiedErrorId : [Server=PCFEXCH006,RequestId=e0740b4a-7b94-42f5-b3ad-7ee42632f9c4]
 [FailureCategory=Cmdlet-SeedInProgressException] 2D10AE04,Microsoft.Exchange.Management.SystemConfigurationTasks.UpdateDatabaseCopy
    + PSComputerName        : pcfexch006.xxxxx.com

Looking carefully at the error message, the error says: A socket operation was attempted to an unreachable network 10.50.3.15:64327

Very strange, as when a network test was run, no errors occurred with connecting to that IP and TCP port.

Test-NetConnection -ComputerName DCExchange -Port 64327


ComputerName     : DCExchange
RemoteAddress    : 10.50.3.15
RemotePort       : 64327
InterfaceAlias   : Ethernet
SourceAddress    : 10.50.2.42
TcpTestSucceeded : True

When the test command Test-ReplicationHealth was run, the ClusterNetwork state was in a failed state:

PCFEXCH006      ClusterNetwork             *FAILED*   On server 'PCFEXCH006' there is more than one network interface
                                                      configured for registration in DNS. Only the interface used for
                                                      the MAPI network should be configured for DNS registration.
                                                      Network 'MapiDagNetwork' has more than one network interface for
                                                      server 'pcfexch006'. Correct the physical network configuration
                                                      so that each Mailbox server has exactly one network interface
                                                      for each subnet you intend to use. Then use the
                                                      Set-DatabaseAvailabilityGroup cmdlet with the -DiscoverNetworks
                                                      parameters to reconfigure the database availability group
                                                      networks.
                                                      Subnet '10.50.3.0/24' on network 'MapiDagNetwork' is not Up.
                                                      Current state is 'Misconfigured'.
                                                      Subnet '10.50.3.0/24' on network 'MapiDagNetwork' is not Up.
                                                      Current state is 'Misconfigured'.
                                                      Subnet '10.50.3.0/24' on network 'MapiDagNetwork' is not Up.
                                                      Current state is 'Misconfigured'.
                                                      Subnet '10.50.3.0/24' on network 'MapiDagNetwork' is not Up.
                                                      Current state is 'Misconfigured'.
                                                      Subnet '10.50.3.0/24' on network 'MapiDagNetwork' is not Up.
                                                      Current state is 'Misconfigured'.
                                                      Subnet '192.168.100.0/24' on network 'MapiDagNetwork' is not Up.
                                                       Current state is 'Misconfigured'.
                                                      Subnet '192.168.100.0/24' on network 'MapiDagNetwork' is not Up.
                                                       Current state is 'Misconfigured'.
                                                      Subnet '192.168.100.0/24' on network 'MapiDagNetwork' is not Up.
                                                       Current state is 'Misconfigured'.
                                                      Subnet '192.168.100.0/24' on network 'MapiDagNetwork' is not Up.
                                                       Current state is 'Misconfigured'.
                                                      Subnet '192.168.100.0/24' on network 'MapiDagNetwork' is not Up.
                                                       Current state is 'Misconfigured'.
                                                      Subnet '10.50.2.0/24' on network 'MapiDagNetwork' is not Up.
                                                      Current state is 'Misconfigured'.
                                                      Subnet '10.50.2.0/24' on network 'MapiDagNetwork' is not Up.
                                                      Current state is 'Misconfigured'.
                                                      Subnet '10.50.2.0/24' on network 'MapiDagNetwork' is not Up.
                                                      Current state is 'Misconfigured'.
                                                      Subnet '10.50.2.0/24' on network 'MapiDagNetwork' is not Up.
                                                      Current state is 'Misconfigured'.
                                                      Subnet '10.50.2.0/24' on network 'MapiDagNetwork' is not Up.
                                                      Current state is 'Misconfigured'.

The Failover Cluster Manager was checked, but no errors were found, and the networks in question were “Up”, and in green status.

Looking further at the output of the Test-ReplicationHealth shows that the current state is “Misconfigured”, so let’s see how that replication traffic is configured. The following shows the output of Get-DatabaseAvailabilityGroupNetwork

RunspaceId         : 57e140b2-15ad-4822-9f94-3e1b0d34f491
Name               : MapiDagNetwork
Description        :
Subnets            : {{10.50.3.0/24,Up}, {10.50.2.0/24,Up}}
Interfaces         : {{DCExchange,Up,10.50.3.15}, {pcfexch005,Up,10.50.2.36}, {pcfexch006,Up,10.50.2.42}}
MapiAccessEnabled  : True
ReplicationEnabled : True
IgnoreNetwork      : False
Identity           : VarDAG2016\MapiDagNetwork
IsValid            : True
ObjectState        : New

RunspaceId         : 57e140b2-15ad-4822-9f94-3e1b0d34f491
Name               : ReplicationDagNetwork01
Description        :
Subnets            : {{192.168.100.0/24,Up}}
Interfaces         : {{pcfexch005,Up,192.168.100.218}, {pcfexch006,Up,192.168.100.217}}
MapiAccessEnabled  : False
ReplicationEnabled : True
IgnoreNetwork      : False
Identity           : VarDAG2016\ReplicationDagNetwork01
IsValid            : True
ObjectState        : New

An attempt was done to reset the network state by disabling the automatic configuration and re-enabling it with the following commands:

Set-DatabaseAvailabilityGroup VarDAG2016 -ManualDagNetworkConfiguration $true
Set-DatabaseAvailabilityGroup VarDAG2016 -ManualDagNetworkConfiguration $false

No change, and the seed attempt failed again.

An attempt to remove the Backup network (Here named “ReplicationDagNetwork01“) from the replication traffic was done with the following commands:

Set-DatabaseAvailabilityGroup VarDAG2016 -ManualDagNetworkConfiguration $true

Set-DatabaseAvailabilityGroupNetwork -Identity VarDAG2016\ReplicationDagNetwork01 -ReplicationEnabled:$false

No change was seen, and the seed attempt failed.

Looking further at the what options the command had, the “IgnoreNetwork” option was used:

Set-DatabaseAvailabilityGroup VarDAG2016 -ManualDagNetworkConfiguration $true

Set-DatabaseAvailabilityGroupNetwork -Identity VarDAG2016\ReplicationDagNetwork01 -ReplicationEnabled:$false -IgnoreNetwork:$true

Still no change, so I set back the autoconfiguration with the command:

Set-DatabaseAvailabilityGroup VarDAG2016 -ManualDagNetworkConfiguration $false

Running Get-DatabaseAvailabilityGroupNetwork | fl showed no visible change, but the Site-to-Site tunnel showed a massive uptick in usage, so I ran the Get-MailboxDatabaseCopyStatus command, and it showed all databases that were in a status of DisconnectedAndResynchronizing synchronizing! I retried the reseed process, and it worked!

I’m not sure why the Set-DatabaseAvailabilityGroupNetwork command showed no visible changes, but it’s obvious the changes did occur, that the replication was disabled over the BackupNet (192.168.100.x) and forced over the correct network.

An insight into a hacked Exchange server

Matthieu Faou just wrote a whitepaper at ESET detailing the process where the sophisticated spy network Turla quietly exploited a backdoor in Microsoft Exchange servers that gave attackers unprecedented access to the emails of at least three targets over several years! The fascinating whitepaper is located here: ESET Lightneuron Whitepaper

Emails arrive on mobile device but not Outlook client

In a single AD Domain with an Exchange 2016 environment that was hosting multiple email domains, there was a power user that has several mailboxes with different email suffixes that would sporadically stop receiving inbound emails to his fully patched, Outlook 2016 client. (The 2013 client behaved exactly the same.)

The Exchange server system is a simple 2 server setup, the databases are replicated in a DAG array, with several different databases split out by company/department.

Exchange DB1

 

As you see in the figure, User1 has four different user accounts with four different mailboxes with different suffixes hosted on the same database, as he is from Company1, but needs to receive separated email to different mailboxes (reply with those unique email addresses), and authenticate separately.

After several hours of combing through the environment, and Microsoft support services unable to find anything amiss, one of the tests were creating a new Outlook profile, adding just one user account, and testing, well what do you know, it works! When a second mailbox is added to the profile, inbound mail stops to the client though. (Again, a mobile device receives the inbound mail immediately, but nothing occurs for the desktop Outlook client)

A hint on how to fix it came when I looked at User2. In this case, User2 also was opening up multiple mailboxes with the same clients, but there were no issues at all. As is evident, even though the mailboxes open from the same Exchange environment, the back end databases are separate.

After creating a new database for “@Othersuffix.com”, and migrating the User1 mailbox over to it, when that additional mailbox was opened in Outlook, mail flow continued!

The Exchange environment pictured has a lot more complexity, to end users it is completely separate, seemingly different Auth Domains, DNS, URLs, etc., but in reality is all the same back end infrastructure for ease of maintenance, (hint, KEMP is used to do a bunch of backward and forward URL rewriting) so adding some additional mailbox databases in the back end didn’t really complicate efforts too much.