Broken ADFS! Service Unavailable – Error 503

Came in this morning to a lovely issue, ADFS authenticated services were completely unavailable! Office 365 archive mailboxes, hosted CRM, etc.

Upon testing the URL: /adfs/services/trust/mex a lovely “Error 503” was displayed!

Key prob 1

I changed the internal ADFS certs to use the new EKU requirements (Server and Client Authentication), verified NT SERVICE\drs and NT SERVICE\adfssrv had the correct permissions on the private keys, but still no dice for external usage.

After using my trusty bing.com, I came across this lovely Microsoft article about the KeySpec property for the Web Application Proxy server: https://docs.microsoft.com/en-us/windows-server/identity/ad-fs/technical-reference/ad-fs-and-keyspec-property

Checking the server’s keys using the Powershell command dir cert:/LocalMachine/My reveals the following problem:

KeySpec = 0

Ext Cert wrong

Ok, so the fix is easy right? Just export the cert to a pfx file, import it with

certutil -csp "Microsoft RSA SChannel Cryptographic Provider" -importpfx

or as the article says:

certutil –importpfx extcert.pfx AT_KEYEXCHANGE

In this case, I got a lovely -importPFX command FAILED: 0x80090029 The requested operation is not supported. error message as shown:

Key prob 3

After looking around for a while, I remembered the article I wrote back in September 2017: LS Audio/Video Authentication Server Error 19008 – Private Key not found, went through that process, and what do you know, it worked!!

Ext Cert right

The URL: /adfs/services/trust/mex now works perfectly, and all services that depend on ADFS are up!

Exchange database contains one or more mailboxes…

What do you do when you have what appears to be an empty mailbox base, but you get the dreaded: “This mailbox database contains one or more mailboxes, mailbox plans….” message?

remove-db error

The following are some commands to run:

Get-MailboxStatistics -Database DatabaseToRemove | ForEach { Update-StoreMailboxState -Database $_.Database -Identity $_.MailboxGuid -Confirm:$false }

Get-MailboxStatistics -Database DatabaseToRemove | where {$_.DisconnectReason -eq "SoftDeleted"} | foreach {Remove-StoreMailbox -Database $_.database -Identity $_.mailboxguid -MailboxState SoftDeleted}
Get-Mailbox -Database DatabaseToRemove -Archive
Get-Mailbox -Database DatabaseToRemove -PublicFolder
Get-Mailbox -Database DatabaseToRemove -Arbitration
Get-Mailbox -Database DatabaseToRemove -AuditLog

If after all those back empty you still have the issue, try to remove the database with the -Verbose parameter, as that parameter will show you what mailboxes (if any) still reside on the database.

Remove-MailboxDatabase DatabaseToRemove -Verbose

If the removal process still fails, a possibility is that the database in question is an Archive Database for a mailbox residing on a different mailbox database.

The following command helps you list mailboxes using a specific database as Archive Database:

Get-Mailbox | where {$_.ArchiveDatabase -eq DatabaseToRemove}

You can migrate just the archive mailbox to another database like so:

New-MoveRequest username -ArchiveOnly

The database can now be removed after the move is completed!

Net Neutrality, my thoughts

My take on Net Neutrality:

The old adage of free market vs government boundaries! – Let’s take a look at some different mediums, the telephone, heavily regulated, how much innovation have we seen there in the past 20 years? What about broadcast spectrum? Sure, we have HDTV now, but innovation? I worked on Microsoft Mediaroom a while back (it’s called ATT Uverse here in the States), do you know 10 years ago they had the capability of choosing what car you wanted to be in while a car race was going on, the angle of the  football field or team you wanted to see from during a game? All from your remote control! Why haven’t we seen it yet? Government regulations! – Let’s look at another medium not regulated in the slightest, the Internet! – When I got involved in it the rage was BBS technologies, a bank of modems where files/bulletin boards, etc were exchanged, then E-Mail took over and went nutso. Groupwise, Lotus Notes and Microsoft slugged it out. During that HTML went nutso, we know that story, then we have PKI, VoIP, IoT, and the plethora of other technologies, what’s the latest? Blockchain, who knows what else? Do you see a contrast?

What’s the big difference here? Free market vs heavily regulated mediums!

I’m no expert, but that’s how I see it.

LS Audio/Video Authentication Server Error 19008 – Private Key not found

After happily running for several years, one of the Skype for Business edge servers for one implementation decided it was not going to start its Audio/Video Authentication and Audio/Video Edge service!

Looking at the event viewer, the following two Event IDs were raised: 19008 and 19005. Specifically: 19008: Private key for server certificate not found by the LS A/V Authentication service or the service does not have sufficient permissions to access the certificate.

After verifying the private key permissions are set correctly (NETWORK SERVICE: Read, etc) in the Certificate MMC snap-in, I checked to see what the certs looked like in PowerShell

PS Cert:\LocalMachine\My\> dir .\ 5E670E493E5EBAACC5B26E219ACA8A629F9485D4 | fl 

HasPrivateKey  : True
PrivateKey     :
PublicKey      : System.Security.Cryptography.X509Certificates.PublicKey

Notice that there is no PrivateKey provider defined here, which means the cert broke somehow! Strange, as in this environment there were two Skype for Business edge servers, one worked perfectly, the other did not.

Anyways, the fix was to tear the certs apart, and put them back together as shown in this Merge certificate public and private key with OpenSSL TechNet article.

Specifically

    1. I got the OpenSSL binaries from: https://indy.fulgan.com/SSL
    2. I extracted the keys using the following commands:
      openssl pkcs12 -in egdev1.pfx -nocerts -out private_key.pem -nodes
      openssl pkcs12 -in egdev1.pfx -nokeys -out public_key.crt
    3. I merged the keys back together using the following command:
      openssl pkcs12 -export -in public_key.crt -inkey private_key.pem -out lync_edge_merged.pfx

After certificate import, and applying it to the services, I checked to see what the certs looked like in PowerShell

PS Cert:\LocalMachine\My\> dir .\ 5E670E493E5EBAACC5B26E219ACA8A629F9485D4 | fl 

HasPrivateKey  : True
PrivateKey     : System.Security.Cryptography.RSACryptoServiceProvider
PublicKey      : System.Security.Cryptography.X509Certificates.PublicKey

Notice that there is now a PrivateKey provider defined here, and the two Audio/Video Authentication and Audio/Video Edge services started up just perfectly!

Skype for Business presentation size limits.

I recently had an implementation where very large PowerPoint presentations was needed. When those pptx files were pre-uploaded to the meeting, the following dreaded “allowable file size exceeded” message occurred:

Exceeds File size SfB

It got me interested in finding out what the allowable file sizes are for Skype, and after scouring documentation, I discovered the following:

As of September 2017:

  • With Office Web Apps 2013, the max file size is 150Mb
  • With Office Online Server, the max file size is 300Mb

The explicit limits, where applicable, are listed in the table below. However, note that there is a 60-second file download time out that applies to all GetFile operations, and this time out can affect the perceived file size limit. In practice, this time out is rarely hit, since connectivity and bandwidth is typically very good between Office Online and host datacenters. However, hosts should be aware of this limit.

File size limits
Application Mode Limit Notes
Excel Online View 5MB
Excel Online Edit 5MB
PowerPoint Online View See notes No limit, but subject to the 60 second time out for file downloads as described above.
PowerPoint Online Edit 300MB While the upper limit is 300MB, this is still subject to the overall 60 second time out for file downloads so it is possible that smaller files will hit that timeout.
Word Online View See notes No limit, but subject to the 60 second time out for file downloads as described above.
Word Online Edit See notes The technical limit is 100,000,000 (100 million) characters in the document XML; however, this does not correlate with file size in a meaningful way. For example, a 1000-page document, hundreds of MB in size does not hit this limit. For the vast majority of use-cases, this limit is irrelevant.

The process to configure these max sizes is fairly simple, and is configured in the “Settings_Service.ini” configuration file. The default location for that file is:

C:\Program Files\Microsoft Office Web Apps\PPTConversionService

At the bottom of the file, just add the following entries:

For Office Web Apps 2013, add:

PowerPointEditServerMaxFileSizeBytes=(System.UInt64)153600000
PowerPointServerMediaEmbeddedMaxSize=(System.UInt64)153600000

For Office Online Server, add:

PowerPointEditServerMaxFileSizeBytes=(System.UInt64)307200000
PowerPointServerMediaEmbeddedMaxSize=(System.UInt64)307200000

Once the changes are saved, restart the server service. You may do so in PowerShell with the following command:

Restart-service WACSM

Please note, these max sizes are for the entire meeting, not per attachment, therefore, if your meeting has much larger files, you will have to split them, upload one, go through it, remove it, upload the next. – You can upload several files at a time, but they cannot collectively be larger than the total MB size limits.

Setting the Network Profile in Windows 2012 or higher

I recently had a non-domain joined edge Windows 2016 machine with two separate NICs that I needed to set different Windows Firewall settings to. Why? For instance, I wanted to allow RDP for the Internal NIC, and not allowing it for the External one, etc. The problem was, the NICs were set with the wrong network profile, the external public facing one was set to Private, and the other was reversed as shown in the following screen capture:

Network Connect profile - 1-1

New with Windows Server 2012 and higher, to change the network profile, PowerShell v4 cmdlets need to be used! Those cmdlets are:

Get-NetConnectionProfile
Set-NetConnectionProfile

Here are the results with the “Get” command:

Network Connect profile - 1

We can see the results are reversed, as the “Internet” connection has the “Private” designation, thus the wrong Windows Firewall profile is assigned to it.

To fix that, we run the “Set” command as shown in the bottom of the capture above, and the correct firewall profile is assigned!

Note: I named the external facing NIC “Public”, and the internal facing one “Private”. You can name it whatever you’d like, and identify it with the -InterfaceAlias property.

Network Connect profile - 1-4

The default profile in Windows Server 2012+ is Public. It automatically changes when you join the server to the Domain. In my instance, I was not joining this server to a Domain, and thus had to set it manually, on top of that, in this instance, the automatic designation was configured incorrectly.

 

Dropped Calls at 10 min with Skype for Business SIP trunk. (& what is SIP ALG?)

The case of the dropped call at the 10 min mark:

We have a SIP trunk provider that due to their implementation of RFC requirements, send down a Re-Invite packet at the 10 minute mark. If you’re just running a single server, there’s no issue, as the FE that gets the packet sends back an ACK packet saying “got it”, and communication continues. When you have multiple servers, it can be an issue, as in this provider’s case, they only send that Re-Invite packet to the first registered IP, and if the outgoing call originates from one of the other servers, they never get that Re-Invite packet, can’t send back an ACK, and the call gets dropped due to timeout.

To mitigate the above mentioned case, an SBC was put in the middle so that the trunk would only have one IP to communicate to, all was fine and dandy, but due to the nature of things changing in an IT environment, a new random call drop issue came up. (Ugh!)

To see what was going on, client and server Skype for Business logs were looked at for the time the drop occurred, they showed a standard “BYE” happening, not an error, so onto looking at the next step, infrastructure.

In order to give some concept of the environment, here is a basic diagram of what it looks like:

Internal Diag

Note that there is NAT’ing going on between the networks, the SBC was configured with the IP 192.168.70.44 (in the syslog figures, it shows up as “Device”) the internal FE servers had NAT addresses of 192.168.70.20, 21, and 22. We’ll get back to that in a bit.

Syslog was enabled on the SBC, perused through those, and this is what I found:

In the process to narrow down what was going on, we found that in the majority of complaints, the calls were getting dropped at the 10 min mark, and since the Re-Invite gets sent at that time, the SIP provider was in the crosshairs, and that was our main focus. After much going back and forth with them, going back and forth with the SBC vender, tearing up the servers, time to start fresh, and look deeper.

In looking at the logs, there are some calls that the re-invites are handled correctly: Here are two long ones, one lasting 13 min, the other almost 30, both with Re-Invite packets, (the second one with three).

Work1-1

Eventually I found some calls that failed, here’s an example of one that wasn’t handled correctly:

Notice the call is only 2 seconds long, but still dropped!

Non0-1

Here’s a screenshot of the Re-Invite hangup problem:

Non3

Here we see the Re-Invite packet coming down at exactly 10 min after the call started (15:08:58), when the internal server answers that with its SDP packet, the address doesn’t get NAT’d, the SBC then tries to send it to the Internal IP, that communication doesn’t occur, and after a period of no communication (14 sec in this instance), the server assumes the call has ended, has the BYE packet sent to drop the call.

So, from looking at all the calls with the syslog viewer, I saw sporadically there were calls failing, and others succeeding, there didn’t seem to be any rhyme or reason to it, except for the failure was always right after an SDP packet coming from the inside server.

This now had me looking at ALG, which in this case, is controlled by the firewall. What is ALG? To put it simply, it is NAT for SIP SDP packets. What is SDP? While SIP deals with creating, modifying, and closing down sessions, SDP deals with the media within those sessions. and what we’re specifically interested in this scenario, is the IP addresses the two endpoints should talk over. Since one of the endpoints is behind a NAT’d device, ALG is the mechanism that changes the data in these packets to have IPs the two devices can talk to!

Thankfully, this firewall makes it easy to capture inbound packets and outbound packets separately in the same session, producing two separate pcap files which can be compared to side by side! Let’s look at packet #24 of a call that does work:

Here is the inbound packet, notice the O and C attributes have the internal IP, as it’s originating from inside the network:

cap1

Once it goes through the firewall, here is the outbound version of that exact same capture, you can see the O and C attributes have been changed to the NAT IP by ALG:

cap2

Here’s an example of a complete failure capture. Note, this was from an “updated” firmware version of the firewall that completely blew up NAT, so the capture was easy to catch, but the concept is the same which is occurring with the sporadic failures above:

Here is the inbound packet #4, notice the O and C attributes have the internal IP, as it’s originating from inside the network:

cap3

Once it goes through the firewall, here is the outbound version of that packet #4, you can see the O and C attributes have NOT been changed to the NAT IP by ALG:

cap4

Let’s take a look at that syslog trace again:

We can see the original invite comes from the Skype FE server (10.50.2.46, which is NAT’d to 192.168.70.22), communication goes to the SBC, and then gets sent up to the SIP provider’s CPE SIP router. Everything works great until one of the SDP packets does not get translated through ALG correctly, when that happens, the SBC does not know where to send the packet, it gets sent to the “Internal” IP, subsequently goes nowhere, and 14 seconds later, the stream is disconnected with a “BYE” packet due to inactivity.

Non1

Why did this problem arise out of the blue? I’m guessing there have been some good firewall firmware versions that have not caused any issues, so there has been some periods of tranquility, with other firmware versions that have “sometimes” caused this issue. The current firewall version installed is 7.1.5. I’ve tried 7.1.8, 8.0, and 8.0.2, they are progressively worse (with 8.0 and higher not working at all), so I went back down to 7.1.5, and am awaiting resolution from the manufacturer.

Another solution is to change the architecture of the network boundaries affected to be Routed instead of NAT’d, that way ALG does not come into the picture. That might not be a feasible solution due to your security or infrastructure constraints, but has solved the issue (temporarily) in this case. It is not a permanent fix, as any SIP conversations going over NAT will continue to be an issue until the vendor resolves it in their firmware.