What to do when ExpressRoutes don’t show up in vNet Peering

question-mark

I had a little problem the other day with a technical environment, and as sometimes happens when you have a problem it enables you to go a bit deeper and understand things better. The solution wasn’t big or clever, but the analysis and configuration checking was worthwhile.

I had built a Minimum Viable Product (MVP) for a client following Microsoft’s standard reference architecture. It was a hub and spoke, with the hub having connectivity and firewalls and shared services. The spokes were for different workloads – in this case, based around the functional: Prod and Non-Prod. Additional spokes can be added in the future with minimal rework of the basic architecture.

The spokes are connected via vNet peering, which became GA in March 2018. Traditionally vNets most likely would have connected via Azure VPN Gateways. vNet Peering opened up opportunities to architect things differently and hub-and-spoke has become a lot more common.

While before you may have had separate vNets connected to on-premises for different functions, you can now have redundant connections into a hub and then hang the spokes off that hub with your vNet peering. Here is an example diagram:

Click to enlarge

This diagram is pretty close to my actual environment . As part of the MVP, I stood up the hub and spoke using an Azure Resource Manager (ARM) template. I used a modified template from the Azure Quickstart Templates Github Repository. Then I created the vNet Peering using some PowerShell to create each peer. I stood up some virtual machines (VMs), checked connectivity and peering and everything was fine.

A week or so later, dual Azure ExpressRoutes were ready to be provisioned to provide connectivity. And, the VPNs were ready too. The VPNs were to be used as secondary routes in case of an MPLS outage that could take out ExpressRoute. This was pretty unlikely, but the redundancy was required by the client. The VPN gateways were left with a lower priority than ExpressRoute so they would only be used if the primary connectivity was unavailable.

So, I completed the following to enable ExpressRoute:

  • I created the circuits in Azure using PowerShell (giving me a repeatable artifact for both regions).
  • I gave the service key to the provider, they provisioned things on their side and I could see the Azure Private circuits as provisioned in Azure.
  • I scripted out provisioning a Virtual Network Gateway and connection for ExpressRoute.
  • And I checked connectivity to the hub vNet over the ExpressRoute and everything worked fine.

Connection failure

Here’s the problem. I then tried to connect to a VM on the Spoke vNet to confirm a connection and this failed. I knew the ExpressRoute was fine so it had to be something with the peering. I checked the Peering configuration again just to be sure I hadn’t made a mistake.

On the hub peer I had enabled:

“Allow Gateway Transit” under configure gateway transit settings. This means the Peer Virtual Network can use the Virtual Network gateways configured.

“Allow Gateway Transit” under configure gateway transit settings

I then checked the spoke peering and I had enabled the following:  “Use remote gateways” under Configure Remote Gateways Settings. This means the peering will use the peer remote gateway settings – in this case the ExpressRoute Gateway in the hub vNet. If the spoke already has this option, this configuration won’t work – see below. So, as you can see everything was correct on the hub and spoke peerings.

correct hub and spoke peerings

A simple solution

I then went to a VM in the hub and could see all the ExpressRoute routes there in the Effective Routes on the NIC. When I went to the spoke however, the routes weren’t there on the VM NIC I used for testing. The routes were not being passed on to the spoke vNet.

I decided to leave it for a while just in case these things take time. I used to say “SCCM teaches patience” because of the length of time SCCM took to deploy things. Usually with Azure I don’t have to invoke this maxim. But, to give a configuration change a bit of time is still a good rule of thumb.

I came back to it the next day and the behavior was the same. Time for some more head scratching.

Before recreating the peering, which would be a fairly quick job because of the scripting completed, I decided to try one last thing. On the spoke peer I de-checked the “Use remote gateways” option and saved. I then checked it again and saved. When I checked the Effective Routes on a VM in the spoke and, Voila!, all the routes were there. I then checked connectivity with MPLS across the ExpressRoute and everything was good.

So, what went wrong? The issue was that I created the vNets and peerings prior to any Virtual Network Gateways being provisioned. When they were deployed, the peering in the spoke for some reason didn’t pick up the ExpressRoute routes and advertise them out to the vNet. I raised a query with Microsoft to find out if this behavior is expected and it is. This is what they said:

“As you have indeed discovered, creating a hub-and-spoke architecture before a gateway is deployed, despite using the remote gateway option, will not allow traffic to flow from the hub to the spoke vnets.
This happens since the connection remains hanging as it doesn’t know what gateway to use, since a gateway doesn’t exist at the moment the peering is created.
This is a known quirk in the system and it can be indeed resolved by simply unchecking the use remote gateway option, saving and then re-enabling the option and saving. This allows the system to refresh the connection and use the newly created Expressroute.”

So, what’s next for this MVP? A next-generation firewall (NGF) is about to be deployed and, accordingly, we will need user defined routing (UDR) and Route Tables added to Azure. When we create these we’ll need to make sure the “Virtual network gateway route propagation” is Enabled otherwise the ExpressRoute Routes won’t be found in the UDRs.

Problem solved! In this case, it was a relatively innocuous solution to a troubling problem.


Scott Brodie is an Azure architect at DXC Technology with two decades of experience in solving cloud and IT architecture solutions for clients.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: