Without getting into the nitty gritty details of the actual configurations that were applied, suffice it to say that spanning tree loops had caused many bad things to happen including; the errDisable of the peer link due to UDLD messages being dropped in the control plane, route peer relationships dropped, etc. The sequence of events was like this:
- spanning-tree loop created
- EIGRP peer relationships failed
- UDLD error disabled peer-link
- console access unavailable to primary vPC peer switch
- looping links physically disconnected
- primary peer switch powered down
Scenarios
All of these tests were performed with NXOS version 5.1(3) and then again with NXOS version 5.2(5). The results for 5.2(5) are included within with some notes about the 5.1(3) results where applicable.
The tests are focused on the behavior of both the failure and recovery of the vPC system for dual failures. I performed the following failure testing scenarios to provide the customer with some knowledge about what type of outages they can expect during the initial failures and also during the recovery from the outages. All reachability testing was performed with a simple ping test from the access layer Nexus 5000 to the Core layer through the vPC environment.
- Fail peer keep-alive link, fail peer link, restore peer keep-alive, restore peer link
- Fail peer keep-alive link, fail peer link, restore peer link, restore peer-keep alive
- Fail peer link, fail peer-keep alive link, restore peer-keep alive link, restore peer link
- Fail peer link, fail peer-keep alive link, restore peer link, restore peer-keep alive
- Enabled auto-recovery, then repeated #4
The tests performed can be grouped into 3 categories because the results for each test were the same within the groups:
- peer keep-alive first failures - tests 1 & 2
- peer link first failures - tests 3 & 4
- peer link first failures with auto-recovery - test 5
I will provide the results only for tests 1, 3, 5.
Topology
Validate
First we need to validate all is good to start with.
vPC domain id : 1
Peer status : peer adjacency formed ok
vPC keep-alive status : peer is alive
Configuration consistency status : success
Per-vlan consistency status : success
Type-2 consistency status : success
vPC role : primary, operational secondary
Number of vPCs configured : 1
Peer Gateway : Disabled
Dual-active excluded VLANs : -
Graceful Consistency Check : Enabled
Auto-recovery status : Disabled
vPC Peer-link status
---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ --------------------------------------------------
1 Po1 up 1,10-11,13-14,65-67,70,84,86-88,92,94,96,98,100,10
3-104,114-118,120,122,130-133,200
vPC status
----------------------------------------------------------------------
id Port Status Consistency Reason Active vlans
-- ---- ------ ----------- ------ ------------
8 Po8 up success success 1,10-11,13-
14,65-67,70
,84,86-88,9
2,94,96,98,
100,103-104 ....
N7KA-AGG(config-vpc-domain)#
vPC domain id : 1
Peer status : peer adjacency formed ok
vPC keep-alive status : peer is alive
Configuration consistency status : success
Per-vlan consistency status : success
Type-2 consistency status : success
vPC role : secondary, operational primary
Number of vPCs configured : 1
Peer Gateway : Disabled
Dual-active excluded VLANs : -
Graceful Consistency Check : Enabled
Auto-recovery status : Disabled
vPC Peer-link status
---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ --------------------------------------------------
1 Po1 up 1,10-11,13-14,65-67,70,84,86-88,92,94,96,98,100,10
3-104,114-118,120,122,130-133,200
vPC status
----------------------------------------------------------------------
id Port Status Consistency Reason Active vlans
-- ---- ------ ----------- ------ ------------
8 Po8 up success success 1,10-11,13-
14,65-67,70
,84,86-88,9
2,94,96,98,
100,103-104 ....
N7KB-AGG(config-vpc-domain)#
Everything appears to be in order so let's get started with the tests.
Failures
The first failure is the the peer keep-alive link. This link's purpose is to provide detection of dual active scenarios. The failure of this link will not impact the vPC environment. The message below is seen on both vPC peers.
N7KB-AGG(config-vpc-domain)# 2012 Oct 3 17:32:17.864 N7KB-AGG %$ VDC-3 %$ %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 1, VPC peer keep-alive receive has failed
The output below show that the peer link and peer keep-alive link are both down, but the vPCs are up on the primary and secondary. At this point there is no communication between the two peer switches.
vPC domain id : 1
Peer status : peer link is down
vPC keep-alive status : Suspended (Destination IP not reachable)
Configuration consistency status : success
Per-vlan consistency status : success
Type-2 consistency status : success
vPC role : primary
Number of vPCs configured : 1
Peer Gateway : Disabled
Dual-active excluded VLANs : -
Graceful Consistency Check : Enabled
Auto-recovery status : Disabled
vPC Peer-link status
---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ --------------------------------------------------
1 Po1 down -
vPC status
----------------------------------------------------------------------
id Port Status Consistency Reason Active vlans
-- ---- ------ ----------- ------ ------------
8 Po8 up success success 1,10-11,13-
14,65-67,70
,84,86-88,9
2,94,96,98,
100,103-104 ....
N7KA-AGG(config-vpc-domain)#
vPC domain id : 1
Peer status : peer link is down
vPC keep-alive status : Suspended (Destination IP not reachable)
Configuration consistency status : success
Per-vlan consistency status : success
Type-2 consistency status : success
vPC role : secondary, operational primary
Number of vPCs configured : 1
Peer Gateway : Disabled
Dual-active excluded VLANs : -
Graceful Consistency Check : Enabled
Auto-recovery status : Disabled
vPC Peer-link status
---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ --------------------------------------------------
1 Po1 down -
vPC status
----------------------------------------------------------------------
id Port Status Consistency Reason Active vlans
-- ---- ------ ----------- ------ ------------
8 Po8 up success success 1,10-11,13-
14,65-67,70
,84,86-88,9
2,94,96,98,
100,103-104 ....
N7KB-AGG(config-vpc-domain)#
Recovery
The order the links were restored in test#1 were peer keep-alive, then peer link. Restoring the links in version 5.2(5) showed no interruption to the traffic using a simple ping test from a downstream host.NOTE: Under version 5.1(3) the downstream client experienced an outage of between 25 - 60 seconds.
Impact
Ping test results - indicate that there were no drops in the pings while we performed all failures and also during the restores.---omitted---
64 bytes from 10.65.1.5: icmp_seq=5982 ttl=253 time=9.597 ms
64 bytes from 10.65.1.5: icmp_seq=5983 ttl=253 time=9.591 ms
64 bytes from 10.65.1.5: icmp_seq=5984 ttl=253 time=9.587 ms
64 bytes from 10.65.1.5: icmp_seq=5985 ttl=253 time=9.592 ms
64 bytes from 10.65.1.5: icmp_seq=5986 ttl=253 time=9.61 ms
64 bytes from 10.65.1.5: icmp_seq=5987 ttl=253 ^C
--- 10.65.1.5 ping statistics ---
6020 packets transmitted, 6020 packets received, 0.00% packet loss
round-trip min/avg/max = 0.685/9.218/71.57 ms
SB-N5K-A#
Test #3 - Fail peer link, fail peer-keep alive, restore peer keep-alive, restore peer link
Validate
vPC domain id : 1
Peer status : peer adjacency formed ok
vPC keep-alive status : peer is alive
Configuration consistency status : success
Per-vlan consistency status : success
Type-2 consistency status : success
vPC role : primary, operational secondary
Number of vPCs configured : 1
Peer Gateway : Disabled
Dual-active excluded VLANs : -
Graceful Consistency Check : Enabled
Auto-recovery status : Disabled
vPC Peer-link status
---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ --------------------------------------------------
1 Po1 up 1,10-11,13-14,65-67,70,84,86-88,92,94,96,98,100,10
3-104,114-118,120,122,130-133,200
vPC status
----------------------------------------------------------------------
id Port Status Consistency Reason Active vlans
-- ---- ------ ----------- ------ ------------
8 Po8 up success success 1,10-11,13-
14,65-67,70
,84,86-88,9
2,94,96,98,
100,103-104 ....
Failures
Tests 3 and 4 both fail the peer link first, then the peer keep-alive link. How the vPC system handles the two failures is different in these cases.In Tests 1 and 2 the failure of the peer keep-alive link followed by the peer link resulted in a dual active or split brain scenario where both sides remain active and forwarding traffic. In tests 3 and 4 the peer link is failed first which results in the SVIs/vPCs being shutdown on the secondary peer switch.
When the peer link is failed the following message is logged:
N7KA-AGG(config-vpc-domain)#
N7KA-AGG(config-vpc-domain)# 2012 Oct 3 21:41:00.672 N7KA-AGG %$ VDC-3 %$ %VPC-2-VPC_SUSP_ALL_VPC: Peer-link going down, suspending all vPCs on secondary
vPC domain id : 1
Peer status : peer link is down
vPC keep-alive status : peer is alive
Configuration consistency status : success
Per-vlan consistency status : success
Type-2 consistency status : success
vPC role : primary, operational secondary
Number of vPCs configured : 1
Peer Gateway : Disabled
Dual-active excluded VLANs : -
Graceful Consistency Check : Enabled
Auto-recovery status : Disabled
vPC Peer-link status
---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ --------------------------------------------------
1 Po1 down -
vPC status
----------------------------------------------------------------------
id Port Status Consistency Reason Active vlans
-- ---- ------ ----------- ------ ------------
8 Po8 down success success -
The subsequent failure of the peer keep-alive link doesn't impact the current state of the vPC environment. In the current configuration without auto-recovery disabled the SVIs/vPCs are left down even during a dual failure. The down side of this is if you then lose the primary switch for any reason all traffic will be black holed until the vPC environment is restored.
---omitted---
Graceful Consistency Check : Enabled
Auto-recovery status : Disabled
---omitted---
Recovery
The VPCs and SVIs were not restored until both the peer link and PkA links are restored.Impact
Ping results show that only 2 requests were lost during all of the failures and restores.64 bytes from 10.65.1.5: icmp_seq=4942 ttl=253 time=9.588 ms
64 bytes from 10.65.1.5: icmp_seq=4943 ttl=253 time=9.602 ms
64 bytes from 10.65.1.5: icmp_seq=4944 ttl=253 time=9.576 ms
64 bytes from 10.65.1.5: icmp_seq=4945 ttl=253 time=0.935 ms
64 bytes from 10.65.1.5: icmp_seq=4946 ttl=253 time=0.759 ms
64 bytes from 10.65.1.5: icmp_seq=4947 ttl=253^C
--- 10.65.1.5 ping statistics ---
4997 packets transmitted, 4995 packets received, 0.04% packet loss
round-trip min/avg/max = 0.655/9.22/68.86 ms
SB-N5K-A#
Test #5 - Fail peer link, fail peer keep-alive, restore peer link, restore peer keep-alive (with auto-recovery enabled available in 5.2(1) )
I performed the same exact steps in test 5 as in test 3/4 to see the behavior of the vPC environment with the auto-recovery feature enabled.Validate
Here is the configuration and validation that the environment is up and has the feature enabled.vpc domain 1
role priority 2000
peer-keepalive destination 192.168.99.9 source 192.168.99.10 vrf vpc-keepalive
auto-recovery
interface port-channel1
vpc peer-link
interface port-channel8
vpc 8
vPC domain id : 1
Peer status : peer adjacency formed ok
vPC keep-alive status : peer is alive
Configuration consistency status : success
Per-vlan consistency status : success
Type-2 consistency status : success
vPC role : secondary, operational primary
Number of vPCs configured : 1
Peer Gateway : Disabled
Dual-active excluded VLANs : -
Graceful Consistency Check : Enabled
Auto-recovery status : Enabled (timeout = 240 seconds)
vPC Peer-link status
---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ --------------------------------------------------
1 Po1 up 1,10-11,13-14,65-67,70,84,86-88,92,94,96,98,100,10
3-104,114-118,120,122,130-133,200
vPC status
----------------------------------------------------------------------
id Port Status Consistency Reason Active vlans
-- ---- ------ ----------- ------ ------------
8 Po8 up success success 1,10-11,13-
14,65-67,70
,84,86-88,9
2,94,96,98,
100,103-104 ....
N7KB-AGG(config-vpc-domain)#
Failures
When the peer link is failed first the SVIs/vPCs are suspended on the vPC secondary switch and the following message is logged:N7KA-AGG(config-vpc-domain)# 2012 Oct 3 21:49:35.730 N7KA-AGG %$ VDC-3 %$ %VPC-2-VPC_SUSP_ALL_VPC: Peer-link going down, suspending all vPCs on secondary
The next step is to fail the peer keep-alive link and we see confirmation of this:
N7KA-AGG(config-vpc-domain)# 2012 Oct 3 21:50:34.931 N7KA-AGG %$ VDC-3 %$ %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 1, VPC peer keep-alive receive has failed
The main difference between test 4 and test 5 is that with the auto-recovery feature enabled a timer is started when peer keep-alive link fails following the peer link failure. The log messages look like this:
2012 Oct 3 21:54:52.952 N7KA-AGG %VPC-5-VPC_DELAY_SVI_BUP_TIMER_START: vPC restore, delay interface-vlan bringup timer started
2012 Oct 3 21:55:02.955 N7KA-AGG %VPC-5-VPC_DELAY_SVI_BUP_TIMER_EXPIRED: vPC restore, delay interface-vlan bringup timer expired, reiniting interface-vlans
2012 Oct 3 21:55:02.958 N7KA-AGG %VPC-5-VPC_RESTORE_TIMER_START: vPC restore timer started to reinit vPCs
2012 Oct 3 21:55:32.961 N7KA-AGG %VPC-5-VPC_RESTORE_TIMER_EXPIRED: vPC restore timer expired, reiniting vPCs
Both the SVIs and the vPCs are brought back online on the secondary vPC peer during this dual failure. This will allow traffic to continue to flow in the event that following the dual failure you lost the primary switch as well.
The vPC environment on the secondary will be in this state:
vPC domain id : 1
Peer status : peer link is down
(peer-keepalive not operational,
peer never alive)
vPC keep-alive status : peer is not reachable through peer-keepalive
Configuration consistency status : success
Per-vlan consistency status : success
Type-2 consistency status : success
vPC role : primary
Number of vPCs configured : 1
Peer Gateway : Disabled
Dual-active excluded VLANs : -
Graceful Consistency Check : Enabled
Auto-recovery status : Enabled (timeout = 240 seconds)
vPC Peer-link status
---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ --------------------------------------------------
1 Po1 up -
vPC status
----------------------------------------------------------------------
id Port Status Consistency Reason Active vlans
-- ---- ------ ----------- ------ ------------
8 Po8 up success Type checks were bypassed 1,10-11,13-
for the vPC 14,65-67,70
,84,86-88,9
2,94,96,98,
100,103-104 ....
N7KA-AGG(config-vpc-domain)#
Recovery
The peer keep-alive and peer link restore order is irrelevant as both are required to restore the system to normal operations.
Impact
Although not extremely bad, the recovery of the links with auto-recovery enabled was the most impactful with about 10-15 seconds of service interruption experienced from the downstream device.64 bytes from 10.65.1.5: icmp_seq=6335 ttl=253 time=9.599 ms
64 bytes from 10.65.1.5: icmp_seq=6336 ttl=253 time=9.596 ms
64 bytes from 10.65.1.5: icmp_seq=6337 ttl=253 time=9.591 ms
64 bytes from 10.65.1.5: icmp_seq=6338 ttl=253 time=9.603 ms
64 bytes from 10.65.1.5: icmp_seq=6339 ttl=253 t^C
--- 10.65.1.5 ping statistics ---
6349 packets transmitted, 6336 packets received, 0.20% packet loss
round-trip min/avg/max = 0.643/9.245/73.496 ms
SB-N5K-A#
Summary
If either the peer link and the peer keep-alive links fail simultaneously or the peer keep-alive fails followed by a peer link failure you will have and active/active situation.
If the peer link fails first then the secondary peer switch will shut down the vPCs/SVIs to protect against a split brain situation.
If the peer link fails first, auto-recovery is enabled, then the peer keep-alive link fails, the vPCs/SVIs will be brought back up following the expiration of the related timers.
References and Further Reading
Auto-Recovery feature explainationhttps://supportforums.cisco.com/docs/DOC-24939
vPC Design Guide
http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9670/C07-572835-00_NX-OS_vPC_DG.pdf