Thursday, October 4, 2012

Nexus 7000 VPC Dual Failure Testing

I recently worked with a customer who experienced an issue in their data center which raised some questions about the vPC failure/recovery scenarios. The root cause of the outage was a self inflicted configuration issue but the customer still wanted an answer to why the environment didn't recover and restore service "properly".

Without getting into the nitty gritty details of the actual configurations that were applied, suffice it to say that spanning tree loops had caused many bad things to happen including; the errDisable of the peer link due to UDLD messages being dropped in the control plane, route peer relationships dropped, etc.  The sequence of events was like this:

  1. spanning-tree loop created
  2. EIGRP peer relationships failed
  3. UDLD error disabled peer-link
  4. console access unavailable to primary vPC peer switch
  5. looping links physically disconnected
  6. primary peer switch powered down

Scenarios

All of these tests were performed with NXOS version 5.1(3) and then again with NXOS version 5.2(5).  The results for 5.2(5) are included within with some notes about the 5.1(3) results where applicable.   

The tests are focused on the behavior of both the failure and recovery of the vPC system for dual failures.  I performed the following failure testing scenarios to provide the customer with some knowledge about what type of outages they can expect during the initial failures and also during the recovery from the outages.  All reachability testing was performed with a simple ping test from the access layer Nexus 5000 to the Core layer through the vPC environment. 
  1. Fail peer keep-alive link, fail peer link, restore peer keep-alive, restore peer link
  2. Fail peer keep-alive link, fail peer link, restore peer link, restore peer-keep alive
  3. Fail peer link, fail peer-keep alive link, restore peer-keep alive link, restore peer link
  4. Fail peer link, fail peer-keep alive link, restore peer link, restore peer-keep alive
  5. Enabled auto-recovery, then repeated #4
The tests performed can be grouped into 3 categories because the results for each test were the same within the groups: 
  1. peer keep-alive first failures - tests 1 & 2
  2. peer link first failures - tests 3 & 4 
  3. peer link first failures with auto-recovery - test 5
I will provide the results only for tests 1, 3, 5.

Topology




Tests

Test #1 - Fail peer keep-alive, fail peer link, restore peer keep-alive, restore peer link

Validate

First we need to validate all is good to start with.



vPC domain id                     : 1   

Peer status                       : peer adjacency formed ok      

vPC keep-alive status             : peer is alive                 

Configuration consistency status  : success 

Per-vlan consistency status       : success                       

Type-2 consistency status         : success 
vPC role                          : primary, operational secondary
Number of vPCs configured         : 1   
Peer Gateway                      : Disabled
Dual-active excluded VLANs        : -
Graceful Consistency Check        : Enabled
Auto-recovery status              : Disabled

vPC Peer-link status
---------------------------------------------------------------------
id   Port   Status Active vlans    
--   ----   ------ --------------------------------------------------
1    Po1    up     1,10-11,13-14,65-67,70,84,86-88,92,94,96,98,100,10     
                   3-104,114-118,120,122,130-133,200                      

vPC status
----------------------------------------------------------------------
id   Port   Status Consistency Reason                     Active vlans
--   ----   ------ ----------- ------                     ------------
8    Po8    up     success     success                    1,10-11,13-     
                                                          14,65-67,70     
                                                          ,84,86-88,9     
                                                          2,94,96,98,     
                                                          100,103-104 ....
N7KA-AGG(config-vpc-domain)# 



vPC domain id                     : 1   
Peer status                       : peer adjacency formed ok      
vPC keep-alive status             : peer is alive                 
Configuration consistency status  : success 
Per-vlan consistency status       : success                       
Type-2 consistency status         : success 
vPC role                          : secondary, operational primary
Number of vPCs configured         : 1   
Peer Gateway                      : Disabled
Dual-active excluded VLANs        : -
Graceful Consistency Check        : Enabled
Auto-recovery status              : Disabled

vPC Peer-link status
---------------------------------------------------------------------
id   Port   Status Active vlans    
--   ----   ------ --------------------------------------------------
1    Po1    up     1,10-11,13-14,65-67,70,84,86-88,92,94,96,98,100,10     
                   3-104,114-118,120,122,130-133,200                      

vPC status
----------------------------------------------------------------------
id   Port   Status Consistency Reason                     Active vlans
--   ----   ------ ----------- ------                     ------------
8    Po8    up     success     success                    1,10-11,13-     
                                                          14,65-67,70     
                                                          ,84,86-88,9     
                                                          2,94,96,98,     
                                                          100,103-104 ....
N7KB-AGG(config-vpc-domain)# 



Everything appears to be in order so let's get started with the tests.

Failures

The first failure is the the peer keep-alive link.  This link's purpose is to provide detection of dual active scenarios.  The failure of this link will not impact the vPC environment.  The message below is seen on both vPC peers.

N7KB-AGG(config-vpc-domain)# 2012 Oct  3 17:32:17.864 N7KB-AGG %$ VDC-3 %$ %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 1, VPC peer keep-alive receive has failed

The second failure in this test is the peer link.  When this link is failed following a failure of the peer keep-alive link both peer switches keep the vPCs/SVIs active and continue to forward actively.  The peer switches cannot distinguish multiple link failures from a complete loss of the peer switch so to keep things moving vPCs/SVIs are left intact.  This would also be the case for simultaneous failure of the peer link and peer keep-alive links.

The output below show that the peer link and peer keep-alive link are both down, but the vPCs are up on the primary and secondary.  At this point there is no communication between the two peer switches.


vPC domain id                     : 1   
Peer status                       : peer link is down             
vPC keep-alive status             : Suspended (Destination IP not reachable)
Configuration consistency status  : success 
Per-vlan consistency status       : success                       
Type-2 consistency status         : success 
vPC role                          : primary                       
Number of vPCs configured         : 1   
Peer Gateway                      : Disabled
Dual-active excluded VLANs        : -
Graceful Consistency Check        : Enabled
Auto-recovery status              : Disabled

vPC Peer-link status
---------------------------------------------------------------------
id   Port   Status Active vlans    
--   ----   ------ --------------------------------------------------
1    Po1    down   -                                                      

vPC status
----------------------------------------------------------------------
id   Port   Status Consistency Reason                     Active vlans
--   ----   ------ ----------- ------                     ------------
8    Po8    up     success     success                    1,10-11,13-     
                                                          14,65-67,70     
                                                          ,84,86-88,9     
                                                          2,94,96,98,     
                                                          100,103-104 ....
N7KA-AGG(config-vpc-domain)# 



vPC domain id                     : 1   
Peer status                       : peer link is down             
vPC keep-alive status             : Suspended (Destination IP not reachable)
Configuration consistency status  : success 
Per-vlan consistency status       : success                       
Type-2 consistency status         : success 
vPC role                          : secondary, operational primary
Number of vPCs configured         : 1   
Peer Gateway                      : Disabled
Dual-active excluded VLANs        : -
Graceful Consistency Check        : Enabled
Auto-recovery status              : Disabled

vPC Peer-link status
---------------------------------------------------------------------
id   Port   Status Active vlans    
--   ----   ------ --------------------------------------------------
1    Po1    down   -                                                      

vPC status
----------------------------------------------------------------------
id   Port   Status Consistency Reason                     Active vlans
--   ----   ------ ----------- ------                     ------------
8    Po8    up     success     success                    1,10-11,13-     
                                                          14,65-67,70     
                                                          ,84,86-88,9     
                                                          2,94,96,98,     
                                                          100,103-104 ....
N7KB-AGG(config-vpc-domain)# 

Recovery

The order the links were restored in test#1 were peer keep-alive, then peer link.  Restoring the links in version 5.2(5) showed no interruption to the traffic using a simple ping test from a downstream host.

NOTE:  Under version 5.1(3) the downstream client experienced an outage of between 25 - 60 seconds.

Impact

Ping test results - indicate that there were no drops in the pings while we performed all failures and also during the restores.


---omitted---
64 bytes from 10.65.1.5: icmp_seq=5982 ttl=253 time=9.597 ms
64 bytes from 10.65.1.5: icmp_seq=5983 ttl=253 time=9.591 ms
64 bytes from 10.65.1.5: icmp_seq=5984 ttl=253 time=9.587 ms
64 bytes from 10.65.1.5: icmp_seq=5985 ttl=253 time=9.592 ms
64 bytes from 10.65.1.5: icmp_seq=5986 ttl=253 time=9.61 ms
64 bytes from 10.65.1.5: icmp_seq=5987 ttl=253 ^C
--- 10.65.1.5 ping statistics ---
6020 packets transmitted, 6020 packets received, 0.00% packet loss
round-trip min/avg/max = 0.685/9.218/71.57 ms
SB-N5K-A# 

Test #3 - Fail peer link, fail peer-keep alive, restore peer keep-alive, restore peer link

Validate


vPC domain id                     : 1   
Peer status                       : peer adjacency formed ok      
vPC keep-alive status             : peer is alive                 
Configuration consistency status  : success 
Per-vlan consistency status       : success                       
Type-2 consistency status         : success 
vPC role                          : primary, operational secondary
Number of vPCs configured         : 1   
Peer Gateway                      : Disabled
Dual-active excluded VLANs        : -
Graceful Consistency Check        : Enabled
Auto-recovery status              : Disabled

vPC Peer-link status
---------------------------------------------------------------------
id   Port   Status Active vlans    
--   ----   ------ --------------------------------------------------
1    Po1    up     1,10-11,13-14,65-67,70,84,86-88,92,94,96,98,100,10     
                   3-104,114-118,120,122,130-133,200                      

vPC status
----------------------------------------------------------------------
id   Port   Status Consistency Reason                     Active vlans
--   ----   ------ ----------- ------                     ------------
8    Po8    up     success     success                    1,10-11,13-     
                                                          14,65-67,70     
                                                          ,84,86-88,9     
                                                          2,94,96,98,     
                                                          100,103-104 ....


Failures

Tests 3 and 4 both fail the peer link first, then the peer keep-alive link.  How the vPC system handles the two failures is different in these cases.

In Tests 1 and 2 the failure of the peer keep-alive link followed by the peer link resulted in a dual active or split brain scenario where both sides remain active and forwarding traffic.  In tests 3 and 4 the peer link is failed first which results in the SVIs/vPCs being shutdown on the secondary peer switch.

When the peer link is failed the following message is logged:


N7KA-AGG(config-vpc-domain)# 
N7KA-AGG(config-vpc-domain)# 2012 Oct  3 21:41:00.672 N7KA-AGG %$ VDC-3 %$ %VPC-2-VPC_SUSP_ALL_VPC: Peer-link going down, suspending all vPCs on secondary



vPC domain id                     : 1   
Peer status                       : peer link is down             
vPC keep-alive status             : peer is alive                 
Configuration consistency status  : success 
Per-vlan consistency status       : success                       
Type-2 consistency status         : success 
vPC role                          : primary, operational secondary
Number of vPCs configured         : 1   
Peer Gateway                      : Disabled
Dual-active excluded VLANs        : -
Graceful Consistency Check        : Enabled
Auto-recovery status              : Disabled

vPC Peer-link status
---------------------------------------------------------------------
id   Port   Status Active vlans    
--   ----   ------ --------------------------------------------------
1    Po1    down   -                                                      

vPC status
----------------------------------------------------------------------
id   Port   Status Consistency Reason                     Active vlans
--   ----   ------ ----------- ------                     ------------
8    Po8    down   success     success                    -        


The subsequent failure of the peer keep-alive link doesn't impact the current state of the vPC environment.  In the current configuration without auto-recovery disabled the SVIs/vPCs are left down even during a dual failure.  The down side of this is if you then lose the primary switch for any reason all traffic will be black holed until the vPC environment is restored.


---omitted---
Graceful Consistency Check        : Enabled
Auto-recovery status              : Disabled
---omitted---

Recovery 

The VPCs and SVIs were not restored until both the peer link and PkA links are restored.

Impact

Ping results show that only 2 requests were lost during all of the failures and restores.

64 bytes from 10.65.1.5: icmp_seq=4942 ttl=253 time=9.588 ms
64 bytes from 10.65.1.5: icmp_seq=4943 ttl=253 time=9.602 ms
64 bytes from 10.65.1.5: icmp_seq=4944 ttl=253 time=9.576 ms
64 bytes from 10.65.1.5: icmp_seq=4945 ttl=253 time=0.935 ms
64 bytes from 10.65.1.5: icmp_seq=4946 ttl=253 time=0.759 ms
64 bytes from 10.65.1.5: icmp_seq=4947 ttl=253^C
--- 10.65.1.5 ping statistics ---
4997 packets transmitted, 4995 packets received, 0.04% packet loss
round-trip min/avg/max = 0.655/9.22/68.86 ms
SB-N5K-A# 

Test #5 - Fail peer link, fail peer keep-alive, restore peer link, restore peer keep-alive (with auto-recovery enabled available in 5.2(1) )

I performed the same exact steps in test 5 as in test 3/4 to see the behavior of the vPC environment with the auto-recovery feature enabled.

Validate

Here is the configuration and validation that the environment is up and has the feature enabled.

vpc domain 1
  role priority 2000
  peer-keepalive destination 192.168.99.9 source 192.168.99.10 vrf vpc-keepalive
  auto-recovery

interface port-channel1
  vpc peer-link

interface port-channel8
  vpc 8



vPC domain id                     : 1   
Peer status                       : peer adjacency formed ok      
vPC keep-alive status             : peer is alive                 
Configuration consistency status  : success 
Per-vlan consistency status       : success                       
Type-2 consistency status         : success 
vPC role                          : secondary, operational primary
Number of vPCs configured         : 1   
Peer Gateway                      : Disabled
Dual-active excluded VLANs        : -
Graceful Consistency Check        : Enabled
Auto-recovery status              : Enabled (timeout = 240 seconds)

vPC Peer-link status
---------------------------------------------------------------------
id   Port   Status Active vlans    
--   ----   ------ --------------------------------------------------
1    Po1    up     1,10-11,13-14,65-67,70,84,86-88,92,94,96,98,100,10     
                   3-104,114-118,120,122,130-133,200                      

vPC status
----------------------------------------------------------------------
id   Port   Status Consistency Reason                     Active vlans
--   ----   ------ ----------- ------                     ------------
8    Po8    up     success     success                    1,10-11,13-     
                                                          14,65-67,70     
                                                          ,84,86-88,9     
                                                          2,94,96,98,     
                                                          100,103-104 ....
N7KB-AGG(config-vpc-domain)# 

Failures

When the peer link is failed first the SVIs/vPCs are suspended on the vPC secondary switch and the following message is logged:

N7KA-AGG(config-vpc-domain)# 2012 Oct  3 21:49:35.730 N7KA-AGG %$ VDC-3 %$ %VPC-2-VPC_SUSP_ALL_VPC: Peer-link going down, suspending all vPCs on secondary

The next step is to fail the peer keep-alive link and we see confirmation of this:


N7KA-AGG(config-vpc-domain)# 2012 Oct  3 21:50:34.931 N7KA-AGG %$ VDC-3 %$ %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 1, VPC peer keep-alive receive has failed

The main difference between test 4 and test 5 is that with the auto-recovery feature enabled a timer is started when peer keep-alive link fails following the peer link failure.  The log messages look like this: 

2012 Oct  3 21:54:52.952 N7KA-AGG %VPC-5-VPC_DELAY_SVI_BUP_TIMER_START: vPC restore, delay interface-vlan bringup timer started
2012 Oct  3 21:55:02.955 N7KA-AGG %VPC-5-VPC_DELAY_SVI_BUP_TIMER_EXPIRED: vPC restore, delay interface-vlan bringup timer expired, reiniting interface-vlans
2012 Oct  3 21:55:02.958 N7KA-AGG %VPC-5-VPC_RESTORE_TIMER_START: vPC restore timer started to reinit vPCs
2012 Oct  3 21:55:32.961 N7KA-AGG %VPC-5-VPC_RESTORE_TIMER_EXPIRED: vPC restore timer expired, reiniting vPCs

Both the SVIs and the vPCs are brought back online on the secondary vPC peer during this dual failure.  This will allow traffic to continue to flow in the event that following the dual failure you lost the primary switch as well.  


The vPC environment on the secondary will be in this state:


vPC domain id                     : 1   
Peer status                       : peer link is down             
                                  (peer-keepalive not operational,        
                                  peer never alive)                       
vPC keep-alive status             : peer is not reachable through peer-keepalive
Configuration consistency status  : success 
Per-vlan consistency status       : success                       
Type-2 consistency status         : success 
vPC role                          : primary                       
Number of vPCs configured         : 1   
Peer Gateway                      : Disabled
Dual-active excluded VLANs        : -
Graceful Consistency Check        : Enabled
Auto-recovery status              : Enabled (timeout = 240 seconds)

vPC Peer-link status
---------------------------------------------------------------------
id   Port   Status Active vlans    
--   ----   ------ --------------------------------------------------
1    Po1    up     -                                                      

vPC status
----------------------------------------------------------------------
id   Port   Status Consistency Reason                     Active vlans
--   ----   ------ ----------- ------                     ------------
8    Po8    up     success     Type checks were bypassed  1,10-11,13-     
                               for the vPC                14,65-67,70     
                                                          ,84,86-88,9     
                                                          2,94,96,98,     
                                                          100,103-104 ....
N7KA-AGG(config-vpc-domain)# 

Recovery

The peer keep-alive and peer link restore order is irrelevant as both are required to restore the system to normal operations.

Impact

Although not extremely bad, the recovery of the links with auto-recovery enabled was the most impactful with about 10-15 seconds of service interruption experienced from the downstream device.


64 bytes from 10.65.1.5: icmp_seq=6335 ttl=253 time=9.599 ms
64 bytes from 10.65.1.5: icmp_seq=6336 ttl=253 time=9.596 ms
64 bytes from 10.65.1.5: icmp_seq=6337 ttl=253 time=9.591 ms
64 bytes from 10.65.1.5: icmp_seq=6338 ttl=253 time=9.603 ms
64 bytes from 10.65.1.5: icmp_seq=6339 ttl=253 t^C
--- 10.65.1.5 ping statistics ---
6349 packets transmitted, 6336 packets received, 0.20% packet loss
round-trip min/avg/max = 0.643/9.245/73.496 ms
SB-N5K-A# 

Summary

If either the peer link and the peer keep-alive links fail simultaneously or the peer keep-alive fails followed by a peer link failure you will have and active/active situation.  

If the peer link fails first then the secondary peer switch will shut down the vPCs/SVIs to protect against a split brain situation.  

If the peer link fails first, auto-recovery is enabled, then the peer keep-alive link fails, the vPCs/SVIs will be brought back up following the expiration of the related timers.

References and Further Reading

Auto-Recovery feature explaination
https://supportforums.cisco.com/docs/DOC-24939

vPC Design Guide
http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9670/C07-572835-00_NX-OS_vPC_DG.pdf