Thursday, October 4, 2012

Nexus 7000 VPC Dual Failure Testing

I recently worked with a customer who experienced an issue in their data center which raised some questions about the vPC failure/recovery scenarios. The root cause of the outage was a self inflicted configuration issue but the customer still wanted an answer to why the environment didn't recover and restore service "properly".

Without getting into the nitty gritty details of the actual configurations that were applied, suffice it to say that spanning tree loops had caused many bad things to happen including; the errDisable of the peer link due to UDLD messages being dropped in the control plane, route peer relationships dropped, etc.  The sequence of events was like this:

  1. spanning-tree loop created
  2. EIGRP peer relationships failed
  3. UDLD error disabled peer-link
  4. console access unavailable to primary vPC peer switch
  5. looping links physically disconnected
  6. primary peer switch powered down

Scenarios

All of these tests were performed with NXOS version 5.1(3) and then again with NXOS version 5.2(5).  The results for 5.2(5) are included within with some notes about the 5.1(3) results where applicable.   

The tests are focused on the behavior of both the failure and recovery of the vPC system for dual failures.  I performed the following failure testing scenarios to provide the customer with some knowledge about what type of outages they can expect during the initial failures and also during the recovery from the outages.  All reachability testing was performed with a simple ping test from the access layer Nexus 5000 to the Core layer through the vPC environment. 
  1. Fail peer keep-alive link, fail peer link, restore peer keep-alive, restore peer link
  2. Fail peer keep-alive link, fail peer link, restore peer link, restore peer-keep alive
  3. Fail peer link, fail peer-keep alive link, restore peer-keep alive link, restore peer link
  4. Fail peer link, fail peer-keep alive link, restore peer link, restore peer-keep alive
  5. Enabled auto-recovery, then repeated #4
The tests performed can be grouped into 3 categories because the results for each test were the same within the groups: 
  1. peer keep-alive first failures - tests 1 & 2
  2. peer link first failures - tests 3 & 4 
  3. peer link first failures with auto-recovery - test 5
I will provide the results only for tests 1, 3, 5.

Topology




Tests

Test #1 - Fail peer keep-alive, fail peer link, restore peer keep-alive, restore peer link

Validate

First we need to validate all is good to start with.



vPC domain id                     : 1   

Peer status                       : peer adjacency formed ok      

vPC keep-alive status             : peer is alive                 

Configuration consistency status  : success 

Per-vlan consistency status       : success                       

Type-2 consistency status         : success 
vPC role                          : primary, operational secondary
Number of vPCs configured         : 1   
Peer Gateway                      : Disabled
Dual-active excluded VLANs        : -
Graceful Consistency Check        : Enabled
Auto-recovery status              : Disabled

vPC Peer-link status
---------------------------------------------------------------------
id   Port   Status Active vlans    
--   ----   ------ --------------------------------------------------
1    Po1    up     1,10-11,13-14,65-67,70,84,86-88,92,94,96,98,100,10     
                   3-104,114-118,120,122,130-133,200                      

vPC status
----------------------------------------------------------------------
id   Port   Status Consistency Reason                     Active vlans
--   ----   ------ ----------- ------                     ------------
8    Po8    up     success     success                    1,10-11,13-     
                                                          14,65-67,70     
                                                          ,84,86-88,9     
                                                          2,94,96,98,     
                                                          100,103-104 ....
N7KA-AGG(config-vpc-domain)# 



vPC domain id                     : 1   
Peer status                       : peer adjacency formed ok      
vPC keep-alive status             : peer is alive                 
Configuration consistency status  : success 
Per-vlan consistency status       : success                       
Type-2 consistency status         : success 
vPC role                          : secondary, operational primary
Number of vPCs configured         : 1   
Peer Gateway                      : Disabled
Dual-active excluded VLANs        : -
Graceful Consistency Check        : Enabled
Auto-recovery status              : Disabled

vPC Peer-link status
---------------------------------------------------------------------
id   Port   Status Active vlans    
--   ----   ------ --------------------------------------------------
1    Po1    up     1,10-11,13-14,65-67,70,84,86-88,92,94,96,98,100,10     
                   3-104,114-118,120,122,130-133,200                      

vPC status
----------------------------------------------------------------------
id   Port   Status Consistency Reason                     Active vlans
--   ----   ------ ----------- ------                     ------------
8    Po8    up     success     success                    1,10-11,13-     
                                                          14,65-67,70     
                                                          ,84,86-88,9     
                                                          2,94,96,98,     
                                                          100,103-104 ....
N7KB-AGG(config-vpc-domain)# 



Everything appears to be in order so let's get started with the tests.

Failures

The first failure is the the peer keep-alive link.  This link's purpose is to provide detection of dual active scenarios.  The failure of this link will not impact the vPC environment.  The message below is seen on both vPC peers.

N7KB-AGG(config-vpc-domain)# 2012 Oct  3 17:32:17.864 N7KB-AGG %$ VDC-3 %$ %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 1, VPC peer keep-alive receive has failed

The second failure in this test is the peer link.  When this link is failed following a failure of the peer keep-alive link both peer switches keep the vPCs/SVIs active and continue to forward actively.  The peer switches cannot distinguish multiple link failures from a complete loss of the peer switch so to keep things moving vPCs/SVIs are left intact.  This would also be the case for simultaneous failure of the peer link and peer keep-alive links.

The output below show that the peer link and peer keep-alive link are both down, but the vPCs are up on the primary and secondary.  At this point there is no communication between the two peer switches.


vPC domain id                     : 1   
Peer status                       : peer link is down             
vPC keep-alive status             : Suspended (Destination IP not reachable)
Configuration consistency status  : success 
Per-vlan consistency status       : success                       
Type-2 consistency status         : success 
vPC role                          : primary                       
Number of vPCs configured         : 1   
Peer Gateway                      : Disabled
Dual-active excluded VLANs        : -
Graceful Consistency Check        : Enabled
Auto-recovery status              : Disabled

vPC Peer-link status
---------------------------------------------------------------------
id   Port   Status Active vlans    
--   ----   ------ --------------------------------------------------
1    Po1    down   -                                                      

vPC status
----------------------------------------------------------------------
id   Port   Status Consistency Reason                     Active vlans
--   ----   ------ ----------- ------                     ------------
8    Po8    up     success     success                    1,10-11,13-     
                                                          14,65-67,70     
                                                          ,84,86-88,9     
                                                          2,94,96,98,     
                                                          100,103-104 ....
N7KA-AGG(config-vpc-domain)# 



vPC domain id                     : 1   
Peer status                       : peer link is down             
vPC keep-alive status             : Suspended (Destination IP not reachable)
Configuration consistency status  : success 
Per-vlan consistency status       : success                       
Type-2 consistency status         : success 
vPC role                          : secondary, operational primary
Number of vPCs configured         : 1   
Peer Gateway                      : Disabled
Dual-active excluded VLANs        : -
Graceful Consistency Check        : Enabled
Auto-recovery status              : Disabled

vPC Peer-link status
---------------------------------------------------------------------
id   Port   Status Active vlans    
--   ----   ------ --------------------------------------------------
1    Po1    down   -                                                      

vPC status
----------------------------------------------------------------------
id   Port   Status Consistency Reason                     Active vlans
--   ----   ------ ----------- ------                     ------------
8    Po8    up     success     success                    1,10-11,13-     
                                                          14,65-67,70     
                                                          ,84,86-88,9     
                                                          2,94,96,98,     
                                                          100,103-104 ....
N7KB-AGG(config-vpc-domain)# 

Recovery

The order the links were restored in test#1 were peer keep-alive, then peer link.  Restoring the links in version 5.2(5) showed no interruption to the traffic using a simple ping test from a downstream host.

NOTE:  Under version 5.1(3) the downstream client experienced an outage of between 25 - 60 seconds.

Impact

Ping test results - indicate that there were no drops in the pings while we performed all failures and also during the restores.


---omitted---
64 bytes from 10.65.1.5: icmp_seq=5982 ttl=253 time=9.597 ms
64 bytes from 10.65.1.5: icmp_seq=5983 ttl=253 time=9.591 ms
64 bytes from 10.65.1.5: icmp_seq=5984 ttl=253 time=9.587 ms
64 bytes from 10.65.1.5: icmp_seq=5985 ttl=253 time=9.592 ms
64 bytes from 10.65.1.5: icmp_seq=5986 ttl=253 time=9.61 ms
64 bytes from 10.65.1.5: icmp_seq=5987 ttl=253 ^C
--- 10.65.1.5 ping statistics ---
6020 packets transmitted, 6020 packets received, 0.00% packet loss
round-trip min/avg/max = 0.685/9.218/71.57 ms
SB-N5K-A# 

Test #3 - Fail peer link, fail peer-keep alive, restore peer keep-alive, restore peer link

Validate


vPC domain id                     : 1   
Peer status                       : peer adjacency formed ok      
vPC keep-alive status             : peer is alive                 
Configuration consistency status  : success 
Per-vlan consistency status       : success                       
Type-2 consistency status         : success 
vPC role                          : primary, operational secondary
Number of vPCs configured         : 1   
Peer Gateway                      : Disabled
Dual-active excluded VLANs        : -
Graceful Consistency Check        : Enabled
Auto-recovery status              : Disabled

vPC Peer-link status
---------------------------------------------------------------------
id   Port   Status Active vlans    
--   ----   ------ --------------------------------------------------
1    Po1    up     1,10-11,13-14,65-67,70,84,86-88,92,94,96,98,100,10     
                   3-104,114-118,120,122,130-133,200                      

vPC status
----------------------------------------------------------------------
id   Port   Status Consistency Reason                     Active vlans
--   ----   ------ ----------- ------                     ------------
8    Po8    up     success     success                    1,10-11,13-     
                                                          14,65-67,70     
                                                          ,84,86-88,9     
                                                          2,94,96,98,     
                                                          100,103-104 ....


Failures

Tests 3 and 4 both fail the peer link first, then the peer keep-alive link.  How the vPC system handles the two failures is different in these cases.

In Tests 1 and 2 the failure of the peer keep-alive link followed by the peer link resulted in a dual active or split brain scenario where both sides remain active and forwarding traffic.  In tests 3 and 4 the peer link is failed first which results in the SVIs/vPCs being shutdown on the secondary peer switch.

When the peer link is failed the following message is logged:


N7KA-AGG(config-vpc-domain)# 
N7KA-AGG(config-vpc-domain)# 2012 Oct  3 21:41:00.672 N7KA-AGG %$ VDC-3 %$ %VPC-2-VPC_SUSP_ALL_VPC: Peer-link going down, suspending all vPCs on secondary



vPC domain id                     : 1   
Peer status                       : peer link is down             
vPC keep-alive status             : peer is alive                 
Configuration consistency status  : success 
Per-vlan consistency status       : success                       
Type-2 consistency status         : success 
vPC role                          : primary, operational secondary
Number of vPCs configured         : 1   
Peer Gateway                      : Disabled
Dual-active excluded VLANs        : -
Graceful Consistency Check        : Enabled
Auto-recovery status              : Disabled

vPC Peer-link status
---------------------------------------------------------------------
id   Port   Status Active vlans    
--   ----   ------ --------------------------------------------------
1    Po1    down   -                                                      

vPC status
----------------------------------------------------------------------
id   Port   Status Consistency Reason                     Active vlans
--   ----   ------ ----------- ------                     ------------
8    Po8    down   success     success                    -        


The subsequent failure of the peer keep-alive link doesn't impact the current state of the vPC environment.  In the current configuration without auto-recovery disabled the SVIs/vPCs are left down even during a dual failure.  The down side of this is if you then lose the primary switch for any reason all traffic will be black holed until the vPC environment is restored.


---omitted---
Graceful Consistency Check        : Enabled
Auto-recovery status              : Disabled
---omitted---

Recovery 

The VPCs and SVIs were not restored until both the peer link and PkA links are restored.

Impact

Ping results show that only 2 requests were lost during all of the failures and restores.

64 bytes from 10.65.1.5: icmp_seq=4942 ttl=253 time=9.588 ms
64 bytes from 10.65.1.5: icmp_seq=4943 ttl=253 time=9.602 ms
64 bytes from 10.65.1.5: icmp_seq=4944 ttl=253 time=9.576 ms
64 bytes from 10.65.1.5: icmp_seq=4945 ttl=253 time=0.935 ms
64 bytes from 10.65.1.5: icmp_seq=4946 ttl=253 time=0.759 ms
64 bytes from 10.65.1.5: icmp_seq=4947 ttl=253^C
--- 10.65.1.5 ping statistics ---
4997 packets transmitted, 4995 packets received, 0.04% packet loss
round-trip min/avg/max = 0.655/9.22/68.86 ms
SB-N5K-A# 

Test #5 - Fail peer link, fail peer keep-alive, restore peer link, restore peer keep-alive (with auto-recovery enabled available in 5.2(1) )

I performed the same exact steps in test 5 as in test 3/4 to see the behavior of the vPC environment with the auto-recovery feature enabled.

Validate

Here is the configuration and validation that the environment is up and has the feature enabled.

vpc domain 1
  role priority 2000
  peer-keepalive destination 192.168.99.9 source 192.168.99.10 vrf vpc-keepalive
  auto-recovery

interface port-channel1
  vpc peer-link

interface port-channel8
  vpc 8



vPC domain id                     : 1   
Peer status                       : peer adjacency formed ok      
vPC keep-alive status             : peer is alive                 
Configuration consistency status  : success 
Per-vlan consistency status       : success                       
Type-2 consistency status         : success 
vPC role                          : secondary, operational primary
Number of vPCs configured         : 1   
Peer Gateway                      : Disabled
Dual-active excluded VLANs        : -
Graceful Consistency Check        : Enabled
Auto-recovery status              : Enabled (timeout = 240 seconds)

vPC Peer-link status
---------------------------------------------------------------------
id   Port   Status Active vlans    
--   ----   ------ --------------------------------------------------
1    Po1    up     1,10-11,13-14,65-67,70,84,86-88,92,94,96,98,100,10     
                   3-104,114-118,120,122,130-133,200                      

vPC status
----------------------------------------------------------------------
id   Port   Status Consistency Reason                     Active vlans
--   ----   ------ ----------- ------                     ------------
8    Po8    up     success     success                    1,10-11,13-     
                                                          14,65-67,70     
                                                          ,84,86-88,9     
                                                          2,94,96,98,     
                                                          100,103-104 ....
N7KB-AGG(config-vpc-domain)# 

Failures

When the peer link is failed first the SVIs/vPCs are suspended on the vPC secondary switch and the following message is logged:

N7KA-AGG(config-vpc-domain)# 2012 Oct  3 21:49:35.730 N7KA-AGG %$ VDC-3 %$ %VPC-2-VPC_SUSP_ALL_VPC: Peer-link going down, suspending all vPCs on secondary

The next step is to fail the peer keep-alive link and we see confirmation of this:


N7KA-AGG(config-vpc-domain)# 2012 Oct  3 21:50:34.931 N7KA-AGG %$ VDC-3 %$ %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL: In domain 1, VPC peer keep-alive receive has failed

The main difference between test 4 and test 5 is that with the auto-recovery feature enabled a timer is started when peer keep-alive link fails following the peer link failure.  The log messages look like this: 

2012 Oct  3 21:54:52.952 N7KA-AGG %VPC-5-VPC_DELAY_SVI_BUP_TIMER_START: vPC restore, delay interface-vlan bringup timer started
2012 Oct  3 21:55:02.955 N7KA-AGG %VPC-5-VPC_DELAY_SVI_BUP_TIMER_EXPIRED: vPC restore, delay interface-vlan bringup timer expired, reiniting interface-vlans
2012 Oct  3 21:55:02.958 N7KA-AGG %VPC-5-VPC_RESTORE_TIMER_START: vPC restore timer started to reinit vPCs
2012 Oct  3 21:55:32.961 N7KA-AGG %VPC-5-VPC_RESTORE_TIMER_EXPIRED: vPC restore timer expired, reiniting vPCs

Both the SVIs and the vPCs are brought back online on the secondary vPC peer during this dual failure.  This will allow traffic to continue to flow in the event that following the dual failure you lost the primary switch as well.  


The vPC environment on the secondary will be in this state:


vPC domain id                     : 1   
Peer status                       : peer link is down             
                                  (peer-keepalive not operational,        
                                  peer never alive)                       
vPC keep-alive status             : peer is not reachable through peer-keepalive
Configuration consistency status  : success 
Per-vlan consistency status       : success                       
Type-2 consistency status         : success 
vPC role                          : primary                       
Number of vPCs configured         : 1   
Peer Gateway                      : Disabled
Dual-active excluded VLANs        : -
Graceful Consistency Check        : Enabled
Auto-recovery status              : Enabled (timeout = 240 seconds)

vPC Peer-link status
---------------------------------------------------------------------
id   Port   Status Active vlans    
--   ----   ------ --------------------------------------------------
1    Po1    up     -                                                      

vPC status
----------------------------------------------------------------------
id   Port   Status Consistency Reason                     Active vlans
--   ----   ------ ----------- ------                     ------------
8    Po8    up     success     Type checks were bypassed  1,10-11,13-     
                               for the vPC                14,65-67,70     
                                                          ,84,86-88,9     
                                                          2,94,96,98,     
                                                          100,103-104 ....
N7KA-AGG(config-vpc-domain)# 

Recovery

The peer keep-alive and peer link restore order is irrelevant as both are required to restore the system to normal operations.

Impact

Although not extremely bad, the recovery of the links with auto-recovery enabled was the most impactful with about 10-15 seconds of service interruption experienced from the downstream device.


64 bytes from 10.65.1.5: icmp_seq=6335 ttl=253 time=9.599 ms
64 bytes from 10.65.1.5: icmp_seq=6336 ttl=253 time=9.596 ms
64 bytes from 10.65.1.5: icmp_seq=6337 ttl=253 time=9.591 ms
64 bytes from 10.65.1.5: icmp_seq=6338 ttl=253 time=9.603 ms
64 bytes from 10.65.1.5: icmp_seq=6339 ttl=253 t^C
--- 10.65.1.5 ping statistics ---
6349 packets transmitted, 6336 packets received, 0.20% packet loss
round-trip min/avg/max = 0.643/9.245/73.496 ms
SB-N5K-A# 

Summary

If either the peer link and the peer keep-alive links fail simultaneously or the peer keep-alive fails followed by a peer link failure you will have and active/active situation.  

If the peer link fails first then the secondary peer switch will shut down the vPCs/SVIs to protect against a split brain situation.  

If the peer link fails first, auto-recovery is enabled, then the peer keep-alive link fails, the vPCs/SVIs will be brought back up following the expiration of the related timers.

References and Further Reading

Auto-Recovery feature explaination
https://supportforums.cisco.com/docs/DOC-24939

vPC Design Guide
http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9670/C07-572835-00_NX-OS_vPC_DG.pdf

Sunday, September 16, 2012

SNMPv3 Configuration Basics on IOS

I was asked to assemble the configuration for SNMPv3 to be applied to our demo center. The requirements were that we use SNMPv3 with authPriv security level. Configuring SNMP communities (version 2c) is very straight forward and also very well known. The concerns that I have heard about SNMPv3 is that it is difficult to configure. I am not convinced so lets take a look.

A basic background...

SNMPv3 is based on a USM (User Security Model) meaning all security levels associated with the solution are based on the creation of an SNMP user in one form or another. OK. So what are the security levels and what do they mean?

  1. noAuthNoPriv - This security level simply means that there is no authentication password exchanged and the communications between the agent and the server are not encrypted. The SNMP requests are authorized based on a simple username string match.
  2. authNoPriv - password authentication is used based on either MD5 or SHA hash authentication and no encryption is used for communications between the devices.
  3. authPriv - authentication is hash based the same as #2 but the communications between the agent and the server are also encrypted. the encryption of the traffic between the two nodes will required a crypto software image on the devices.
That seems pretty simple.

The SNMP user provides the mechanism for authenticating the session where the SNMP group is utilized to control what the user can access.

SNMP views can be created to include or exclude various portions of the MIBs. These SNMP views are then associated with the SNMP groups. Users are associated with groups.


The security level noAuthNoPriv is actually very similar to the communities that we are used to already. They use the simple string match so the username is analogous to the community in this case. You can restrict who can access the device with that username via access lists just like communities and what they can access with customized SNMP views.

So how is v3 better? The enhancements are evident in the other security levels (authNoPriv and authPriv) that add a hash based password exchange and potentially encrypt the communications.

The configurations for SNMPv3 are actually very simple and straight forward. I will demonstrate the basic configurations and also how you can easily validate what you have done to ensure all is working as you intended.

Environment

I will utilize a simple single router in GNS3 connected to by local machine via a loopback interface. Validation tests will utilize a simple SNMP get request using snmpwalk and the OID = .1.3.6.1.4.1.9.3.6.5.0 which will display the IOS version information for the router.

basic-net-topology
Basic Network Configuration

Note: Syntax will vary from platform to platform by the concepts will remain the same.


Let's Go

noAuthNoPriv

Steps:

  1. Create an SNMPv3 Group
  2. Create an SNMPv3 User
  3. Validate noAuthNoPriv
Here are the router commands that I entered:

snmp-server group noAuthNoPriv v3 noauth

snmp-server user noAuthNoPriv noAuthNoPriv v3

The group parameters used specify the name of the group (noAuthNoPriv), the security model (v3) and the security level (noauth).

The user command then specifies the username (noAuthNoPriv), the group name to associate this user with (noAuthNoPriv) and the security model (v3).

Here is the snmpwalk command used for validation:

snmpwalk -v3 -u noAuthNoPriv 10.254.254.253 .1.3.6.1.4.1.9.3.6.5.0



Command line switches
-u = username

AuthNoPriv

  1. Create an SNMPv3 Group
  2. Create an SNMPv3 User
  3. Validate authNoPriv
Here are the commands that I entered on the router:

snmp-server group authNoPriv v3 auth
snmp-server user authNoPriv authNoPriv v3 auth md5 test1234


The group command used here has the same commandline parameters as the first example, but we have changed from noauth to auth. This means that the users associated with this group will be required to authenticate before accessing the permitted MIBs.

The additional options on the username specify that authentication protocol MD5 will be used and the password of test1234. Passwords of less an 8 characters will be rejected during the snmp request although the command would be allowed in the CLI.

Here is the snmpwalk validation command that I will use to test:
snmpwalk -v3 -u authNoPriv -A test1234 -l authNoPriv -a MD5 10.254.254.253 .1.3.6.1.4.1.9.3.6.5.0

Command line switches
-u = username
-A = password
-l = security level
-a = authentication protocol



Success!

authPriv

  1. Create an SNMPv3 Group
  2. Create an SNMPv3 User
  3. Validate authPriv
Here are the router commands that were entered:

snmp-server group authPriv v3 priv

snmp-server user authPriv authPriv v3 auth md5 test1234 priv des56 test1234

This final example is using the most secure security model and level. The group command option, priv, indicates the authPriv security level.

The new options on the user command here configure this user for priv security level with des56 encryption and a passphrase of test1234. This additional configuration is what requires encryption of the messages between the SNMP agent and management station.

Here is the snmpwalk command used for validation:


snmpwalk -v3 -u authPriv -A test1234 -l authPriv -a MD5 -x DES -X test1234 10.254.254.253 .1.3.6.1.4.1.9.3.6.5.0

Command line switches


-u = username
-A = password
-l = security level
-a = authentication protocol
-x = encryption protocol
-X = privacy passphrase used for encryption



Success!

Conclusion

Configuring SNMPv3 versus SNMPv2c is highly recommended due the increased security capabilities. Since the configurations are very straight forward, there is no reason not to.