TCAM over utilization on the core

Started by config t, October 02, 2015, 05:14:20 AM

Previous topic - Next topic

config t

One of the networks I'm doing O&M on has a pair of 6500 series as the core. It doesn't happen all the time, but around once or twice a week snmp traps for TCAM over utilization will be screaming at us all day. Nobody seems to care since it hasn't had an effect on services (that attitude is prevalent around here), but I want to solve it for my own peace of mind.

The utilization is 99%. Any suggestions on where I should start looking? I'm going to hit the Google machine here in a minute but I figured I would ping you guys too.
:matrix:

Please don't mistake my experience for intelligence.

wintermute000


Reggle

Whats the mac stable size? VRFs? Routing table size? Do you have multicast routing? Although something like IPv6 MLD indeed sounds more likely.

icecream-guy

try this

--snip—


event manager applet cpu_stats

event snmp oid "1.3.6.1.4.1.9.9.109.1.1.1.1.3.1" get-type exact entry-op gt entry-val "70"

exit-op lt exit-val "50" poll-interval 5

action 1.01 syslog msg "------HIGH CPU DETECTED----, CPU:$_snmp_oid_val %"

action 1.02 cli command "enable"

action 1.03 cli command "show clock | append disk0:cpu_stats"

action 1.04 cli command "show proc cpu sort | append disk0:cpu_stats"

action 1.05 cli command "Show proc cpu | exc 0.00% | append disk0:cpu_stats"

action 1.06 cli command "Show proc cpu history | append disk0:cpu_stats"

action 1.07 cli command "show logging | append disk0:cpu_stats "

action 1.08 cli command "show spanning-tree detail | in ieee|occurr|from|is exec | append disk0:cpu_stats"

action 1.09 cli command "debug netdr cap rx | append disk0:cpu_stats"

action 1.10 cli command "show netdr cap | append disk0:cpu_stats"

action 1.11 cli command "undebug all"

!
** EEM Script will fire when CPU goes above 70%, and will not refire until CPU goes back under 50%.

you can tweak entry-vlan and exit-val to your taste.
:professorcat:

My Moral Fibers have been cut.

icecream-guy

Quote from: ristau5741 on October 02, 2015, 07:15:50 AM
try this

--snip—


event manager applet cpu_stats

event snmp oid "1.3.6.1.4.1.9.9.109.1.1.1.1.3.1" get-type exact entry-op gt entry-val "70"

exit-op lt exit-val "50" poll-interval 5

action 1.01 syslog msg "------HIGH CPU DETECTED----, CPU:$_snmp_oid_val %"

action 1.02 cli command "enable"

action 1.03 cli command "show clock | append disk0:cpu_stats"

action 1.04 cli command "show proc cpu sort | append disk0:cpu_stats"

action 1.05 cli command "Show proc cpu | exc 0.00% | append disk0:cpu_stats"

action 1.06 cli command "Show proc cpu history | append disk0:cpu_stats"

action 1.07 cli command "show logging | append disk0:cpu_stats "

action 1.08 cli command "show spanning-tree detail | in ieee|occurr|from|is exec | append disk0:cpu_stats"

action 1.09 cli command "debug netdr cap rx | append disk0:cpu_stats"

action 1.10 cli command "show netdr cap | append disk0:cpu_stats"

action 1.11 cli command "undebug all"

!
** EEM Script will fire when CPU goes above 70%, and will not refire until CPU goes back under 50%.

you can tweak entry-vlan and exit-val to your taste.

crap, this is for CPU utilization, not TCAM utilization....

you probably could use the SNMP navigator to find the TCAM OID (1.3.6.1.4.1.9.9.97.1.9.1.1.1?), and then update the cli commands to reflect what you want to see.

p.s. probably should open a TAC case for this, with a 'show tech' output while the issue is happening.
:professorcat:

My Moral Fibers have been cut.

config t

Sorry for the late response.

This hasn't happened again since the 2nd and we are short-handed so I have been slammed with the normal netops activities. It's hard sometimes being a pillar of networking might.

We aren't running any IPv6, so that is out.. and no multicast that I am aware of. When it happens again I will do as ristau said and open a TAC case so someone smarter than me can look at it :D

Quote from: ristau5741 on October 02, 2015, 07:15:50 AM
try this

--snip—


event manager applet cpu_stats

event snmp oid "1.3.6.1.4.1.9.9.109.1.1.1.1.3.1" get-type exact entry-op gt entry-val "70"

exit-op lt exit-val "50" poll-interval 5

action 1.01 syslog msg "------HIGH CPU DETECTED----, CPU:$_snmp_oid_val %"

action 1.02 cli command "enable"

action 1.03 cli command "show clock | append disk0:cpu_stats"

action 1.04 cli command "show proc cpu sort | append disk0:cpu_stats"

action 1.05 cli command "Show proc cpu | exc 0.00% | append disk0:cpu_stats"

action 1.06 cli command "Show proc cpu history | append disk0:cpu_stats"

action 1.07 cli command "show logging | append disk0:cpu_stats "

action 1.08 cli command "show spanning-tree detail | in ieee|occurr|from|is exec | append disk0:cpu_stats"

action 1.09 cli command "debug netdr cap rx | append disk0:cpu_stats"

action 1.10 cli command "show netdr cap | append disk0:cpu_stats"

action 1.11 cli command "undebug all"

!
** EEM Script will fire when CPU goes above 70%, and will not refire until CPU goes back under 50%.

you can tweak entry-vlan and exit-val to your taste.

This is some network ninja stuff right here.. I'm going to play with this later.
:matrix:

Please don't mistake my experience for intelligence.

wintermute000

Just because you're not running ipv6 doesn't mean a pc or Nic is not. Google ipv6 mld flooding for a nasty example involving intel buggy Nic drivers that I have personally seen in the wild

config t

Quote from: wintermute000 on October 08, 2015, 02:01:18 AM
Just because you're not running ipv6 doesn't mean a pc or Nic is not. Google ipv6 mld flooding for a nasty example involving intel buggy Nic drivers that I have personally seen in the wild

:challenge-accepted:

Quote from: Reggle on October 02, 2015, 07:03:16 AM
Whats the mac stable size? VRFs? Routing table size? Do you have multicast routing? Although something like IPv6 MLD indeed sounds more likely.

Do you mean the max size? Or how many mac addresses and routes are currently in the tables?
:matrix:

Please don't mistake my experience for intelligence.

routerdork

Quote from: config t on October 08, 2015, 02:46:00 AM
Do you mean the max size? Or how many mac addresses and routes are currently in the tables?
He means this. Those take up TCAM space. If for example you take full BGP routes on a 6500/7600 you can run into TCAM issues if your supervisor can't handle that many routes or if it can and hasn't been adjusted to accept a larger amount. I'll try to find the command to check, can't remember it right off.
"The thing about quotes on the internet is that you cannot confirm their validity." -Abraham Lincoln

routerdork

Ok I remembered it better than I though I would once I was on the CLI.

6509#show platform hardware capacity forwarding
L2 Forwarding Resources
           MAC Table usage:   Module  Collisions  Total       Used       %Used
                              1                0  98304       1071          1%
                              2                0  98304       1053          1%
                              3                0  98304       1053          1%
                              5                0  65536       1058          2%

             VPN CAM usage:                       Total       Used       %Used
                                                    512          0          0%
L3 Forwarding Resources
             FIB TCAM usage:                     Total        Used       %Used
                  72 bits (IPv4, MPLS, EoM)     196608        6189          3%
                 144 bits (IP mcast, IPv6)       32768          65          1%

                     detail:      Protocol                    Used       %Used
                                  IPv4                        6187          3%
                                  MPLS                           1          1%
                                  EoM                            1          1%

                                  IPv6                           1          1%
                                  IPv4 mcast                    61          1%
                                  IPv6 mcast                     3          1%

            Adjacency usage:                     Total        Used       %Used
                                               1048576         978          1%

     Forwarding engine load:
                     Module       pps   peak-pps                     peak-time
                     1           2482     420560  06:55:17 EDT Sat Jul 18 2015
                     2           6532     333333  16:54:00 EDT Tue Sep 8 2015
                     3            542      37735  12:41:00 EDT Tue Sep 22 2015
                     5           4350     188235  01:30:54 EDT Tue Jun 16 2015

6509#show mod
Mod Ports Card Type                              Model              Serial No.
--- ----- -------------------------------------- ------------------ -----------
  1    8  CEF720 8 port 10GE with DFC            WS-X6708-10GE      blahblahblah
  2    8  CEF720 8 port 10GE with DFC            WS-X6708-10GE      blahblahblah
  3    8  CEF720 8 port 10GE with DFC            WS-X6708-10GE      blahblahblah
  5    2  Supervisor Engine 720 (Active)         WS-SUP720-3B       blahblahblah
  9   48  SFM-capable 48 port 10/100/1000mb RJ45 WS-X6548-GE-45AF   blahblahblah
"The thing about quotes on the internet is that you cannot confirm their validity." -Abraham Lincoln

config t

#10
Quote from: routerdork on October 08, 2015, 08:36:41 AM
Ok I remembered it better than I though I would once I was on the CLI.

6509#show platform hardware capacity forwarding
L2 Forwarding Resources
           MAC Table usage:   Module  Collisions  Total       Used       %Used
                              1                0  98304       1071          1%
                              2                0  98304       1053          1%
                              3                0  98304       1053          1%
                              5                0  65536       1058          2%

             VPN CAM usage:                       Total       Used       %Used
                                                    512          0          0%
L3 Forwarding Resources
             FIB TCAM usage:                     Total        Used       %Used
                  72 bits (IPv4, MPLS, EoM)     196608        6189          3%
                 144 bits (IP mcast, IPv6)       32768          65          1%

                     detail:      Protocol                    Used       %Used
                                  IPv4                        6187          3%
                                  MPLS                           1          1%
                                  EoM                            1          1%

                                  IPv6                           1          1%
                                  IPv4 mcast                    61          1%
                                  IPv6 mcast                     3          1%

            Adjacency usage:                     Total        Used       %Used
                                               1048576         978          1%

     Forwarding engine load:
                     Module       pps   peak-pps                     peak-time
                     1           2482     420560  06:55:17 EDT Sat Jul 18 2015
                     2           6532     333333  16:54:00 EDT Tue Sep 8 2015
                     3            542      37735  12:41:00 EDT Tue Sep 22 2015
                     5           4350     188235  01:30:54 EDT Tue Jun 16 2015

6509#show mod
Mod Ports Card Type                              Model              Serial No.
--- ----- -------------------------------------- ------------------ -----------
  1    8  CEF720 8 port 10GE with DFC            WS-X6708-10GE      blahblahblah
  2    8  CEF720 8 port 10GE with DFC            WS-X6708-10GE      blahblahblah
  3    8  CEF720 8 port 10GE with DFC            WS-X6708-10GE      blahblahblah
  5    2  Supervisor Engine 720 (Active)         WS-SUP720-3B       blahblahblah
  9   48  SFM-capable 48 port 10/100/1000mb RJ45 WS-X6548-GE-45AF   blahblahblah


Sweet. Those are good commands to know.. google wasn't being helpful when I tried to find them earlier today.

On the subject of IPv6 MLD flooding.. I researched it as wintermute suggested and reading this article got me thinking..
http://packetpushers.net/good-nics-bad-things-blast-ipv6-multicast-listener-discovery-queries/

If MLD is the cause, I think the topology of the network here that experienced the high utilization may have mitigated the problem to the point where it didn't cause an outage. It is set up so that it tunnels through the normal production network on line-encryptors (TACLANE), meaning the layer 2 domains are broken up and isolated to single buildings with the exception of about 10 switches with direct connections to the VSS pair. The fact that I have only seen it on rare occasions could be because any new machines put on the network that contributed would have eventually had IPv6 turned off through a GP update.

Does that make any sense? I won't be able to prove it until it happens again but I feel like I'm on to something.
:matrix:

Please don't mistake my experience for intelligence.

routerdork

Quote from: config t on October 08, 2015, 11:49:43 AM
Sweet. Those are good commands to know.. google wasn't being helpful when I tried to find them earlier today.
I tried to keep a blog going for obscure things like this that I wanted to remember and then found that I don't take the time to blog much so it's never been put back online after my last upgrade.
"The thing about quotes on the internet is that you cannot confirm their validity." -Abraham Lincoln

SimonV

Quote from: config t on October 08, 2015, 11:49:43 AM
On the subject of IPv6 MLD flooding.. I researched it as wintermute suggested and reading this article got me thinking..
http://packetpushers.net/good-nics-bad-things-blast-ipv6-multicast-listener-discovery-queries/

Funny, I was called out to a business a month or two ago for that exact same issue. Their wireless scanners in production were getting kicked off the network and when I arrived I saw their access point LED going crazy. Suspected a broadcast storm but it was all multicast, around 100 Mbps, coming from one PC with that I-217LM NIC. Found it going by the mac address, that was an interesting problem :)

config t

Quote from: routerdork on October 08, 2015, 02:24:09 PM
[I tried to keep a blog going for obscure things like this that I wanted to remember and then found that I don't take the time to blog much so it's never been put back online after my last upgrade.

We ought to have a sticky in the R&S forum for t-shooting commands.
:matrix:

Please don't mistake my experience for intelligence.