HomeLab: Basic Syslog Configuration on Cisco Catalyst Devices

FrontiervilleblueoxIn my homelab setup I am dumping syslog on all my devices to my Linux desktop. Have not figure out what I am going to do with it yet, but I see myself either setting up Splunk or Greylog in the near future. Note, a while back I wrote a post on how to configure rsyslog on RHEL 6 – s0 if you are interested you can find that post here.

So lets get down to brass tacks and configure some freaking syslog.

In this instance we are configuring syslog redirection on a Cisco 3548xl switch. Note we are in configure terminal mode.

First we must tell our device to insert timestamps on

s-3550-1(config)#service timestamps log datetime

Now we tell the device where to send the syslog messages

s-3550-1(config)#logging 192.168.0.195

Now we tell the device which log levels to send to the syslog server. In this instance I am sending warning level messages and above. This is pretty verbose, but its a home lab so I am not worried about a slew of log messages pounding my syslog server.

s-3550-1(config)#logging trap warning

For reference I am including the logging levels below.

Emergency: 0

Alert: 1

Critical: 2

Error: 3

Warning: 4

Notice: 5

Informational: 6

Debug: 7

Now lets review what we have done with the show logging command

s-3550-1#show logging
Syslog logging: enabled (0 messages dropped, 0 flushes, 0 overruns)
    Console logging: level debugging, 13 messages logged
    Monitor logging: level debugging, 0 messages logged
    Buffer logging: level debugging, 13 messages logged
    File logging: disabled
    Trap logging: level warnings, 13 message lines logged
        Logging to 192.168.0.195, 0 message lines logged

 

Note that this procedure is exactly the same on my Cisco 2621 switch.

 

Related articles

HomeLab: The Cisco 3560G
HomeLab: Simple SSH Setup on a Cisco Router
HomeLab: Cisco 3550 Switch Software Configuration Guide
Using Good Old Syslog When Troubleshooting (by Tony Fortunato)
HomeLab: Configuring the NTP Client on a Cisco Catalyst Switches
[PATCH 1/9] syslog_ns: add syslog_namespace and put/get_syslog_ns

Very High snmpd CPU Usage and 1000s of IP Addresses

Clipper_fistFor a while now we have been experiencing SNMP timeouts from our central monitoring server to a set of new and recently deployed servers. We at first attributed the issue to the network driver being a bit different than what we have used in the past (latest greatest hardware) and then to the fact that we are completely gutting our production network in situ. Normalization of the driver across this application farm made no difference and the issue began to get worse, spreading to just about all servers in this one specific service tier. As it was not impacting any other servers it wasn’t surprising that we could not find an issue at the network layer so we had to keep digging.

In our environment this tier is unique for two reasons:

  1. Very high IO per server. On the order of 10,000 write ops to disk per second at peak load.
  2. Thousands of private IPs per system (added as ip rules – not virtual addresses on interface devices).

The first issue was in fact an issue, just not for SNMP. The snmpd process was getting lost in the noise generated by the high system load the disk IO was causing (lots or concurrent processes in this app). The team quickly addressed this issue and scaled the IO subsystem substantially to meet the growing workload. After this change the system load dropped from over 1000 to under 10 (1 minute average) during peak utilization on these servers. Unfortunately snmpd was still timing out. However, now with the noise of all those blocked processes out of the way, running a top showed the snmpd process stuck at 100 percent cpu usage for very long period of times. In fact, periods of less than 100 percent cpu were noted for their rarity.

That led us to difference number two. Recently we had been normalizing our system configurations which resulted in the whole tier of afflicted servers having even more IPs assigned. Ok great the team thought – we know where that lives in net-snmp – the IP-MIB. Our assumption was that net-snmp should only refresh this MIB based on client requests and for some reason our monitoring solution was requesting all the IPs from the IP-MIB portion. That would certainly explain the timeouts we thought. Step one was try the latest net-snmp just in case it was bug on our old default 5.4 install. No change. So the team’s next step to trouble shooting this was to essentially block access to the ip-mibIP-MIB part of the OID tree.

We tried various versions of:

view systemview excluded ip.ipAddressIfIndex.ipv4
view systemview excluded ip.ipAddressType.ipv4
view systemview excluded .1.3.6.1.2.1.4.20.1

and so on to no effect. Next we did the obvious test – block all client access. No change – snmpd was still consuming 100% of a cpu all the time. Now what?

Unfortunately we were not on Solaris so no DTrace. However, strace did provide the clues needed to make progress. The strace of the snmpd process was almost entirely ioctl calls to the interface devices to fetch the IP addresses associated with them. These calls themselves were not so much the issue as the response time of the ioctl calls were in microseconds. A continuous snmpget was executed against a system while the strace was generated and the strace output compared to client experience. Client time outs lined up exactly with when the ioctl “storm” would start. As the system time consumed was tiny by snmpd during this, the issue had to be in the user-land component of net-snmp.

One thing the strace showed was snmpd was almost alway determining all the IPs on each interface in what looked like a very tight loop. There was perhaps two to three seconds between each storm. In addition, each subsequent ioctl call involved more and more IPs per call – almost displaying an O(n^2) behavior as the list of IPs in the previous call were then being listed again with only the addition of another IP. So now we knew snmpd was, for some reason, always rebuilding the .1.3.6.1.2.1.4 tree and doing it a nasty way.

The first thing that jumped to mind was somewhere the caching we assumed Net-SNMP would do on the ip table was misconfigured. We attacked the nsCacheTable and its ilk from all directions. No luck and no change in behavior. At this point a more thorough web search found two similar postings describing high snmpd cpu usage. One for system with a very large BGP routing table and another for a system with thousands of VLANs. Neither indicated a solution though the BGP postings indicated that sorting within snmpd was not very efficient. At this point your author started to change some of the cache constants within the net-snmp source code to see if the polling cycle on the interfaces would change. It did not.

Time to turn on debug :
The snmpd daemon that comes with Net-SNMP has a pretty thorough debug mode. So thorough in fact we went to that step last as in our environment the full debug mode generates about 75MB of log data every five seconds (remember all those ip addresses). We had tried various documented methods to only have it run debug on certain sub-modules like the IP-MIB, but we could not find an actual working command line option. The documented examples on the net-snmp wiki pages did not work at all on our build for some reason. So we were forced to deal with the fire hose of logging everything.

With full debug on (-DALL) snmpd was started. Five seconds or so was all that was needed to generate a trace that would help us track down the issue. The net-snmpd agent code is full of logging code like below:

DEBUGMSGTL(("access:ipaddress:container", "processing %d interfaces\n", interfaces));

which will generate a message like:

trace: _netsnmp_ioctl_ipaddress_container_load_v4(): ip-mib/data_access/ipaddress_ioctl.c, 171:
access:ipaddress:container: processing 4 interfaces

This makes it very easy to find where in the code the message was generated and simplifies tracing the application (relatively – there are 100s of modules). So at this point we did the tedious task of simply walking the log file in parallel with walking the code, focusing obviously on the interface and IP portions. Along the way numerous little hooks and debug messages were dropped into the code and snmpd recompiled. This helped us along the way to at least realizing the real problem was much larger than what we wanted to tackle. Essentially, the table access and cache components just will not work when one has 1000s of IP addresses. Best we could tell, snmpd is resorting the IP oid tree after each and every IP address it comes across. If you follow the snippets of debug out put below you will see how it does a table compare after each and every IP on a small test system. On our production box with over 7000 IPs snmpd is spending all its time doing index compares. The full debug snippet is available if anyone wants it.

Realizing there was no way for us to quickly and safely modify either the cache code which honestly, we couldn’t figure out why the IP table was called as often as it was. Even interface stats (.1.2.1.3.6.1.2.1.2.2) were being refreshed at a much higher refresh rate than nsCacheTable would imply] or the table and index code we decided to cut the monster off at the head – the call into the “access:ipaddress:container” portion of the code. Through a bit of debugging your author determined the head of the monster was the _netsnmp_ioctl_ipaddress_container_load_v4 function in the ip-mib/ipAddressTable/ipAddressTable_interface.c source file. This function seemed to lead to the discovery of all the IP addresses associated with an interface and hence all the sorting, indexing, and other madness we were experiencing.

Solution
The solution is neither pretty nor elegant and is a bit of the nuclear fly-swatter type. The function identified above loops over all the physical interfaces with IP addresses assigned. In our case that is the loopback and a set of bonded interfaces on various VLANs. Lucky for us, the 1000s of IPs we have on each server are on a set of dedicated bonds on their own VLANs. These are completely separate from the base server VLAN and other “user experienced” addresses. In addition, there is nothing related to these addresses in SNMP that we have ever used or need to use via snmp. As we determined, our monitoring software didn’t even know these IPs existed nor really ever hit anything in the IP-MIB table. All it ever knew or cared about was the base IP assigned to each bonded interface.

So in our case, the obvious solution was to just not even process these bonded interfaces to determine what IP addresses were on them. We found some references in the code to essentially leverage ablack list but we could not figure out how to use it. As such we did the next best thing – modified the code of the loop in the _netsnmp_ioctl_ipaddress_container_load_v4 to just skip a set of known bonds – essentially do something like:


if (strcmp(ifrp->ifr_name,"bond1.16") == 0) { DEBUGMSGTL(("access:ipaddress:skipping", " interface %d, %s\n", i, ifrp->ifr_name)); continue; }

The full diff is below. Note the code was written so it would be easy to generate a patch based on our automation tools (i.e. if new bonds come and go) and to be obvious to folks without much C experience to know what to change. Not super efficient but good enough. After making this change snmpd CPU usage went from a continuous 100 percent to under 0.5 percent. Better yet – no more timeouts! All we lost was access to the portion go the IP table for the excluded interfaces. Something we didn’t use anyways.

We still don’t know why snmpd was determined to rebuild the ip-mib continuously. As our use case is probably pretty rare it is not surprising there are so few reports of this behavior.


--- ipaddress_ioctl.c.new +++ ipaddress_ioctl.c.orig @@ -165,11 +165,6 @@ DEBUGMSGTL(("access:ipaddress:container", " interface %d, %s\n", i, ifrp->ifr_name)); - /* Ops was here */ - - DEBUGMSGTL(("access:ipaddress:containerSPNew", - " interface %d, %s\n", i, ifrp->ifr_name)); - if (AF_INET != ifrp->ifr_addr.sa_family) { DEBUGMSGTL(("access:ipaddress:container", " skipping %s; non AF_INET family %d\n", @@ -177,36 +172,6 @@ continue; } - /* Working around issue with 6000 IPs on a host - * So we exclude known problem interfaces - */ - - if (strcmp(ifrp->ifr_name,"bond1.16") == 0) { - DEBUGMSGTL(("a ccess:ipaddress:skipping", - " interface %d, %s\n", i, ifrp->ifr_name)); - continue; - } - - if (strcmp(ifrp->ifr_name,"bond1.20") == 0) { - DEBUGMSGTL(("access:ipaddress:skipping", - " interface %d, %s\n", i, ifrp->ifr_name)); - continue; - } - - if (strcmp(ifrp->ifr_name,"bond1.24") == 0) { - DEBUGMSGTL(("access:ipaddress:skipping", - " interface %d, %s\n", i, ifrp->ifr_name)); - continue; - } - - if (strcmp(ifrp->ifr_name,"bond1.28") == 0) { - DEBUGMSGTL(("access:ipaddress:skipping", - " interface %d, %s\n", i, ifrp->ifr_name)); - continue; - } - - /* End Ops Hack */ - /* */ entry = netsnmp_access_ipaddress_entry_create();

MegaRaid Cards Via CLI

Megaman_8bit

MegaRAID® is LSI’s line of SATA/SAS Storage Controller.

MegaCLI

MegaCLI is the Linux console based management utility for LSI SAS controllers. Honestly its a pretty crummy command when compared to HP’s command line tool, but that’s often what you are stuck with when you buy Dell or Supermicro.

Note that I am running the 64 bit version of MegaCLI which is installed in /opt/MegaRAID/MegaCli and is called MegaCli64. On 32 bit systems its called MegaCLI.

The command below will dump out a bunch of info, but if you look for the section labeled “Device Present” you can see failed/degrated drives. In this case I have one failed drive out of 4 total drives

./MegaCli64 -AdpAllInfo -aALL

 Device Present
================
Virtual Drives    : 2
  Degraded        : 1
  Offline         : 0
Physical Devices  : 4
  Disks           : 3
  Critical Disks  : 0
  Failed Disks    : 0

For more specific disk information run the following command.

./MegaCli64 -LDPDInfo -aAll

Using the command above I can see more information on the drive with the failed submirror

Virtual Disk: 1 (Target Id: 1)
Name:
RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
Size:59.125 GB
State: Degraded
Stripe Size: 64 KB
Number Of Drives:2

Megactl

According to Sourceforce, Megactl is.. “is a small collection of programs for examining configuration and status of LSI megaraid adapters, especially Dell PERC RAID adapters, and attached storage devices.”  Get it here.

In my this case I am running medasasctl which makes it a bit easier to see the failed drive.  In the example below I can see two virtual disks (both raid1), but only 3 physical, which indicates that one of my submirrors has failed.

megactl-0.4.1]# ./megasasctl
a0       LSI MegaRAID SAS 9260-8i encl:1 ldrv:2  batt:FAULT, unknown charge state
a0d0       29GiB RAID 1   1×2  optimal
a0d1       59GiB RAID 1   1×2  DEGRADED
a0e252s0    29GiB  a0d0  online 
a0e252s1    29GiB  a0d0  online 
a0e252s2    59GiB  a0d1  online 

How to use SAR on Redhat

Eyeball Sar is a system monitor command used to display system activity. Sar ts installed via the sysstat rpm. Use the command below to install sysstat.

[root@fedora ~]# yum -y install sysstat

Once installed you can check out the sysstat config file (/etc/sysconfig/sysstat) and configure how long to sar will keep your logfiles, on my system the default was 7 days. I changed this to 30 days.

The cron job for sar is located here (/etc/cron.d/sysstat) if you want to modify it as well.

Once installed and configured to your liking you must ensure that it starts and runs at boot time. I accomplished this via the command below.

[root@fedora ~]# chkconfig sysstat on && service sysstat start

Once up and running it will write its logs out to /var/log/sa, and you can read those files with the following command (sar -d <filename>, where filename is the name of the file that you want to read). Note that in the examples below, 3 is the interval, and 10 is the count.

Additionally you may run sar interactively, below are a few sample commands.

View disk i/o and transfer rate stats:

sar -b 3 10

View memory and swap space stats:

sar -r 3 10

View swapping stats:

sar -W 3 10

View network stats:

sar -n DEV 3 10

View CPU stats:

sar -P ALL 3 10

Enabling SNMP in ESXi 4.1 using the Remote CLI

cdc63-6a00e551c39e1c88340168ea399c7e970c-pi

Based on the fact that ESX 4.1 is the last major release of ESX, I decided that I would make myself familiar with managing ESXi hosts.  Since I monitor all my hosts via Zenoss, I figured that I needed to get snmp up and running first.

So I first when out and installed the remote cli for ESX on my ubuntu desktop.  The rcli can be downloaded here. The remote cli allows you to run command administrative commands against ESX/ESXi systems. Its availbile for Windows or Linux.

Configuring on ESXi 4.1, Licensed

First configure your community string, target, and port:

vicfg-snmp –server <ESXi_ip> -c <communityname> -p 161 -t <destination_host>@161/<community name>

Then enable it using the command below:

vicfg-snmp –server <ESXi_ip> -E

Next verify your settings:

vicfg-snmp –server <ESXi_ip> -s

Now test your settings:

vicfg-snmp –server <ESXi_ip> -T

These settings are written out to /etc/vmware/snmp.xml. Sample file below.

/etc/vmware # cat snmp.xml
<config>
<snmpSettings>
<communities>pubic</communities>
<enable>true</enable>
<port>161</port>
<targets>10.1.xx.xx@161 public</targets>
</snmpSettings>

Configuring on ESXi 4.0, Free/Foundation

I had a couple of ESXi 4.0 free hosts to configure, but my attempts to configure them using the cli failed as the snmp settings via cli were read only. So the first thing that you need to do is enable the unsupported console. Instructions can be found here.

Once you are able to ssh to the ESXi box, you need to edit the following file by hand, /etc/vmware/snmp.xml. Use the sample file above as a template and modify your ip, port, and string as needed. I use vi to edit mine.

Then run the command below

services.sh restart

You can then verify your settings using the remote cli by running the command below against your esxi box.

vicfg-snmp –server <ESXi_ip> -T