Monitor Tripplite UPS on RHEL 7 via NUT

61cR65ugGTL._SL1200_.jpg

One of the UPS’s in my home lab is a Tripplite 1500VALCD. I wanted to be able to monitor/manage the UPS via RHEL/Centos however Tripplite no longer makes a Linux version of Power Alert Local for Linux.  Instead I decided to use Nut.

After connecting a USB cable between my RHEL server and my UPS, I needed to install lsusb to verify that it was detected properly.
# yum -y install usbutils

I was then able to verify connectivity

# lsusb | grep -i trip
Bus 003 Device 123: ID 09ae:2012 Tripp Lite

Nut can be found in the EPEL repo which I needed to install.

#wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
# yum localinstall epel-release-latest-7.noarch.rpm

Then install Nut.

# yum -y install nut.x86_64

I then ran nut-monitor to detect the proper config for Nut.

[nutdev1]
driver = “usbhid-ups”
port = “auto”
vendorid = “09AE”
productid = “2012”
product = “Tripp Lite UPS”
vendor = “Tripp Lite”
bus = “003”

Use the output from the above command to populate /etc/ups/ups.conf. In the example below I I only changed the name of the device.

[tripplite]
driver = “usbhid-ups”
port = “auto”
vendorid = “09AE”
productid = “2012”
product = “Tripp Lite UPS”
vendor = “Tripp Lite”
bus = “003”

I then started and enabled the following services.

# systemctl start nut-server
# systemctl enable nut-server

I was then able to run upsc and query the ups.

# upsc tripplite
battery.charge: 100
battery.runtime: 620
battery.type: PbAC
battery.voltage: 26.3
battery.voltage.nominal: 24.0
device.mfr: Tripp Lite
device.model: Tripp Lite UPS
device.type: ups
driver.name: usbhid-ups
driver.parameter.bus: 003
driver.parameter.pollfreq: 30
driver.parameter.pollinterval: 2
driver.parameter.port: auto
driver.parameter.product: Tripp Lite UPS
driver.parameter.productid: 2012
driver.parameter.vendor: Tripp Lite
driver.parameter.vendorid: 09AE
driver.version: 2.7.2
driver.version.data: TrippLite HID 0.81
driver.version.internal: 0.38
input.frequency: 59.8
input.voltage: 112.2
input.voltage.nominal: 120
output.frequency.nominal: 60
output.voltage: 112.2
output.voltage.nominal: 120
ups.beeper.status: disabled
ups.delay.shutdown: 20
ups.load: 48
ups.mfr: Tripp Lite
ups.model: Tripp Lite UPS
ups.power: 0.0
ups.power.nominal: 1500
ups.productid: 2012
ups.status: OL
ups.test.result: Done and error
ups.timer.reboot: 65535
ups.timer.shutdown: 65535
ups.vendorid: 09ae
ups.watchdog.status: 0

Next I plan to explore nut-monitor, but for know I can at least query the UPS. Apparently the battery is dead.

Cockpit for Centos and RHEL 7: Install and Configure

Snail_On_White_Background_600

Introduction

I have recently purchased 3 Dell servers, and put myself to task to build out a new lab. My old lab was in desperate need of updating as I had long past the time when 48GB of memory per node was sufficient. The cost of memory, old or new was not even closely in line with cheap server grade CPUs that were perfect for lab servers. Today you can buy a used E7540, a low power, 12 core (HT enabled) Xeon for less than $30 (USD) from a reputable retailer. Cram two of these into an 11 gen Dell and you are in business.

So, three new (to me) Dell rackmounts, deployed as virtualization servers, and I want a simple way to view performance stats in a nice clean single pain of glass. I am not in any way shape or form looking to build fancy dashboard and setup any sort of historical monitoring. I just want to know where the performance hot spots are when my environment seems to be running slowly.

I installed Cockpit before on a laptop or two and thought it might foot the bill, especially since you could use one dashboard for multiple nodes.

So here we are going to deploy Cockpit on all three nodes, on each the steps are the same.

Prerequisites

First we must open a firewall port on each node.

Continue reading

Centos7/Rhel7 – Collectd Config for Libvirt, Carbon-Cache

collectd-logo

The following is the collectd config that I am running on my RHEL 7 kvm hypervisor. This is not meant to be an all inclusive config for collectd, rather I am looking to gather basic performance metrics on my hypervisors and VMs.

Note that I have disabled selinux, as I am running these hypervisors in my lab. Do not do the same in your production environments.


#setenforce 0

I also edited /etc/selinux/config as shown below. Again, this is for non-prod/test envs. Do not disable in production


# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
#     enforcing - SELinux security policy is enforced.
#     permissive - SELinux prints warnings instead of enforcing.
#     disabled - No SELinux policy is loaded.
SELINUX=disabled
# SELINUXTYPE= can take one of three two values:
#     targeted - Targeted processes are protected,
#     minimum - Modification of targeted policy. Only selected processes are protected. 
#     mls - Multi Level Security protection.
SELINUXTYPE=targeted 

Also note that I have already configured the EPEL repo.. again, not production.

EPEL repo installed via the command below


rpm -ivh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm

First a couple of steps to get us up and running. We need to install packages


# yum -y install collectd collectd-virt.x86_64 collectd-rrdtool.x86_64 collectd-sensors.x86_64 collectd-smart.x86_64 collectd-netlink.x86_64 collectd-ipmi.x86_64

We also need to write our config.

You can pull my config using the gist below


#Custom Collectd Config
LoadPlugin "write_graphite"
<Plugin "write_graphite">
<Node "example">
Host "10.1.0.202"
Port "2003"
Prefix "collectd."
#Postfix ""
Protocol "tcp"
#Protocol "udp"
#LogSendErrors false
EscapeCharacter "_"
SeparateInstances true
StoreRates false
AlwaysAppendDS false
</Node>
</Plugin>
LoadPlugin cpu
LoadPlugin load
LoadPlugin memory
LoadPlugin processes
LoadPlugin disk
<Plugin "disk">
Disk "sda1"
Disk "sdb1"
Disk "sdc1"
Disk "dm-0"
Disk "dm-2"
Disk "dm-3"
IgnoreSelected true
</Plugin>
<LoadPlugin virt>
Globals false
</LoadPlugin>
<Plugin "virt">
Connection "qemu:///system"
RefreshInterval 60
Domain "dom0"
BlockDevice "name:device"
InterfaceDevice "name:interface"
IgnoreSelected true
HostnameFormat "name"
</Plugin>
LoadPlugin interface
<Plugin interface>
Interface "lo"
IgnoreSelected true
</Plugin>
Include "/etc/collectd.d"

view raw

collectd.conf

hosted with ❤ by GitHub

We now need to enable and start collectd


#systemctl enable collect
#systemctl start collectd 

 

HomeLab: Basic Syslog Configuration on Cisco Catalyst Devices

FrontiervilleblueoxIn my homelab setup I am dumping syslog on all my devices to my Linux desktop. Have not figure out what I am going to do with it yet, but I see myself either setting up Splunk or Greylog in the near future. Note, a while back I wrote a post on how to configure rsyslog on RHEL 6 – s0 if you are interested you can find that post here.

So lets get down to brass tacks and configure some freaking syslog.

In this instance we are configuring syslog redirection on a Cisco 3548xl switch. Note we are in configure terminal mode.

First we must tell our device to insert timestamps on

s-3550-1(config)#service timestamps log datetime

Now we tell the device where to send the syslog messages

s-3550-1(config)#logging 192.168.0.195

Now we tell the device which log levels to send to the syslog server. In this instance I am sending warning level messages and above. This is pretty verbose, but its a home lab so I am not worried about a slew of log messages pounding my syslog server.

s-3550-1(config)#logging trap warning

For reference I am including the logging levels below.

Emergency: 0

Alert: 1

Critical: 2

Error: 3

Warning: 4

Notice: 5

Informational: 6

Debug: 7

Now lets review what we have done with the show logging command

s-3550-1#show logging
Syslog logging: enabled (0 messages dropped, 0 flushes, 0 overruns)
    Console logging: level debugging, 13 messages logged
    Monitor logging: level debugging, 0 messages logged
    Buffer logging: level debugging, 13 messages logged
    File logging: disabled
    Trap logging: level warnings, 13 message lines logged
        Logging to 192.168.0.195, 0 message lines logged

 

Note that this procedure is exactly the same on my Cisco 2621 switch.

 

Related articles

HomeLab: The Cisco 3560G
HomeLab: Simple SSH Setup on a Cisco Router
HomeLab: Cisco 3550 Switch Software Configuration Guide
Using Good Old Syslog When Troubleshooting (by Tony Fortunato)
HomeLab: Configuring the NTP Client on a Cisco Catalyst Switches
[PATCH 1/9] syslog_ns: add syslog_namespace and put/get_syslog_ns

Very High snmpd CPU Usage and 1000s of IP Addresses

Clipper_fistFor a while now we have been experiencing SNMP timeouts from our central monitoring server to a set of new and recently deployed servers. We at first attributed the issue to the network driver being a bit different than what we have used in the past (latest greatest hardware) and then to the fact that we are completely gutting our production network in situ. Normalization of the driver across this application farm made no difference and the issue began to get worse, spreading to just about all servers in this one specific service tier. As it was not impacting any other servers it wasn’t surprising that we could not find an issue at the network layer so we had to keep digging.

In our environment this tier is unique for two reasons:

  1. Very high IO per server. On the order of 10,000 write ops to disk per second at peak load.
  2. Thousands of private IPs per system (added as ip rules – not virtual addresses on interface devices).

The first issue was in fact an issue, just not for SNMP. The snmpd process was getting lost in the noise generated by the high system load the disk IO was causing (lots or concurrent processes in this app). The team quickly addressed this issue and scaled the IO subsystem substantially to meet the growing workload. After this change the system load dropped from over 1000 to under 10 (1 minute average) during peak utilization on these servers. Unfortunately snmpd was still timing out. However, now with the noise of all those blocked processes out of the way, running a top showed the snmpd process stuck at 100 percent cpu usage for very long period of times. In fact, periods of less than 100 percent cpu were noted for their rarity.

That led us to difference number two. Recently we had been normalizing our system configurations which resulted in the whole tier of afflicted servers having even more IPs assigned. Ok great the team thought – we know where that lives in net-snmp – the IP-MIB. Our assumption was that net-snmp should only refresh this MIB based on client requests and for some reason our monitoring solution was requesting all the IPs from the IP-MIB portion. That would certainly explain the timeouts we thought. Step one was try the latest net-snmp just in case it was bug on our old default 5.4 install. No change. So the team’s next step to trouble shooting this was to essentially block access to the ip-mibIP-MIB part of the OID tree.

We tried various versions of:

view systemview excluded ip.ipAddressIfIndex.ipv4
view systemview excluded ip.ipAddressType.ipv4
view systemview excluded .1.3.6.1.2.1.4.20.1

and so on to no effect. Next we did the obvious test – block all client access. No change – snmpd was still consuming 100% of a cpu all the time. Now what?

Unfortunately we were not on Solaris so no DTrace. However, strace did provide the clues needed to make progress. The strace of the snmpd process was almost entirely ioctl calls to the interface devices to fetch the IP addresses associated with them. These calls themselves were not so much the issue as the response time of the ioctl calls were in microseconds. A continuous snmpget was executed against a system while the strace was generated and the strace output compared to client experience. Client time outs lined up exactly with when the ioctl “storm” would start. As the system time consumed was tiny by snmpd during this, the issue had to be in the user-land component of net-snmp.

One thing the strace showed was snmpd was almost alway determining all the IPs on each interface in what looked like a very tight loop. There was perhaps two to three seconds between each storm. In addition, each subsequent ioctl call involved more and more IPs per call – almost displaying an O(n^2) behavior as the list of IPs in the previous call were then being listed again with only the addition of another IP. So now we knew snmpd was, for some reason, always rebuilding the .1.3.6.1.2.1.4 tree and doing it a nasty way.

The first thing that jumped to mind was somewhere the caching we assumed Net-SNMP would do on the ip table was misconfigured. We attacked the nsCacheTable and its ilk from all directions. No luck and no change in behavior. At this point a more thorough web search found two similar postings describing high snmpd cpu usage. One for system with a very large BGP routing table and another for a system with thousands of VLANs. Neither indicated a solution though the BGP postings indicated that sorting within snmpd was not very efficient. At this point your author started to change some of the cache constants within the net-snmp source code to see if the polling cycle on the interfaces would change. It did not.

Time to turn on debug :
The snmpd daemon that comes with Net-SNMP has a pretty thorough debug mode. So thorough in fact we went to that step last as in our environment the full debug mode generates about 75MB of log data every five seconds (remember all those ip addresses). We had tried various documented methods to only have it run debug on certain sub-modules like the IP-MIB, but we could not find an actual working command line option. The documented examples on the net-snmp wiki pages did not work at all on our build for some reason. So we were forced to deal with the fire hose of logging everything.

With full debug on (-DALL) snmpd was started. Five seconds or so was all that was needed to generate a trace that would help us track down the issue. The net-snmpd agent code is full of logging code like below:

DEBUGMSGTL(("access:ipaddress:container", "processing %d interfaces\n", interfaces));

which will generate a message like:

trace: _netsnmp_ioctl_ipaddress_container_load_v4(): ip-mib/data_access/ipaddress_ioctl.c, 171:
access:ipaddress:container: processing 4 interfaces

This makes it very easy to find where in the code the message was generated and simplifies tracing the application (relatively – there are 100s of modules). So at this point we did the tedious task of simply walking the log file in parallel with walking the code, focusing obviously on the interface and IP portions. Along the way numerous little hooks and debug messages were dropped into the code and snmpd recompiled. This helped us along the way to at least realizing the real problem was much larger than what we wanted to tackle. Essentially, the table access and cache components just will not work when one has 1000s of IP addresses. Best we could tell, snmpd is resorting the IP oid tree after each and every IP address it comes across. If you follow the snippets of debug out put below you will see how it does a table compare after each and every IP on a small test system. On our production box with over 7000 IPs snmpd is spending all its time doing index compares. The full debug snippet is available if anyone wants it.

Realizing there was no way for us to quickly and safely modify either the cache code which honestly, we couldn’t figure out why the IP table was called as often as it was. Even interface stats (.1.2.1.3.6.1.2.1.2.2) were being refreshed at a much higher refresh rate than nsCacheTable would imply] or the table and index code we decided to cut the monster off at the head – the call into the “access:ipaddress:container” portion of the code. Through a bit of debugging your author determined the head of the monster was the _netsnmp_ioctl_ipaddress_container_load_v4 function in the ip-mib/ipAddressTable/ipAddressTable_interface.c source file. This function seemed to lead to the discovery of all the IP addresses associated with an interface and hence all the sorting, indexing, and other madness we were experiencing.

Solution
The solution is neither pretty nor elegant and is a bit of the nuclear fly-swatter type. The function identified above loops over all the physical interfaces with IP addresses assigned. In our case that is the loopback and a set of bonded interfaces on various VLANs. Lucky for us, the 1000s of IPs we have on each server are on a set of dedicated bonds on their own VLANs. These are completely separate from the base server VLAN and other “user experienced” addresses. In addition, there is nothing related to these addresses in SNMP that we have ever used or need to use via snmp. As we determined, our monitoring software didn’t even know these IPs existed nor really ever hit anything in the IP-MIB table. All it ever knew or cared about was the base IP assigned to each bonded interface.

So in our case, the obvious solution was to just not even process these bonded interfaces to determine what IP addresses were on them. We found some references in the code to essentially leverage ablack list but we could not figure out how to use it. As such we did the next best thing – modified the code of the loop in the _netsnmp_ioctl_ipaddress_container_load_v4 to just skip a set of known bonds – essentially do something like:
[code language=”css”]

if (strcmp(ifrp->ifr_name,"bond1.16") == 0) { DEBUGMSGTL(("access:ipaddress:skipping", " interface %d, %s\n", i, ifrp->ifr_name)); continue; }
[/code]

The full diff is below. Note the code was written so it would be easy to generate a patch based on our automation tools (i.e. if new bonds come and go) and to be obvious to folks without much C experience to know what to change. Not super efficient but good enough. After making this change snmpd CPU usage went from a continuous 100 percent to under 0.5 percent. Better yet – no more timeouts! All we lost was access to the portion go the IP table for the excluded interfaces. Something we didn’t use anyways.

We still don’t know why snmpd was determined to rebuild the ip-mib continuously. As our use case is probably pretty rare it is not surprising there are so few reports of this behavior.

[code language=”css”]

— ipaddress_ioctl.c.new +++ ipaddress_ioctl.c.orig @@ -165,11 +165,6 @@ DEBUGMSGTL(("access:ipaddress:container", " interface %d, %s\n", i, ifrp->ifr_name)); – /* Ops was here */ – – DEBUGMSGTL(("access:ipaddress:containerSPNew", – " interface %d, %s\n", i, ifrp->ifr_name)); – if (AF_INET != ifrp->ifr_addr.sa_family) { DEBUGMSGTL(("access:ipaddress:container", " skipping %s; non AF_INET family %d\n", @@ -177,36 +172,6 @@ continue; } – /* Working around issue with 6000 IPs on a host – * So we exclude known problem interfaces – */ – – if (strcmp(ifrp->ifr_name,"bond1.16") == 0) { – DEBUGMSGTL(("a ccess:ipaddress:skipping", – " interface %d, %s\n", i, ifrp->ifr_name)); – continue; – } – – if (strcmp(ifrp->ifr_name,"bond1.20") == 0) { – DEBUGMSGTL(("access:ipaddress:skipping", – " interface %d, %s\n", i, ifrp->ifr_name)); – continue; – } – – if (strcmp(ifrp->ifr_name,"bond1.24") == 0) { – DEBUGMSGTL(("access:ipaddress:skipping", – " interface %d, %s\n", i, ifrp->ifr_name)); – continue; – } – – if (strcmp(ifrp->ifr_name,"bond1.28") == 0) { – DEBUGMSGTL(("access:ipaddress:skipping", – " interface %d, %s\n", i, ifrp->ifr_name)); – continue; – } – – /* End Ops Hack */ – /* */ entry = netsnmp_access_ipaddress_entry_create();

[/code]

MegaRaid Cards Via CLI

Megaman_8bit

MegaRAID® is LSI’s line of SATA/SAS Storage Controller.

MegaCLI

MegaCLI is the Linux console based management utility for LSI SAS controllers. Honestly its a pretty crummy command when compared to HP’s command line tool, but that’s often what you are stuck with when you buy Dell or Supermicro.

Note that I am running the 64 bit version of MegaCLI which is installed in /opt/MegaRAID/MegaCli and is called MegaCli64. On 32 bit systems its called MegaCLI.

The command below will dump out a bunch of info, but if you look for the section labeled “Device Present” you can see failed/degrated drives. In this case I have one failed drive out of 4 total drives

./MegaCli64 -AdpAllInfo -aALL

 Device Present
================
Virtual Drives    : 2
  Degraded        : 1
  Offline         : 0
Physical Devices  : 4
  Disks           : 3
  Critical Disks  : 0
  Failed Disks    : 0

For more specific disk information run the following command.

./MegaCli64 -LDPDInfo -aAll

Using the command above I can see more information on the drive with the failed submirror

Virtual Disk: 1 (Target Id: 1)
Name:
RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
Size:59.125 GB
State: Degraded
Stripe Size: 64 KB
Number Of Drives:2

Megactl

According to Sourceforce, Megactl is.. “is a small collection of programs for examining configuration and status of LSI megaraid adapters, especially Dell PERC RAID adapters, and attached storage devices.”  Get it here.

In my this case I am running medasasctl which makes it a bit easier to see the failed drive.  In the example below I can see two virtual disks (both raid1), but only 3 physical, which indicates that one of my submirrors has failed.

megactl-0.4.1]# ./megasasctl
a0       LSI MegaRAID SAS 9260-8i encl:1 ldrv:2  batt:FAULT, unknown charge state
a0d0       29GiB RAID 1   1×2  optimal
a0d1       59GiB RAID 1   1×2  DEGRADED
a0e252s0    29GiB  a0d0  online 
a0e252s1    29GiB  a0d0  online 
a0e252s2    59GiB  a0d1  online 

How to use SAR on Redhat

Eyeball Sar is a system monitor command used to display system activity. Sar ts installed via the sysstat rpm. Use the command below to install sysstat.

[root@fedora ~]# yum -y install sysstat

Once installed you can check out the sysstat config file (/etc/sysconfig/sysstat) and configure how long to sar will keep your logfiles, on my system the default was 7 days. I changed this to 30 days.

The cron job for sar is located here (/etc/cron.d/sysstat) if you want to modify it as well.

Once installed and configured to your liking you must ensure that it starts and runs at boot time. I accomplished this via the command below.

[root@fedora ~]# chkconfig sysstat on && service sysstat start

Once up and running it will write its logs out to /var/log/sa, and you can read those files with the following command (sar -d <filename>, where filename is the name of the file that you want to read). Note that in the examples below, 3 is the interval, and 10 is the count.

Additionally you may run sar interactively, below are a few sample commands.

View disk i/o and transfer rate stats:

sar -b 3 10

View memory and swap space stats:

sar -r 3 10

View swapping stats:

sar -W 3 10

View network stats:

sar -n DEV 3 10

View CPU stats:

sar -P ALL 3 10