HomeLab: How to Resolve Supermicro x8dti Fan Revving Issues

Supermicro-Hyper-Speed-6027AX-TRF-Front-Straight

Its Winter here in the Atlanta area, and the temperatures have been dropping down close to freezing. Likewise, the temperatures in my basement Homelab have been dipping below 66 degrees Fahrenheit, and apparently this makes my Supermicro Servers a bit unhappy.

A bit of background. When I set out to build my lab I decided to transfer my Supermicro X8DTI boards out of their stock rack mount enclosures (like what you see above) and into Standard EATX Towers. Running my boards in these towers allowed me to use large 120mm and 140mm fans for cooling. Sound wise this is a huge improvement over the stock 80mm fans used in the default enclosure as these larger fans can spin much slower than stock and still keep the system cool.

On several occasions I have been working in my office and have heard the fans in my lab servers revving up and back down again. At first I thought that maybe a fan was failing, and that one of my systems was overheating. Or that one of my systems had sucked in a bit too much basement dirt and dust. However neither were the case.

Specifically what was happening was this… The systems were running cool, so the fans would spin down to a low rpm and the system would then throw a low rpm threshold alert and spin the fan back up.When this occurs the system switches into some sort of “Critical Cooling Mode” and spins all the fans up to 100% for a few seconds. Rinse and repeat a few dozen times and you hear what almost sounds like an intoxicated neighbor playing with his new weedeater.

Using the IPMIitool command from my Linux desktop first logged into IPMI controller on my systems and checked to make sure that the fans were actually working properly. SDR is short for Sensor Data Repository

# ipmitool -H 10.1.0.104 -U admin -P <password> sdr list
Fan2 | 2176 RPM | ok
Fan3 | 340 RPM | ok
Fan4 | no reading | ns
Fan5 | 544 RPM | ok
Fan6 | 340 RPM | ok
…truncated…

Fan6 above is spinning mighty slow. Slow enough to drop below the Lower Non-Critical Value threshold. Rather than increase the speed of the fan as I have seen others do, I decided to lower the thresholds for this fan with the command below. Since Fan5 is also a bigun’ I decided to preemptively adjust its thresholds as well.

ipmitool -H 10.1.0.104 -U admin -P <password> sensor thres Fan4 lower 100 200 300
ipmitool -H 10.1.0.104 -U admin -P <password> sensor thres Fan5 lower 100 200 300

How to Add Users via the CLI to a Thinklogical Secure Console Server

7050Since the Thinklogical Secure Console Server is running a variation of RedHat Linux, it is very tempting to try to add users via the command line using the useradd command as you normally would on a Linux machine. Please note that you probably do not want to do this for several reasons.

First off your users will be unable to connect to any serial ports and possibly be unable to log into the webUI at all.

So here is the first thing you need to know about adding users or changing passwords on the command line. Do not create any passwords over 10 characters. Do this, and your users will be unable to log into the webUI.

Even worse, do this to your root account, and you have now locked yourself out of the web interface.

Second thing to note is that if you use the useradd command your new user will not be have the correct permissions to connect to any of the managed serial ports.

Instead, Thinklogical has created a wrapper script for useradd, called adduser. This custom command creates your user and adds it to the proper groups and required configuration file. When added properly each user ends up with their own config file in /etc/lsi/conf.

In this example I created a user named “admin” using the add user command. In order to connect to serial ports you must be a member of the “scsusers” group. To monitor (or view) other user’s active connections, your user will need to be a member of the “monitor” group. Below you can see my new user’s group memberships.

[root@scs config]# id admin

uid=500(admin) gid=701(scsusers) groups=701(scsusers),702(monitor)

Now even though my user id a member of the correct groups, I still need the a config file in /etc/lsi/conf. Since my userid is admin, my config file will be /etc/lsi/conf/admin.conf.

[code language=”css”]
ESCAPE_SEQ=&quot;\x1bA&quot;
BREAK_SEQ=&quot;\x1bB&quot;
ALLOW_CLEAR=1-48
ALLOW_CONNECT=1-48
ALLOW_MONITOR=1-48
[/code]

Thinklogical Secure Console Server Super Quick Start Guide

SCS480R_F_B500

Today we are going to dive into how to setup and use Thinklogical’s line of Secure Console Servers. What I like about these devices (available in 8-, 16-, 32-, and 48- port models) is that they are actually running Linux, so the setup and configuration is a breeze via the command line for anyone comfortable on a Redhat based system.

Initial Device Setup and Configuration.

There are two pretty simple ways to connect to your SCS one you have unboxed it and have powered it up.

The first is via IP. The default ip address of the device is 10.9.8.7. So plug one end of an Ethernet cable into a network port on a laptop or desktop. Plug the other end of this same cable into the first network port on the SCS and configure your workstation so that it has an IP address on the same network as the SCS. In my case I set the IP address of my laptop to 10.9.8.8.  No Netmask or Gateway needed when connected directly. This method enabled me to either ssh directly into the device or connect to it via web browser.

The second method is via a serial connection to the SCS’s console port. In this case I fired up minicom (hyperterm or putty will do as well if you are running Windows) and configured it to use /tty/USB0, which is the device number associated with my USB to serial converter. If you have an serial port on your laptop you can skip the USB to serial adapter and just plug right into the serial port on your workstation. This method allows you to login directly to the device’s console. In this scenario,  I used a Cisco console cable to connect the two devices together.

The initial login and password are root/root. It goes without saying that you need to change this password ASAP.

Continue reading

Fedora/RHEL – Find NVIDIA Video Card Model and Driver Version

NvidiaSo I while back I had the need to buy a new new video card for my Fedora 14 desktop.  Specifically I needed a semi-decent HDMI capable PCI card as my current workstation at home only has one PCI-E slot, and I wanted to use my newly acquired LSI 8888ELP in that slot. A picture of the hardware overkill is below.

 

AmlxfacCIAEVHcT

Sorry for the Blurry Image … I was giggling when I took the pic

Anyway, not that its pertinent to this article, but the 8888ELP was to be configured with 4 SSDs in RAID10, which to me was much cooler than having top of the line video card for a box that rarely did anything graphic intensive.

Anyway, back to the story. So I poked around online and found this big mamba-jamba PCI Nvidia card with HDMI out, which if you look at the picture below, is basically a giant heat sink with a video card attached to its undercarriage. Apparently its perfect for a HTPC as its nice and quiet without a fan. So boom, I slap that sucker in and go on with my life.

 

Aml1oL6CMAArO4N

Giant Heatsink with Video Card Attached

Anyway, fast forward a few months and NVIDIA has new drivers available for Linux, which supposedly offer massive performance improvements.  So I figured that I would try them out, but by this time I have forgotten what video card I am using, who made it, and if I am using the vendor's drivers.

So the first thing I do is run the command below which updates the hardware descriptions which lspci spits out

# update-pciids

Next I run lspci and figure out that I have an NVIDIA Card.

06:00.0 VGA compatible controller: NVIDIA Corporation GF108 [GeForce GT 430] (rev a1)

Ok now lets see if I am using NVIDIA's drivers. I accompish this by attempting to run the command below

# nvidia-settings

Which pops up a nice little GUI, that tells me which driver version I am running.

NVIDIA Driver Version :280.13

Broadcom (bnx2) Network Adapters Dropping Recieved Packets Under Linux

VampSo a few weeks ago some of our Centos 5.4 and OEL 5.5 servers started exibiting strange connectivity problems. Monitoring started alerting that hosts were down when they weren't; some boxes could ping target hosts and some couldn't; some boxes became unresponsive when interfaces were failed over, and the strangest of all is that some of the boxes would magically "repair" themselves. Like I said, strange.

Over the next week or so we ran into the issue a few more times and were able to see a pattern emerge. All the affected servers were running Centos 5.4 or Oracle Linux 5.5 and had broadcom (bnx2) adapters that were on the recieving end of some pretty decent traffic. Most importantly, all had a good number of dropped recieved packets that was continuously, albeit slowly, increasing.

A bit of google research led us to this bugzilla, which suggested changing the adapter's coalescense settings…. so a bit on coalescense.

Coalescense

In your network adapter, coalescence is all about interupts. Traditionally interupt coalescense (or IC) is used to reduce the number of interupts generated by the system by delaying the generating of an interrupt by a very short period of time…think less then a milisecond. In turn more traffic will be recieved by the host and the next interupt generated will be larger in size. You can find out more than you would ever want to know about coalesence here

The Fix

So apparently the Broadcom IC settings were not aggressive enough. Packets would come in, fill up the receive queue, and get dropped before they could be sent off for processing via an interrupt. This takes us back to the bugzilla above and the suggested settings below which you set with the ethtool command

 ethtool -C ethX rx-usecs 8 rx-usecs-irq 8 rx-frames 0 rx-frames-irq 0

Note that this was not an issue on any in Centos 5.6, any server with Intel adapters, or any server with 10g adapters. As a matter of fact, those servers had IC settings even more agressive then those above. See the Intel 82599EB 10-gigabit settings below

rx-usecs: 1
rx-frames: 0
rx-usecs-irq: 0
rx-frames-irq: 0

Final Configuration

So now that we know the fix we need to make it permanent, which is not as easy as editing a config file for the device as the coalescence config is set at boot and it part of the installed driver for the device. Rather than muck around with trying to modify the driver itself, we decided to set and configure our devices at boottime with a rc script that checks the checks the each network interface on the box and modifys their IC settings if they are using the bnx2 (Broadcom) driver.  We dropped the script below into /etc/rc.d and created a symbolic link to it in /etc/rc3.d.

#!/bin/bash

case "$1" in
start)

IFACE=$(ls /etc/sysconfig/network-scripts/ifcfg-eth*| grep -v bak | cut -d – -f 3)

for ETH in $IFACE
        do
                if ( ethtool -i $ETH | grep -qw bnx2 )
                then
                        echo "$Changing Settings for $ETH"
                        ethtool -C $ETH rx-usecs 8 rx-usecs-irq 8 rx-frames 0 rx-frames-irq 0
                else
                        echo "$ETH is not a broadcom"

                fi
        done
exit 1
;;

stop)

echo " hammer time"
;;

*)
    echo "usage: $0 (start|stop)"
;;
esac

Poor Man’s eSATA Drive Hot Swap without AHCI or Hotplug Support Under Linux

 

 

ServerRoomFireWeb

Unfortunately, hot-swapping an eSATA drive is a bit more complicated than hot swapping a USB drive.

First off,  your BIOS needs to support AHCI (click here for more info on AHCI), and your SATA controller also needs to support it as well. Secondly your OS, needs to specifically support hot plug, and in the case of Windows 7, it wont boot if you change to AHCI after the OS has been installed.

So, In my case I need to update firmware on lots of SATA SSDs and want to do so without rebooting, and without worrying about changing bios settings. So in order to keep things simple, I followed the procedure below.

First, you need to detect your drive. So watch dmesg to see what drive letter is assigned to your new disk upon initial connection.

#dmesg

[86527.985994]  sdd: unknown partition table
[86528.012820] sd 8:0:0:0: [sdd] Assuming drive cache: write through
[86528.012823] sd 8:0:0:0: [sdd] Attached SCSI disk
[86528.456281] device label btrfs devid 1 transid 11 /dev/sdd

Then, when its time to remove the disk device to the following. Subsitute your disk device letters.

# echo 1 > /sys/block/sdd/device/delete

Now you are free to swap your disk. No reboot, no bios changes, required.

LSI MegaCLI — Check For Failed Raid Controller Battery

701590_rusty_batteryThere are several tools that you can use to monitor and configure and LSI SAS controller, however as I have found, some are easier than others to use and some do not always display the correct information.

In my case my controller is a SAS 9260-8i, and when building a server I always make sure that I install the MegaRaid Storage Manager gui for configuring disks and setting up email alerts. However I have often found that this tool is sometimes confusing to use for other tasks so I also make sure that I install the MegaCLI (command line interface). Both utilities can be downloaded directly from LSI here.

MegaRaid Storage Manager installs to /usr/local/MegaRAID Storage Manager, while the cli installs via rpm to /opt/MegaRAID/MegaCli.

Anyway to check the battery status run the following (note i am running 64 bit os)

#>./MegaCli64 -AdpBbuCmd -aAll

Your output will be lengthy – but look for the line below to know if you need to replace your BBU.

Battery Replacement required            : Yes

Two additional usefully commands are:

  • megacli -AdpAllInfo -aALL lists all the adapters in the machine
  • megacli -PDList -aALL lists all disks and enclosures

Note that there is an open source CLI called Megactl, and while its quick and easy to use to see a quick list of your disks and their statuses, its not shown itself to be accurate when it comes to detecting whether or not a battery has failed. You can get it here

Additonal Megacli command can be found here; https://twiki.cern.ch/twiki/bin/view/FIOgroup/DiskRefPerc