Basic AIX Performance Troubleshooting Commands

600px-Orange_x.svgWow, today I logged into my first AIX Server in about 4.5 years. It was a horrible experience. I’ve been working with Redhat/CentOS pretty much exculsively for so long, I was mostly helpless to do anything of importance on the CLI other than create a few users and move some files around.  None of the common commands that I am so used to using even exist in AIX.

Figured I would do a bit of homework and figure out how to do some basic troubleshooting before I was in a server down situation with no idea how to troubleshoot.

Checking Free Memory

To check free memory on a box use the svmon command.

svmon -G

Overall System Status

For this you will probably want to use topas, which is pretty simiar to top. Topas gives you a quick and dirty overview of what is going on on a system. Here you can find CPU usage, top processes, disk utililization. Check out the fancy screen shot below.

Top-ass1

List Volume Groups

Wow, Linux has really confused me on this one. Anyway, use lsvg

# lsvg -o
rootvg
crsrdb_bin
crsprdb_data
crsprdb_index
crsprdb_arch
crsprdb_rman

List Info About a Volume Group.

# lsvg rootvg

Display Names of all Logical Volumes in a Volume Group.

# lsvg -l rootvg

Display Physical Memory

# lsattr -El sys0 -a realmem

Finding Disk I/O Issues

Sar appears to be a fine option here. Especially since I am looking for percent busy. Iostat also exists on AIX, btw.

# sar -d 1 2

Show Network Throughput

The more I poke around the internet trying to figure out how to actually use AIX the more I keep running into topas. Anyway this one is a good one

#topas -E

I plan to have more of these one liners documented here in the future, but for now this is going to have to do.

RHEL6 – SELinux Troubleshooting II: Electric Boogaloo

Little_Miss_Trouble_by_Percyfan94So a good while back I posted an article on how to troubleshoot SELinux violations and after reviewing that article as part of a troubleshooting exercise, I realized that I left out a few details. Needless to say my original article was not as clear as it should be. Anyway I wanted to use up a few more bytes of the internet to clarify.

When the package setroubleshoot-server is installed, SELinux violations will be sent to /var/log/messages, which makes it fairly easy to troubleshoot SELinux issues.

So first lets install setroubleshoot and all its parts

# yum install setroubleshoot*

In my case on RHEL6, the following packages were installed

setroubleshoot-plugins-3.0.40-1.el6.noarch
setroubleshoot-server-3.0.47-3.el6_3.x86_64
setroubleshoot-3.0.47-3.el6_3.x86_64

Note that the setroubleshoot-server is the one that you need to troubleshoot via the command line.

Now lets generate a violation. In this case I am just dropping a file with the wrong selinux context into /var/www/html and am trying to access it.

# touch /root/file3 && cp /root/index.html /var/www/html/file3

Check the context if you must to make sure that its not correct for httpd content. In this case you can see that it is not.

# ls -lZ /var/www/html/file3
-rwxrwxrwx. root root system_u:object_r:admin_home_t:s0 /var/www/html/file3

Now start Apache and try to access the file via elinks or a browser. You will get a Forbidden error, which I have omitted below.

# elinks -dump http://localhost/file3

Note that you may need to restart auditd if your message does not show up in the messages file.

Aug 11 17:08:39 vfatmin01 setroubleshoot: SELinux is preventing /usr/sbin/httpd from getattr access on the file /var/www/html/file3. For complete SELinux messages. run sealert -l 5a413022-af89-4222-b055-0cc1edc4bbad

Note: You will also find a the same error in /var/log/audit/audit.log, albeit in a bit less friendly format.

type=AVC msg=audit(1344719319.890:7196): avc:  denied  { getattr } for  pid=6765 comm=”httpd” path=”/var/www/html/file3″ dev=dm-1 ino=656718 scontext=unconfined_u:system_r:httpd_t:s0 tcontext=unconfined_u:object_r:admin_home_t:s0 tclass=file

Anyway back to the error from the messages file. At the end of the error you are shown the UUID of the error and the sealert command to run to get more information on the error.

# sealert -l 5a413022-af89-4222-b055-0cc1edc4bbad

Output below:

SELinux is preventing /usr/sbin/httpd from getattr access on the file /var/www/html/file3.

*****  Plugin restorecon (99.5 confidence) suggests  *************************

If you want to fix the label.
/var/www/html/file3 default label should be httpd_sys_content_t.
Then you can run restorecon.
Do
# /sbin/restorecon -v /var/www/html/file3

Wow, sealert actually tells you why the file is being blocked and the commands that you should run to fix the problem. Nice!

RHEL6 – Restore Grub on MBR

GrubGRUB, which stands for the GRand Unified Bootloader is the default boot loader in Linux these days ( it replaced LILO). When your server boots, the system BIOS transfers control to the Master Boot Record of your first boot device which is where Grub is installed.  If the removed, damaged, or overwritten, then you will not be able to boot, and in which case you will need to repair/reinstall grub.

The entire process only takes a few minutes if you already have a Redhat/Centos cd to boot of off. Just slap that sucker into the cd drive (or virtual cd drive) and at the Boot Menu type “linux rescue”

Then run grub as shown below.

#grub

Next identify your /boot partition

#root (hd0,0)

Then install first stage grub into the MBR

# setup (hd0)

then exit.

All you ever wanted to know about Grub can be found below

http://en.wikipedia.org/wiki/GNU_GRUB

Broadcom (bnx2) Network Adapters Dropping Recieved Packets Under Linux

VampSo a few weeks ago some of our Centos 5.4 and OEL 5.5 servers started exibiting strange connectivity problems. Monitoring started alerting that hosts were down when they weren't; some boxes could ping target hosts and some couldn't; some boxes became unresponsive when interfaces were failed over, and the strangest of all is that some of the boxes would magically "repair" themselves. Like I said, strange.

Over the next week or so we ran into the issue a few more times and were able to see a pattern emerge. All the affected servers were running Centos 5.4 or Oracle Linux 5.5 and had broadcom (bnx2) adapters that were on the recieving end of some pretty decent traffic. Most importantly, all had a good number of dropped recieved packets that was continuously, albeit slowly, increasing.

A bit of google research led us to this bugzilla, which suggested changing the adapter's coalescense settings…. so a bit on coalescense.

Coalescense

In your network adapter, coalescence is all about interupts. Traditionally interupt coalescense (or IC) is used to reduce the number of interupts generated by the system by delaying the generating of an interrupt by a very short period of time…think less then a milisecond. In turn more traffic will be recieved by the host and the next interupt generated will be larger in size. You can find out more than you would ever want to know about coalesence here

The Fix

So apparently the Broadcom IC settings were not aggressive enough. Packets would come in, fill up the receive queue, and get dropped before they could be sent off for processing via an interrupt. This takes us back to the bugzilla above and the suggested settings below which you set with the ethtool command

 ethtool -C ethX rx-usecs 8 rx-usecs-irq 8 rx-frames 0 rx-frames-irq 0

Note that this was not an issue on any in Centos 5.6, any server with Intel adapters, or any server with 10g adapters. As a matter of fact, those servers had IC settings even more agressive then those above. See the Intel 82599EB 10-gigabit settings below

rx-usecs: 1
rx-frames: 0
rx-usecs-irq: 0
rx-frames-irq: 0

Final Configuration

So now that we know the fix we need to make it permanent, which is not as easy as editing a config file for the device as the coalescence config is set at boot and it part of the installed driver for the device. Rather than muck around with trying to modify the driver itself, we decided to set and configure our devices at boottime with a rc script that checks the checks the each network interface on the box and modifys their IC settings if they are using the bnx2 (Broadcom) driver.  We dropped the script below into /etc/rc.d and created a symbolic link to it in /etc/rc3.d.

#!/bin/bash

case "$1" in
start)

IFACE=$(ls /etc/sysconfig/network-scripts/ifcfg-eth*| grep -v bak | cut -d – -f 3)

for ETH in $IFACE
        do
                if ( ethtool -i $ETH | grep -qw bnx2 )
                then
                        echo "$Changing Settings for $ETH"
                        ethtool -C $ETH rx-usecs 8 rx-usecs-irq 8 rx-frames 0 rx-frames-irq 0
                else
                        echo "$ETH is not a broadcom"

                fi
        done
exit 1
;;

stop)

echo " hammer time"
;;

*)
    echo "usage: $0 (start|stop)"
;;
esac

RHEL6 — Troubleshooting SELinux Violations

Sad_face1Dear Reader: Welcome to my third and not final installment on SELinux. The first two can be read here and here. They are exciting reads and are sure to have you on the edge of your seat.

Anyway, the best way to implement SELinux sucessfully is to know how to troubleshoot when things aren’t going your way. If you panic at the first sign of trouble, you are just going to end up turning off SELinux and not reap the rich rewards that it will bring you in life. Now that I have convinced you to run SELinux lets get started.

First install the package setroubleshoot, which will send SELinux messages to our messages file.

yum -y install setroubleshoot-server.x86_64

Now you can search the messages file for SELinux Violations. Use sealert -l UUID to find information on a specific incident, or sealert -a  /var/log/audit.log to search an entire log file for violations.

In this specfic example, I created a test file and dropped it in /var/www/html, however I did not set the context to httpd_sys_content_t, then i attempted to view the file in a browser. Obviously access was denied. The output of sealert shows me the error and then tells me how to fix it.

Summary:

SELinux is preventing /usr/sbin/httpd “getattr” access to /var/www/html/file3.

Detailed Description:

SELinux denied access requested by httpd. /var/www/html/file3 may be a
mislabeled. /var/www/html/file3 default SELinux type is httpd_sys_content_t, but
its current type is admin_home_t. Changing this file back to the default type,
may fix your problem.

…TRUNCATED…

Allowing Access:

You can restore the default system context to this file by executing the
restorecon command. restorecon ‘/var/www/html/file3’, if this file is a
directory, you can recursively restore using restorecon -R
‘/var/www/html/file3’.

Fix Command:

/sbin/restorecon ‘/var/www/html/file3’

Boom goes the dynomite! Problem solved.

RHEL6 – Managing Swap Space

54af9-6a00e551c39e1c8834017ee46a8c54970d-piSwap space on a Linux box is an area on disk that is used to hold inactive memory pages. This occurs when the system needs more memory then is currently available, so it swaps these inactive memory pages to disk.

To create additional swap space on the fly you are either going to need a spare disk or free partition on a disk that you can use.

First, using fdisk,  you will need to make sure that the partition type for the disk (or partition) is set to 82.

Then setup the swap area using mkswap. In this example I am using /dev/sdb2, but your setup is bound to be different.

>mkswap /dev/sdb2

Then determine the UUID of the new swap space.

>blkid /dev/sdb2

Then add an entry to the /etc/fstab, so that the swap space is mounted at boottime. The show in the example below is the output of the blkid command above.

>UUID=7b05f0a9-18d5-42e5-b259-78ba3a8cc1b7 swap                    swap    defaults        0 0

Then activate your new swap space

>swapon -a

Then check to make sure everything worked by checkign for your new swap partition in the output of the command below.

>swapon -s

/dev/sdb2                             partition       4193276 7480    0

Troubleshooting NTP on Linux

Daylight-savings-time Using NTP to set the time on a linux server is not hard, however it can have a trick or two up its sleeve. In this example I was troubleshooting NTP on a RedHat 8 server (yes I know its old).

Before we get started, the basics on NTP can be found here. A primer on the ntp.conf file can be found here. For most people this is all you will need to get ntp up and running. I unfortunately was not one of those people.

Below is the error messages that I was receiving when I attempted to start ntp via ‘service ntpd start’.

ntpdate[4999]: no server suitable for synchronization found

What is this??? Unfortunatley the server that am attempting to sync to is behind a firewall and is not pingable, so doing a simple ping test to verify that I can connect to the box is out of the question. So I ask a network guy to check the firewall and he tells me that he sees the request coming from the box in question, but its not going to the box that I specified in the ntp.conf. The answer can be found in the /etc/ntp/step-tickers file.

The step-tickers file is meant to hold an initial hostname or IP address to sync with upon startup of ntp. In RedHat, at least, the server runs an ntpdate against it. The entry in my step-tickers was an external host that was no longer accessible so I removed it and added one of my ntp hosts.

However the sync still failed. This time I take a close look at the current time on the client box, and sure enough the date way off. NTP will not sync if there is more than a 1000 second difference between the host and the server. So I fix this using the date command and try again.

Again it fails..

So I run the ntpdate -db command below to get some more info.  The transmit section shows that I am not getting a response, this is not news to me but its good to verify,

ntpdate -bd <NTP_SERVER>
 ntpdate[5023]: ntpdate 4.2.0a@1.1199-r Thu May  4 11:01:34 EDT 2006 (1)
Looking for host <NTP_SERVER> and service ntp
host found :<NTP_SERVER>
transmit<NTP_SERVER>
transmit<NTP_SERVER>
transmit<NTP_SERVER>
transmit<NTP_SERVER>
transmit<NTP_SERVER>
<NTP_SERVER> Server dropped: no data
server <NTP_SERVER> port 123
stratum 0, precision 0, leap 00, trust 000
refid [10.253.82.1], delay 0.00000, dispersion 64.00000
transmitted 4, in filter 4
reference time:    00000000.00000000  Thu, Feb  7 2036  1:28:16.000
originate timestamp: 00000000.00000000  Thu, Feb  7 2036  1:28:16.000
transmit timestamp:  ccf1430f.64e5a35d  Mon, Dec 15 2008 15:56:47.394
filter delay:  0.00000  0.00000  0.00000  0.00000
         0.00000  0.00000  0.00000  0.00000
filter offset: 0.000000 0.000000 0.000000 0.000000
         0.000000 0.000000 0.000000 0.000000
delay 0.00000, dispersion 64.00000
offset 0.000000

Ok so onto the NTP host where I run the following command to sniff traffic on UDP port 123.

tcpdump dst port 123

There is can see the client communicating with the host

5:52:49.831677 IP <HIDDEN IP> .ntp > <HIDDEN IP>: NTPv4, Client, length 48

So another call goes out to the Admin of the ntp server and have him verify that ntp is setup properly and is running. The provided the information above to him. Turns out he had iptables running and was blocking NTP. The other Admin makes a change and I am off and running.