Ubuntu Linux- Locate Failed Dimms without Pain

Cartoon_ramSo I have recently made the switch to Kubuntu 12.10 on my new desktop. Basically I am building a monster workstation and I ran into issue running Fedora 18. Since I wanted Steam support too I decided not to try to move to an earlier Fedora version, but rather, I chose to  give Ubuntu another try (its been years since I have run Ubuntu).

Anyway, I am building this hoss of a work station that has 12 Dimm slots, which I fully populated with 4GB dims. However when I booted my new monster, I found that I was 8GB short in the Memory department.

So, how do I figure out which two dimms are bad? I certainly dont want to have to pull all of them out and boot the machine and test each dimm one by one.

So this is where lshw comes to the rescue. Which I blogged about back in 2010 here.

Anyway. Here is how you find the empty slots.

# lshw -short -C memory

which output what you see below.

0/14                           memory      System Memory
/0/14/0                         memory      4GiB DIMM DDR3 1066 MHz (0.9 ns)
/0/14/1                         memory      4GiB DIMM DDR3 1066 MHz (0.9 ns)
/0/14/2                         memory      4GiB DIMM DDR3 1066 MHz (0.9 ns)
/0/14/3                         memory      4GiB DIMM DDR3 1066 MHz (0.9 ns)
/0/14/4                         memory      4GiB DIMM DDR3 1066 MHz (0.9 ns)
/0/14/5                         memory      4GiB DIMM DDR3 1066 MHz (0.9 ns)
/0/22                           memory      System Memory
/0/22/0                         memory      DIMM DDR3 [empty]
/0/22/1                         memory      DIMM DDR3 [empty]
/0/22/2                         memory      4GiB DIMM DDR3 1066 MHz (0.9 ns)
/0/22/3                         memory      4GiB DIMM DDR3 1066 MHz (0.9 ns)
/0/22/4                         memory      4GiB DIMM DDR3 1066 MHz (0.9 ns)
/0/22/5                         memory      4GiB DIMM DDR3 1066 MHz (0.9 ns)

Basically this is telling me that my first two dimms on CPU two are dead and are the ones that need to be replaced.

Now all I have to do is powerdown and pull and replace two dimms.. which will save my fingers from much discomfort.

LSI MegaCLI — Check For Failed Raid Controller Battery

701590_rusty_batteryThere are several tools that you can use to monitor and configure and LSI SAS controller, however as I have found, some are easier than others to use and some do not always display the correct information.

In my case my controller is a SAS 9260-8i, and when building a server I always make sure that I install the MegaRaid Storage Manager gui for configuring disks and setting up email alerts. However I have often found that this tool is sometimes confusing to use for other tasks so I also make sure that I install the MegaCLI (command line interface). Both utilities can be downloaded directly from LSI here.

MegaRaid Storage Manager installs to /usr/local/MegaRAID Storage Manager, while the cli installs via rpm to /opt/MegaRAID/MegaCli.

Anyway to check the battery status run the following (note i am running 64 bit os)

#>./MegaCli64 -AdpBbuCmd -aAll

Your output will be lengthy – but look for the line below to know if you need to replace your BBU.

Battery Replacement required            : Yes

Two additional usefully commands are:

  • megacli -AdpAllInfo -aALL lists all the adapters in the machine
  • megacli -PDList -aALL lists all disks and enclosures

Note that there is an open source CLI called Megactl, and while its quick and easy to use to see a quick list of your disks and their statuses, its not shown itself to be accurate when it comes to detecting whether or not a battery has failed. You can get it here

Additonal Megacli command can be found here; https://twiki.cern.ch/twiki/bin/view/FIOgroup/DiskRefPerc