Using IPMI Tools to Monitor System Hardware

Using IPMI Tools to Monitor System Hardware

May 2

  • Created: May 2, 2011 1:47 PM

Using IPMI Tools to Monitor System Hardware
IPMI is the Intelligent Platform Management Interface which is an open standard that is integrated into most systems by hardware manufacturers. IPMI can provide real time hardware information including system temperatures, voltages, fan speeds and hardware status, along with identification of the types of hardware installed within the system.

Using IPMI is simple. Once installed for your distribution, command line options can be passed to the ipmitool executable to obtain system information. One of the most useful uses of IPMI is to see system error messages that have occurred on the hardware . If you have ever used any Dell servers, IPMI will be able to tell you in detail what has caused the orange flashing light on your server. Some of the more useful commands are:

$ ipmitool sel list

This will provide a system event log similar to the following:

<br />
   1 | 01/15/2008 | 11:37:43 | Event Logging Disabled #0x72 | Log area reset/cleared | Asserted<br />
   2 | Pre-Init Time-stamp   | Power Supply #0x65 | Failure detected | Asserted<br />
   3 | Pre-Init Time-stamp   | Power Supply #0x65 | Power Supply AC lost | Asserted<br />
   4 | 01/22/2010 | 14:27:29 | Physical Security #0x73 | General Chassis intrusion | Asserted<br />
   5 | 01/22/2010 | 14:31:18 | Physical Security #0x73 | General Chassis intrusion | Deasserted<br />

In the above output you can see that power was lost to one power supply and at some point the cover was removed from the server. You will see other useful information in these logs such as if you are having errors with a specific DIMM:

<br />
  5e | 04/06/2011 | 17:23:21 | Memory #0x1b | Transition to Non-critical from OK<br />
  5f | 04/06/2011 | 17:24:55 | Memory #0x1b | Transition to Critical from less severe<br />

System temperature issues will also show up as well:

<br />
  5c | Pre-Init Time-stamp   | Temperature #0x76 | State Asserted<br />
  5d | 01/07/2011 | 15:13:19 | Temperature #0x76 | State Asserted<br />

Over time this list will become very large, and if you fixed the error while the system is running you may have to manually clear the system log to make the flashing orange light go away. You can do this by running:

$  ipmitool sel clear

It will report the following when cleared:

Clearing SEL.  Please allow a few seconds to erase.

IPMI can also provide real time system information by running:

$ ipmitool sdr

A list similar to the following will be provided:

<br />
Temp             | -55 degrees C     | ok<br />
Temp             | -58 degrees C     | ok<br />
Temp             | 40 degrees C      | ok<br />
Temp             | 40 degrees C      | ok<br />
Ambient Temp     | 26 degrees C      | ok<br />
CMOS Battery     | 0x00              | ok<br />
ROMB Battery     | 0x00              | ok<br />
VCORE            | 0x01              | ok<br />
VCORE            | 0x01              | ok<br />
CPU VTT          | 0x01              | ok<br />
1.5V PG          | 0x01              | ok<br />
1.8V PG          | 0x01              | ok<br />
3.3V PG          | 0x01              | ok<br />
5V PG            | 0x01              | ok<br />
1.5V PXH PG      | 0x01              | ok<br />
5V Riser PG      | 0x01              | ok<br />
Backplane PG     | 0x01              | ok<br />
Linear PG        | 0x01              | ok<br />
0.9V PG          | 0x01              | ok<br />
0.9V Over Volt   | 0x01              | ok<br />
CPU Power Fault  | 0x01              | ok<br />
FAN MOD 1A RPM   | 7575 RPM          | ok<br />
FAN MOD 1B RPM   | 7650 RPM          | ok<br />
…<br />

If there are any specific hardware failures, you will be able to see them in the the above output. A simpler summary of the hardware status is also available with:

$ ipmitool chassis status</p>
<p>System Power         : on<br />
Power Overload       : false<br />
Power Interlock      : inactive<br />
Main Power Fault     : false<br />
Power Control Fault  : false<br />
Power Restore Policy : always-off<br />
Last Power Event     :<br />
Chassis Intrusion    : inactive<br />
Front-Panel Lockout  : inactive<br />
Drive Fault          : false<br />
Cooling/Fan Fault    : false<br />
Sleep Button Disable : not allowed<br />
Diag Button Disable  : allowed<br />
Reset Button Disable : not allowed<br />
Power Button Disable : allowed<br />
Sleep Button Disabled: false<br />
Diag Button Disabled : true<br />
Reset Button Disabled: false<br />
Power Button Disabled: false<br />

  • Steve Crye

    Brad, we have two new identical Dell PE r710 servers running Centos 5.5 x64. One is fine, the other’s fans scream like banshees. Have spent two weeks with Dell tech support trying everything to fix it. They replaced the mombo, CPUs, heatsinks and two fans. We used the USC utility to update the iDRAC6 and BIOS to latest. Nothing has worked. I’d like to at least be able to see the fan speeds in Linus. I’m not a guru. I am desperate, the whine, even trough a soundproof wall between me and the datacenter is driving me crazy! Can you help me get the IPMI installed? Thanks!

    Steve

  • http://www.nexcess.net/ Brad

    Steve,

    If you are running Centos 5.5 and are using the centos repos, you can
    just run a ‘yum install OpenIPMI’, It should be the only thing you need
    to install. After installing, start ipmi “service ipmi start’ and the
    command should then work using ipmitool.

    Thanks!

    -Brad

  • http://originalcialis.com/ Rinkesh

    Thanks, Brad!!! I will take it into consideration!!! The same problem. I will try it out and will report you back later!!!