spod.cx

Sun Microsystems Support on a Sun Fire X2100

Introduction

I've ranted previously about problems ordering hardware from Sun, but I feel in the interests of fairness that I should even things up by talking about how consistently excellent Sun's hardware support is.

Support levels

Sun support broadly falls into three categories - Warranty, SunSpectrum and Hardware Service Plans.

Warranty is the standard "if the hardware breaks within the warranty period we'll fix it, but you're otherwise on your own" level of support. SunSpectrum covers Sun hardware and Solaris support, so you end up with support for the entire platform with a guaranteed level of service (response within X hours, 24/7 or business hours, depending on how much you pay). Hardware Service Plans cover the hardware with a guaranteed level of service.

Most of our hardware is covered under SunSpectrum Silver, which gives 8-8 Monday to Friday coverage, with a 4 hour response time, but some less important machines we just leave on warranty support.

The Problem

The X2100 is one of the new "entry level" Opteron X64 servers, with support for a dual core CPU and a couple of SATA disks. Ours is used as half of an LVS loadbalancing cluster, so it's not running solaris, and its services are covered by another machine in the event of any problems. As a result of this, we hadn't bothered with paying for any extra level of support.

One morning I came into work and our monitoring service was complaining that the machine had stopped responding shortly after I'd left home. The other node of our loadbalancer had happily picked up its services, so I wandered into the machineroom to have a look. The machine was indeed dead, but after powercycling it came up OK. I left it for a couple of days not doing anything to see if the problem reoccurred, and a couple of days later it started generating disk read errors.

Apr 24 10:17:09 andimok kernel: ata2: translated ATA stat/err 0x51/84 to SCSI SK/ASC/ASCQ 0xb/47/00
Apr 24 10:17:09 andimok kernel: ata2: status=0x51 { DriveReady SeekComplete Error }
Apr 24 10:17:09 andimok kernel: ata2: error=0x84 { DriveStatusError BadCRC }

Apr 24 10:17:09 andimok kernel: sd 1:0:0:0: SCSI error: return code = 0x8000002
Apr 24 10:17:09 andimok kernel: sdb: Current: sense key: Aborted Command
Apr 24 10:17:09 andimok kernel:     Additional sense: Scsi parity error
Apr 24 10:17:09 andimok kernel: end_request: I/O error, dev sdb, sector 7807590

After running some tests it seemed clear that one of the disks was generating errors, but unlike a normal disk failure on some types of I/O the machine would lock up, and require powercycling. We had a "spare" X2100 that wasn't in service, so I swapped one of its disks for the faulty one, and curiously enough it would generate some of the same errors, suggesting it wasn't just a disk fault.

Support Call

I phoned Sun Support on the Monday morning and logged the fault. The front-line support person I spoke to took the details (and accepted that it was running Debian Linux as fine) and told me that as a warranty call I'd have an engineer call back by the end of the working day. They actually got back to me within a couple of hours, and asked me to send some logs and a report of what had happened. I did this, including my suspicion that it was more than a simple disk failure, and got an email later in the afternoon asking for complete logs and the output of some commands that are solaris specific (iostat -En and metastat). I sent the closest equivalent output, and the relevant log files.

Next thing we heard was a call from Sun dispatch on Tuesday which was taken by one of my colleagues, as I had the afternoon off. They said they would send out replacement parts, and when they arrived we should contact them for an engineer. On Wednesday morning a box containing a replacement motherboard arrived, and I duely phoned for an engineer. I was asked if an engineer on Thursday would be OK, which I accepted, and then about 10 minutes later got another phone call saying that as they already had an engineer due on site with our CS department would it be ok if he came to us afterwards. The promise of an even quicker fix (and an engineer who we've had before and is very good) seemed reasonable, so I accepted this.

Our engineer came, as expected on Wednesday afternoon, and after a motherboard swap this machine has been behaving fine.

Pretty good for a warranty replacement, I feel - especially as it wasn't even running Solaris, or one of Sun's "supported" Linux distros. Whenever we deal with Sun support I'm always impressed because unlike other vendors their support is as good as it should be, and their "response within X hours" guarantee is really worth something.


Contact: site@spod.cx