I've been suffering from a lot of SCSI Sense errors under FreeBSD 9.0-STABLE with a recent ZFS based SAN build that have been driving me mad.
They are mostly present when ZFS does a wide scan of the available drives - via a 'zpool import' or 'zpool scan' or other operation.
My drives are Seagate ST1000DM0003 SATA3 1TB drives, contained in a SuperMicro SC847 Chassis (LSI SAS2x36 SAS2 Expander so the whole thing can run 6gbps).
These errors also show after a few hours of running a custom script to saturate writes to my ZFS pool. SCSI Sense errors build until we end up in a flurry of errors on the console, and a hang of the SAS expanders, sometimes taking ZFS with it.
Mar 7 13:50:06 Test kernel: (da33:mps1:0:27:0): CAM status: SCSI Status Error
Mar 7 13:50:06 Test kernel: (da33:mps1:0:27:0): SCSI status: Check Condition
Mar 7 13:50:06 Test kernel: (da33:mps1:0:27:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Mar 7 13:50:11 Test kernel: (da1:mps1:0:11:0): WRITE(10). CDB: 2a 0 0 6e 45 bf 0 1 0 0 length 131072 SMID 303 terminated ioc 804b s
Doing a bit of research, I find that I'm not the only one with SCSI Sense error problems, and naturally I started suspecting the recent mps driver committed from LSI, or some sort of firmware interaction issue with it (Check my recent post on the ixbge driver issue with LACP).
I've used the exact same hardware in the past for FreeBSD 9.0-BETA builds without this trouble, so I was sure it was software or expander related.
My HBA's are mostly Supermicro AOC-USAS2-L8i internal SAS2 cards for my ZFS builds. It's a good value card, and gives me what I want at a decent price.
The firmware on these cards was v7.21 and I knew Supermicro had newer BIOS and firmware, so I flashed them up to BIOS v 11 and FW v7.23. The documentation with the flash files shows a lot of advancement in the FW and BIOS that corrects errors, and I thought it would be a good try - Maybe the new mps driver made better use of the card, and needed newer firmware.
However, after the upgrade, my problems continued. System did seem to boot a little faster however.
I then applied the LSI direct FW (v7.23) and BIOS (v12) that you can find on LSI support site for the 9211-8i card (same thing as SuperMicro's). The flashing process was the same (although the SuperMicro download was a bit more automated with it's batch file).
Same problems.
I then started testing one item at a time - ending up with replacing my brand new Amphenol SAS2 external Mini-SAS cables for some older ones I had in use on a different server.
And it fixed it!
Damn! I then confirmed by swapping between the new and old cables that it was the two new cables that I had purchased that were causing all of this. Very frustrating, as it's a high quality, expensive Amphenol cable, not a no-name thin OEM thingie.
What was more puzzling was that a nearly identical hardware array was using the same cables, (from the same order as the bad cables), and it wasn't having issues.
However, that in-use array is using SATA2 1.5 Gbps drives, not 6Gbps SATA3. When I checked the logs carefully, I saw we were running into a few SCSI Sense errors, but nothing was really wrong with the servers, and ZFS was generally happy.
I'm fairly confident that if I put that array under the strain of my saturation script, it would throw errors as well, possibly not as fast as the 6Gbps array does.
The Amphenol supplier has been most kind, and is offering to RMA the 4 cables for me. I'll get a new batch of 4, and will check them out carefully under heavy load before I deploy them.
It makes me wonder what other cables I have that will fail under heavy load. Data speeds are quickly increasing, and I'm sure it puts a strain on the cable manufacturers to keep up with the latest spec'ed cable, connectors, and techniques to ensure that the cables they make are indeed reliable at any speed, not to mention having to buy new cable testing gear.
They are mostly present when ZFS does a wide scan of the available drives - via a 'zpool import' or 'zpool scan' or other operation.
My drives are Seagate ST1000DM0003 SATA3 1TB drives, contained in a SuperMicro SC847 Chassis (LSI SAS2x36 SAS2 Expander so the whole thing can run 6gbps).
These errors also show after a few hours of running a custom script to saturate writes to my ZFS pool. SCSI Sense errors build until we end up in a flurry of errors on the console, and a hang of the SAS expanders, sometimes taking ZFS with it.
Mar 7 13:50:06 Test kernel: (da33:mps1:0:27:0): CAM status: SCSI Status Error
Mar 7 13:50:06 Test kernel: (da33:mps1:0:27:0): SCSI status: Check Condition
Mar 7 13:50:06 Test kernel: (da33:mps1:0:27:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Mar 7 13:50:11 Test kernel: (da1:mps1:0:11:0): WRITE(10). CDB: 2a 0 0 6e 45 bf 0 1 0 0 length 131072 SMID 303 terminated ioc 804b s
Doing a bit of research, I find that I'm not the only one with SCSI Sense error problems, and naturally I started suspecting the recent mps driver committed from LSI, or some sort of firmware interaction issue with it (Check my recent post on the ixbge driver issue with LACP).
I've used the exact same hardware in the past for FreeBSD 9.0-BETA builds without this trouble, so I was sure it was software or expander related.
My HBA's are mostly Supermicro AOC-USAS2-L8i internal SAS2 cards for my ZFS builds. It's a good value card, and gives me what I want at a decent price.
The firmware on these cards was v7.21 and I knew Supermicro had newer BIOS and firmware, so I flashed them up to BIOS v 11 and FW v7.23. The documentation with the flash files shows a lot of advancement in the FW and BIOS that corrects errors, and I thought it would be a good try - Maybe the new mps driver made better use of the card, and needed newer firmware.
However, after the upgrade, my problems continued. System did seem to boot a little faster however.
I then applied the LSI direct FW (v7.23) and BIOS (v12) that you can find on LSI support site for the 9211-8i card (same thing as SuperMicro's). The flashing process was the same (although the SuperMicro download was a bit more automated with it's batch file).
Same problems.
I then started testing one item at a time - ending up with replacing my brand new Amphenol SAS2 external Mini-SAS cables for some older ones I had in use on a different server.
And it fixed it!
Damn! I then confirmed by swapping between the new and old cables that it was the two new cables that I had purchased that were causing all of this. Very frustrating, as it's a high quality, expensive Amphenol cable, not a no-name thin OEM thingie.
What was more puzzling was that a nearly identical hardware array was using the same cables, (from the same order as the bad cables), and it wasn't having issues.
However, that in-use array is using SATA2 1.5 Gbps drives, not 6Gbps SATA3. When I checked the logs carefully, I saw we were running into a few SCSI Sense errors, but nothing was really wrong with the servers, and ZFS was generally happy.
I'm fairly confident that if I put that array under the strain of my saturation script, it would throw errors as well, possibly not as fast as the 6Gbps array does.
The Amphenol supplier has been most kind, and is offering to RMA the 4 cables for me. I'll get a new batch of 4, and will check them out carefully under heavy load before I deploy them.
It makes me wonder what other cables I have that will fail under heavy load. Data speeds are quickly increasing, and I'm sure it puts a strain on the cable manufacturers to keep up with the latest spec'ed cable, connectors, and techniques to ensure that the cables they make are indeed reliable at any speed, not to mention having to buy new cable testing gear.
Very good job done. Thanks for this impressive stuff.
ReplyDeleteTests
Thanks for this imput, I have the same problem atm with my ZFS setup in FreeNAS, now I know where to look, I'm using 3 x 3ware multilane cable for this setup, there might be the problem.
ReplyDeleteI've run into this problem in a few other situations now, each OS/Backplane/HBA handles it differently.
ReplyDeleteI think as 6G becomes the common SAS speed, this will disappear.. but not before it wastes a lot of IT time. :-)
Glad I could help a few people.