"An ongoing study on datasets of several Petabytes have shown that there can be 'silent data corruption' at rates much larger than one might naively expect from the expected error rates in RAID arrays and the expected probability of single bit uncorrected errors in hard disks," began a recent query on the Linux kernel mailing list asking where the errors might be introduced. Alan Cox replied, "its almost entirely device specific at every level." He then continued on with some general information, tracing the path of the data from the drive, through the cable and bus, into main memory and the CPU cache, as well as over the network, "once its crossing the PCI bus and main memory and CPU cache its entirely down to the system you are running what is protected and how much. Note that a lot of systems won't report ECC errors unless you ask." Alan continued:
"The next usual mess is network transfers. The TCP checksum strength is questionable for such workloads but the ethernet one is pretty good. Unfortunately lots of high performance people use checksum offload which removes much of the end to end protection and leads to problems with iffy cards and the like. This is well studied and known to be very problematic but in the market speed sells not correctness."
Regarding the specific study in question, Alan noted, "for drivers/ide there are *lots* of problems with error handling so that might be implicated (would want to do old [versus] new ide tests on the same h/w which would be very intriguing)."
From: Bruce Allen [email blocked] Subject: ECC and DMA to/from disk controllers Date: Mon, 10 Sep 2007 07:19:35 -0500 (CDT) Dear LKML, Apologies in advance for potential mis-use of LKML, but I don't know where else to ask. An ongoing study on datasets of several Petabytes have shown that there can be 'silent data corruption' at rates much larger than one might naively expect from the expected error rates in RAID arrays and the expected probability of single bit uncorrected errors in hard disks. The origin of this data corruption is still unknown. See for example http://cern.ch/Peter.Kelemen/talk/2007/kelemen-2007-C5-Silent_Corruptions.pdf In thinking about this, I began to wonder about the following. Suppose that a (possibly RAID) disk controller correctly reads data from disk and has correct data in the controller memory and buffers. However when that data is DMA'd into system memory some errors occur (cosmic rays, electrical noise, etc). Am I correct that these errors would NOT be detected, even on a 'reliable' server with ECC memory? In other words the ECC bits would be calculated in server memory based on incorrect data from the disk. The alternative is that disk controllers (or at least ones that are meant to be reliable) DMA both the data AND the ECC byte into system memory. So that if an error occurs in this transfer, then it would most likely be picked up and corrected by the ECC mechanism. But I don't think that 'this is how it works'. Could someone knowledgable please confirm or contradict? Cheers, Bruce
From: Alan Cox [email blocked] Subject: Re: ECC and DMA to/from disk controllers Date: Mon, 10 Sep 2007 14:54:15 +0100 Windsor, Berkshire, SL4 1TE, Y Deyrnas Gyfunol. Cofrestrwyd yng Nghymru a Lloegr o'r rhif cofrestru 3798903 > In thinking about this, I began to wonder about the following. Suppose > that a (possibly RAID) disk controller correctly reads data from disk and > has correct data in the controller memory and buffers. However when that > data is DMA'd into system memory some errors occur (cosmic rays, > electrical noise, etc). Am I correct that these errors would NOT be > detected, even on a 'reliable' server with ECC memory? In other words the > ECC bits would be calculated in server memory based on incorrect data from > the disk. Architecture specific. > The alternative is that disk controllers (or at least ones that are meant > to be reliable) DMA both the data AND the ECC byte into system memory. > So that if an error occurs in this transfer, then it would most likely be > picked up and corrected by the ECC mechanism. But I don't think that > 'this is how it works'. Could someone knowledgable please confirm or > contradict? Its almost entirely device specific at every level. Some general information and comment however - Drives normally do error correction and shouldn't be fooled very often by bad bits. - The ECC level on the drive processors and memory cache vary by vendor. Good luck getting any information on this although maybe if you are Cern sized they will talk After the drive we cross the cable. For SATA this is pretty good, and UDMA data transfer is CRC protected. For PATA the data is but not the command block so on PATA there is a minute chance you send the CRC protected block to the wrong place Once its crossing the PCI bus and main memory and CPU cache its entirely down to the system you are running what is protected and how much. Note that a lot of systems won't report ECC errors unless you ask. If you have hardware RAID controllers its all vendor specific including CPU cache etc on the card etc. The next usual mess is network transfers. The TCP checksum strength is questionable for such workloads but the ethernet one is pretty good. Unfortunately lots of high performance people use checksum offload which removes much of the end to end protection and leads to problems with iffy cards and the like. This is well studied and known to be very problematic but in the market speed sells not correctness. From the paper type II sounds like slab might be a candidate kernel side but also CPU bugs as near OOM we will be paging hard and any L2 cache page out/page table race from software or hardware would fit what it describes, especially the transient nature Type III wrong block on PATA fits with the fact the block number isn't protected and also the limits on the cache quality of drives/drive firmware bugs. For drivers/ide there are *lots* of problems with error handling so that might be implicated (would want to do old v new ide tests on the same h/w which would be very intriguing). Stale data from disk cache I've seen reported, also offsets from FIFO hardware bugs (The LOTR render farm hit the latter and had to avoid UDMA to avoid a hardware bug) Chunks of zero sounds like caches again, would be interesting to know what hardware changes occurred at the point they began to pop up and what software. We also see chipset bugs under high contention some of which are explained and worked around (VIA ones in the past), others we see are clear correlations - eg between Nvidia chipsets and Silicon Image SATA controllers.
Scrubbing
I have work on controller embbeded in satellite. They used memory with correcter code but also scrubbing.
In the document, only type I could look like disk error. Or they did not speak at all of the beavior of the disk system if the read sector is detected bad. Maybe "zero" is send, maybe the actual read (false) data ?
We calulate a rate of bit error per unit of time. So errors accumulate them self, if there is no read-check-write cycle.
What about a scrubing on hard disk ? Does a "dd if=/dev/hda of=/dev/null bs=1M" could have an utility ?
ZFS (for Linux via FUSE)
ZFS (for Linux via FUSE) provides both resilvering and scrubbing.
md does too
The md driver provides both resilvering and scrubbing.
So, does it mean the ZFS
So, does it mean the ZFS guys were right? No kidding.
;-)
Actually, yes, although a
Actually, yes, although a hardware RAID controller has ECC memory. Like I said above ZFS does resilvering and scrubbing. Scrubbing costs a lot of CPU and I/O though. ECC memory is also recommended. But Alan Cox points out there are many more bottlenecks. I normally put txcsum/rxcsum off as it lowers the speed by 50% on gbit ethernet, while the stability is also bad with that on.
old story: byte shifting under heavy load
In the early days of ATA, I bought a harddrive and a cheap IDE ISA card, plugged everything in, and started reinstalling all of my software.
Somewhere in the middle of the process, I started to get strange errors and at first though it was corruption of my backups.
More extensive triage (and some custom test script coding) revealed that under heavy loads the data blocks being received from the harddrive were sometimes being shifted by 1-2 bytes, such that the first byte or so of the disk block were dropped and random garbage was appended onto the end of the segment.
I took the card back to the shop, bought a new one that was $10 more expensive, and reran the tests with a more successful outcome.
true story: shift happens.
comments with a punchline
comments with a punchline ftw ;)
Drive Firmware
The early Quantum Bigfoot drives had a firmware bug where repeating non continuous single sector reads from the same sector would return the wrong data. It was quite possible to never notice if you system only did multiple sector reads.
When I contacted Quantum about it they confirmed the problem and gave me a link to download a firmware update program.
I'm behind the times
Do any modern hard drives offer a sector size that's not a power of 2? I'd think it would be extremely useful to have something like a 512 byte + 32 byte, or 4096 byte + 256 byte sector size. (I'm picking the additional segment's size out of thin air without deep analysis.) For example, depending on the mode in which you write a CD-ROM, you have either 2048 or 2304 bytes per sector, or something like that.
The idea would be to put a strong checksum and/or ECC in that segment, as well as some form of sector identifier and perhaps some additional (theoretically redundant) metadata. This would catch two kinds of errors: Data corruption between the drive and the CPU, and command/routing corruption. It also gives you a side channel for reconstructing metadata during a fsck, if things really go wrong.
You really want to store this sort of information near the sector itself so that you don't pay a seek penalty to bring it in. Once it's in the computer, though, the sideband can be separated from the data and stored in separate buffers so that the upper I/O layers don't have to deal with it.
--
Program Intellivision and play Space Patrol!
SAS Drives that support Block Protection
There is a feature in some SAS drives and controllers called Block Protection. Basically, this involves increasing the sector size the drive uses from 512 bytes to 520 bytes. The extra 8 bytes contains a CRC value calculated for that particular 512 byte block of data. These extra 8 bytes are tacked on at the end of the block and are considered part of the data that is transferred.
Ideally, this CRC value accompanies the data from end-to-end, from the drive to the host along each path the data may take. In this way, at each point along the path, the CRC for that block of data can be re-calculated by each piece of hardware that touches it and compared to the CRC value stored in the block. This reduces the chances of silent data corruption, but only if the hardware and software that touches the data are aware of the Block Protection, and are capable of calculating the CRC of the data block and comparing it to the CRC dword that is transferred as part of that block.
Currently, I know that there are Hitachi SAS drives out there that support this, and the Intel IOP34x SAS/SATA RAID processors support this. I'm not sure what other hardware out there support this.
SAS Drives that support Block Protection
This is also called T10 DIF or Block Guard. The 8 bytes is more than just CRC. 8 bytes of CRC on a 512 byte block would be a big waste. 2 bytes is plenty. DIF is 2 bytes of gaurd (CRC), 2 bytes for an application tag, and 4 bytes for a reference tag (least significant LBA bits).
There are some stale data scenarios that the current version of DIF doesn't help with...
Emulex has support with their 8 gig Saturn fibre channel HBAs.
Yea, tell me about it...
"clear correlations - eg between Nvidia chipsets and Silicon Image SATA
controllers."
I have an Asus MB with onboard SiI3112a SATA chipset, talking to a 451GB gmirror filesystem. What a piece of crap. I can feed music off it OK, but if I try a big copy from the RAIDed SCSI drives the SATA controller barfs everytime; locks up tight. The 3112a was an early market entry but that is no excuse for such consistenly terrible performance.
Shit happens
Six years ago I owned a x86 PC with VIA Apollo Pro MB chipset. God knows why but I couldn't reliably copy a file bigger than 500MB - every single time such a transfer resulted with one or two bits randomly inverted. To solve this problem I changed HDD cables, memory modules and power supply ... all in vain.
In the end I got rid of that PC and his new owner has never complained about this issue - he was a novice user and he never had any sensitive or important data - so probably he has never hit this problem seriously. (I warned him in advance about that - don't blame me!).
with larger and denser disks
with larger and denser disks (and their aggregates like RAID-arrays) - your probability to have an undetected read error is quite high.
But how many people actually DO ask about general 'data integrity' features ? - none !
First price, then capacity, then speed.
The drive-manufacturers made a good job to believe, that
your data is safe on their platters.
Reality is, that as long as you don't want to pay for that,
the manufacturers don't care about it.
Safer data on disk - goes at the cost of capacity.
To most people - a filesystem is a part of a computer,
which never loses/corrupts/modifies data ...
Hard to explain to anyone, that bugs and glitches are not only
present in applications, but also in filesystems and disk-drives (!).
But:
- you can only detect corruption, if you check for it.
(and you must do that right of course)
- detecting corruption doesn't mean you can fix it.
- being able to fix corruption means - adding overhead
(which implicitly reduces speed as you must write
additional redundancy data ...)
- as long as this isn't happening automatically in the OS,
the dumb mass-market will not notice it.
I heared some time ago, that drive vendors want to move to 4k sectors - as it would allow them to implement better error-correction at a lower capacity overhead than for 512 byte sectors ... let's see - it's time IMHO.
Thats why Linux needs a FS
Thats why Linux needs a FS (like ZFS) which allows resilvering. Data integrity is usually worth a slight performance loss. Maybe Ext4 will have a feature like that, who knows.
or just use md
The md driver provides both resilvering and scrubbing.