"These patches allow data integrity information (checksum and more) to be attached to I/Os at the block/filesystem layers and transferred through the entire I/O stack all the way to the physical storage device," began Martin Petersen. He went on to explain, "the integrity metadata can be generated in close proximity to the original data. Capable host adapters, RAID arrays and physical disks can verify the data integrity and abort I/Os in case of a mismatch." He noted that support currently only exists for SCSI disks, but that work is underway to also add support for SATA drives and SCSI tapes, "with a few minor nits due to protocol limitations the proposed SATA format is identical to the SCSI". Explaining how this works, Martin continued:
"SCSI drives can usually be reformatted to 520-byte sectors, yielding 8 extra bytes per sector. These 8 bytes have traditionally been used by RAID controllers to store internal protection information. DIF (Data Integrity Field) is an extension to the SCSI Block Commands that standardizes the format of the 8 extra bytes and defines ways to interact with the contents at the protocol level. [...] When writing, the HBA (Host Bus Adapter) will DMA 512-byte sectors from host memory, generate the matching integrity metadata and send out 520-byte sectors on the wire. The disk will verify the integrity of the data before committing it to stable storage. When reading, the drive will send 520-byte sectors to the HBA. The HBA will verify the data integrity and DMA 512-byte sectors to host memory."
From: Martin K. Petersen <martin.petersen@...> Subject: [PATCH 0 of 7] Block/SCSI Data Integrity Support Date: Jun 7, 12:55 am 2008 Another post of my block I/O data integrity patches. This kit goes on top of the scsi_data_buffer and sd.h cleanups I posted earlier today. There has been no changes to the block layer code since my last submission. Within SCSI, the changes are cleanups based on comments from Christoph as well as working support for Type 3 and 4KB sectors. What's This All About? ---------------------- These patches allow data integrity information (checksum and more) to be attached to I/Os at the block/filesystem layers and transferred through the entire I/O stack all the way to the physical storage device. The integrity metadata can be generated in close proximity to the original data. Capable host adapters, RAID arrays and physical disks can verify the data integrity and abort I/Os in case of a mismatch. Right now this is SCSI disk only but similar efforts are in progress for SATA and SCSI tape. With a few minor nits due to protocol limitations the proposed SATA format is identical to the SCSI ditto for easy interoperability. T10 DIF ------- SCSI drives can usually be reformatted to 520-byte sectors, yielding 8 extra bytes per sector. These 8 bytes have traditionally been used by RAID controllers to store internal protection information. DIF (Data Integrity Field) is an extension to the SCSI Block Commands that standardizes the format of the 8 extra bytes and defines ways to interact with the contents at the protocol level. Each 8-byte DIF tuple is split into three chunks: - a 16-bit guard tag containing a CRC of the data portion of the sector. - a 16-bit application tag which is up for grabs. - a 32-bit reference tag which contains an incrementing counter for each sector. For DIF Type 1 it also needs to match the physical LBA on the drive. There are three types of DIF defined: Type 1, Type 2, and Type 3. These patches support Type 1 and Type 3. Type 2 depends on 32-byte CDBs and is work in progress. Since the DIF tuple format is standardized, both initiators and targets (as well as potentially transport switches in-between) are able to verify the integrity of the data going over the bus. When writing, the HBA will DMA 512-byte sectors from host memory, generate the matching integrity metadata and send out 520-byte sectors on the wire. The disk will verify the integrity of the data before committing it to stable storage. When reading, the drive will send 520-byte sectors to the HBA. The HBA will verify the data integrity and DMA 512-byte sectors to host memory. IOW, DIF provides means for added integrity protection between HBA and disk. Data Integrity Extensions ------------------------- In order to provide true end-to-end data integrity we need to be able to get access to the integrity metadata from the OS. Dealing with 520-byte sectors is quite inconvenient so we have worked with HBA manufacturers to separate the data buffer scatter-gather from the integrity metadata scatter-gather. Also, the CRC16 is somewhat expensive to calculate in software. So we have also allowed alternate checksums to be used. Currently we support the IP checksum which is fast and cheap to calculate. When writing, the HBA will DMA two scatterlists from host memory: One containing the data as usual, and one containing the integrity metadata. The HBA will verify that the two are in agreement and interleave them before sending them out on the wire as 520-byte sectors. When reading, the disk will return 520-byte sectors, the HBA will verify the integrity, split the integrity metadata from the data, and DMA to the two separate scatterlists in host memory. SCSI Layer Changes ------------------ At the SCSI level, there are a few changes required to support this: - an extra scatterlist for the integrity metadata - tweaks to sd.c to detect and handle disks formatted with DIF - sd.c must issue the right READ/WRITE commands when a disk is formatted with DIF - extra fields in scsi_host to signal the HBA driver's DIF capabilities Block Layer Changes ------------------- The main idea of DIF/DIX is to allow integrity metadata to be generated as close to the original data as possible. So in the long run we'd like this to happen in userland. Given mmap(), direct I/O, etc. this obviously poses some challenges. *cough* For now the integrity metadata is generated at the block layer when an I/O is submitted by the filesystem. There are also functions that allow filesystems to generate the integrity metadata earlier, and to use the application tag to mark sectors for recovery purposes. struct bio has been extended with a pointer to a struct bip which in turn contains the integrity metadata. The bip is essentially a trimmed down bio with a bio_vec and some housekeeping. There are a few hooks inserted in fs/bio.c and block/blk-* to allow integrity metadata to be handled correctly when splitting, cloning and merging. Aside from that, the integrity stuff is completely opaque. Because we don't want the block layer, filesystems, etc. to know about DIF and tuple formats, all the functions that interact with the integrity metadata reside in the SCSI layer and are registered via a callback handler template. The block layer changes have been made so that the upcoming standards for data integrity on SATA (T13 External Path Protection) and SCSI tape will fit right in and can register their own handlers. I have included a more in-depth description of the block layer changes in Documentation/block/data-integrity.txt. Comments and suggestions welcome. -- Martin K. Petersen Oracle Linux Engineering --
From: Jeff Moyer <jmoyer@...> Subject: Re: [PATCH 0 of 7] Block/SCSI Data Integrity Support Date: Jun 10, 10:41 am 2008 "Martin K. Petersen" <martin.petersen@oracle.com> writes: > Another post of my block I/O data integrity patches. This kit goes on > top of the scsi_data_buffer and sd.h cleanups I posted earlier today. Pointers to archives would have been appreciated. I can't, for the life of me, find these. > There has been no changes to the block layer code since my last > submission. > > Within SCSI, the changes are cleanups based on comments from Christoph > as well as working support for Type 3 and 4KB sectors. Thanks for all of the great documentation. It would be good to include some instructions on how one would test this, and what testing you performed. Also, please use the '-p' switch to diff, as it makes reviewing patches much easier. I set out to try your changes, but ran into some problems. First, this patch set didn't apply cleanly to a git checkout. So, I grabbed your mercurial repository, but got a build failure: block/blk-core.c: In function 'generic_make_request': include/linux/bio.h:469: sorry, unimplemented: inlining failed in call to 'bio_i ntegrity_enabled': function body not available block/blk-core.c:1388: sorry, unimplemented: called from here make[1]: *** [block/blk-core.o] Error 1 make: *** [block] Error 2 Cheers, Jeff --
From: Martin K. Petersen <martin.petersen@...> Subject: Re: [PATCH 0 of 7] Block/SCSI Data Integrity Support Date: Jun 10, 11:28 am 2008 >>>>> "Jeff" == Jeff Moyer <jmoyer@redhat.com> writes: Jeff> "Martin K. Petersen" <martin.petersen@oracle.com> writes: >> Another post of my block I/O data integrity patches. This kit goes >> on top of the scsi_data_buffer and sd.h cleanups I posted earlier >> today. Jeff> Pointers to archives would have been appreciated. I can't, for Jeff> the life of me, find these. http://marc.info/?l=linux-scsi&m=121272302931588&w=2 http://marc.info/?l=linux-scsi&m=121278031605941&w=2 http://marc.info/?l=linux-scsi&m=121302438515260&w=2 http://marc.info/?l=linux-scsi&m=121278067906564&w=2 Jeff> Thanks for all of the great documentation. It would be good to Jeff> include some instructions on how one would test this, and what Jeff> testing you performed. modprobe scsi_debug dix=199 dif=1 guard=1 dev_size_mb=1024 num_parts=1 I'm testing with XFS and btrfs. Generally doing kernel builds, etc. ext2/3 are still problematic because they modify pages in flight. Jeff> I set out to try your changes, but ran into some problems. Jeff> First, this patch set didn't apply cleanly to a git checkout. I generally track Linus closely so it must be because of the patches you were missing. You can grab my patch stack here. It's always in sync with the hg repo: http://oss.oracle.com/~mkp/patches/ Jeff> block/blk-core.c: In function 'generic_make_request': Jeff> include/linux/bio.h:469: sorry, unimplemented: inlining failed Jeff> in call to 'bio_i ntegrity_enabled': function body not available Jeff> block/blk-core.c:1388: sorry, unimplemented: called from here Jeff> make[1]: *** [block/blk-core.o] Error 1 make: *** [block] Error Jeff> 2 Odd. Which compiler are you using? Compiles just fine for me on both EL5 and FC9. Judging from the error I'm guessing it's objecting to the inlining. Tried to work around it. Please pull, update and let me know whether that did the trick. -- Martin K. Petersen Oracle Linux Engineering --
From: Jeff Moyer <jmoyer@...> Subject: Re: [PATCH 0 of 7] Block/SCSI Data Integrity Support Date: Jun 10, 2:49 pm 2008 "Martin K. Petersen" <martin.petersen@oracle.com> writes: >>>>>> "Jeff" == Jeff Moyer <jmoyer@redhat.com> writes: > Jeff> Thanks for all of the great documentation. It would be good to > Jeff> include some instructions on how one would test this, and what > Jeff> testing you performed. > > modprobe scsi_debug dix=199 dif=1 guard=1 dev_size_mb=1024 num_parts=1 > > I'm testing with XFS and btrfs. Generally doing kernel builds, etc. > ext2/3 are still problematic because they modify pages in flight. So, is it safe to say that the library routines for integrity-aware file systems have not been tested at all? Specifically, I'm talking about: bio_integrity_tag_size bio_integrity_set_tag bio_integrity_get_tag > Jeff> block/blk-core.c: In function 'generic_make_request': > Jeff> include/linux/bio.h:469: sorry, unimplemented: inlining failed > Jeff> in call to 'bio_i ntegrity_enabled': function body not available > Jeff> block/blk-core.c:1388: sorry, unimplemented: called from here > Jeff> make[1]: *** [block/blk-core.o] Error 1 make: *** [block] Error > Jeff> 2 > > Odd. Which compiler are you using? Compiles just fine for me on both > EL5 and FC9. gcc (GCC) 4.1.2 20071124 (Red Hat 4.1.2-41) > Judging from the error I'm guessing it's objecting to the inlining. > Tried to work around it. Please pull, update and let me know whether > that did the trick. I did a new clone (just to be sure I got your change) and I get the same problem. I also can't see the changeset in the log, so are you sure you pushed it? I got rid of the inline in the definition in bio.h. The .c file didn't define the function as inline, so I didn't have to change it. It seems to be building now. Cheers, Jeff --
From: Martin K. Petersen <martin.petersen@...> Subject: Re: [PATCH 0 of 7] Block/SCSI Data Integrity Support Date: Jun 10, 4:47 pm 2008 >>>>> "Jeff" == Jeff Moyer <jmoyer@redhat.com> writes: Jeff> So, is it safe to say that the library routines for Jeff> integrity-aware file systems have not been tested at all? Jeff> Specifically, I'm talking about: bio_integrity_tag_size Jeff> bio_integrity_set_tag bio_integrity_get_tag I have not tried using them from within a filesystem, if that's what you mean. But I have attached random strings to bios and read them back later. Jeff> gcc (GCC) 4.1.2 20071124 (Red Hat 4.1.2-41) Ok, I'm trying to chase down a 5.2 box to figure out what the problem is. Maybe I'll just move that function to the header file. Jeff> I did a new clone (just to be sure I got your change) and I get Jeff> the same problem. I also can't see the changeset in the log, so Jeff> are you sure you pushed it? Yup, it's there. Jeff> I got rid of the inline in the definition in bio.h. The .c file Jeff> didn't define the function as inline, so I didn't have to change Jeff> it. It seems to be building now. The problem is that your gcc is unhappy about the fact that the inlined function is defined elsewhere. The gcc info page said only declare it inline in the header and not the declaration. The change I pushed removed inline from the .c file. But that didn't help. -- Martin K. Petersen Oracle Linux Engineering --
From: Jeff Moyer <jmoyer@...> Subject: Re: [PATCH 0 of 7] Block/SCSI Data Integrity Support Date: Jun 10, 4:53 pm 2008 "Martin K. Petersen" <martin.petersen@oracle.com> writes: >>>>>> "Jeff" == Jeff Moyer <jmoyer@redhat.com> writes: > > Jeff> So, is it safe to say that the library routines for > Jeff> integrity-aware file systems have not been tested at all? > Jeff> Specifically, I'm talking about: bio_integrity_tag_size > Jeff> bio_integrity_set_tag bio_integrity_get_tag > > I have not tried using them from within a filesystem, if that's what > you mean. But I have attached random strings to bios and read them > back later. I was checking to see if you had exercised the code at all, and you have. Great! Cheers, Jeff --

Thanks for the good read. I
Thanks for the good read.
I myself run only SCSI, have 5 drives. Fujitsu, Maxtor & Seagate.
Never had any of them fail opposite to my old IDE/ATA.
I think with the large capacity drives these days, cramping as much space as possible on each platter, the room for error is very high.
Data Integrity is very important.
thanks
thanks for good text
good info
like zfs checksums?
How might this relate to the software implementation of zfs checksums? I believe zfs calculates a checksum for every write and maybe even on read? Might this be more robust?
RAID-Z
KernelTrap had an article about this earlier:
See the section of that article titled "End-to-end data integrity" for more details.