Jens Axboe [interview] posted a series of ten patches that add support for large IO commands. He began by defining the problem:
"Some people complain that Linux doesn't support really large IO commands. The main reason why we do not support infinitely sized IO is that we need to allocate a scatterlist to fill these elements into for dma mapping. The Linux scatterlist is an array of scatterlist elements, so we need to allocate a contiguous piece of memory to hold them all. On i386, we can at most fit 256 scatterlist elements into a page, and on x86-64 we are stuck with 128. So that puts us somewhere between 512kb and 1024kb for a single IO."
Jens went on to explain his solution, "to get around that limitation, this patchset introduces an sg chaining concept. The way it works is that the last element of an sg table can point to a new sgtable, thus extending the size of the total IO scatterlist greatly." Regarding the current status he noted, "it works for me, but you can't enable large commands on anything but i386 right now. I still need to go over the x86-64 iommu bits to enable it there as well."
From: Jens Axboe [email blocked] To: linux-kernel Subject: [PATCH 0/10] Chaining sg lists for big IO commands v2 Date: Wed, 9 May 2007 09:59:14 +0200 Hi, Ok, got this cleaned up and split a bit. Should be more reviewable now. A rough overview of what this does: Some people complain that Linux doesn't support really large IO commands. The main reason why we do not support infinitely sized IO is that we need to allocate a scatterlist to fill these elements into for dma mapping. The Linux scatterlist is an array of scatterlist elements, so we need to allocate a contig piece of memory to hold them all. On i386, we can at most fit 256 scatterlist elements into a page, and on x86-64 we are stuck with 128. So that puts us somewhere between 512kb and 1024kb for a single IO. To get around that limitation, this patchset introduces an sg chaining concept. The way it works is that the last element of an sg table can point to a new sgtable, thus extending the size of the total IO scatterlist greatly. The first parts of the patch are preparatory stuff, abstracting out sg browsing/lookup and converting libata/SCSI/block to using those. The latter part is enabling sg chaining on i386 and SCSI (and thus libata as well). The patch set defaults to being safe and doesn't enable large commands, you must actively do so yourself. If you want to test eg sda with large commands, you would do: # cd /sys/block/sda/queue # echo 1024 > max_segments # cat max_hw_sectors_kb > max_sectors_kb which would limit you to 1024 segments (effectively 8 scatterlists chained), and should give you IO's of at least 4mb. You can go larger than 1024, there's no real limit. Changes since last time: - Hopefully get the libata atapi/pio bits fixed. - Clear __GFP_WAIT on second (and on) rounds of scatterlist allocation. - Cleanups/fixes/etc. It works for me, but you can't enable large commands on anything but i386 right now. I still need to go over the x86-64 iommu bits to enable it there as well. block/ll_rw_blk.c | 41 +++++- crypto/digest.c | 2 crypto/scatterwalk.c | 2 crypto/scatterwalk.h | 2 drivers/ata/libata-core.c | 30 ++-- drivers/scsi/scsi_lib.c | 212 ++++++++++++++++++++++++--------- drivers/scsi/scsi_tgt_lib.c | 4 include/asm-i386/dma-mapping.h | 13 +- include/asm-i386/scatterlist.h | 4 include/linux/libata.h | 16 +- include/linux/scatterlist.h | 40 ++++++ include/scsi/scsi.h | 7 - include/scsi/scsi_cmnd.h | 3 13 files changed, 275 insertions(+), 101 deletions(-)
From: Justin Piszcz [email blocked] To: axboe Subject: Chaining sg lists for big I/O commands: Question Date: Wed, 9 May 2007 09:22:08 -0400 (EDT) http://kerneltrap.org/node/8176 I am a mdadm/disk/hard drive fanatic, I was curious: > On i386, we can at most fit 256 scatterlist elements into a page, > and on x86-64 we are stuck with 128. So that puts us somewhere > between 512kb and 1024kb for a single IO. How come 32bit is 256 and 64 is only 128? I am sure it is something very fundamental/simple but I was curious, I would think x86_64 would fit/support more scatterlists in a page. Also, when this patch is implemented for x86_64 and if merged into mainline, what does this mean for performance? I have an mdadm raid5 of 10 raptors and get 434MB/s write and 622MB/s read, would I see an increase in performance with this patch? Justin.
From: Jens Axboe [email blocked] Subject: Re: Chaining sg lists for big I/O commands: Question Date: Wed, 9 May 2007 15:38:30 +0200 On Wed, May 09 2007, Justin Piszcz wrote: > http://kerneltrap.org/node/8176 Oh > I am a mdadm/disk/hard drive fanatic, I was curious: > > >On i386, we can at most fit 256 scatterlist elements into a page, > >and on x86-64 we are stuck with 128. So that puts us somewhere > >between 512kb and 1024kb for a single IO. > > How come 32bit is 256 and 64 is only 128? > > I am sure it is something very fundamental/simple but I was curious, I > would think x86_64 would fit/support more scatterlists in a page. Because of the size of the scatterlist structure. As pointers are bigger on 64-bit archs, the scatterlist structure ends up being bigger. The page size on x86-64 is 4kb, hence the number of structures you can fit in a page is smaller. > Also, when this patch is implemented for x86_64 and if merged into > mainline, what does this mean for performance? The sglist branch of block repo has x86-64 support now. I'll post a new patchset tomorrow. Performance wise, it's meant to help higher end hardware that need 2-4mb (or bigger) commands to get good performance. That also includes things like tapes that have big block sizes, getting a command of the right size there is the difference between good and abysmal performance. > I have an mdadm raid5 of 10 raptors and get 434MB/s write and 622MB/s > read, would I see an increase in performance with this patch? Perhaps, depends on a lot of factors. -- Jens Axboe