Message ID | 1432318723-18829-2-git-send-email-mlin@kernel.org (mailing list archive) |
---|---|
State | Superseded, archived |
Delegated to: | Mike Snitzer |
Headers | show |
On Fri, May 22 2015 at 2:18pm -0400, Ming Lin <mlin@kernel.org> wrote: > From: Kent Overstreet <kent.overstreet@gmail.com> > > The way the block layer is currently written, it goes to great lengths > to avoid having to split bios; upper layer code (such as bio_add_page()) > checks what the underlying device can handle and tries to always create > bios that don't need to be split. > > But this approach becomes unwieldy and eventually breaks down with > stacked devices and devices with dynamic limits, and it adds a lot of > complexity. If the block layer could split bios as needed, we could > eliminate a lot of complexity elsewhere - particularly in stacked > drivers. Code that creates bios can then create whatever size bios are > convenient, and more importantly stacked drivers don't have to deal with > both their own bio size limitations and the limitations of the > (potentially multiple) devices underneath them. In the future this will > let us delete merge_bvec_fn and a bunch of other code. This series doesn't take any steps to train upper layers (e.g. filesystems) to size their bios larger (which is defined as "whatever size bios are convenient" above). bio_add_page(), and merge_bvec_fn, served as the means for upper layers (and direct IO) to build up optimally sized bios. Without a replacement (that I can see anyway) how is this patchset making forward progress (getting Acks, etc)!? I like the idea of reduced complexity associated with these late bio splitting changes I'm just not seeing how this is ready given there are no upper layer changes that speak to building larger bios.. What am I missing? Please advise, thanks! Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, May 26, 2015 at 7:36 AM, Mike Snitzer <snitzer@redhat.com> wrote: > On Fri, May 22 2015 at 2:18pm -0400, > Ming Lin <mlin@kernel.org> wrote: > >> From: Kent Overstreet <kent.overstreet@gmail.com> >> >> The way the block layer is currently written, it goes to great lengths >> to avoid having to split bios; upper layer code (such as bio_add_page()) >> checks what the underlying device can handle and tries to always create >> bios that don't need to be split. >> >> But this approach becomes unwieldy and eventually breaks down with >> stacked devices and devices with dynamic limits, and it adds a lot of >> complexity. If the block layer could split bios as needed, we could >> eliminate a lot of complexity elsewhere - particularly in stacked >> drivers. Code that creates bios can then create whatever size bios are >> convenient, and more importantly stacked drivers don't have to deal with >> both their own bio size limitations and the limitations of the >> (potentially multiple) devices underneath them. In the future this will >> let us delete merge_bvec_fn and a bunch of other code. > > This series doesn't take any steps to train upper layers > (e.g. filesystems) to size their bios larger (which is defined as > "whatever size bios are convenient" above). > > bio_add_page(), and merge_bvec_fn, served as the means for upper layers > (and direct IO) to build up optimally sized bios. Without a replacement > (that I can see anyway) how is this patchset making forward progress > (getting Acks, etc)!? > > I like the idea of reduced complexity associated with these late bio > splitting changes I'm just not seeing how this is ready given there are > no upper layer changes that speak to building larger bios.. > > What am I missing? See: [PATCH v4 02/11] block: simplify bio_add_page() https://lkml.org/lkml/2015/5/22/754 Now bio_add_page() can build lager bios. And blk_queue_split() can split the bios in ->make_request() if needed. Thanks. > > Please advise, thanks! > Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, May 26, 2015 at 08:02:08AM -0700, Ming Lin wrote: > Now bio_add_page() can build lager bios. > And blk_queue_split() can split the bios in ->make_request() if needed. But why not try to make the bio the right size in the first place so you don't have to incur the performance impact of splitting? What performance testing have you yet done to demonstrate the *actual* impact of this patchset in situations where merge_bvec_fn is currently a net benefit? Alasdair -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, May 26 2015 at 11:02am -0400, Ming Lin <mlin@kernel.org> wrote: > On Tue, May 26, 2015 at 7:36 AM, Mike Snitzer <snitzer@redhat.com> wrote: > > On Fri, May 22 2015 at 2:18pm -0400, > > Ming Lin <mlin@kernel.org> wrote: > > > >> From: Kent Overstreet <kent.overstreet@gmail.com> > >> > >> The way the block layer is currently written, it goes to great lengths > >> to avoid having to split bios; upper layer code (such as bio_add_page()) > >> checks what the underlying device can handle and tries to always create > >> bios that don't need to be split. > >> > >> But this approach becomes unwieldy and eventually breaks down with > >> stacked devices and devices with dynamic limits, and it adds a lot of > >> complexity. If the block layer could split bios as needed, we could > >> eliminate a lot of complexity elsewhere - particularly in stacked > >> drivers. Code that creates bios can then create whatever size bios are > >> convenient, and more importantly stacked drivers don't have to deal with > >> both their own bio size limitations and the limitations of the > >> (potentially multiple) devices underneath them. In the future this will > >> let us delete merge_bvec_fn and a bunch of other code. > > > > This series doesn't take any steps to train upper layers > > (e.g. filesystems) to size their bios larger (which is defined as > > "whatever size bios are convenient" above). > > > > bio_add_page(), and merge_bvec_fn, served as the means for upper layers > > (and direct IO) to build up optimally sized bios. Without a replacement > > (that I can see anyway) how is this patchset making forward progress > > (getting Acks, etc)!? > > > > I like the idea of reduced complexity associated with these late bio > > splitting changes I'm just not seeing how this is ready given there are > > no upper layer changes that speak to building larger bios.. > > > > What am I missing? > > See: [PATCH v4 02/11] block: simplify bio_add_page() > https://lkml.org/lkml/2015/5/22/754 > > Now bio_add_page() can build lager bios. > And blk_queue_split() can split the bios in ->make_request() if needed. That'll result in quite large bios and always needing splitting. As Alasdair asked: please provide some performance data that justifies these changes. E.g use a setup like: XFS on a DM striped target. We can iterate on more complex setups once we have established some basic tests. If you're just punting to reviewers to do the testing for you that isn't going to instill _any_ confidence in me for this patchset as a suitabe replacement relative to performance. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, May 26, 2015 at 9:04 AM, Mike Snitzer <snitzer@redhat.com> wrote: > On Tue, May 26 2015 at 11:02am -0400, > Ming Lin <mlin@kernel.org> wrote: > >> On Tue, May 26, 2015 at 7:36 AM, Mike Snitzer <snitzer@redhat.com> wrote: >> > On Fri, May 22 2015 at 2:18pm -0400, >> > Ming Lin <mlin@kernel.org> wrote: >> > >> >> From: Kent Overstreet <kent.overstreet@gmail.com> >> >> >> >> The way the block layer is currently written, it goes to great lengths >> >> to avoid having to split bios; upper layer code (such as bio_add_page()) >> >> checks what the underlying device can handle and tries to always create >> >> bios that don't need to be split. >> >> >> >> But this approach becomes unwieldy and eventually breaks down with >> >> stacked devices and devices with dynamic limits, and it adds a lot of >> >> complexity. If the block layer could split bios as needed, we could >> >> eliminate a lot of complexity elsewhere - particularly in stacked >> >> drivers. Code that creates bios can then create whatever size bios are >> >> convenient, and more importantly stacked drivers don't have to deal with >> >> both their own bio size limitations and the limitations of the >> >> (potentially multiple) devices underneath them. In the future this will >> >> let us delete merge_bvec_fn and a bunch of other code. >> > >> > This series doesn't take any steps to train upper layers >> > (e.g. filesystems) to size their bios larger (which is defined as >> > "whatever size bios are convenient" above). >> > >> > bio_add_page(), and merge_bvec_fn, served as the means for upper layers >> > (and direct IO) to build up optimally sized bios. Without a replacement >> > (that I can see anyway) how is this patchset making forward progress >> > (getting Acks, etc)!? >> > >> > I like the idea of reduced complexity associated with these late bio >> > splitting changes I'm just not seeing how this is ready given there are >> > no upper layer changes that speak to building larger bios.. >> > >> > What am I missing? >> >> See: [PATCH v4 02/11] block: simplify bio_add_page() >> https://lkml.org/lkml/2015/5/22/754 >> >> Now bio_add_page() can build lager bios. >> And blk_queue_split() can split the bios in ->make_request() if needed. > > That'll result in quite large bios and always needing splitting. > > As Alasdair asked: please provide some performance data that justifies > these changes. E.g use a setup like: XFS on a DM striped target. We > can iterate on more complex setups once we have established some basic > tests. I'll test XFS on DM and also what Christoph suggested: https://lkml.org/lkml/2015/5/25/226 > > If you're just punting to reviewers to do the testing for you that isn't > going to instill _any_ confidence in me for this patchset as a suitabe > replacement relative to performance. Kent's Direct IO rewrite patch depends on this series. https://git.kernel.org/cgit/linux/kernel/git/mlin/linux.git/log/?h=block-dio-rewrite I did test the dio patch on a 2 sockets(48 logical CPUs) server and saw 40% improvement with 48 null_blks. Here is the fio data of 4k read. 4.1-rc2 ---------- Test 1: bw=50509MB/s, iops=12930K Test 2: bw=49745MB/s, iops=12735K Test 3: bw=50297MB/s, iops=12876K, Average: bw=50183MB/s, iops=12847K 4.1-rc2-dio-rewrite ------------------------ Test 1: bw=70269MB/s, iops=17989K Test 2: bw=70097MB/s, iops=17945K Test 3: bw=70907MB/s, iops=18152K Average: bw=70424MB/s, iops=18028K -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, 26 May 2015 16:34:14 +0100 Alasdair G Kergon <agk@redhat.com> wrote: > On Tue, May 26, 2015 at 08:02:08AM -0700, Ming Lin wrote: > > Now bio_add_page() can build lager bios. > > And blk_queue_split() can split the bios in ->make_request() if needed. > > But why not try to make the bio the right size in the first place so you > don't have to incur the performance impact of splitting? Because we don't know what the "right" size is. And the "right" size can change when array reconfiguration happens. Splitting has to happen somewhere, if only in bio_addpage where it decides to create a new bio rather than add another page to the current one. So moving the split to a different level of the stack shouldn't necessarily change the performance profile. Obviously testing is important to confirm that. NeilBrown > > What performance testing have you yet done to demonstrate the *actual* impact > of this patchset in situations where merge_bvec_fn is currently a net benefit? > > Alasdair > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, May 27, 2015 at 09:06:40AM +1000, Neil Brown wrote: > Because we don't know what the "right" size is. And the "right" size can > change when array reconfiguration happens. In certain configurations today, device-mapper does report back a sensible maximum bio size smaller than would otherwise be used and thereby avoids retrospective splitting. (In tests, the overhead of the duplicate calculation was found to be negligible so we never restructured the code to optimise it away.) > Splitting has to happen somewhere, if only in bio_addpage where it decides to > create a new bio rather than add another page to the current one. So moving > the split to a different level of the stack shouldn't necessarily change the > performance profile. It does sometimes make a significant difference to device-mapper stacks. DM only uses it for performance reasons - it can already split bios when it needs to. I tried to remove merge_bvec_fn from DM several years ago but couldn't because of the adverse performance impact of lots of splitting activity. The overall cost of splitting ought to be less in many (but not necessarily all) cases now as a result of all these patches, so exactly where the best balance lies now needs to be reassessed empirically. It is hard to reach conclusions theoretically because of the complex interplay between the various factors at different levels. Alasdair -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, May 27, 2015 at 01:40:22AM +0100, Alasdair G Kergon wrote: > It does sometimes make a significant difference to device-mapper stacks. > DM only uses it for performance reasons - it can already split bios when > it needs to. I tried to remove merge_bvec_fn from DM several years ago but > couldn't because of the adverse performance impact of lots of splitting activity. Does it still? Since the move to immutable biovecs the bio splits are pretty cheap now, but I'd really like to see this verified by benchmarks. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, May 26, 2015 at 9:04 AM, Mike Snitzer <snitzer@redhat.com> wrote: > On Tue, May 26 2015 at 11:02am -0400, > Ming Lin <mlin@kernel.org> wrote: > >> On Tue, May 26, 2015 at 7:36 AM, Mike Snitzer <snitzer@redhat.com> wrote: >> > On Fri, May 22 2015 at 2:18pm -0400, >> > Ming Lin <mlin@kernel.org> wrote: >> > >> >> From: Kent Overstreet <kent.overstreet@gmail.com> >> >> >> >> The way the block layer is currently written, it goes to great lengths >> >> to avoid having to split bios; upper layer code (such as bio_add_page()) >> >> checks what the underlying device can handle and tries to always create >> >> bios that don't need to be split. >> >> >> >> But this approach becomes unwieldy and eventually breaks down with >> >> stacked devices and devices with dynamic limits, and it adds a lot of >> >> complexity. If the block layer could split bios as needed, we could >> >> eliminate a lot of complexity elsewhere - particularly in stacked >> >> drivers. Code that creates bios can then create whatever size bios are >> >> convenient, and more importantly stacked drivers don't have to deal with >> >> both their own bio size limitations and the limitations of the >> >> (potentially multiple) devices underneath them. In the future this will >> >> let us delete merge_bvec_fn and a bunch of other code. >> > >> > This series doesn't take any steps to train upper layers >> > (e.g. filesystems) to size their bios larger (which is defined as >> > "whatever size bios are convenient" above). >> > >> > bio_add_page(), and merge_bvec_fn, served as the means for upper layers >> > (and direct IO) to build up optimally sized bios. Without a replacement >> > (that I can see anyway) how is this patchset making forward progress >> > (getting Acks, etc)!? >> > >> > I like the idea of reduced complexity associated with these late bio >> > splitting changes I'm just not seeing how this is ready given there are >> > no upper layer changes that speak to building larger bios.. >> > >> > What am I missing? >> >> See: [PATCH v4 02/11] block: simplify bio_add_page() >> https://lkml.org/lkml/2015/5/22/754 >> >> Now bio_add_page() can build lager bios. >> And blk_queue_split() can split the bios in ->make_request() if needed. > > That'll result in quite large bios and always needing splitting. > > As Alasdair asked: please provide some performance data that justifies > these changes. E.g use a setup like: XFS on a DM striped target. We > can iterate on more complex setups once we have established some basic > tests. Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD. Does it make sense? 4.1-rc4 4.1-rc4-patched ------------------ ----------------------- (KB/s) (KB/s) sequential-read-buf: 150822 151371 sequential-read-direct: 408938 421940 random-read-buf: 3404.9 3389.1 random-read-direct: 4859.8 4843.5 sequential-write-buf: 333455 335776 sequential-write-direct: 44739 43194 random-write-buf: 7272.1 7209.6 random-write-direct: 4333.9 4330.7 root@minggr:~/tmp/test# cat t.job [global] size=1G directory=/mnt/ numjobs=8 group_reporting runtime=300 time_based bs=8k ioengine=libaio iodepth=64 [sequential-read-buf] rw=read [sequential-read-direct] rw=read direct=1 [random-read-buf] rw=randread [random-read-direct] rw=randread direct=1 [sequential-write-buf] rw=write [sequential-write-direct] rw=write direct=1 [random-write-buf] rw=randwrite [random-write-direct] rw=randwrite direct=1 root@minggr:~/tmp/test# cat run.sh #!/bin/bash jobs="sequential-read-buf sequential-read-direct random-read-buf random-read-direct" jobs="$jobs sequential-write-buf sequential-write-direct random-write-buf random-write-direct" #each partition is 100G pvcreate /dev/sdb3 /dev/nvme0n1p1 /dev/sdc6 vgcreate striped_vol_group /dev/sdb3 /dev/nvme0n1p1 /dev/sdc6 lvcreate -i3 -I4 -L250G -nstriped_logical_volume striped_vol_group for job in $jobs ; do umount /mnt > /dev/null 2>&1 mkfs.xfs -f /dev/striped_vol_group/striped_logical_volume mount /dev/striped_vol_group/striped_logical_volume /mnt fio --output=${job}.log --section=${job} t.job done -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, May 27, 2015 at 04:42:44PM -0700, Ming Lin wrote: > Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD. > Does it make sense? To stripe across devices with different characteristics? Some suggestions. Prepare 3 kernels. O - Old kernel. M - Old kernel with merge_bvec_fn disabled. N - New kernel. You're trying to search for counter-examples to the hypothesis that "Kernel N always outperforms Kernel O". Then if you find any, trying to show either that the performance impediment is small enough that it doesn't matter or that the cases are sufficiently rare or obscure that they may be ignored because of the greater benefits of N in much more common cases. (1) You're looking to set up configurations where kernel O performs noticeably better than M. Then you're comparing the performance of O and N in those situations. (2) You're looking at other sensible configurations where O and M have similar performance, and comparing that with the performance of N. In each case you find, you expect to be able to vary some parameter (such as stripe size) to show a progression of the effect. When running tests you've to take care the system is reset into the same initial state before each test, so you'll normally also try to include some baseline test between tests that should give the same results each time and also take the average of a number of runs (while also reporting some measure of the variation within each set to make sure that remains low, typically a low single digit percentage). Since we're mostly concerned about splitting, you'll want to monitor iostat to see if that gives you enough information to home in on suitable configurations for (1). Failing that, you might need to instrument the kernels to tell you the sizes of the bios being created and the amount of splitting actually happening. Striping was mentioned because it forces splitting. So show the progression from tiny stripes to huge stripes. (Ensure all the devices providing the stripes have identical characteristics, but you can test with slow and fast underlying devices.) You may also want to test systems with a restricted amount of available memory to show how the splitting via worker thread performs. (Again, instrument to prove the extent to which the new code is being exercised.) Alasdair -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, May 27, 2015 at 5:36 PM, Alasdair G Kergon <agk@redhat.com> wrote: > On Wed, May 27, 2015 at 04:42:44PM -0700, Ming Lin wrote: >> Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD. >> Does it make sense? > > To stripe across devices with different characteristics? I intended to test it on a 2 sockets server with 10 NVMe drives. But that server has been busy running other tests. So I have to run test on a PC which happen to have 2 SSDs + 1 HDD. > > Some suggestions. Thanks for the great detail. I'm reading to understand. > > Prepare 3 kernels. > O - Old kernel. > M - Old kernel with merge_bvec_fn disabled. > N - New kernel. > > You're trying to search for counter-examples to the hypothesis that > "Kernel N always outperforms Kernel O". Then if you find any, trying > to show either that the performance impediment is small enough that > it doesn't matter or that the cases are sufficiently rare or obscure > that they may be ignored because of the greater benefits of N in much more > common cases. > > (1) You're looking to set up configurations where kernel O performs noticeably > better than M. Then you're comparing the performance of O and N in those > situations. > > (2) You're looking at other sensible configurations where O and M have > similar performance, and comparing that with the performance of N. > > In each case you find, you expect to be able to vary some parameter (such as > stripe size) to show a progression of the effect. > > When running tests you've to take care the system is reset into the same > initial state before each test, so you'll normally also try to include some > baseline test between tests that should give the same results each time > and also take the average of a number of runs (while also reporting some > measure of the variation within each set to make sure that remains low, > typically a low single digit percentage). > > Since we're mostly concerned about splitting, you'll want to monitor > iostat to see if that gives you enough information to home in on > suitable configurations for (1). Failing that, you might need to > instrument the kernels to tell you the sizes of the bios being > created and the amount of splitting actually happening. > > Striping was mentioned because it forces splitting. So show the progression > from tiny stripes to huge stripes. (Ensure all the devices providing the > stripes have identical characteristics, but you can test with slow and > fast underlying devices.) > > You may also want to test systems with a restricted amount of available > memory to show how the splitting via worker thread performs. (Again, > instrument to prove the extent to which the new code is being exercised.) > > Alasdair > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, May 27, 2015 at 5:36 PM, Alasdair G Kergon <agk@redhat.com> wrote: > On Wed, May 27, 2015 at 04:42:44PM -0700, Ming Lin wrote: >> Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD. >> Does it make sense? > > To stripe across devices with different characteristics? > > Some suggestions. > > Prepare 3 kernels. > O - Old kernel. > M - Old kernel with merge_bvec_fn disabled. How to disable it? Maybe just hack it as below? void blk_queue_merge_bvec(struct request_queue *q, merge_bvec_fn *mbfn) { //q->merge_bvec_fn = mbfn; } > N - New kernel. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, May 29 2015 at 3:05P -0400, Ming Lin <mlin@kernel.org> wrote: > On Wed, May 27, 2015 at 5:36 PM, Alasdair G Kergon <agk@redhat.com> wrote: > > On Wed, May 27, 2015 at 04:42:44PM -0700, Ming Lin wrote: > >> Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD. > >> Does it make sense? > > > > To stripe across devices with different characteristics? > > > > Some suggestions. > > > > Prepare 3 kernels. > > O - Old kernel. > > M - Old kernel with merge_bvec_fn disabled. > > How to disable it? > Maybe just hack it as below? > > void blk_queue_merge_bvec(struct request_queue *q, merge_bvec_fn *mbfn) > { > //q->merge_bvec_fn = mbfn; > } Right, there isn't an existing way to disable it, you'd need a hack like that. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, 2015-05-28 at 01:36 +0100, Alasdair G Kergon wrote: > On Wed, May 27, 2015 at 04:42:44PM -0700, Ming Lin wrote: > > Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD. > > Does it make sense? > > To stripe across devices with different characteristics? > > Some suggestions. > > Prepare 3 kernels. > O - Old kernel. > M - Old kernel with merge_bvec_fn disabled. > N - New kernel. > > You're trying to search for counter-examples to the hypothesis that > "Kernel N always outperforms Kernel O". Then if you find any, trying > to show either that the performance impediment is small enough that > it doesn't matter or that the cases are sufficiently rare or obscure > that they may be ignored because of the greater benefits of N in much more > common cases. > > (1) You're looking to set up configurations where kernel O performs noticeably > better than M. Then you're comparing the performance of O and N in those > situations. > > (2) You're looking at other sensible configurations where O and M have > similar performance, and comparing that with the performance of N. I didn't find case (1). But the important thing for this series is to simplify block layer based on immutable biovecs. I don't expect performance improvement. Here is the changes statistics. "68 files changed, 336 insertions(+), 1331 deletions(-)" I run below 3 test cases to make sure it didn't bring any regressions. Test environment: 2 NVMe drives on 2 sockets server. Each case run for 30 minutes. 2) btrfs radi0 mkfs.btrfs -f -d raid0 /dev/nvme0n1 /dev/nvme1n1 mount /dev/nvme0n1 /mnt Then run 8K read. [global] ioengine=libaio iodepth=64 direct=1 runtime=1800 time_based group_reporting numjobs=4 rw=read [job1] bs=8K directory=/mnt size=1G 2) ext4 on MD raid5 mdadm --create /dev/md0 --level=5 --raid-devices=2 /dev/nvme0n1 /dev/nvme1n1 mkfs.ext4 /dev/md0 mount /dev/md0 /mnt fio script same as btrfs test 3) xfs on DM stripped target pvcreate /dev/nvme0n1 /dev/nvme1n1 vgcreate striped_vol_group /dev/nvme0n1 /dev/nvme1n1 lvcreate -i2 -I4 -L250G -nstriped_logical_volume striped_vol_group mkfs.xfs -f /dev/striped_vol_group/striped_logical_volume mount /dev/striped_vol_group/striped_logical_volume /mnt fio script same as btrfs test ------ Results: 4.1-rc4 4.1-rc4-patched btrfs 1818.6MB/s 1874.1MB/s ext4 717307KB/s 714030KB/s xfs 1396.6MB/s 1398.6MB/s -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Sun, May 31, 2015 at 11:02 PM, Ming Lin <mlin@kernel.org> wrote: > On Thu, 2015-05-28 at 01:36 +0100, Alasdair G Kergon wrote: >> On Wed, May 27, 2015 at 04:42:44PM -0700, Ming Lin wrote: >> > Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD. >> > Does it make sense? >> >> To stripe across devices with different characteristics? >> >> Some suggestions. >> >> Prepare 3 kernels. >> O - Old kernel. >> M - Old kernel with merge_bvec_fn disabled. >> N - New kernel. >> >> You're trying to search for counter-examples to the hypothesis that >> "Kernel N always outperforms Kernel O". Then if you find any, trying >> to show either that the performance impediment is small enough that >> it doesn't matter or that the cases are sufficiently rare or obscure >> that they may be ignored because of the greater benefits of N in much more >> common cases. >> >> (1) You're looking to set up configurations where kernel O performs noticeably >> better than M. Then you're comparing the performance of O and N in those >> situations. >> >> (2) You're looking at other sensible configurations where O and M have >> similar performance, and comparing that with the performance of N. > > I didn't find case (1). > > But the important thing for this series is to simplify block layer > based on immutable biovecs. I don't expect performance improvement. > > Here is the changes statistics. > > "68 files changed, 336 insertions(+), 1331 deletions(-)" > > I run below 3 test cases to make sure it didn't bring any regressions. > Test environment: 2 NVMe drives on 2 sockets server. > Each case run for 30 minutes. > > 2) btrfs radi0 > > mkfs.btrfs -f -d raid0 /dev/nvme0n1 /dev/nvme1n1 > mount /dev/nvme0n1 /mnt > > Then run 8K read. > > [global] > ioengine=libaio > iodepth=64 > direct=1 > runtime=1800 > time_based > group_reporting > numjobs=4 > rw=read > > [job1] > bs=8K > directory=/mnt > size=1G > > 2) ext4 on MD raid5 > > mdadm --create /dev/md0 --level=5 --raid-devices=2 /dev/nvme0n1 /dev/nvme1n1 > mkfs.ext4 /dev/md0 > mount /dev/md0 /mnt > > fio script same as btrfs test > > 3) xfs on DM stripped target > > pvcreate /dev/nvme0n1 /dev/nvme1n1 > vgcreate striped_vol_group /dev/nvme0n1 /dev/nvme1n1 > lvcreate -i2 -I4 -L250G -nstriped_logical_volume striped_vol_group > mkfs.xfs -f /dev/striped_vol_group/striped_logical_volume > mount /dev/striped_vol_group/striped_logical_volume /mnt > > fio script same as btrfs test > > ------ > > Results: > > 4.1-rc4 4.1-rc4-patched > btrfs 1818.6MB/s 1874.1MB/s > ext4 717307KB/s 714030KB/s > xfs 1396.6MB/s 1398.6MB/s Hi Alasdair & Mike, Would you like these numbers? I'd like to address your concerns to move forward. Thanks. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, Jun 02 2015 at 4:59pm -0400, Ming Lin <mlin@kernel.org> wrote: > On Sun, May 31, 2015 at 11:02 PM, Ming Lin <mlin@kernel.org> wrote: > > On Thu, 2015-05-28 at 01:36 +0100, Alasdair G Kergon wrote: > >> On Wed, May 27, 2015 at 04:42:44PM -0700, Ming Lin wrote: > >> > Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD. > >> > Does it make sense? > >> > >> To stripe across devices with different characteristics? > >> > >> Some suggestions. > >> > >> Prepare 3 kernels. > >> O - Old kernel. > >> M - Old kernel with merge_bvec_fn disabled. > >> N - New kernel. > >> > >> You're trying to search for counter-examples to the hypothesis that > >> "Kernel N always outperforms Kernel O". Then if you find any, trying > >> to show either that the performance impediment is small enough that > >> it doesn't matter or that the cases are sufficiently rare or obscure > >> that they may be ignored because of the greater benefits of N in much more > >> common cases. > >> > >> (1) You're looking to set up configurations where kernel O performs noticeably > >> better than M. Then you're comparing the performance of O and N in those > >> situations. > >> > >> (2) You're looking at other sensible configurations where O and M have > >> similar performance, and comparing that with the performance of N. > > > > I didn't find case (1). > > > > But the important thing for this series is to simplify block layer > > based on immutable biovecs. I don't expect performance improvement. No simplifying isn't the important thing. Any change to remove the merge_bvec callbacks needs to not introduce performance regressions on enterprise systems with large RAID arrays, etc. It is fine if there isn't a performance improvement but I really don't think the limited testing you've done on a relatively small storage configuration has come even close to showing these changes don't introduce performance regressions. > > Here is the changes statistics. > > > > "68 files changed, 336 insertions(+), 1331 deletions(-)" > > > > I run below 3 test cases to make sure it didn't bring any regressions. > > Test environment: 2 NVMe drives on 2 sockets server. > > Each case run for 30 minutes. > > > > 2) btrfs radi0 > > > > mkfs.btrfs -f -d raid0 /dev/nvme0n1 /dev/nvme1n1 > > mount /dev/nvme0n1 /mnt > > > > Then run 8K read. > > > > [global] > > ioengine=libaio > > iodepth=64 > > direct=1 > > runtime=1800 > > time_based > > group_reporting > > numjobs=4 > > rw=read > > > > [job1] > > bs=8K > > directory=/mnt > > size=1G > > > > 2) ext4 on MD raid5 > > > > mdadm --create /dev/md0 --level=5 --raid-devices=2 /dev/nvme0n1 /dev/nvme1n1 > > mkfs.ext4 /dev/md0 > > mount /dev/md0 /mnt > > > > fio script same as btrfs test > > > > 3) xfs on DM stripped target > > > > pvcreate /dev/nvme0n1 /dev/nvme1n1 > > vgcreate striped_vol_group /dev/nvme0n1 /dev/nvme1n1 > > lvcreate -i2 -I4 -L250G -nstriped_logical_volume striped_vol_group > > mkfs.xfs -f /dev/striped_vol_group/striped_logical_volume > > mount /dev/striped_vol_group/striped_logical_volume /mnt > > > > fio script same as btrfs test > > > > ------ > > > > Results: > > > > 4.1-rc4 4.1-rc4-patched > > btrfs 1818.6MB/s 1874.1MB/s > > ext4 717307KB/s 714030KB/s > > xfs 1396.6MB/s 1398.6MB/s > > Hi Alasdair & Mike, > > Would you like these numbers? > I'd like to address your concerns to move forward. I really don't see that these NVMe results prove much. We need to test on large HW raid setups like a Netapp filer (or even local SAS drives connected via some SAS controller). Like a 8+2 drive RAID6 or 8+1 RAID5 setup. Testing with MD raid on JBOD setups with 8 devices is also useful. It is larger RAID setups that will be more sensitive to IO sizes being properly aligned on RAID stripe and/or chunk size boundaries. There are tradeoffs between creating a really large bio and creating a properly sized bio from the start. And yes, to one of neilb's original points, limits do change and we suck at restacking limits.. so what was once properly sized may no longer be but: that is a relatively rare occurrence. Late splitting does do away with the limits stacking disconnect. And in general I like the idea of removing all the merge_bvec code. I just don't think I can confidently Ack such a wholesale switch at this point with such limited performance analysis. If we (the DM/lvm team at Red Hat) are being painted into a corner of having to provide our own testing that meets our definition of "thorough" then we'll need time to carry out those tests. But I'd hate to hold up everyone because DM is not in agreement on this change... So taking a step back, why can't we introduce late bio splitting in a phased approach? 1: introduce late bio splitting to block core BUT still keep established merge_bvec infrastructure 2: establish a way for upper layers to skip merge_bvec if they'd like to do so (e.g. block-core exposes a 'use_late_bio_splitting' or something for userspace or upper layers to set, can also have a Kconfig that enables this feature by default) 3: we gain confidence in late bio-splitting and then carry on with the removal of merge_bvec et al (could be incrementally done on a per-driver basis, e.g. DM, MD, btrfs, etc, etc). Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, Jun 4, 2015 at 2:06 PM, Mike Snitzer <snitzer@redhat.com> wrote: > On Tue, Jun 02 2015 at 4:59pm -0400, > Ming Lin <mlin@kernel.org> wrote: > >> On Sun, May 31, 2015 at 11:02 PM, Ming Lin <mlin@kernel.org> wrote: >> > On Thu, 2015-05-28 at 01:36 +0100, Alasdair G Kergon wrote: >> >> On Wed, May 27, 2015 at 04:42:44PM -0700, Ming Lin wrote: >> >> > Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD. >> >> > Does it make sense? >> >> >> >> To stripe across devices with different characteristics? >> >> >> >> Some suggestions. >> >> >> >> Prepare 3 kernels. >> >> O - Old kernel. >> >> M - Old kernel with merge_bvec_fn disabled. >> >> N - New kernel. >> >> >> >> You're trying to search for counter-examples to the hypothesis that >> >> "Kernel N always outperforms Kernel O". Then if you find any, trying >> >> to show either that the performance impediment is small enough that >> >> it doesn't matter or that the cases are sufficiently rare or obscure >> >> that they may be ignored because of the greater benefits of N in much more >> >> common cases. >> >> >> >> (1) You're looking to set up configurations where kernel O performs noticeably >> >> better than M. Then you're comparing the performance of O and N in those >> >> situations. >> >> >> >> (2) You're looking at other sensible configurations where O and M have >> >> similar performance, and comparing that with the performance of N. >> > >> > I didn't find case (1). >> > >> > But the important thing for this series is to simplify block layer >> > based on immutable biovecs. I don't expect performance improvement. > > No simplifying isn't the important thing. Any change to remove the > merge_bvec callbacks needs to not introduce performance regressions on > enterprise systems with large RAID arrays, etc. > > It is fine if there isn't a performance improvement but I really don't > think the limited testing you've done on a relatively small storage > configuration has come even close to showing these changes don't > introduce performance regressions. > >> > Here is the changes statistics. >> > >> > "68 files changed, 336 insertions(+), 1331 deletions(-)" >> > >> > I run below 3 test cases to make sure it didn't bring any regressions. >> > Test environment: 2 NVMe drives on 2 sockets server. >> > Each case run for 30 minutes. >> > >> > 2) btrfs radi0 >> > >> > mkfs.btrfs -f -d raid0 /dev/nvme0n1 /dev/nvme1n1 >> > mount /dev/nvme0n1 /mnt >> > >> > Then run 8K read. >> > >> > [global] >> > ioengine=libaio >> > iodepth=64 >> > direct=1 >> > runtime=1800 >> > time_based >> > group_reporting >> > numjobs=4 >> > rw=read >> > >> > [job1] >> > bs=8K >> > directory=/mnt >> > size=1G >> > >> > 2) ext4 on MD raid5 >> > >> > mdadm --create /dev/md0 --level=5 --raid-devices=2 /dev/nvme0n1 /dev/nvme1n1 >> > mkfs.ext4 /dev/md0 >> > mount /dev/md0 /mnt >> > >> > fio script same as btrfs test >> > >> > 3) xfs on DM stripped target >> > >> > pvcreate /dev/nvme0n1 /dev/nvme1n1 >> > vgcreate striped_vol_group /dev/nvme0n1 /dev/nvme1n1 >> > lvcreate -i2 -I4 -L250G -nstriped_logical_volume striped_vol_group >> > mkfs.xfs -f /dev/striped_vol_group/striped_logical_volume >> > mount /dev/striped_vol_group/striped_logical_volume /mnt >> > >> > fio script same as btrfs test >> > >> > ------ >> > >> > Results: >> > >> > 4.1-rc4 4.1-rc4-patched >> > btrfs 1818.6MB/s 1874.1MB/s >> > ext4 717307KB/s 714030KB/s >> > xfs 1396.6MB/s 1398.6MB/s >> >> Hi Alasdair & Mike, >> >> Would you like these numbers? >> I'd like to address your concerns to move forward. > > I really don't see that these NVMe results prove much. > > We need to test on large HW raid setups like a Netapp filer (or even > local SAS drives connected via some SAS controller). Like a 8+2 drive > RAID6 or 8+1 RAID5 setup. Testing with MD raid on JBOD setups with 8 > devices is also useful. It is larger RAID setups that will be more > sensitive to IO sizes being properly aligned on RAID stripe and/or chunk > size boundaries. I'll test it on large HW raid setup. Here is HW RAID5 setup with 19 278G HDDs on Dell R730xd(2sockets/48 logical cpus/264G mem). http://minggr.net/pub/20150604/hw_raid5.jpg The stripe size is 64K. I'm going to test ext4/btrfs/xfs on it. "bs" set to 1216k(64K * 19 = 1216k) and run 48 jobs. [global] ioengine=libaio iodepth=64 direct=1 runtime=1800 time_based group_reporting numjobs=48 rw=read [job1] bs=1216K directory=/mnt size=1G Or do you have other suggestions of what tests I should run? Thanks. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, Jun 04 2015 at 6:21pm -0400, Ming Lin <mlin@kernel.org> wrote: > On Thu, Jun 4, 2015 at 2:06 PM, Mike Snitzer <snitzer@redhat.com> wrote: > > > > We need to test on large HW raid setups like a Netapp filer (or even > > local SAS drives connected via some SAS controller). Like a 8+2 drive > > RAID6 or 8+1 RAID5 setup. Testing with MD raid on JBOD setups with 8 > > devices is also useful. It is larger RAID setups that will be more > > sensitive to IO sizes being properly aligned on RAID stripe and/or chunk > > size boundaries. > > I'll test it on large HW raid setup. > > Here is HW RAID5 setup with 19 278G HDDs on Dell R730xd(2sockets/48 > logical cpus/264G mem). > http://minggr.net/pub/20150604/hw_raid5.jpg > > The stripe size is 64K. > > I'm going to test ext4/btrfs/xfs on it. > "bs" set to 1216k(64K * 19 = 1216k) > and run 48 jobs. Definitely an odd blocksize (though 1280K full stripe is pretty common for 10+2 HW RAID6 w/ 128K chunk size). > [global] > ioengine=libaio > iodepth=64 > direct=1 > runtime=1800 > time_based > group_reporting > numjobs=48 > rw=read > > [job1] > bs=1216K > directory=/mnt > size=1G How does time_based relate to size=1G? It'll rewrite the same 1 gig file repeatedly? > Or do you have other suggestions of what tests I should run? You're welcome to run this job but I'll also check with others here to see what fio jobs we used in the recent past when assessing performance of the dm-crypt parallelization changes. Also, a lot of care needs to be taken to eliminate jitter in the system while the test is running. We got a lot of good insight from Bart Van Assche on that and put it to practice. I'll see if we can (re)summarize that too. Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, Jun 4, 2015 at 5:06 PM, Mike Snitzer <snitzer@redhat.com> wrote: > On Thu, Jun 04 2015 at 6:21pm -0400, > Ming Lin <mlin@kernel.org> wrote: > >> On Thu, Jun 4, 2015 at 2:06 PM, Mike Snitzer <snitzer@redhat.com> wrote: >> > >> > We need to test on large HW raid setups like a Netapp filer (or even >> > local SAS drives connected via some SAS controller). Like a 8+2 drive >> > RAID6 or 8+1 RAID5 setup. Testing with MD raid on JBOD setups with 8 >> > devices is also useful. It is larger RAID setups that will be more >> > sensitive to IO sizes being properly aligned on RAID stripe and/or chunk >> > size boundaries. >> >> I'll test it on large HW raid setup. >> >> Here is HW RAID5 setup with 19 278G HDDs on Dell R730xd(2sockets/48 >> logical cpus/264G mem). >> http://minggr.net/pub/20150604/hw_raid5.jpg >> >> The stripe size is 64K. >> >> I'm going to test ext4/btrfs/xfs on it. >> "bs" set to 1216k(64K * 19 = 1216k) >> and run 48 jobs. > > Definitely an odd blocksize (though 1280K full stripe is pretty common > for 10+2 HW RAID6 w/ 128K chunk size). I can change it to 10 HDDs HW RAID6 w/ 128K chunk size, then use bs=1280K > >> [global] >> ioengine=libaio >> iodepth=64 >> direct=1 >> runtime=1800 >> time_based >> group_reporting >> numjobs=48 >> rw=read >> >> [job1] >> bs=1216K >> directory=/mnt >> size=1G > > How does time_based relate to size=1G? It'll rewrite the same 1 gig > file repeatedly? Above job file is for read. For write, I think so. Do is make sense for performance test? > >> Or do you have other suggestions of what tests I should run? > > You're welcome to run this job but I'll also check with others here to > see what fio jobs we used in the recent past when assessing performance > of the dm-crypt parallelization changes. That's very helpful. > > Also, a lot of care needs to be taken to eliminate jitter in the system > while the test is running. We got a lot of good insight from Bart Van > Assche on that and put it to practice. I'll see if we can (re)summarize > that too. Very helpful too. Thanks. > > Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, 2015-06-04 at 17:06 -0400, Mike Snitzer wrote: > We need to test on large HW raid setups like a Netapp filer (or even > local SAS drives connected via some SAS controller). Like a 8+2 drive > RAID6 or 8+1 RAID5 setup. Testing with MD raid on JBOD setups with 8 > devices is also useful. It is larger RAID setups that will be more > sensitive to IO sizes being properly aligned on RAID stripe and/or chunk > size boundaries. Here are tests results of xfs/ext4/btrfs read/write on HW RAID6/MD RAID6/DM stripe target. Each case run 0.5 hour, so it took 36 hours to finish all the tests on 4.1-rc4 and 4.1-rc4-patched kernels. No performance regressions were introduced. Test server: Dell R730xd(2 sockets/48 logical cpus/264G memory) HW RAID6/MD RAID6/DM stripe target were configured with 10 HDDs, each 280G Stripe size 64k and 128k were tested. devs="/dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk" spare_devs="/dev/sdl /dev/sdm" stripe_size=64 (or 128) MD RAID6 was created by: mdadm --create --verbose /dev/md0 --level=6 --raid-devices=10 $devs --spare-devices=2 $spare_devs -c $stripe_size DM stripe target was created by: pvcreate $devs vgcreate striped_vol_group $devs lvcreate -i10 -I${stripe_size} -L2T -nstriped_logical_volume striped_vol_group Here is an example of fio script for stripe size 128k: [global] ioengine=libaio iodepth=64 direct=1 runtime=1800 time_based group_reporting numjobs=48 gtod_reduce=0 norandommap write_iops_log=fs [job1] bs=1280K directory=/mnt size=5G rw=read All results here: http://minggr.net/pub/20150608/fio_results/ Results summary: 1. HW RAID6: stripe size 64k 4.1-rc4 4.1-rc4-patched ------- --------------- (MB/s) (MB/s) xfs read: 821.23 812.20 -1.09% xfs write: 753.16 754.42 +0.16% ext4 read: 827.80 834.82 +0.84% ext4 write: 783.08 777.58 -0.70% btrfs read: 859.26 871.68 +1.44% btrfs write: 815.63 844.40 +3.52% 2. HW RAID6: stripe size 128k 4.1-rc4 4.1-rc4-patched ------- --------------- (MB/s) (MB/s) xfs read: 948.27 979.11 +3.25% xfs write: 820.78 819.94 -0.10% ext4 read: 978.35 997.92 +2.00% ext4 write: 853.51 847.97 -0.64% btrfs read: 1013.1 1015.6 +0.24% btrfs write: 854.43 850.42 -0.46% 3. MD RAID6: stripe size 64k 4.1-rc4 4.1-rc4-patched ------- --------------- (MB/s) (MB/s) xfs read: 847.34 869.43 +2.60% xfs write: 198.67 199.03 +0.18% ext4 read: 763.89 767.79 +0.51% ext4 write: 281.44 282.83 +0.49% btrfs read: 756.02 743.69 -1.63% btrfs write: 268.37 265.93 -0.90% 4. MD RAID6: stripe size 128k 4.1-rc4 4.1-rc4-patched ------- --------------- (MB/s) (MB/s) xfs read: 993.04 1014.1 +2.12% xfs write: 293.06 298.95 +2.00% ext4 read: 1019.6 1020.9 +0.12% ext4 write: 371.51 371.47 -0.01% btrfs read: 1000.4 1020.8 +2.03% btrfs write: 241.08 246.77 +2.36% 5. DM: stripe size 64k 4.1-rc4 4.1-rc4-patched ------- --------------- (MB/s) (MB/s) xfs read: 1084.4 1080.1 -0.39% xfs write: 1071.1 1063.4 -0.71% ext4 read: 991.54 1003.7 +1.22% ext4 write: 1069.7 1052.2 -1.63% btrfs read: 1076.1 1082.1 +0.55% btrfs write: 968.98 965.07 -0.40% 6. DM: stripe size 128k 4.1-rc4 4.1-rc4-patched ------- --------------- (MB/s) (MB/s) xfs read: 1020.4 1066.1 +4.47% xfs write: 1058.2 1066.6 +0.79% ext4 read: 990.72 988.19 -0.25% ext4 write: 1050.4 1070.2 +1.88% btrfs read: 1080.9 1074.7 -0.57% btrfs write: 975.10 972.76 -0.23% -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, Jun 8, 2015 at 11:09 PM, Ming Lin <mlin@kernel.org> wrote: > On Thu, 2015-06-04 at 17:06 -0400, Mike Snitzer wrote: >> We need to test on large HW raid setups like a Netapp filer (or even >> local SAS drives connected via some SAS controller). Like a 8+2 drive >> RAID6 or 8+1 RAID5 setup. Testing with MD raid on JBOD setups with 8 >> devices is also useful. It is larger RAID setups that will be more >> sensitive to IO sizes being properly aligned on RAID stripe and/or chunk >> size boundaries. > > Here are tests results of xfs/ext4/btrfs read/write on HW RAID6/MD RAID6/DM stripe target. > Each case run 0.5 hour, so it took 36 hours to finish all the tests on 4.1-rc4 and 4.1-rc4-patched kernels. > > No performance regressions were introduced. > > Test server: Dell R730xd(2 sockets/48 logical cpus/264G memory) > HW RAID6/MD RAID6/DM stripe target were configured with 10 HDDs, each 280G > Stripe size 64k and 128k were tested. > > devs="/dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk" > spare_devs="/dev/sdl /dev/sdm" > stripe_size=64 (or 128) > > MD RAID6 was created by: > mdadm --create --verbose /dev/md0 --level=6 --raid-devices=10 $devs --spare-devices=2 $spare_devs -c $stripe_size > > DM stripe target was created by: > pvcreate $devs > vgcreate striped_vol_group $devs > lvcreate -i10 -I${stripe_size} -L2T -nstriped_logical_volume striped_vol_group > > Here is an example of fio script for stripe size 128k: > [global] > ioengine=libaio > iodepth=64 > direct=1 > runtime=1800 > time_based > group_reporting > numjobs=48 > gtod_reduce=0 > norandommap > write_iops_log=fs > > [job1] > bs=1280K > directory=/mnt > size=5G > rw=read > > All results here: http://minggr.net/pub/20150608/fio_results/ > > Results summary: > > 1. HW RAID6: stripe size 64k > 4.1-rc4 4.1-rc4-patched > ------- --------------- > (MB/s) (MB/s) > xfs read: 821.23 812.20 -1.09% > xfs write: 753.16 754.42 +0.16% > ext4 read: 827.80 834.82 +0.84% > ext4 write: 783.08 777.58 -0.70% > btrfs read: 859.26 871.68 +1.44% > btrfs write: 815.63 844.40 +3.52% > > 2. HW RAID6: stripe size 128k > 4.1-rc4 4.1-rc4-patched > ------- --------------- > (MB/s) (MB/s) > xfs read: 948.27 979.11 +3.25% > xfs write: 820.78 819.94 -0.10% > ext4 read: 978.35 997.92 +2.00% > ext4 write: 853.51 847.97 -0.64% > btrfs read: 1013.1 1015.6 +0.24% > btrfs write: 854.43 850.42 -0.46% > > 3. MD RAID6: stripe size 64k > 4.1-rc4 4.1-rc4-patched > ------- --------------- > (MB/s) (MB/s) > xfs read: 847.34 869.43 +2.60% > xfs write: 198.67 199.03 +0.18% > ext4 read: 763.89 767.79 +0.51% > ext4 write: 281.44 282.83 +0.49% > btrfs read: 756.02 743.69 -1.63% > btrfs write: 268.37 265.93 -0.90% > > 4. MD RAID6: stripe size 128k > 4.1-rc4 4.1-rc4-patched > ------- --------------- > (MB/s) (MB/s) > xfs read: 993.04 1014.1 +2.12% > xfs write: 293.06 298.95 +2.00% > ext4 read: 1019.6 1020.9 +0.12% > ext4 write: 371.51 371.47 -0.01% > btrfs read: 1000.4 1020.8 +2.03% > btrfs write: 241.08 246.77 +2.36% > > 5. DM: stripe size 64k > 4.1-rc4 4.1-rc4-patched > ------- --------------- > (MB/s) (MB/s) > xfs read: 1084.4 1080.1 -0.39% > xfs write: 1071.1 1063.4 -0.71% > ext4 read: 991.54 1003.7 +1.22% > ext4 write: 1069.7 1052.2 -1.63% > btrfs read: 1076.1 1082.1 +0.55% > btrfs write: 968.98 965.07 -0.40% > > 6. DM: stripe size 128k > 4.1-rc4 4.1-rc4-patched > ------- --------------- > (MB/s) (MB/s) > xfs read: 1020.4 1066.1 +4.47% > xfs write: 1058.2 1066.6 +0.79% > ext4 read: 990.72 988.19 -0.25% > ext4 write: 1050.4 1070.2 +1.88% > btrfs read: 1080.9 1074.7 -0.57% > btrfs write: 975.10 972.76 -0.23% Hi Mike, How about these numbers? I'm also happy to run other fio jobs your team used. Thanks. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, Jun 10 2015 at 5:20pm -0400, Ming Lin <mlin@kernel.org> wrote: > On Mon, Jun 8, 2015 at 11:09 PM, Ming Lin <mlin@kernel.org> wrote: > > On Thu, 2015-06-04 at 17:06 -0400, Mike Snitzer wrote: > >> We need to test on large HW raid setups like a Netapp filer (or even > >> local SAS drives connected via some SAS controller). Like a 8+2 drive > >> RAID6 or 8+1 RAID5 setup. Testing with MD raid on JBOD setups with 8 > >> devices is also useful. It is larger RAID setups that will be more > >> sensitive to IO sizes being properly aligned on RAID stripe and/or chunk > >> size boundaries. > > > > Here are tests results of xfs/ext4/btrfs read/write on HW RAID6/MD RAID6/DM stripe target. > > Each case run 0.5 hour, so it took 36 hours to finish all the tests on 4.1-rc4 and 4.1-rc4-patched kernels. > > > > No performance regressions were introduced. > > > > Test server: Dell R730xd(2 sockets/48 logical cpus/264G memory) > > HW RAID6/MD RAID6/DM stripe target were configured with 10 HDDs, each 280G > > Stripe size 64k and 128k were tested. > > > > devs="/dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk" > > spare_devs="/dev/sdl /dev/sdm" > > stripe_size=64 (or 128) > > > > MD RAID6 was created by: > > mdadm --create --verbose /dev/md0 --level=6 --raid-devices=10 $devs --spare-devices=2 $spare_devs -c $stripe_size > > > > DM stripe target was created by: > > pvcreate $devs > > vgcreate striped_vol_group $devs > > lvcreate -i10 -I${stripe_size} -L2T -nstriped_logical_volume striped_vol_group DM had a regression relative to merge_bvec that wasn't fixed until recently (it wasn't in 4.1-rc4), see commit 1c220c69ce0 ("dm: fix casting bug in dm_merge_bvec()"). It was introduced in 4.1. So your 4.1-rc4 DM stripe testing may have effectively been with merge_bvec disabled. > > Here is an example of fio script for stripe size 128k: > > [global] > > ioengine=libaio > > iodepth=64 > > direct=1 > > runtime=1800 > > time_based > > group_reporting > > numjobs=48 > > gtod_reduce=0 > > norandommap > > write_iops_log=fs > > > > [job1] > > bs=1280K > > directory=/mnt > > size=5G > > rw=read > > > > All results here: http://minggr.net/pub/20150608/fio_results/ > > > > Results summary: > > > > 1. HW RAID6: stripe size 64k > > 4.1-rc4 4.1-rc4-patched > > ------- --------------- > > (MB/s) (MB/s) > > xfs read: 821.23 812.20 -1.09% > > xfs write: 753.16 754.42 +0.16% > > ext4 read: 827.80 834.82 +0.84% > > ext4 write: 783.08 777.58 -0.70% > > btrfs read: 859.26 871.68 +1.44% > > btrfs write: 815.63 844.40 +3.52% > > > > 2. HW RAID6: stripe size 128k > > 4.1-rc4 4.1-rc4-patched > > ------- --------------- > > (MB/s) (MB/s) > > xfs read: 948.27 979.11 +3.25% > > xfs write: 820.78 819.94 -0.10% > > ext4 read: 978.35 997.92 +2.00% > > ext4 write: 853.51 847.97 -0.64% > > btrfs read: 1013.1 1015.6 +0.24% > > btrfs write: 854.43 850.42 -0.46% > > > > 3. MD RAID6: stripe size 64k > > 4.1-rc4 4.1-rc4-patched > > ------- --------------- > > (MB/s) (MB/s) > > xfs read: 847.34 869.43 +2.60% > > xfs write: 198.67 199.03 +0.18% > > ext4 read: 763.89 767.79 +0.51% > > ext4 write: 281.44 282.83 +0.49% > > btrfs read: 756.02 743.69 -1.63% > > btrfs write: 268.37 265.93 -0.90% > > > > 4. MD RAID6: stripe size 128k > > 4.1-rc4 4.1-rc4-patched > > ------- --------------- > > (MB/s) (MB/s) > > xfs read: 993.04 1014.1 +2.12% > > xfs write: 293.06 298.95 +2.00% > > ext4 read: 1019.6 1020.9 +0.12% > > ext4 write: 371.51 371.47 -0.01% > > btrfs read: 1000.4 1020.8 +2.03% > > btrfs write: 241.08 246.77 +2.36% > > > > 5. DM: stripe size 64k > > 4.1-rc4 4.1-rc4-patched > > ------- --------------- > > (MB/s) (MB/s) > > xfs read: 1084.4 1080.1 -0.39% > > xfs write: 1071.1 1063.4 -0.71% > > ext4 read: 991.54 1003.7 +1.22% > > ext4 write: 1069.7 1052.2 -1.63% > > btrfs read: 1076.1 1082.1 +0.55% > > btrfs write: 968.98 965.07 -0.40% > > > > 6. DM: stripe size 128k > > 4.1-rc4 4.1-rc4-patched > > ------- --------------- > > (MB/s) (MB/s) > > xfs read: 1020.4 1066.1 +4.47% > > xfs write: 1058.2 1066.6 +0.79% > > ext4 read: 990.72 988.19 -0.25% > > ext4 write: 1050.4 1070.2 +1.88% > > btrfs read: 1080.9 1074.7 -0.57% > > btrfs write: 975.10 972.76 -0.23% > > Hi Mike, > > How about these numbers? Looks fairly good. I just am not sure the workload is going to test the code paths in question like we'd hope. I'll have to set aside some time to think through scenarios to test. My concern still remains that at some point it the future we'll regret not having merge_bvec but it'll be too late. That is just my own FUD at this point... > I'm also happy to run other fio jobs your team used. I've been busy getting DM changes for the 4.2 merge window finalized. As such I haven't connected with others on the team to discuss this issue. I'll see if we can make time in the next 2 days. But I also have RHEL-specific kernel deadlines I'm coming up against. Seems late to be staging this extensive a change for 4.2... are you pushing for this code to land in the 4.2 merge window? Or do we have time to work this further and target the 4.3 merge? Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, Jun 10, 2015 at 2:46 PM, Mike Snitzer <snitzer@redhat.com> wrote: > On Wed, Jun 10 2015 at 5:20pm -0400, > Ming Lin <mlin@kernel.org> wrote: > >> On Mon, Jun 8, 2015 at 11:09 PM, Ming Lin <mlin@kernel.org> wrote: >> > On Thu, 2015-06-04 at 17:06 -0400, Mike Snitzer wrote: >> >> We need to test on large HW raid setups like a Netapp filer (or even >> >> local SAS drives connected via some SAS controller). Like a 8+2 drive >> >> RAID6 or 8+1 RAID5 setup. Testing with MD raid on JBOD setups with 8 >> >> devices is also useful. It is larger RAID setups that will be more >> >> sensitive to IO sizes being properly aligned on RAID stripe and/or chunk >> >> size boundaries. >> > >> > Here are tests results of xfs/ext4/btrfs read/write on HW RAID6/MD RAID6/DM stripe target. >> > Each case run 0.5 hour, so it took 36 hours to finish all the tests on 4.1-rc4 and 4.1-rc4-patched kernels. >> > >> > No performance regressions were introduced. >> > >> > Test server: Dell R730xd(2 sockets/48 logical cpus/264G memory) >> > HW RAID6/MD RAID6/DM stripe target were configured with 10 HDDs, each 280G >> > Stripe size 64k and 128k were tested. >> > >> > devs="/dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk" >> > spare_devs="/dev/sdl /dev/sdm" >> > stripe_size=64 (or 128) >> > >> > MD RAID6 was created by: >> > mdadm --create --verbose /dev/md0 --level=6 --raid-devices=10 $devs --spare-devices=2 $spare_devs -c $stripe_size >> > >> > DM stripe target was created by: >> > pvcreate $devs >> > vgcreate striped_vol_group $devs >> > lvcreate -i10 -I${stripe_size} -L2T -nstriped_logical_volume striped_vol_group > > DM had a regression relative to merge_bvec that wasn't fixed until > recently (it wasn't in 4.1-rc4), see commit 1c220c69ce0 ("dm: fix > casting bug in dm_merge_bvec()"). It was introduced in 4.1. > > So your 4.1-rc4 DM stripe testing may have effectively been with > merge_bvec disabled. I'l rebase it to latest Linus tree and re-run DM stripe testing. > >> > Here is an example of fio script for stripe size 128k: >> > [global] >> > ioengine=libaio >> > iodepth=64 >> > direct=1 >> > runtime=1800 >> > time_based >> > group_reporting >> > numjobs=48 >> > gtod_reduce=0 >> > norandommap >> > write_iops_log=fs >> > >> > [job1] >> > bs=1280K >> > directory=/mnt >> > size=5G >> > rw=read >> > >> > All results here: http://minggr.net/pub/20150608/fio_results/ >> > >> > Results summary: >> > >> > 1. HW RAID6: stripe size 64k >> > 4.1-rc4 4.1-rc4-patched >> > ------- --------------- >> > (MB/s) (MB/s) >> > xfs read: 821.23 812.20 -1.09% >> > xfs write: 753.16 754.42 +0.16% >> > ext4 read: 827.80 834.82 +0.84% >> > ext4 write: 783.08 777.58 -0.70% >> > btrfs read: 859.26 871.68 +1.44% >> > btrfs write: 815.63 844.40 +3.52% >> > >> > 2. HW RAID6: stripe size 128k >> > 4.1-rc4 4.1-rc4-patched >> > ------- --------------- >> > (MB/s) (MB/s) >> > xfs read: 948.27 979.11 +3.25% >> > xfs write: 820.78 819.94 -0.10% >> > ext4 read: 978.35 997.92 +2.00% >> > ext4 write: 853.51 847.97 -0.64% >> > btrfs read: 1013.1 1015.6 +0.24% >> > btrfs write: 854.43 850.42 -0.46% >> > >> > 3. MD RAID6: stripe size 64k >> > 4.1-rc4 4.1-rc4-patched >> > ------- --------------- >> > (MB/s) (MB/s) >> > xfs read: 847.34 869.43 +2.60% >> > xfs write: 198.67 199.03 +0.18% >> > ext4 read: 763.89 767.79 +0.51% >> > ext4 write: 281.44 282.83 +0.49% >> > btrfs read: 756.02 743.69 -1.63% >> > btrfs write: 268.37 265.93 -0.90% >> > >> > 4. MD RAID6: stripe size 128k >> > 4.1-rc4 4.1-rc4-patched >> > ------- --------------- >> > (MB/s) (MB/s) >> > xfs read: 993.04 1014.1 +2.12% >> > xfs write: 293.06 298.95 +2.00% >> > ext4 read: 1019.6 1020.9 +0.12% >> > ext4 write: 371.51 371.47 -0.01% >> > btrfs read: 1000.4 1020.8 +2.03% >> > btrfs write: 241.08 246.77 +2.36% >> > >> > 5. DM: stripe size 64k >> > 4.1-rc4 4.1-rc4-patched >> > ------- --------------- >> > (MB/s) (MB/s) >> > xfs read: 1084.4 1080.1 -0.39% >> > xfs write: 1071.1 1063.4 -0.71% >> > ext4 read: 991.54 1003.7 +1.22% >> > ext4 write: 1069.7 1052.2 -1.63% >> > btrfs read: 1076.1 1082.1 +0.55% >> > btrfs write: 968.98 965.07 -0.40% >> > >> > 6. DM: stripe size 128k >> > 4.1-rc4 4.1-rc4-patched >> > ------- --------------- >> > (MB/s) (MB/s) >> > xfs read: 1020.4 1066.1 +4.47% >> > xfs write: 1058.2 1066.6 +0.79% >> > ext4 read: 990.72 988.19 -0.25% >> > ext4 write: 1050.4 1070.2 +1.88% >> > btrfs read: 1080.9 1074.7 -0.57% >> > btrfs write: 975.10 972.76 -0.23% >> >> Hi Mike, >> >> How about these numbers? > > Looks fairly good. I just am not sure the workload is going to test the > code paths in question like we'd hope. I'll have to set aside some time How about adding some counters to record, for example, how many time ->merge_bvec is called in old kernel and how many time bio splitting is called in patched kernel? > to think through scenarios to test. Great. > > My concern still remains that at some point it the future we'll regret > not having merge_bvec but it'll be too late. That is just my own FUD at > this point... > >> I'm also happy to run other fio jobs your team used. > > I've been busy getting DM changes for the 4.2 merge window finalized. > As such I haven't connected with others on the team to discuss this > issue. > > I'll see if we can make time in the next 2 days. But I also have > RHEL-specific kernel deadlines I'm coming up against. > > Seems late to be staging this extensive a change for 4.2... are you > pushing for this code to land in the 4.2 merge window? Or do we have > time to work this further and target the 4.3 merge? I'm OK to target the 4.3 merge. But hope we can get it into linux-next tree ASAP for more wide tests. > > Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, 2015-06-10 at 15:06 -0700, Ming Lin wrote: > On Wed, Jun 10, 2015 at 2:46 PM, Mike Snitzer <snitzer@redhat.com> wrote: > > On Wed, Jun 10 2015 at 5:20pm -0400, > > Ming Lin <mlin@kernel.org> wrote: > > > >> On Mon, Jun 8, 2015 at 11:09 PM, Ming Lin <mlin@kernel.org> wrote: > >> > On Thu, 2015-06-04 at 17:06 -0400, Mike Snitzer wrote: > >> >> We need to test on large HW raid setups like a Netapp filer (or even > >> >> local SAS drives connected via some SAS controller). Like a 8+2 drive > >> >> RAID6 or 8+1 RAID5 setup. Testing with MD raid on JBOD setups with 8 > >> >> devices is also useful. It is larger RAID setups that will be more > >> >> sensitive to IO sizes being properly aligned on RAID stripe and/or chunk > >> >> size boundaries. > >> > > >> > Here are tests results of xfs/ext4/btrfs read/write on HW RAID6/MD RAID6/DM stripe target. > >> > Each case run 0.5 hour, so it took 36 hours to finish all the tests on 4.1-rc4 and 4.1-rc4-patched kernels. > >> > > >> > No performance regressions were introduced. > >> > > >> > Test server: Dell R730xd(2 sockets/48 logical cpus/264G memory) > >> > HW RAID6/MD RAID6/DM stripe target were configured with 10 HDDs, each 280G > >> > Stripe size 64k and 128k were tested. > >> > > >> > devs="/dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk" > >> > spare_devs="/dev/sdl /dev/sdm" > >> > stripe_size=64 (or 128) > >> > > >> > MD RAID6 was created by: > >> > mdadm --create --verbose /dev/md0 --level=6 --raid-devices=10 $devs --spare-devices=2 $spare_devs -c $stripe_size > >> > > >> > DM stripe target was created by: > >> > pvcreate $devs > >> > vgcreate striped_vol_group $devs > >> > lvcreate -i10 -I${stripe_size} -L2T -nstriped_logical_volume striped_vol_group > > > > DM had a regression relative to merge_bvec that wasn't fixed until > > recently (it wasn't in 4.1-rc4), see commit 1c220c69ce0 ("dm: fix > > casting bug in dm_merge_bvec()"). It was introduced in 4.1. > > > > So your 4.1-rc4 DM stripe testing may have effectively been with > > merge_bvec disabled. > > I'l rebase it to latest Linus tree and re-run DM stripe testing. Here is the results for 4.1-rc7. Also looks good. 5. DM: stripe size 64k 4.1-rc7 4.1-rc7-patched ------- --------------- (MB/s) (MB/s) xfs read: 784.0 783.5 -0.06% xfs write: 751.8 768.8 +2.26% ext4 read: 837.0 832.3 -0.56% ext4 write: 806.8 814.3 +0.92% btrfs read: 787.5 786.1 -0.17% btrfs write: 722.8 718.7 -0.56% 6. DM: stripe size 128k 4.1-rc7 4.1-rc7-patched ------- --------------- (MB/s) (MB/s) xfs read: 1045.5 1068.8 +2.22% xfs write: 1058.9 1052.7 -0.58% ext4 read: 1001.8 1020.7 +1.88% ext4 write: 1049.9 1053.7 +0.36% btrfs read: 1082.8 1084.8 +0.18% btrfs write: 948.15 948.74 +0.06% -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, Jun 10, 2015 at 2:46 PM, Mike Snitzer <snitzer@redhat.com> wrote: > On Wed, Jun 10 2015 at 5:20pm -0400, > Ming Lin <mlin@kernel.org> wrote: > >> On Mon, Jun 8, 2015 at 11:09 PM, Ming Lin <mlin@kernel.org> wrote: >> > On Thu, 2015-06-04 at 17:06 -0400, Mike Snitzer wrote: >> >> We need to test on large HW raid setups like a Netapp filer (or even >> >> local SAS drives connected via some SAS controller). Like a 8+2 drive >> >> RAID6 or 8+1 RAID5 setup. Testing with MD raid on JBOD setups with 8 >> >> devices is also useful. It is larger RAID setups that will be more >> >> sensitive to IO sizes being properly aligned on RAID stripe and/or chunk >> >> size boundaries. >> > >> > Here are tests results of xfs/ext4/btrfs read/write on HW RAID6/MD RAID6/DM stripe target. >> > Each case run 0.5 hour, so it took 36 hours to finish all the tests on 4.1-rc4 and 4.1-rc4-patched kernels. >> > >> > No performance regressions were introduced. >> > >> > Test server: Dell R730xd(2 sockets/48 logical cpus/264G memory) >> > HW RAID6/MD RAID6/DM stripe target were configured with 10 HDDs, each 280G >> > Stripe size 64k and 128k were tested. >> > >> > devs="/dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk" >> > spare_devs="/dev/sdl /dev/sdm" >> > stripe_size=64 (or 128) >> > >> > MD RAID6 was created by: >> > mdadm --create --verbose /dev/md0 --level=6 --raid-devices=10 $devs --spare-devices=2 $spare_devs -c $stripe_size >> > >> > DM stripe target was created by: >> > pvcreate $devs >> > vgcreate striped_vol_group $devs >> > lvcreate -i10 -I${stripe_size} -L2T -nstriped_logical_volume striped_vol_group > > DM had a regression relative to merge_bvec that wasn't fixed until > recently (it wasn't in 4.1-rc4), see commit 1c220c69ce0 ("dm: fix > casting bug in dm_merge_bvec()"). It was introduced in 4.1. > > So your 4.1-rc4 DM stripe testing may have effectively been with > merge_bvec disabled. > >> > Here is an example of fio script for stripe size 128k: >> > [global] >> > ioengine=libaio >> > iodepth=64 >> > direct=1 >> > runtime=1800 >> > time_based >> > group_reporting >> > numjobs=48 >> > gtod_reduce=0 >> > norandommap >> > write_iops_log=fs >> > >> > [job1] >> > bs=1280K >> > directory=/mnt >> > size=5G >> > rw=read >> > >> > All results here: http://minggr.net/pub/20150608/fio_results/ >> > >> > Results summary: >> > >> > 1. HW RAID6: stripe size 64k >> > 4.1-rc4 4.1-rc4-patched >> > ------- --------------- >> > (MB/s) (MB/s) >> > xfs read: 821.23 812.20 -1.09% >> > xfs write: 753.16 754.42 +0.16% >> > ext4 read: 827.80 834.82 +0.84% >> > ext4 write: 783.08 777.58 -0.70% >> > btrfs read: 859.26 871.68 +1.44% >> > btrfs write: 815.63 844.40 +3.52% >> > >> > 2. HW RAID6: stripe size 128k >> > 4.1-rc4 4.1-rc4-patched >> > ------- --------------- >> > (MB/s) (MB/s) >> > xfs read: 948.27 979.11 +3.25% >> > xfs write: 820.78 819.94 -0.10% >> > ext4 read: 978.35 997.92 +2.00% >> > ext4 write: 853.51 847.97 -0.64% >> > btrfs read: 1013.1 1015.6 +0.24% >> > btrfs write: 854.43 850.42 -0.46% >> > >> > 3. MD RAID6: stripe size 64k >> > 4.1-rc4 4.1-rc4-patched >> > ------- --------------- >> > (MB/s) (MB/s) >> > xfs read: 847.34 869.43 +2.60% >> > xfs write: 198.67 199.03 +0.18% >> > ext4 read: 763.89 767.79 +0.51% >> > ext4 write: 281.44 282.83 +0.49% >> > btrfs read: 756.02 743.69 -1.63% >> > btrfs write: 268.37 265.93 -0.90% >> > >> > 4. MD RAID6: stripe size 128k >> > 4.1-rc4 4.1-rc4-patched >> > ------- --------------- >> > (MB/s) (MB/s) >> > xfs read: 993.04 1014.1 +2.12% >> > xfs write: 293.06 298.95 +2.00% >> > ext4 read: 1019.6 1020.9 +0.12% >> > ext4 write: 371.51 371.47 -0.01% >> > btrfs read: 1000.4 1020.8 +2.03% >> > btrfs write: 241.08 246.77 +2.36% >> > >> > 5. DM: stripe size 64k >> > 4.1-rc4 4.1-rc4-patched >> > ------- --------------- >> > (MB/s) (MB/s) >> > xfs read: 1084.4 1080.1 -0.39% >> > xfs write: 1071.1 1063.4 -0.71% >> > ext4 read: 991.54 1003.7 +1.22% >> > ext4 write: 1069.7 1052.2 -1.63% >> > btrfs read: 1076.1 1082.1 +0.55% >> > btrfs write: 968.98 965.07 -0.40% >> > >> > 6. DM: stripe size 128k >> > 4.1-rc4 4.1-rc4-patched >> > ------- --------------- >> > (MB/s) (MB/s) >> > xfs read: 1020.4 1066.1 +4.47% >> > xfs write: 1058.2 1066.6 +0.79% >> > ext4 read: 990.72 988.19 -0.25% >> > ext4 write: 1050.4 1070.2 +1.88% >> > btrfs read: 1080.9 1074.7 -0.57% >> > btrfs write: 975.10 972.76 -0.23% >> >> Hi Mike, >> >> How about these numbers? > > Looks fairly good. I just am not sure the workload is going to test the > code paths in question like we'd hope. I'll have to set aside some time > to think through scenarios to test. Hi Mike, Will you get a chance to think about it? Thanks. > > My concern still remains that at some point it the future we'll regret > not having merge_bvec but it'll be too late. That is just my own FUD at > this point... > >> I'm also happy to run other fio jobs your team used. > > I've been busy getting DM changes for the 4.2 merge window finalized. > As such I haven't connected with others on the team to discuss this > issue. > > I'll see if we can make time in the next 2 days. But I also have > RHEL-specific kernel deadlines I'm coming up against. > > Seems late to be staging this extensive a change for 4.2... are you > pushing for this code to land in the 4.2 merge window? Or do we have > time to work this further and target the 4.3 merge? > > Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
diff --git a/block/blk-core.c b/block/blk-core.c index 7871603..fbbb337 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -619,6 +619,10 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) if (q->id < 0) goto fail_q; + q->bio_split = bioset_create(BIO_POOL_SIZE, 0); + if (!q->bio_split) + goto fail_id; + q->backing_dev_info.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE; q->backing_dev_info.state = 0; @@ -628,7 +632,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) err = bdi_init(&q->backing_dev_info); if (err) - goto fail_id; + goto fail_split; setup_timer(&q->backing_dev_info.laptop_mode_wb_timer, laptop_mode_timer_fn, (unsigned long) q); @@ -670,6 +674,8 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) fail_bdi: bdi_destroy(&q->backing_dev_info); +fail_split: + bioset_free(q->bio_split); fail_id: ida_simple_remove(&blk_queue_ida, q->id); fail_q: @@ -1586,6 +1592,8 @@ void blk_queue_bio(struct request_queue *q, struct bio *bio) struct request *req; unsigned int request_count = 0; + blk_queue_split(q, &bio, q->bio_split); + /* * low level driver can indicate that it wants pages above a * certain limit bounced to low memory (ie for highmem, or even @@ -1809,15 +1817,6 @@ generic_make_request_checks(struct bio *bio) goto end_io; } - if (likely(bio_is_rw(bio) && - nr_sectors > queue_max_hw_sectors(q))) { - printk(KERN_ERR "bio too big device %s (%u > %u)\n", - bdevname(bio->bi_bdev, b), - bio_sectors(bio), - queue_max_hw_sectors(q)); - goto end_io; - } - part = bio->bi_bdev->bd_part; if (should_fail_request(part, bio->bi_iter.bi_size) || should_fail_request(&part_to_disk(part)->part0, diff --git a/block/blk-merge.c b/block/blk-merge.c index fd3fee8..dc14255 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -9,12 +9,158 @@ #include "blk.h" +static struct bio *blk_bio_discard_split(struct request_queue *q, + struct bio *bio, + struct bio_set *bs) +{ + unsigned int max_discard_sectors, granularity; + int alignment; + sector_t tmp; + unsigned split_sectors; + + /* Zero-sector (unknown) and one-sector granularities are the same. */ + granularity = max(q->limits.discard_granularity >> 9, 1U); + + max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9); + max_discard_sectors -= max_discard_sectors % granularity; + + if (unlikely(!max_discard_sectors)) { + /* XXX: warn */ + return NULL; + } + + if (bio_sectors(bio) <= max_discard_sectors) + return NULL; + + split_sectors = max_discard_sectors; + + /* + * If the next starting sector would be misaligned, stop the discard at + * the previous aligned sector. + */ + alignment = (q->limits.discard_alignment >> 9) % granularity; + + tmp = bio->bi_iter.bi_sector + split_sectors - alignment; + tmp = sector_div(tmp, granularity); + + if (split_sectors > tmp) + split_sectors -= tmp; + + return bio_split(bio, split_sectors, GFP_NOIO, bs); +} + +static struct bio *blk_bio_write_same_split(struct request_queue *q, + struct bio *bio, + struct bio_set *bs) +{ + if (!q->limits.max_write_same_sectors) + return NULL; + + if (bio_sectors(bio) <= q->limits.max_write_same_sectors) + return NULL; + + return bio_split(bio, q->limits.max_write_same_sectors, GFP_NOIO, bs); +} + +static struct bio *blk_bio_segment_split(struct request_queue *q, + struct bio *bio, + struct bio_set *bs) +{ + struct bio *split; + struct bio_vec bv, bvprv; + struct bvec_iter iter; + unsigned seg_size = 0, nsegs = 0; + int prev = 0; + + struct bvec_merge_data bvm = { + .bi_bdev = bio->bi_bdev, + .bi_sector = bio->bi_iter.bi_sector, + .bi_size = 0, + .bi_rw = bio->bi_rw, + }; + + bio_for_each_segment(bv, bio, iter) { + if (q->merge_bvec_fn && + q->merge_bvec_fn(q, &bvm, &bv) < (int) bv.bv_len) + goto split; + + bvm.bi_size += bv.bv_len; + + if (bvm.bi_size >> 9 > queue_max_sectors(q)) + goto split; + + /* + * If the queue doesn't support SG gaps and adding this + * offset would create a gap, disallow it. + */ + if (q->queue_flags & (1 << QUEUE_FLAG_SG_GAPS) && + prev && bvec_gap_to_prev(&bvprv, bv.bv_offset)) + goto split; + + if (prev && blk_queue_cluster(q)) { + if (seg_size + bv.bv_len > queue_max_segment_size(q)) + goto new_segment; + if (!BIOVEC_PHYS_MERGEABLE(&bvprv, &bv)) + goto new_segment; + if (!BIOVEC_SEG_BOUNDARY(q, &bvprv, &bv)) + goto new_segment; + + seg_size += bv.bv_len; + bvprv = bv; + prev = 1; + continue; + } +new_segment: + if (nsegs == queue_max_segments(q)) + goto split; + + nsegs++; + bvprv = bv; + prev = 1; + seg_size = bv.bv_len; + } + + return NULL; +split: + split = bio_clone_bioset(bio, GFP_NOIO, bs); + + split->bi_iter.bi_size -= iter.bi_size; + bio->bi_iter = iter; + + if (bio_integrity(bio)) { + bio_integrity_advance(bio, split->bi_iter.bi_size); + bio_integrity_trim(split, 0, bio_sectors(split)); + } + + return split; +} + +void blk_queue_split(struct request_queue *q, struct bio **bio, + struct bio_set *bs) +{ + struct bio *split; + + if ((*bio)->bi_rw & REQ_DISCARD) + split = blk_bio_discard_split(q, *bio, bs); + else if ((*bio)->bi_rw & REQ_WRITE_SAME) + split = blk_bio_write_same_split(q, *bio, bs); + else + split = blk_bio_segment_split(q, *bio, q->bio_split); + + if (split) { + bio_chain(split, *bio); + generic_make_request(*bio); + *bio = split; + } +} +EXPORT_SYMBOL(blk_queue_split); + static unsigned int __blk_recalc_rq_segments(struct request_queue *q, struct bio *bio, bool no_sg_merge) { struct bio_vec bv, bvprv = { NULL }; - int cluster, high, highprv = 1; + int cluster, prev = 0; unsigned int seg_size, nr_phys_segs; struct bio *fbio, *bbio; struct bvec_iter iter; @@ -36,7 +182,6 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q, cluster = blk_queue_cluster(q); seg_size = 0; nr_phys_segs = 0; - high = 0; for_each_bio(bio) { bio_for_each_segment(bv, bio, iter) { /* @@ -46,13 +191,7 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q, if (no_sg_merge) goto new_segment; - /* - * the trick here is making sure that a high page is - * never considered part of another segment, since - * that might change with the bounce page. - */ - high = page_to_pfn(bv.bv_page) > queue_bounce_pfn(q); - if (!high && !highprv && cluster) { + if (prev && cluster) { if (seg_size + bv.bv_len > queue_max_segment_size(q)) goto new_segment; @@ -72,8 +211,8 @@ new_segment: nr_phys_segs++; bvprv = bv; + prev = 1; seg_size = bv.bv_len; - highprv = high; } bbio = bio; } diff --git a/block/blk-mq.c b/block/blk-mq.c index e68b71b..e7fae76 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1256,6 +1256,8 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio) return; } + blk_queue_split(q, &bio, q->bio_split); + rq = blk_mq_map_request(q, bio, &data); if (unlikely(!rq)) return; @@ -1339,6 +1341,8 @@ static void blk_sq_make_request(struct request_queue *q, struct bio *bio) return; } + blk_queue_split(q, &bio, q->bio_split); + if (use_plug && !blk_queue_nomerges(q) && blk_attempt_plug_merge(q, bio, &request_count)) return; diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c index 3907202..a6265bc 100644 --- a/drivers/block/drbd/drbd_req.c +++ b/drivers/block/drbd/drbd_req.c @@ -1497,6 +1497,8 @@ void drbd_make_request(struct request_queue *q, struct bio *bio) struct drbd_device *device = (struct drbd_device *) q->queuedata; unsigned long start_jif; + blk_queue_split(q, &bio, q->bio_split); + start_jif = jiffies; /* diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c index 09e628da..ea10bd9 100644 --- a/drivers/block/pktcdvd.c +++ b/drivers/block/pktcdvd.c @@ -2446,6 +2446,10 @@ static void pkt_make_request(struct request_queue *q, struct bio *bio) char b[BDEVNAME_SIZE]; struct bio *split; + blk_queue_bounce(q, &bio); + + blk_queue_split(q, &bio, q->bio_split); + pd = q->queuedata; if (!pd) { pr_err("%s incorrect request queue\n", @@ -2476,8 +2480,6 @@ static void pkt_make_request(struct request_queue *q, struct bio *bio) goto end_io; } - blk_queue_bounce(q, &bio); - do { sector_t zone = get_zone(bio->bi_iter.bi_sector, pd); sector_t last_zone = get_zone(bio_end_sector(bio) - 1, pd); diff --git a/drivers/block/ps3vram.c b/drivers/block/ps3vram.c index ef45cfb..e32e799 100644 --- a/drivers/block/ps3vram.c +++ b/drivers/block/ps3vram.c @@ -605,6 +605,8 @@ static void ps3vram_make_request(struct request_queue *q, struct bio *bio) dev_dbg(&dev->core, "%s\n", __func__); + blk_queue_split(q, &bio, q->bio_split); + spin_lock_irq(&priv->lock); busy = !bio_list_empty(&priv->list); bio_list_add(&priv->list, bio); diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c index ac8c62c..50ef199 100644 --- a/drivers/block/rsxx/dev.c +++ b/drivers/block/rsxx/dev.c @@ -148,6 +148,8 @@ static void rsxx_make_request(struct request_queue *q, struct bio *bio) struct rsxx_bio_meta *bio_meta; int st = -EINVAL; + blk_queue_split(q, &bio, q->bio_split); + might_sleep(); if (!card) diff --git a/drivers/block/umem.c b/drivers/block/umem.c index 4cf81b5..13d577c 100644 --- a/drivers/block/umem.c +++ b/drivers/block/umem.c @@ -531,6 +531,8 @@ static void mm_make_request(struct request_queue *q, struct bio *bio) (unsigned long long)bio->bi_iter.bi_sector, bio->bi_iter.bi_size); + blk_queue_split(q, &bio, q->bio_split); + spin_lock_irq(&card->lock); *card->biotail = bio; bio->bi_next = NULL; diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 8dcbced..36a004e 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -981,6 +981,8 @@ static void zram_make_request(struct request_queue *queue, struct bio *bio) if (unlikely(!zram_meta_get(zram))) goto error; + blk_queue_split(queue, &bio, queue->bio_split); + if (!valid_io_request(zram, bio->bi_iter.bi_sector, bio->bi_iter.bi_size)) { atomic64_inc(&zram->stats.invalid_io); diff --git a/drivers/md/dm.c b/drivers/md/dm.c index a930b72..34f6063 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1784,6 +1784,8 @@ static void dm_make_request(struct request_queue *q, struct bio *bio) map = dm_get_live_table(md, &srcu_idx); + blk_queue_split(q, &bio, q->bio_split); + generic_start_io_acct(rw, bio_sectors(bio), &dm_disk(md)->part0); /* if we're suspended, we have to queue this io for later */ diff --git a/drivers/md/md.c b/drivers/md/md.c index 593a024..046b3c9 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -257,6 +257,8 @@ static void md_make_request(struct request_queue *q, struct bio *bio) unsigned int sectors; int cpu; + blk_queue_split(q, &bio, q->bio_split); + if (mddev == NULL || mddev->pers == NULL || !mddev->ready) { bio_io_error(bio); diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c index da21281..267ca3a 100644 --- a/drivers/s390/block/dcssblk.c +++ b/drivers/s390/block/dcssblk.c @@ -826,6 +826,8 @@ dcssblk_make_request(struct request_queue *q, struct bio *bio) unsigned long source_addr; unsigned long bytes_done; + blk_queue_split(q, &bio, q->bio_split); + bytes_done = 0; dev_info = bio->bi_bdev->bd_disk->private_data; if (dev_info == NULL) diff --git a/drivers/s390/block/xpram.c b/drivers/s390/block/xpram.c index 7d4e939..1305ed3 100644 --- a/drivers/s390/block/xpram.c +++ b/drivers/s390/block/xpram.c @@ -190,6 +190,8 @@ static void xpram_make_request(struct request_queue *q, struct bio *bio) unsigned long page_addr; unsigned long bytes; + blk_queue_split(q, &bio, q->bio_split); + if ((bio->bi_iter.bi_sector & 7) != 0 || (bio->bi_iter.bi_size & 4095) != 0) /* Request is not page-aligned. */ diff --git a/drivers/staging/lustre/lustre/llite/lloop.c b/drivers/staging/lustre/lustre/llite/lloop.c index 413a840..a8645a9 100644 --- a/drivers/staging/lustre/lustre/llite/lloop.c +++ b/drivers/staging/lustre/lustre/llite/lloop.c @@ -340,6 +340,8 @@ static void loop_make_request(struct request_queue *q, struct bio *old_bio) int rw = bio_rw(old_bio); int inactive; + blk_queue_split(q, &old_bio, q->bio_split); + if (!lo) goto err; diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 7f9a516..93b81a2 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -488,6 +488,7 @@ struct request_queue { struct blk_mq_tag_set *tag_set; struct list_head tag_set_list; + struct bio_set *bio_split; }; #define QUEUE_FLAG_QUEUED 1 /* uses generic tag queueing */ @@ -812,6 +813,8 @@ extern void blk_rq_unprep_clone(struct request *rq); extern int blk_insert_cloned_request(struct request_queue *q, struct request *rq); extern void blk_delay_queue(struct request_queue *, unsigned long); +extern void blk_queue_split(struct request_queue *, struct bio **, + struct bio_set *); extern void blk_recount_segments(struct request_queue *, struct bio *); extern int scsi_verify_blk_ioctl(struct block_device *, unsigned int); extern int scsi_cmd_blk_ioctl(struct block_device *, fmode_t,