diff mbox

[v4,01/11] block: make generic_make_request handle arbitrarily sized bios

Message ID 1432318723-18829-2-git-send-email-mlin@kernel.org (mailing list archive)
State Superseded, archived
Delegated to: Mike Snitzer
Headers show

Commit Message

Ming Lin May 22, 2015, 6:18 p.m. UTC
From: Kent Overstreet <kent.overstreet@gmail.com>

The way the block layer is currently written, it goes to great lengths
to avoid having to split bios; upper layer code (such as bio_add_page())
checks what the underlying device can handle and tries to always create
bios that don't need to be split.

But this approach becomes unwieldy and eventually breaks down with
stacked devices and devices with dynamic limits, and it adds a lot of
complexity. If the block layer could split bios as needed, we could
eliminate a lot of complexity elsewhere - particularly in stacked
drivers. Code that creates bios can then create whatever size bios are
convenient, and more importantly stacked drivers don't have to deal with
both their own bio size limitations and the limitations of the
(potentially multiple) devices underneath them.  In the future this will
let us delete merge_bvec_fn and a bunch of other code.

We do this by adding calls to blk_queue_split() to the various
make_request functions that need it - a few can already handle arbitrary
size bios. Note that we add the call _after_ any call to
blk_queue_bounce(); this means that blk_queue_split() and
blk_recalc_rq_segments() don't need to be concerned with bouncing
affecting segment merging.

Some make_request_fn() callbacks were simple enough to audit and verify
they don't need blk_queue_split() calls. The skipped ones are:

 * nfhd_make_request (arch/m68k/emu/nfblock.c)
 * axon_ram_make_request (arch/powerpc/sysdev/axonram.c)
 * simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c)
 * brd_make_request (ramdisk - drivers/block/brd.c)
 * mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c)
 * loop_make_request
 * null_queue_bio
 * bcache's make_request fns

Some others are almost certainly safe to remove now, but will be left
for future patches.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: drbd-user@lists.linbit.com
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Jim Paris <jim@jtan.com>
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: skip more mq-based drivers, resolve merge conflicts, etc.]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <mlin@kernel.org>
---
 block/blk-core.c                            |  19 ++--
 block/blk-merge.c                           | 159 ++++++++++++++++++++++++++--
 block/blk-mq.c                              |   4 +
 drivers/block/drbd/drbd_req.c               |   2 +
 drivers/block/pktcdvd.c                     |   6 +-
 drivers/block/ps3vram.c                     |   2 +
 drivers/block/rsxx/dev.c                    |   2 +
 drivers/block/umem.c                        |   2 +
 drivers/block/zram/zram_drv.c               |   2 +
 drivers/md/dm.c                             |   2 +
 drivers/md/md.c                             |   2 +
 drivers/s390/block/dcssblk.c                |   2 +
 drivers/s390/block/xpram.c                  |   2 +
 drivers/staging/lustre/lustre/llite/lloop.c |   2 +
 include/linux/blkdev.h                      |   3 +
 15 files changed, 189 insertions(+), 22 deletions(-)

Comments

Mike Snitzer May 26, 2015, 2:36 p.m. UTC | #1
On Fri, May 22 2015 at  2:18pm -0400,
Ming Lin <mlin@kernel.org> wrote:

> From: Kent Overstreet <kent.overstreet@gmail.com>
> 
> The way the block layer is currently written, it goes to great lengths
> to avoid having to split bios; upper layer code (such as bio_add_page())
> checks what the underlying device can handle and tries to always create
> bios that don't need to be split.
> 
> But this approach becomes unwieldy and eventually breaks down with
> stacked devices and devices with dynamic limits, and it adds a lot of
> complexity. If the block layer could split bios as needed, we could
> eliminate a lot of complexity elsewhere - particularly in stacked
> drivers. Code that creates bios can then create whatever size bios are
> convenient, and more importantly stacked drivers don't have to deal with
> both their own bio size limitations and the limitations of the
> (potentially multiple) devices underneath them.  In the future this will
> let us delete merge_bvec_fn and a bunch of other code.

This series doesn't take any steps to train upper layers
(e.g. filesystems) to size their bios larger (which is defined as
"whatever size bios are convenient" above).

bio_add_page(), and merge_bvec_fn, served as the means for upper layers
(and direct IO) to build up optimally sized bios.  Without a replacement
(that I can see anyway) how is this patchset making forward progress
(getting Acks, etc)!?

I like the idea of reduced complexity associated with these late bio
splitting changes I'm just not seeing how this is ready given there are
no upper layer changes that speak to building larger bios..

What am I missing?

Please advise, thanks!
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lin May 26, 2015, 3:02 p.m. UTC | #2
On Tue, May 26, 2015 at 7:36 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> On Fri, May 22 2015 at  2:18pm -0400,
> Ming Lin <mlin@kernel.org> wrote:
>
>> From: Kent Overstreet <kent.overstreet@gmail.com>
>>
>> The way the block layer is currently written, it goes to great lengths
>> to avoid having to split bios; upper layer code (such as bio_add_page())
>> checks what the underlying device can handle and tries to always create
>> bios that don't need to be split.
>>
>> But this approach becomes unwieldy and eventually breaks down with
>> stacked devices and devices with dynamic limits, and it adds a lot of
>> complexity. If the block layer could split bios as needed, we could
>> eliminate a lot of complexity elsewhere - particularly in stacked
>> drivers. Code that creates bios can then create whatever size bios are
>> convenient, and more importantly stacked drivers don't have to deal with
>> both their own bio size limitations and the limitations of the
>> (potentially multiple) devices underneath them.  In the future this will
>> let us delete merge_bvec_fn and a bunch of other code.
>
> This series doesn't take any steps to train upper layers
> (e.g. filesystems) to size their bios larger (which is defined as
> "whatever size bios are convenient" above).
>
> bio_add_page(), and merge_bvec_fn, served as the means for upper layers
> (and direct IO) to build up optimally sized bios.  Without a replacement
> (that I can see anyway) how is this patchset making forward progress
> (getting Acks, etc)!?
>
> I like the idea of reduced complexity associated with these late bio
> splitting changes I'm just not seeing how this is ready given there are
> no upper layer changes that speak to building larger bios..
>
> What am I missing?

See: [PATCH v4 02/11] block: simplify bio_add_page()
https://lkml.org/lkml/2015/5/22/754

Now bio_add_page() can build lager bios.
And blk_queue_split() can split the bios in ->make_request() if needed.

Thanks.

>
> Please advise, thanks!
> Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Alasdair G Kergon May 26, 2015, 3:34 p.m. UTC | #3
On Tue, May 26, 2015 at 08:02:08AM -0700, Ming Lin wrote:
> Now bio_add_page() can build lager bios.
> And blk_queue_split() can split the bios in ->make_request() if needed.

But why not try to make the bio the right size in the first place so you
don't have to incur the performance impact of splitting?

What performance testing have you yet done to demonstrate the *actual* impact
of this patchset in situations where merge_bvec_fn is currently a net benefit?

Alasdair

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer May 26, 2015, 4:04 p.m. UTC | #4
On Tue, May 26 2015 at 11:02am -0400,
Ming Lin <mlin@kernel.org> wrote:

> On Tue, May 26, 2015 at 7:36 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> > On Fri, May 22 2015 at  2:18pm -0400,
> > Ming Lin <mlin@kernel.org> wrote:
> >
> >> From: Kent Overstreet <kent.overstreet@gmail.com>
> >>
> >> The way the block layer is currently written, it goes to great lengths
> >> to avoid having to split bios; upper layer code (such as bio_add_page())
> >> checks what the underlying device can handle and tries to always create
> >> bios that don't need to be split.
> >>
> >> But this approach becomes unwieldy and eventually breaks down with
> >> stacked devices and devices with dynamic limits, and it adds a lot of
> >> complexity. If the block layer could split bios as needed, we could
> >> eliminate a lot of complexity elsewhere - particularly in stacked
> >> drivers. Code that creates bios can then create whatever size bios are
> >> convenient, and more importantly stacked drivers don't have to deal with
> >> both their own bio size limitations and the limitations of the
> >> (potentially multiple) devices underneath them.  In the future this will
> >> let us delete merge_bvec_fn and a bunch of other code.
> >
> > This series doesn't take any steps to train upper layers
> > (e.g. filesystems) to size their bios larger (which is defined as
> > "whatever size bios are convenient" above).
> >
> > bio_add_page(), and merge_bvec_fn, served as the means for upper layers
> > (and direct IO) to build up optimally sized bios.  Without a replacement
> > (that I can see anyway) how is this patchset making forward progress
> > (getting Acks, etc)!?
> >
> > I like the idea of reduced complexity associated with these late bio
> > splitting changes I'm just not seeing how this is ready given there are
> > no upper layer changes that speak to building larger bios..
> >
> > What am I missing?
> 
> See: [PATCH v4 02/11] block: simplify bio_add_page()
> https://lkml.org/lkml/2015/5/22/754
> 
> Now bio_add_page() can build lager bios.
> And blk_queue_split() can split the bios in ->make_request() if needed.

That'll result in quite large bios and always needing splitting.

As Alasdair asked: please provide some performance data that justifies
these changes.  E.g use a setup like: XFS on a DM striped target.  We
can iterate on more complex setups once we have established some basic
tests.

If you're just punting to reviewers to do the testing for you that isn't
going to instill _any_ confidence in me for this patchset as a suitabe
replacement relative to performance.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lin May 26, 2015, 5:17 p.m. UTC | #5
On Tue, May 26, 2015 at 9:04 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> On Tue, May 26 2015 at 11:02am -0400,
> Ming Lin <mlin@kernel.org> wrote:
>
>> On Tue, May 26, 2015 at 7:36 AM, Mike Snitzer <snitzer@redhat.com> wrote:
>> > On Fri, May 22 2015 at  2:18pm -0400,
>> > Ming Lin <mlin@kernel.org> wrote:
>> >
>> >> From: Kent Overstreet <kent.overstreet@gmail.com>
>> >>
>> >> The way the block layer is currently written, it goes to great lengths
>> >> to avoid having to split bios; upper layer code (such as bio_add_page())
>> >> checks what the underlying device can handle and tries to always create
>> >> bios that don't need to be split.
>> >>
>> >> But this approach becomes unwieldy and eventually breaks down with
>> >> stacked devices and devices with dynamic limits, and it adds a lot of
>> >> complexity. If the block layer could split bios as needed, we could
>> >> eliminate a lot of complexity elsewhere - particularly in stacked
>> >> drivers. Code that creates bios can then create whatever size bios are
>> >> convenient, and more importantly stacked drivers don't have to deal with
>> >> both their own bio size limitations and the limitations of the
>> >> (potentially multiple) devices underneath them.  In the future this will
>> >> let us delete merge_bvec_fn and a bunch of other code.
>> >
>> > This series doesn't take any steps to train upper layers
>> > (e.g. filesystems) to size their bios larger (which is defined as
>> > "whatever size bios are convenient" above).
>> >
>> > bio_add_page(), and merge_bvec_fn, served as the means for upper layers
>> > (and direct IO) to build up optimally sized bios.  Without a replacement
>> > (that I can see anyway) how is this patchset making forward progress
>> > (getting Acks, etc)!?
>> >
>> > I like the idea of reduced complexity associated with these late bio
>> > splitting changes I'm just not seeing how this is ready given there are
>> > no upper layer changes that speak to building larger bios..
>> >
>> > What am I missing?
>>
>> See: [PATCH v4 02/11] block: simplify bio_add_page()
>> https://lkml.org/lkml/2015/5/22/754
>>
>> Now bio_add_page() can build lager bios.
>> And blk_queue_split() can split the bios in ->make_request() if needed.
>
> That'll result in quite large bios and always needing splitting.
>
> As Alasdair asked: please provide some performance data that justifies
> these changes.  E.g use a setup like: XFS on a DM striped target.  We
> can iterate on more complex setups once we have established some basic
> tests.

I'll test XFS on DM and also what Christoph suggested:
https://lkml.org/lkml/2015/5/25/226

>
> If you're just punting to reviewers to do the testing for you that isn't
> going to instill _any_ confidence in me for this patchset as a suitabe
> replacement relative to performance.

Kent's Direct IO rewrite patch depends on this series.
https://git.kernel.org/cgit/linux/kernel/git/mlin/linux.git/log/?h=block-dio-rewrite

I did test the dio patch on a 2 sockets(48 logical CPUs) server and
saw 40% improvement with 48 null_blks.
Here is the fio data of 4k read.

4.1-rc2
----------
Test 1: bw=50509MB/s, iops=12930K
Test 2: bw=49745MB/s, iops=12735K
Test 3: bw=50297MB/s, iops=12876K,
Average: bw=50183MB/s, iops=12847K

4.1-rc2-dio-rewrite
------------------------
Test 1: bw=70269MB/s, iops=17989K
Test 2: bw=70097MB/s, iops=17945K
Test 3: bw=70907MB/s, iops=18152K
Average: bw=70424MB/s, iops=18028K

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
NeilBrown May 26, 2015, 11:06 p.m. UTC | #6
On Tue, 26 May 2015 16:34:14 +0100 Alasdair G Kergon <agk@redhat.com> wrote:

> On Tue, May 26, 2015 at 08:02:08AM -0700, Ming Lin wrote:
> > Now bio_add_page() can build lager bios.
> > And blk_queue_split() can split the bios in ->make_request() if needed.
> 
> But why not try to make the bio the right size in the first place so you
> don't have to incur the performance impact of splitting?

Because we don't know what the "right" size is.  And the "right" size can
change when array reconfiguration happens.

Splitting has to happen somewhere, if only in bio_addpage where it decides to
create a new bio rather than add another page to the current one.  So moving
the split to a different level of the stack shouldn't necessarily change the
performance profile.

Obviously testing is important to confirm that.

NeilBrown

> 
> What performance testing have you yet done to demonstrate the *actual* impact
> of this patchset in situations where merge_bvec_fn is currently a net benefit?
> 
> Alasdair
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Alasdair G Kergon May 27, 2015, 12:40 a.m. UTC | #7
On Wed, May 27, 2015 at 09:06:40AM +1000, Neil Brown wrote:
> Because we don't know what the "right" size is.  And the "right" size can
> change when array reconfiguration happens.
 
In certain configurations today, device-mapper does report back a sensible
maximum bio size smaller than would otherwise be used and thereby avoids
retrospective splitting.  (In tests, the overhead of the duplicate calculation
was found to be negligible so we never restructured the code to optimise it away.)

> Splitting has to happen somewhere, if only in bio_addpage where it decides to
> create a new bio rather than add another page to the current one.  So moving
> the split to a different level of the stack shouldn't necessarily change the
> performance profile.
 
It does sometimes make a significant difference to device-mapper stacks.
DM only uses it for performance reasons - it can already split bios when
it needs to.  I tried to remove merge_bvec_fn from DM several years ago but
couldn't because of the adverse performance impact of lots of splitting activity.

The overall cost of splitting ought to be less in many (but not necessarily
all) cases now as a result of all these patches, so exactly where the best
balance lies now needs to be reassessed empirically.  It is hard to reach
conclusions theoretically because of the complex interplay between the various
factors at different levels.

Alasdair

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Christoph Hellwig May 27, 2015, 8:20 a.m. UTC | #8
On Wed, May 27, 2015 at 01:40:22AM +0100, Alasdair G Kergon wrote:
> It does sometimes make a significant difference to device-mapper stacks.
> DM only uses it for performance reasons - it can already split bios when
> it needs to.  I tried to remove merge_bvec_fn from DM several years ago but
> couldn't because of the adverse performance impact of lots of splitting activity.

Does it still?  Since the move to immutable biovecs the bio splits are
pretty cheap now, but I'd really like to see this verified by benchmarks.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lin May 27, 2015, 11:42 p.m. UTC | #9
On Tue, May 26, 2015 at 9:04 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> On Tue, May 26 2015 at 11:02am -0400,
> Ming Lin <mlin@kernel.org> wrote:
>
>> On Tue, May 26, 2015 at 7:36 AM, Mike Snitzer <snitzer@redhat.com> wrote:
>> > On Fri, May 22 2015 at  2:18pm -0400,
>> > Ming Lin <mlin@kernel.org> wrote:
>> >
>> >> From: Kent Overstreet <kent.overstreet@gmail.com>
>> >>
>> >> The way the block layer is currently written, it goes to great lengths
>> >> to avoid having to split bios; upper layer code (such as bio_add_page())
>> >> checks what the underlying device can handle and tries to always create
>> >> bios that don't need to be split.
>> >>
>> >> But this approach becomes unwieldy and eventually breaks down with
>> >> stacked devices and devices with dynamic limits, and it adds a lot of
>> >> complexity. If the block layer could split bios as needed, we could
>> >> eliminate a lot of complexity elsewhere - particularly in stacked
>> >> drivers. Code that creates bios can then create whatever size bios are
>> >> convenient, and more importantly stacked drivers don't have to deal with
>> >> both their own bio size limitations and the limitations of the
>> >> (potentially multiple) devices underneath them.  In the future this will
>> >> let us delete merge_bvec_fn and a bunch of other code.
>> >
>> > This series doesn't take any steps to train upper layers
>> > (e.g. filesystems) to size their bios larger (which is defined as
>> > "whatever size bios are convenient" above).
>> >
>> > bio_add_page(), and merge_bvec_fn, served as the means for upper layers
>> > (and direct IO) to build up optimally sized bios.  Without a replacement
>> > (that I can see anyway) how is this patchset making forward progress
>> > (getting Acks, etc)!?
>> >
>> > I like the idea of reduced complexity associated with these late bio
>> > splitting changes I'm just not seeing how this is ready given there are
>> > no upper layer changes that speak to building larger bios..
>> >
>> > What am I missing?
>>
>> See: [PATCH v4 02/11] block: simplify bio_add_page()
>> https://lkml.org/lkml/2015/5/22/754
>>
>> Now bio_add_page() can build lager bios.
>> And blk_queue_split() can split the bios in ->make_request() if needed.
>
> That'll result in quite large bios and always needing splitting.
>
> As Alasdair asked: please provide some performance data that justifies
> these changes.  E.g use a setup like: XFS on a DM striped target.  We
> can iterate on more complex setups once we have established some basic
> tests.

Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD.
Does it make sense?

                                         4.1-rc4                 4.1-rc4-patched
                                         ------------------
-----------------------
                                          (KB/s)                 (KB/s)
sequential-read-buf:            150822                 151371
sequential-read-direct:         408938                 421940
random-read-buf:                3404.9                  3389.1
random-read-direct:             4859.8                 4843.5
sequential-write-buf:            333455                 335776
sequential-write-direct:        44739                   43194
random-write-buf:                7272.1                 7209.6
random-write-direct:             4333.9                 4330.7



root@minggr:~/tmp/test# cat t.job
[global]
size=1G
directory=/mnt/
numjobs=8
group_reporting
runtime=300
time_based
bs=8k
ioengine=libaio
iodepth=64

[sequential-read-buf]
rw=read

[sequential-read-direct]
rw=read
direct=1

[random-read-buf]
rw=randread

[random-read-direct]
rw=randread
direct=1

[sequential-write-buf]
rw=write

[sequential-write-direct]
rw=write
direct=1

[random-write-buf]
rw=randwrite

[random-write-direct]
rw=randwrite
direct=1


root@minggr:~/tmp/test# cat run.sh
#!/bin/bash

jobs="sequential-read-buf sequential-read-direct random-read-buf
random-read-direct"
jobs="$jobs sequential-write-buf sequential-write-direct
random-write-buf random-write-direct"

#each partition is 100G
pvcreate /dev/sdb3 /dev/nvme0n1p1 /dev/sdc6
vgcreate striped_vol_group /dev/sdb3 /dev/nvme0n1p1 /dev/sdc6
lvcreate -i3 -I4 -L250G -nstriped_logical_volume striped_vol_group

for job in $jobs ; do
        umount /mnt > /dev/null 2>&1
        mkfs.xfs -f /dev/striped_vol_group/striped_logical_volume
        mount /dev/striped_vol_group/striped_logical_volume /mnt

        fio --output=${job}.log --section=${job} t.job
done

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Alasdair G Kergon May 28, 2015, 12:36 a.m. UTC | #10
On Wed, May 27, 2015 at 04:42:44PM -0700, Ming Lin wrote:
> Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD.
> Does it make sense?

To stripe across devices with different characteristics?

Some suggestions.

Prepare 3 kernels.
  O - Old kernel.
  M - Old kernel with merge_bvec_fn disabled.
  N - New kernel.

You're trying to search for counter-examples to the hypothesis that 
"Kernel N always outperforms Kernel O".  Then if you find any, trying 
to show either that the performance impediment is small enough that
it doesn't matter or that the cases are sufficiently rare or obscure
that they may be ignored because of the greater benefits of N in much more
common cases.

(1) You're looking to set up configurations where kernel O performs noticeably
better than M.  Then you're comparing the performance of O and N in those
situations.

(2) You're looking at other sensible configurations where O and M have
similar performance, and comparing that with the performance of N.

In each case you find, you expect to be able to vary some parameter (such as
stripe size) to show a progression of the effect.

When running tests you've to take care the system is reset into the same
initial state before each test, so you'll normally also try to include some
baseline test between tests that should give the same results each time
and also take the average of a number of runs (while also reporting some
measure of the variation within each set to make sure that remains low,
typically a low single digit percentage).

Since we're mostly concerned about splitting, you'll want to monitor
iostat to see if that gives you enough information to home in on 
suitable configurations for (1).  Failing that, you might need to
instrument the kernels to tell you the sizes of the bios being
created and the amount of splitting actually happening.

Striping was mentioned because it forces splitting.  So show the progression
from tiny stripes to huge stripes.  (Ensure all the devices providing the
stripes have identical characteristics, but you can test with slow and
fast underlying devices.)

You may also want to test systems with a restricted amount of available
memory to show how the splitting via worker thread performs.  (Again,
instrument to prove the extent to which the new code is being exercised.)

Alasdair

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lin May 28, 2015, 5:54 a.m. UTC | #11
On Wed, May 27, 2015 at 5:36 PM, Alasdair G Kergon <agk@redhat.com> wrote:
> On Wed, May 27, 2015 at 04:42:44PM -0700, Ming Lin wrote:
>> Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD.
>> Does it make sense?
>
> To stripe across devices with different characteristics?

I intended to test it on a 2 sockets server with 10 NVMe drives.
But that server has been busy running other tests.

So I have to run test on a PC which happen to have 2 SSDs + 1 HDD.

>
> Some suggestions.

Thanks for the great detail.
I'm reading to understand.

>
> Prepare 3 kernels.
>   O - Old kernel.
>   M - Old kernel with merge_bvec_fn disabled.
>   N - New kernel.
>
> You're trying to search for counter-examples to the hypothesis that
> "Kernel N always outperforms Kernel O".  Then if you find any, trying
> to show either that the performance impediment is small enough that
> it doesn't matter or that the cases are sufficiently rare or obscure
> that they may be ignored because of the greater benefits of N in much more
> common cases.
>
> (1) You're looking to set up configurations where kernel O performs noticeably
> better than M.  Then you're comparing the performance of O and N in those
> situations.
>
> (2) You're looking at other sensible configurations where O and M have
> similar performance, and comparing that with the performance of N.
>
> In each case you find, you expect to be able to vary some parameter (such as
> stripe size) to show a progression of the effect.
>
> When running tests you've to take care the system is reset into the same
> initial state before each test, so you'll normally also try to include some
> baseline test between tests that should give the same results each time
> and also take the average of a number of runs (while also reporting some
> measure of the variation within each set to make sure that remains low,
> typically a low single digit percentage).
>
> Since we're mostly concerned about splitting, you'll want to monitor
> iostat to see if that gives you enough information to home in on
> suitable configurations for (1).  Failing that, you might need to
> instrument the kernels to tell you the sizes of the bios being
> created and the amount of splitting actually happening.
>
> Striping was mentioned because it forces splitting.  So show the progression
> from tiny stripes to huge stripes.  (Ensure all the devices providing the
> stripes have identical characteristics, but you can test with slow and
> fast underlying devices.)
>
> You may also want to test systems with a restricted amount of available
> memory to show how the splitting via worker thread performs.  (Again,
> instrument to prove the extent to which the new code is being exercised.)
>
> Alasdair
>

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lin May 29, 2015, 7:05 a.m. UTC | #12
On Wed, May 27, 2015 at 5:36 PM, Alasdair G Kergon <agk@redhat.com> wrote:
> On Wed, May 27, 2015 at 04:42:44PM -0700, Ming Lin wrote:
>> Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD.
>> Does it make sense?
>
> To stripe across devices with different characteristics?
>
> Some suggestions.
>
> Prepare 3 kernels.
>   O - Old kernel.
>   M - Old kernel with merge_bvec_fn disabled.

How to disable it?
Maybe just hack it as below?

void blk_queue_merge_bvec(struct request_queue *q, merge_bvec_fn *mbfn)
{
        //q->merge_bvec_fn = mbfn;
}

>   N - New kernel.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer May 29, 2015, 3:15 p.m. UTC | #13
On Fri, May 29 2015 at  3:05P -0400,
Ming Lin <mlin@kernel.org> wrote:

> On Wed, May 27, 2015 at 5:36 PM, Alasdair G Kergon <agk@redhat.com> wrote:
> > On Wed, May 27, 2015 at 04:42:44PM -0700, Ming Lin wrote:
> >> Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD.
> >> Does it make sense?
> >
> > To stripe across devices with different characteristics?
> >
> > Some suggestions.
> >
> > Prepare 3 kernels.
> >   O - Old kernel.
> >   M - Old kernel with merge_bvec_fn disabled.
> 
> How to disable it?
> Maybe just hack it as below?
> 
> void blk_queue_merge_bvec(struct request_queue *q, merge_bvec_fn *mbfn)
> {
>         //q->merge_bvec_fn = mbfn;
> }

Right, there isn't an existing way to disable it, you'd need a hack like
that.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lin June 1, 2015, 6:02 a.m. UTC | #14
On Thu, 2015-05-28 at 01:36 +0100, Alasdair G Kergon wrote:
> On Wed, May 27, 2015 at 04:42:44PM -0700, Ming Lin wrote:
> > Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD.
> > Does it make sense?
> 
> To stripe across devices with different characteristics?
> 
> Some suggestions.
> 
> Prepare 3 kernels.
>   O - Old kernel.
>   M - Old kernel with merge_bvec_fn disabled.
>   N - New kernel.
> 
> You're trying to search for counter-examples to the hypothesis that 
> "Kernel N always outperforms Kernel O".  Then if you find any, trying 
> to show either that the performance impediment is small enough that
> it doesn't matter or that the cases are sufficiently rare or obscure
> that they may be ignored because of the greater benefits of N in much more
> common cases.
> 
> (1) You're looking to set up configurations where kernel O performs noticeably
> better than M.  Then you're comparing the performance of O and N in those
> situations.
> 
> (2) You're looking at other sensible configurations where O and M have
> similar performance, and comparing that with the performance of N.

I didn't find case (1).

But the important thing for this series is to simplify block layer
based on immutable biovecs. I don't expect performance improvement.

Here is the changes statistics.

"68 files changed, 336 insertions(+), 1331 deletions(-)"

I run below 3 test cases to make sure it didn't bring any regressions.
Test environment: 2 NVMe drives on 2 sockets server.
Each case run for 30 minutes.

2) btrfs radi0

mkfs.btrfs -f -d raid0 /dev/nvme0n1 /dev/nvme1n1
mount /dev/nvme0n1 /mnt

Then run 8K read.

[global]
ioengine=libaio
iodepth=64
direct=1
runtime=1800
time_based
group_reporting
numjobs=4
rw=read

[job1]
bs=8K
directory=/mnt
size=1G

2) ext4 on MD raid5

mdadm --create /dev/md0 --level=5 --raid-devices=2 /dev/nvme0n1 /dev/nvme1n1
mkfs.ext4 /dev/md0
mount /dev/md0 /mnt

fio script same as btrfs test

3) xfs on DM stripped target

pvcreate /dev/nvme0n1 /dev/nvme1n1
vgcreate striped_vol_group /dev/nvme0n1 /dev/nvme1n1
lvcreate -i2 -I4 -L250G -nstriped_logical_volume striped_vol_group
mkfs.xfs -f /dev/striped_vol_group/striped_logical_volume
mount /dev/striped_vol_group/striped_logical_volume /mnt

fio script same as btrfs test

------

Results:

	4.1-rc4		4.1-rc4-patched
btrfs	1818.6MB/s	1874.1MB/s
ext4	717307KB/s	714030KB/s
xfs	1396.6MB/s	1398.6MB/s


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lin June 2, 2015, 8:59 p.m. UTC | #15
On Sun, May 31, 2015 at 11:02 PM, Ming Lin <mlin@kernel.org> wrote:
> On Thu, 2015-05-28 at 01:36 +0100, Alasdair G Kergon wrote:
>> On Wed, May 27, 2015 at 04:42:44PM -0700, Ming Lin wrote:
>> > Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD.
>> > Does it make sense?
>>
>> To stripe across devices with different characteristics?
>>
>> Some suggestions.
>>
>> Prepare 3 kernels.
>>   O - Old kernel.
>>   M - Old kernel with merge_bvec_fn disabled.
>>   N - New kernel.
>>
>> You're trying to search for counter-examples to the hypothesis that
>> "Kernel N always outperforms Kernel O".  Then if you find any, trying
>> to show either that the performance impediment is small enough that
>> it doesn't matter or that the cases are sufficiently rare or obscure
>> that they may be ignored because of the greater benefits of N in much more
>> common cases.
>>
>> (1) You're looking to set up configurations where kernel O performs noticeably
>> better than M.  Then you're comparing the performance of O and N in those
>> situations.
>>
>> (2) You're looking at other sensible configurations where O and M have
>> similar performance, and comparing that with the performance of N.
>
> I didn't find case (1).
>
> But the important thing for this series is to simplify block layer
> based on immutable biovecs. I don't expect performance improvement.
>
> Here is the changes statistics.
>
> "68 files changed, 336 insertions(+), 1331 deletions(-)"
>
> I run below 3 test cases to make sure it didn't bring any regressions.
> Test environment: 2 NVMe drives on 2 sockets server.
> Each case run for 30 minutes.
>
> 2) btrfs radi0
>
> mkfs.btrfs -f -d raid0 /dev/nvme0n1 /dev/nvme1n1
> mount /dev/nvme0n1 /mnt
>
> Then run 8K read.
>
> [global]
> ioengine=libaio
> iodepth=64
> direct=1
> runtime=1800
> time_based
> group_reporting
> numjobs=4
> rw=read
>
> [job1]
> bs=8K
> directory=/mnt
> size=1G
>
> 2) ext4 on MD raid5
>
> mdadm --create /dev/md0 --level=5 --raid-devices=2 /dev/nvme0n1 /dev/nvme1n1
> mkfs.ext4 /dev/md0
> mount /dev/md0 /mnt
>
> fio script same as btrfs test
>
> 3) xfs on DM stripped target
>
> pvcreate /dev/nvme0n1 /dev/nvme1n1
> vgcreate striped_vol_group /dev/nvme0n1 /dev/nvme1n1
> lvcreate -i2 -I4 -L250G -nstriped_logical_volume striped_vol_group
> mkfs.xfs -f /dev/striped_vol_group/striped_logical_volume
> mount /dev/striped_vol_group/striped_logical_volume /mnt
>
> fio script same as btrfs test
>
> ------
>
> Results:
>
>         4.1-rc4         4.1-rc4-patched
> btrfs   1818.6MB/s      1874.1MB/s
> ext4    717307KB/s      714030KB/s
> xfs     1396.6MB/s      1398.6MB/s

Hi Alasdair & Mike,

Would you like these numbers?
I'd like to address your concerns to move forward.

Thanks.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer June 4, 2015, 9:06 p.m. UTC | #16
On Tue, Jun 02 2015 at  4:59pm -0400,
Ming Lin <mlin@kernel.org> wrote:

> On Sun, May 31, 2015 at 11:02 PM, Ming Lin <mlin@kernel.org> wrote:
> > On Thu, 2015-05-28 at 01:36 +0100, Alasdair G Kergon wrote:
> >> On Wed, May 27, 2015 at 04:42:44PM -0700, Ming Lin wrote:
> >> > Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD.
> >> > Does it make sense?
> >>
> >> To stripe across devices with different characteristics?
> >>
> >> Some suggestions.
> >>
> >> Prepare 3 kernels.
> >>   O - Old kernel.
> >>   M - Old kernel with merge_bvec_fn disabled.
> >>   N - New kernel.
> >>
> >> You're trying to search for counter-examples to the hypothesis that
> >> "Kernel N always outperforms Kernel O".  Then if you find any, trying
> >> to show either that the performance impediment is small enough that
> >> it doesn't matter or that the cases are sufficiently rare or obscure
> >> that they may be ignored because of the greater benefits of N in much more
> >> common cases.
> >>
> >> (1) You're looking to set up configurations where kernel O performs noticeably
> >> better than M.  Then you're comparing the performance of O and N in those
> >> situations.
> >>
> >> (2) You're looking at other sensible configurations where O and M have
> >> similar performance, and comparing that with the performance of N.
> >
> > I didn't find case (1).
> >
> > But the important thing for this series is to simplify block layer
> > based on immutable biovecs. I don't expect performance improvement.

No simplifying isn't the important thing.  Any change to remove the
merge_bvec callbacks needs to not introduce performance regressions on
enterprise systems with large RAID arrays, etc.

It is fine if there isn't a performance improvement but I really don't
think the limited testing you've done on a relatively small storage
configuration has come even close to showing these changes don't
introduce performance regressions.

> > Here is the changes statistics.
> >
> > "68 files changed, 336 insertions(+), 1331 deletions(-)"
> >
> > I run below 3 test cases to make sure it didn't bring any regressions.
> > Test environment: 2 NVMe drives on 2 sockets server.
> > Each case run for 30 minutes.
> >
> > 2) btrfs radi0
> >
> > mkfs.btrfs -f -d raid0 /dev/nvme0n1 /dev/nvme1n1
> > mount /dev/nvme0n1 /mnt
> >
> > Then run 8K read.
> >
> > [global]
> > ioengine=libaio
> > iodepth=64
> > direct=1
> > runtime=1800
> > time_based
> > group_reporting
> > numjobs=4
> > rw=read
> >
> > [job1]
> > bs=8K
> > directory=/mnt
> > size=1G
> >
> > 2) ext4 on MD raid5
> >
> > mdadm --create /dev/md0 --level=5 --raid-devices=2 /dev/nvme0n1 /dev/nvme1n1
> > mkfs.ext4 /dev/md0
> > mount /dev/md0 /mnt
> >
> > fio script same as btrfs test
> >
> > 3) xfs on DM stripped target
> >
> > pvcreate /dev/nvme0n1 /dev/nvme1n1
> > vgcreate striped_vol_group /dev/nvme0n1 /dev/nvme1n1
> > lvcreate -i2 -I4 -L250G -nstriped_logical_volume striped_vol_group
> > mkfs.xfs -f /dev/striped_vol_group/striped_logical_volume
> > mount /dev/striped_vol_group/striped_logical_volume /mnt
> >
> > fio script same as btrfs test
> >
> > ------
> >
> > Results:
> >
> >         4.1-rc4         4.1-rc4-patched
> > btrfs   1818.6MB/s      1874.1MB/s
> > ext4    717307KB/s      714030KB/s
> > xfs     1396.6MB/s      1398.6MB/s
> 
> Hi Alasdair & Mike,
> 
> Would you like these numbers?
> I'd like to address your concerns to move forward.

I really don't see that these NVMe results prove much.

We need to test on large HW raid setups like a Netapp filer (or even
local SAS drives connected via some SAS controller).  Like a 8+2 drive
RAID6 or 8+1 RAID5 setup.  Testing with MD raid on JBOD setups with 8
devices is also useful.  It is larger RAID setups that will be more
sensitive to IO sizes being properly aligned on RAID stripe and/or chunk
size boundaries.

There are tradeoffs between creating a really large bio and creating a
properly sized bio from the start.  And yes, to one of neilb's original
points, limits do change and we suck at restacking limits.. so what was
once properly sized may no longer be but: that is a relatively rare
occurrence.  Late splitting does do away with the limits stacking
disconnect.  And in general I like the idea of removing all the
merge_bvec code.  I just don't think I can confidently Ack such a
wholesale switch at this point with such limited performance analysis.
If we (the DM/lvm team at Red Hat) are being painted into a corner of
having to provide our own testing that meets our definition of
"thorough" then we'll need time to carry out those tests.  But I'd hate
to hold up everyone because DM is not in agreement on this change...

So taking a step back, why can't we introduce late bio splitting in a
phased approach?

1: introduce late bio splitting to block core BUT still keep established
   merge_bvec infrastructure
2: establish a way for upper layers to skip merge_bvec if they'd like to
   do so (e.g. block-core exposes a 'use_late_bio_splitting' or
   something for userspace or upper layers to set, can also have a
   Kconfig that enables this feature by default)
3: we gain confidence in late bio-splitting and then carry on with the
   removal of merge_bvec et al (could be incrementally done on a
   per-driver basis, e.g. DM, MD, btrfs, etc, etc).

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lin June 4, 2015, 10:21 p.m. UTC | #17
On Thu, Jun 4, 2015 at 2:06 PM, Mike Snitzer <snitzer@redhat.com> wrote:
> On Tue, Jun 02 2015 at  4:59pm -0400,
> Ming Lin <mlin@kernel.org> wrote:
>
>> On Sun, May 31, 2015 at 11:02 PM, Ming Lin <mlin@kernel.org> wrote:
>> > On Thu, 2015-05-28 at 01:36 +0100, Alasdair G Kergon wrote:
>> >> On Wed, May 27, 2015 at 04:42:44PM -0700, Ming Lin wrote:
>> >> > Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD.
>> >> > Does it make sense?
>> >>
>> >> To stripe across devices with different characteristics?
>> >>
>> >> Some suggestions.
>> >>
>> >> Prepare 3 kernels.
>> >>   O - Old kernel.
>> >>   M - Old kernel with merge_bvec_fn disabled.
>> >>   N - New kernel.
>> >>
>> >> You're trying to search for counter-examples to the hypothesis that
>> >> "Kernel N always outperforms Kernel O".  Then if you find any, trying
>> >> to show either that the performance impediment is small enough that
>> >> it doesn't matter or that the cases are sufficiently rare or obscure
>> >> that they may be ignored because of the greater benefits of N in much more
>> >> common cases.
>> >>
>> >> (1) You're looking to set up configurations where kernel O performs noticeably
>> >> better than M.  Then you're comparing the performance of O and N in those
>> >> situations.
>> >>
>> >> (2) You're looking at other sensible configurations where O and M have
>> >> similar performance, and comparing that with the performance of N.
>> >
>> > I didn't find case (1).
>> >
>> > But the important thing for this series is to simplify block layer
>> > based on immutable biovecs. I don't expect performance improvement.
>
> No simplifying isn't the important thing.  Any change to remove the
> merge_bvec callbacks needs to not introduce performance regressions on
> enterprise systems with large RAID arrays, etc.
>
> It is fine if there isn't a performance improvement but I really don't
> think the limited testing you've done on a relatively small storage
> configuration has come even close to showing these changes don't
> introduce performance regressions.
>
>> > Here is the changes statistics.
>> >
>> > "68 files changed, 336 insertions(+), 1331 deletions(-)"
>> >
>> > I run below 3 test cases to make sure it didn't bring any regressions.
>> > Test environment: 2 NVMe drives on 2 sockets server.
>> > Each case run for 30 minutes.
>> >
>> > 2) btrfs radi0
>> >
>> > mkfs.btrfs -f -d raid0 /dev/nvme0n1 /dev/nvme1n1
>> > mount /dev/nvme0n1 /mnt
>> >
>> > Then run 8K read.
>> >
>> > [global]
>> > ioengine=libaio
>> > iodepth=64
>> > direct=1
>> > runtime=1800
>> > time_based
>> > group_reporting
>> > numjobs=4
>> > rw=read
>> >
>> > [job1]
>> > bs=8K
>> > directory=/mnt
>> > size=1G
>> >
>> > 2) ext4 on MD raid5
>> >
>> > mdadm --create /dev/md0 --level=5 --raid-devices=2 /dev/nvme0n1 /dev/nvme1n1
>> > mkfs.ext4 /dev/md0
>> > mount /dev/md0 /mnt
>> >
>> > fio script same as btrfs test
>> >
>> > 3) xfs on DM stripped target
>> >
>> > pvcreate /dev/nvme0n1 /dev/nvme1n1
>> > vgcreate striped_vol_group /dev/nvme0n1 /dev/nvme1n1
>> > lvcreate -i2 -I4 -L250G -nstriped_logical_volume striped_vol_group
>> > mkfs.xfs -f /dev/striped_vol_group/striped_logical_volume
>> > mount /dev/striped_vol_group/striped_logical_volume /mnt
>> >
>> > fio script same as btrfs test
>> >
>> > ------
>> >
>> > Results:
>> >
>> >         4.1-rc4         4.1-rc4-patched
>> > btrfs   1818.6MB/s      1874.1MB/s
>> > ext4    717307KB/s      714030KB/s
>> > xfs     1396.6MB/s      1398.6MB/s
>>
>> Hi Alasdair & Mike,
>>
>> Would you like these numbers?
>> I'd like to address your concerns to move forward.
>
> I really don't see that these NVMe results prove much.
>
> We need to test on large HW raid setups like a Netapp filer (or even
> local SAS drives connected via some SAS controller).  Like a 8+2 drive
> RAID6 or 8+1 RAID5 setup.  Testing with MD raid on JBOD setups with 8
> devices is also useful.  It is larger RAID setups that will be more
> sensitive to IO sizes being properly aligned on RAID stripe and/or chunk
> size boundaries.

I'll test it on large HW raid setup.

Here is HW RAID5 setup with 19 278G HDDs on Dell R730xd(2sockets/48
logical cpus/264G mem).
http://minggr.net/pub/20150604/hw_raid5.jpg

The stripe size is 64K.

I'm going to test ext4/btrfs/xfs on it.
"bs" set to 1216k(64K * 19 = 1216k)
and run 48 jobs.

[global]
ioengine=libaio
iodepth=64
direct=1
runtime=1800
time_based
group_reporting
numjobs=48
rw=read

[job1]
bs=1216K
directory=/mnt
size=1G

Or do you have other suggestions of what tests I should run?

Thanks.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer June 5, 2015, 12:06 a.m. UTC | #18
On Thu, Jun 04 2015 at  6:21pm -0400,
Ming Lin <mlin@kernel.org> wrote:

> On Thu, Jun 4, 2015 at 2:06 PM, Mike Snitzer <snitzer@redhat.com> wrote:
> >
> > We need to test on large HW raid setups like a Netapp filer (or even
> > local SAS drives connected via some SAS controller).  Like a 8+2 drive
> > RAID6 or 8+1 RAID5 setup.  Testing with MD raid on JBOD setups with 8
> > devices is also useful.  It is larger RAID setups that will be more
> > sensitive to IO sizes being properly aligned on RAID stripe and/or chunk
> > size boundaries.
> 
> I'll test it on large HW raid setup.
> 
> Here is HW RAID5 setup with 19 278G HDDs on Dell R730xd(2sockets/48
> logical cpus/264G mem).
> http://minggr.net/pub/20150604/hw_raid5.jpg
> 
> The stripe size is 64K.
> 
> I'm going to test ext4/btrfs/xfs on it.
> "bs" set to 1216k(64K * 19 = 1216k)
> and run 48 jobs.

Definitely an odd blocksize (though 1280K full stripe is pretty common
for 10+2 HW RAID6 w/ 128K chunk size).

> [global]
> ioengine=libaio
> iodepth=64
> direct=1
> runtime=1800
> time_based
> group_reporting
> numjobs=48
> rw=read
> 
> [job1]
> bs=1216K
> directory=/mnt
> size=1G

How does time_based relate to size=1G?  It'll rewrite the same 1 gig
file repeatedly?

> Or do you have other suggestions of what tests I should run?

You're welcome to run this job but I'll also check with others here to
see what fio jobs we used in the recent past when assessing performance
of the dm-crypt parallelization changes.

Also, a lot of care needs to be taken to eliminate jitter in the system
while the test is running.  We got a lot of good insight from Bart Van
Assche on that and put it to practice.  I'll see if we can (re)summarize
that too.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lin June 5, 2015, 5:21 a.m. UTC | #19
On Thu, Jun 4, 2015 at 5:06 PM, Mike Snitzer <snitzer@redhat.com> wrote:
> On Thu, Jun 04 2015 at  6:21pm -0400,
> Ming Lin <mlin@kernel.org> wrote:
>
>> On Thu, Jun 4, 2015 at 2:06 PM, Mike Snitzer <snitzer@redhat.com> wrote:
>> >
>> > We need to test on large HW raid setups like a Netapp filer (or even
>> > local SAS drives connected via some SAS controller).  Like a 8+2 drive
>> > RAID6 or 8+1 RAID5 setup.  Testing with MD raid on JBOD setups with 8
>> > devices is also useful.  It is larger RAID setups that will be more
>> > sensitive to IO sizes being properly aligned on RAID stripe and/or chunk
>> > size boundaries.
>>
>> I'll test it on large HW raid setup.
>>
>> Here is HW RAID5 setup with 19 278G HDDs on Dell R730xd(2sockets/48
>> logical cpus/264G mem).
>> http://minggr.net/pub/20150604/hw_raid5.jpg
>>
>> The stripe size is 64K.
>>
>> I'm going to test ext4/btrfs/xfs on it.
>> "bs" set to 1216k(64K * 19 = 1216k)
>> and run 48 jobs.
>
> Definitely an odd blocksize (though 1280K full stripe is pretty common
> for 10+2 HW RAID6 w/ 128K chunk size).

I can change it to 10 HDDs HW RAID6 w/ 128K chunk size, then use bs=1280K

>
>> [global]
>> ioengine=libaio
>> iodepth=64
>> direct=1
>> runtime=1800
>> time_based
>> group_reporting
>> numjobs=48
>> rw=read
>>
>> [job1]
>> bs=1216K
>> directory=/mnt
>> size=1G
>
> How does time_based relate to size=1G?  It'll rewrite the same 1 gig
> file repeatedly?

Above job file is for read.
For write, I think so.
Do is make sense for performance test?

>
>> Or do you have other suggestions of what tests I should run?
>
> You're welcome to run this job but I'll also check with others here to
> see what fio jobs we used in the recent past when assessing performance
> of the dm-crypt parallelization changes.

That's very helpful.

>
> Also, a lot of care needs to be taken to eliminate jitter in the system
> while the test is running.  We got a lot of good insight from Bart Van
> Assche on that and put it to practice.  I'll see if we can (re)summarize
> that too.

Very helpful too.

Thanks.

>
> Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lin June 9, 2015, 6:09 a.m. UTC | #20
On Thu, 2015-06-04 at 17:06 -0400, Mike Snitzer wrote:
> We need to test on large HW raid setups like a Netapp filer (or even
> local SAS drives connected via some SAS controller).  Like a 8+2 drive
> RAID6 or 8+1 RAID5 setup.  Testing with MD raid on JBOD setups with 8
> devices is also useful.  It is larger RAID setups that will be more
> sensitive to IO sizes being properly aligned on RAID stripe and/or chunk
> size boundaries.

Here are tests results of xfs/ext4/btrfs read/write on HW RAID6/MD RAID6/DM stripe target.
Each case run 0.5 hour, so it took 36 hours to finish all the tests on 4.1-rc4 and 4.1-rc4-patched kernels.

No performance regressions were introduced.

Test server: Dell R730xd(2 sockets/48 logical cpus/264G memory)
HW RAID6/MD RAID6/DM stripe target were configured with 10 HDDs, each 280G
Stripe size 64k and 128k were tested.

devs="/dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk"
spare_devs="/dev/sdl /dev/sdm"
stripe_size=64 (or 128)

MD RAID6 was created by:
mdadm --create --verbose /dev/md0 --level=6 --raid-devices=10 $devs --spare-devices=2 $spare_devs -c $stripe_size

DM stripe target was created by:
pvcreate $devs
vgcreate striped_vol_group $devs
lvcreate -i10 -I${stripe_size} -L2T -nstriped_logical_volume striped_vol_group

Here is an example of fio script for stripe size 128k:
[global]
ioengine=libaio
iodepth=64
direct=1
runtime=1800
time_based
group_reporting
numjobs=48
gtod_reduce=0
norandommap
write_iops_log=fs

[job1]
bs=1280K
directory=/mnt
size=5G
rw=read

All results here: http://minggr.net/pub/20150608/fio_results/

Results summary:

1. HW RAID6: stripe size 64k
		4.1-rc4		4.1-rc4-patched
		-------		---------------
		(MB/s)		(MB/s)
xfs read:	821.23		812.20  -1.09%
xfs write:	753.16		754.42  +0.16%
ext4 read:	827.80		834.82  +0.84%
ext4 write:	783.08		777.58  -0.70%
btrfs read:	859.26		871.68  +1.44%
btrfs write:	815.63		844.40  +3.52%

2. HW RAID6: stripe size 128k
		4.1-rc4		4.1-rc4-patched
		-------		---------------
		(MB/s)		(MB/s)
xfs read:	948.27		979.11  +3.25%
xfs write:	820.78		819.94  -0.10%
ext4 read:	978.35		997.92  +2.00%
ext4 write:	853.51		847.97  -0.64%
btrfs read:	1013.1		1015.6  +0.24%
btrfs write:	854.43		850.42  -0.46%

3. MD RAID6: stripe size 64k
		4.1-rc4		4.1-rc4-patched
		-------		---------------
		(MB/s)		(MB/s)
xfs read:	847.34		869.43  +2.60%
xfs write:	198.67		199.03  +0.18%
ext4 read:	763.89		767.79  +0.51%
ext4 write:	281.44		282.83  +0.49%
btrfs read:	756.02		743.69  -1.63%
btrfs write:	268.37		265.93  -0.90%

4. MD RAID6: stripe size 128k
		4.1-rc4		4.1-rc4-patched
		-------		---------------
		(MB/s)		(MB/s)
xfs read:	993.04		1014.1  +2.12%
xfs write:	293.06		298.95  +2.00%
ext4 read:	1019.6		1020.9  +0.12%
ext4 write:	371.51		371.47  -0.01%
btrfs read:	1000.4		1020.8  +2.03%
btrfs write:	241.08		246.77  +2.36%

5. DM: stripe size 64k
		4.1-rc4		4.1-rc4-patched
		-------		---------------
		(MB/s)		(MB/s)
xfs read:	1084.4		1080.1  -0.39%
xfs write:	1071.1		1063.4  -0.71%
ext4 read:	991.54		1003.7  +1.22%
ext4 write:	1069.7		1052.2  -1.63%
btrfs read:	1076.1		1082.1  +0.55%
btrfs write:	968.98		965.07  -0.40%

6. DM: stripe size 128k
		4.1-rc4		4.1-rc4-patched
		-------		---------------
		(MB/s)		(MB/s)
xfs read:	1020.4		1066.1  +4.47%
xfs write:	1058.2		1066.6  +0.79%
ext4 read:	990.72		988.19  -0.25%
ext4 write:	1050.4		1070.2  +1.88%
btrfs read:	1080.9		1074.7  -0.57%
btrfs write:	975.10		972.76  -0.23%





--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lin June 10, 2015, 9:20 p.m. UTC | #21
On Mon, Jun 8, 2015 at 11:09 PM, Ming Lin <mlin@kernel.org> wrote:
> On Thu, 2015-06-04 at 17:06 -0400, Mike Snitzer wrote:
>> We need to test on large HW raid setups like a Netapp filer (or even
>> local SAS drives connected via some SAS controller).  Like a 8+2 drive
>> RAID6 or 8+1 RAID5 setup.  Testing with MD raid on JBOD setups with 8
>> devices is also useful.  It is larger RAID setups that will be more
>> sensitive to IO sizes being properly aligned on RAID stripe and/or chunk
>> size boundaries.
>
> Here are tests results of xfs/ext4/btrfs read/write on HW RAID6/MD RAID6/DM stripe target.
> Each case run 0.5 hour, so it took 36 hours to finish all the tests on 4.1-rc4 and 4.1-rc4-patched kernels.
>
> No performance regressions were introduced.
>
> Test server: Dell R730xd(2 sockets/48 logical cpus/264G memory)
> HW RAID6/MD RAID6/DM stripe target were configured with 10 HDDs, each 280G
> Stripe size 64k and 128k were tested.
>
> devs="/dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk"
> spare_devs="/dev/sdl /dev/sdm"
> stripe_size=64 (or 128)
>
> MD RAID6 was created by:
> mdadm --create --verbose /dev/md0 --level=6 --raid-devices=10 $devs --spare-devices=2 $spare_devs -c $stripe_size
>
> DM stripe target was created by:
> pvcreate $devs
> vgcreate striped_vol_group $devs
> lvcreate -i10 -I${stripe_size} -L2T -nstriped_logical_volume striped_vol_group
>
> Here is an example of fio script for stripe size 128k:
> [global]
> ioengine=libaio
> iodepth=64
> direct=1
> runtime=1800
> time_based
> group_reporting
> numjobs=48
> gtod_reduce=0
> norandommap
> write_iops_log=fs
>
> [job1]
> bs=1280K
> directory=/mnt
> size=5G
> rw=read
>
> All results here: http://minggr.net/pub/20150608/fio_results/
>
> Results summary:
>
> 1. HW RAID6: stripe size 64k
>                 4.1-rc4         4.1-rc4-patched
>                 -------         ---------------
>                 (MB/s)          (MB/s)
> xfs read:       821.23          812.20  -1.09%
> xfs write:      753.16          754.42  +0.16%
> ext4 read:      827.80          834.82  +0.84%
> ext4 write:     783.08          777.58  -0.70%
> btrfs read:     859.26          871.68  +1.44%
> btrfs write:    815.63          844.40  +3.52%
>
> 2. HW RAID6: stripe size 128k
>                 4.1-rc4         4.1-rc4-patched
>                 -------         ---------------
>                 (MB/s)          (MB/s)
> xfs read:       948.27          979.11  +3.25%
> xfs write:      820.78          819.94  -0.10%
> ext4 read:      978.35          997.92  +2.00%
> ext4 write:     853.51          847.97  -0.64%
> btrfs read:     1013.1          1015.6  +0.24%
> btrfs write:    854.43          850.42  -0.46%
>
> 3. MD RAID6: stripe size 64k
>                 4.1-rc4         4.1-rc4-patched
>                 -------         ---------------
>                 (MB/s)          (MB/s)
> xfs read:       847.34          869.43  +2.60%
> xfs write:      198.67          199.03  +0.18%
> ext4 read:      763.89          767.79  +0.51%
> ext4 write:     281.44          282.83  +0.49%
> btrfs read:     756.02          743.69  -1.63%
> btrfs write:    268.37          265.93  -0.90%
>
> 4. MD RAID6: stripe size 128k
>                 4.1-rc4         4.1-rc4-patched
>                 -------         ---------------
>                 (MB/s)          (MB/s)
> xfs read:       993.04          1014.1  +2.12%
> xfs write:      293.06          298.95  +2.00%
> ext4 read:      1019.6          1020.9  +0.12%
> ext4 write:     371.51          371.47  -0.01%
> btrfs read:     1000.4          1020.8  +2.03%
> btrfs write:    241.08          246.77  +2.36%
>
> 5. DM: stripe size 64k
>                 4.1-rc4         4.1-rc4-patched
>                 -------         ---------------
>                 (MB/s)          (MB/s)
> xfs read:       1084.4          1080.1  -0.39%
> xfs write:      1071.1          1063.4  -0.71%
> ext4 read:      991.54          1003.7  +1.22%
> ext4 write:     1069.7          1052.2  -1.63%
> btrfs read:     1076.1          1082.1  +0.55%
> btrfs write:    968.98          965.07  -0.40%
>
> 6. DM: stripe size 128k
>                 4.1-rc4         4.1-rc4-patched
>                 -------         ---------------
>                 (MB/s)          (MB/s)
> xfs read:       1020.4          1066.1  +4.47%
> xfs write:      1058.2          1066.6  +0.79%
> ext4 read:      990.72          988.19  -0.25%
> ext4 write:     1050.4          1070.2  +1.88%
> btrfs read:     1080.9          1074.7  -0.57%
> btrfs write:    975.10          972.76  -0.23%

Hi Mike,

How about these numbers?

I'm also happy to run other fio jobs your team used.

Thanks.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer June 10, 2015, 9:46 p.m. UTC | #22
On Wed, Jun 10 2015 at  5:20pm -0400,
Ming Lin <mlin@kernel.org> wrote:

> On Mon, Jun 8, 2015 at 11:09 PM, Ming Lin <mlin@kernel.org> wrote:
> > On Thu, 2015-06-04 at 17:06 -0400, Mike Snitzer wrote:
> >> We need to test on large HW raid setups like a Netapp filer (or even
> >> local SAS drives connected via some SAS controller).  Like a 8+2 drive
> >> RAID6 or 8+1 RAID5 setup.  Testing with MD raid on JBOD setups with 8
> >> devices is also useful.  It is larger RAID setups that will be more
> >> sensitive to IO sizes being properly aligned on RAID stripe and/or chunk
> >> size boundaries.
> >
> > Here are tests results of xfs/ext4/btrfs read/write on HW RAID6/MD RAID6/DM stripe target.
> > Each case run 0.5 hour, so it took 36 hours to finish all the tests on 4.1-rc4 and 4.1-rc4-patched kernels.
> >
> > No performance regressions were introduced.
> >
> > Test server: Dell R730xd(2 sockets/48 logical cpus/264G memory)
> > HW RAID6/MD RAID6/DM stripe target were configured with 10 HDDs, each 280G
> > Stripe size 64k and 128k were tested.
> >
> > devs="/dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk"
> > spare_devs="/dev/sdl /dev/sdm"
> > stripe_size=64 (or 128)
> >
> > MD RAID6 was created by:
> > mdadm --create --verbose /dev/md0 --level=6 --raid-devices=10 $devs --spare-devices=2 $spare_devs -c $stripe_size
> >
> > DM stripe target was created by:
> > pvcreate $devs
> > vgcreate striped_vol_group $devs
> > lvcreate -i10 -I${stripe_size} -L2T -nstriped_logical_volume striped_vol_group

DM had a regression relative to merge_bvec that wasn't fixed until
recently (it wasn't in 4.1-rc4), see commit 1c220c69ce0 ("dm: fix
casting bug in dm_merge_bvec()").  It was introduced in 4.1.

So your 4.1-rc4 DM stripe testing may have effectively been with
merge_bvec disabled.

> > Here is an example of fio script for stripe size 128k:
> > [global]
> > ioengine=libaio
> > iodepth=64
> > direct=1
> > runtime=1800
> > time_based
> > group_reporting
> > numjobs=48
> > gtod_reduce=0
> > norandommap
> > write_iops_log=fs
> >
> > [job1]
> > bs=1280K
> > directory=/mnt
> > size=5G
> > rw=read
> >
> > All results here: http://minggr.net/pub/20150608/fio_results/
> >
> > Results summary:
> >
> > 1. HW RAID6: stripe size 64k
> >                 4.1-rc4         4.1-rc4-patched
> >                 -------         ---------------
> >                 (MB/s)          (MB/s)
> > xfs read:       821.23          812.20  -1.09%
> > xfs write:      753.16          754.42  +0.16%
> > ext4 read:      827.80          834.82  +0.84%
> > ext4 write:     783.08          777.58  -0.70%
> > btrfs read:     859.26          871.68  +1.44%
> > btrfs write:    815.63          844.40  +3.52%
> >
> > 2. HW RAID6: stripe size 128k
> >                 4.1-rc4         4.1-rc4-patched
> >                 -------         ---------------
> >                 (MB/s)          (MB/s)
> > xfs read:       948.27          979.11  +3.25%
> > xfs write:      820.78          819.94  -0.10%
> > ext4 read:      978.35          997.92  +2.00%
> > ext4 write:     853.51          847.97  -0.64%
> > btrfs read:     1013.1          1015.6  +0.24%
> > btrfs write:    854.43          850.42  -0.46%
> >
> > 3. MD RAID6: stripe size 64k
> >                 4.1-rc4         4.1-rc4-patched
> >                 -------         ---------------
> >                 (MB/s)          (MB/s)
> > xfs read:       847.34          869.43  +2.60%
> > xfs write:      198.67          199.03  +0.18%
> > ext4 read:      763.89          767.79  +0.51%
> > ext4 write:     281.44          282.83  +0.49%
> > btrfs read:     756.02          743.69  -1.63%
> > btrfs write:    268.37          265.93  -0.90%
> >
> > 4. MD RAID6: stripe size 128k
> >                 4.1-rc4         4.1-rc4-patched
> >                 -------         ---------------
> >                 (MB/s)          (MB/s)
> > xfs read:       993.04          1014.1  +2.12%
> > xfs write:      293.06          298.95  +2.00%
> > ext4 read:      1019.6          1020.9  +0.12%
> > ext4 write:     371.51          371.47  -0.01%
> > btrfs read:     1000.4          1020.8  +2.03%
> > btrfs write:    241.08          246.77  +2.36%
> >
> > 5. DM: stripe size 64k
> >                 4.1-rc4         4.1-rc4-patched
> >                 -------         ---------------
> >                 (MB/s)          (MB/s)
> > xfs read:       1084.4          1080.1  -0.39%
> > xfs write:      1071.1          1063.4  -0.71%
> > ext4 read:      991.54          1003.7  +1.22%
> > ext4 write:     1069.7          1052.2  -1.63%
> > btrfs read:     1076.1          1082.1  +0.55%
> > btrfs write:    968.98          965.07  -0.40%
> >
> > 6. DM: stripe size 128k
> >                 4.1-rc4         4.1-rc4-patched
> >                 -------         ---------------
> >                 (MB/s)          (MB/s)
> > xfs read:       1020.4          1066.1  +4.47%
> > xfs write:      1058.2          1066.6  +0.79%
> > ext4 read:      990.72          988.19  -0.25%
> > ext4 write:     1050.4          1070.2  +1.88%
> > btrfs read:     1080.9          1074.7  -0.57%
> > btrfs write:    975.10          972.76  -0.23%
> 
> Hi Mike,
> 
> How about these numbers?

Looks fairly good.  I just am not sure the workload is going to test the
code paths in question like we'd hope.  I'll have to set aside some time
to think through scenarios to test.

My concern still remains that at some point it the future we'll regret
not having merge_bvec but it'll be too late.  That is just my own FUD at
this point...

> I'm also happy to run other fio jobs your team used.

I've been busy getting DM changes for the 4.2 merge window finalized.
As such I haven't connected with others on the team to discuss this
issue.

I'll see if we can make time in the next 2 days.  But I also have
RHEL-specific kernel deadlines I'm coming up against.

Seems late to be staging this extensive a change for 4.2... are you
pushing for this code to land in the 4.2 merge window?  Or do we have
time to work this further and target the 4.3 merge?

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lin June 10, 2015, 10:06 p.m. UTC | #23
On Wed, Jun 10, 2015 at 2:46 PM, Mike Snitzer <snitzer@redhat.com> wrote:
> On Wed, Jun 10 2015 at  5:20pm -0400,
> Ming Lin <mlin@kernel.org> wrote:
>
>> On Mon, Jun 8, 2015 at 11:09 PM, Ming Lin <mlin@kernel.org> wrote:
>> > On Thu, 2015-06-04 at 17:06 -0400, Mike Snitzer wrote:
>> >> We need to test on large HW raid setups like a Netapp filer (or even
>> >> local SAS drives connected via some SAS controller).  Like a 8+2 drive
>> >> RAID6 or 8+1 RAID5 setup.  Testing with MD raid on JBOD setups with 8
>> >> devices is also useful.  It is larger RAID setups that will be more
>> >> sensitive to IO sizes being properly aligned on RAID stripe and/or chunk
>> >> size boundaries.
>> >
>> > Here are tests results of xfs/ext4/btrfs read/write on HW RAID6/MD RAID6/DM stripe target.
>> > Each case run 0.5 hour, so it took 36 hours to finish all the tests on 4.1-rc4 and 4.1-rc4-patched kernels.
>> >
>> > No performance regressions were introduced.
>> >
>> > Test server: Dell R730xd(2 sockets/48 logical cpus/264G memory)
>> > HW RAID6/MD RAID6/DM stripe target were configured with 10 HDDs, each 280G
>> > Stripe size 64k and 128k were tested.
>> >
>> > devs="/dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk"
>> > spare_devs="/dev/sdl /dev/sdm"
>> > stripe_size=64 (or 128)
>> >
>> > MD RAID6 was created by:
>> > mdadm --create --verbose /dev/md0 --level=6 --raid-devices=10 $devs --spare-devices=2 $spare_devs -c $stripe_size
>> >
>> > DM stripe target was created by:
>> > pvcreate $devs
>> > vgcreate striped_vol_group $devs
>> > lvcreate -i10 -I${stripe_size} -L2T -nstriped_logical_volume striped_vol_group
>
> DM had a regression relative to merge_bvec that wasn't fixed until
> recently (it wasn't in 4.1-rc4), see commit 1c220c69ce0 ("dm: fix
> casting bug in dm_merge_bvec()").  It was introduced in 4.1.
>
> So your 4.1-rc4 DM stripe testing may have effectively been with
> merge_bvec disabled.

I'l rebase it to latest Linus tree and re-run DM stripe testing.

>
>> > Here is an example of fio script for stripe size 128k:
>> > [global]
>> > ioengine=libaio
>> > iodepth=64
>> > direct=1
>> > runtime=1800
>> > time_based
>> > group_reporting
>> > numjobs=48
>> > gtod_reduce=0
>> > norandommap
>> > write_iops_log=fs
>> >
>> > [job1]
>> > bs=1280K
>> > directory=/mnt
>> > size=5G
>> > rw=read
>> >
>> > All results here: http://minggr.net/pub/20150608/fio_results/
>> >
>> > Results summary:
>> >
>> > 1. HW RAID6: stripe size 64k
>> >                 4.1-rc4         4.1-rc4-patched
>> >                 -------         ---------------
>> >                 (MB/s)          (MB/s)
>> > xfs read:       821.23          812.20  -1.09%
>> > xfs write:      753.16          754.42  +0.16%
>> > ext4 read:      827.80          834.82  +0.84%
>> > ext4 write:     783.08          777.58  -0.70%
>> > btrfs read:     859.26          871.68  +1.44%
>> > btrfs write:    815.63          844.40  +3.52%
>> >
>> > 2. HW RAID6: stripe size 128k
>> >                 4.1-rc4         4.1-rc4-patched
>> >                 -------         ---------------
>> >                 (MB/s)          (MB/s)
>> > xfs read:       948.27          979.11  +3.25%
>> > xfs write:      820.78          819.94  -0.10%
>> > ext4 read:      978.35          997.92  +2.00%
>> > ext4 write:     853.51          847.97  -0.64%
>> > btrfs read:     1013.1          1015.6  +0.24%
>> > btrfs write:    854.43          850.42  -0.46%
>> >
>> > 3. MD RAID6: stripe size 64k
>> >                 4.1-rc4         4.1-rc4-patched
>> >                 -------         ---------------
>> >                 (MB/s)          (MB/s)
>> > xfs read:       847.34          869.43  +2.60%
>> > xfs write:      198.67          199.03  +0.18%
>> > ext4 read:      763.89          767.79  +0.51%
>> > ext4 write:     281.44          282.83  +0.49%
>> > btrfs read:     756.02          743.69  -1.63%
>> > btrfs write:    268.37          265.93  -0.90%
>> >
>> > 4. MD RAID6: stripe size 128k
>> >                 4.1-rc4         4.1-rc4-patched
>> >                 -------         ---------------
>> >                 (MB/s)          (MB/s)
>> > xfs read:       993.04          1014.1  +2.12%
>> > xfs write:      293.06          298.95  +2.00%
>> > ext4 read:      1019.6          1020.9  +0.12%
>> > ext4 write:     371.51          371.47  -0.01%
>> > btrfs read:     1000.4          1020.8  +2.03%
>> > btrfs write:    241.08          246.77  +2.36%
>> >
>> > 5. DM: stripe size 64k
>> >                 4.1-rc4         4.1-rc4-patched
>> >                 -------         ---------------
>> >                 (MB/s)          (MB/s)
>> > xfs read:       1084.4          1080.1  -0.39%
>> > xfs write:      1071.1          1063.4  -0.71%
>> > ext4 read:      991.54          1003.7  +1.22%
>> > ext4 write:     1069.7          1052.2  -1.63%
>> > btrfs read:     1076.1          1082.1  +0.55%
>> > btrfs write:    968.98          965.07  -0.40%
>> >
>> > 6. DM: stripe size 128k
>> >                 4.1-rc4         4.1-rc4-patched
>> >                 -------         ---------------
>> >                 (MB/s)          (MB/s)
>> > xfs read:       1020.4          1066.1  +4.47%
>> > xfs write:      1058.2          1066.6  +0.79%
>> > ext4 read:      990.72          988.19  -0.25%
>> > ext4 write:     1050.4          1070.2  +1.88%
>> > btrfs read:     1080.9          1074.7  -0.57%
>> > btrfs write:    975.10          972.76  -0.23%
>>
>> Hi Mike,
>>
>> How about these numbers?
>
> Looks fairly good.  I just am not sure the workload is going to test the
> code paths in question like we'd hope.  I'll have to set aside some time

How about adding some counters to record, for example, how many time
->merge_bvec is called in old kernel and how many time bio splitting is called
in patched kernel?

> to think through scenarios to test.

Great.

>
> My concern still remains that at some point it the future we'll regret
> not having merge_bvec but it'll be too late.  That is just my own FUD at
> this point...
>
>> I'm also happy to run other fio jobs your team used.
>
> I've been busy getting DM changes for the 4.2 merge window finalized.
> As such I haven't connected with others on the team to discuss this
> issue.
>
> I'll see if we can make time in the next 2 days.  But I also have
> RHEL-specific kernel deadlines I'm coming up against.
>
> Seems late to be staging this extensive a change for 4.2... are you
> pushing for this code to land in the 4.2 merge window?  Or do we have
> time to work this further and target the 4.3 merge?

I'm OK to target the 4.3 merge.
But hope we can get it into linux-next tree ASAP for more wide tests.

>
> Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lin June 12, 2015, 5:49 a.m. UTC | #24
On Wed, 2015-06-10 at 15:06 -0700, Ming Lin wrote:
> On Wed, Jun 10, 2015 at 2:46 PM, Mike Snitzer <snitzer@redhat.com> wrote:
> > On Wed, Jun 10 2015 at  5:20pm -0400,
> > Ming Lin <mlin@kernel.org> wrote:
> >
> >> On Mon, Jun 8, 2015 at 11:09 PM, Ming Lin <mlin@kernel.org> wrote:
> >> > On Thu, 2015-06-04 at 17:06 -0400, Mike Snitzer wrote:
> >> >> We need to test on large HW raid setups like a Netapp filer (or even
> >> >> local SAS drives connected via some SAS controller).  Like a 8+2 drive
> >> >> RAID6 or 8+1 RAID5 setup.  Testing with MD raid on JBOD setups with 8
> >> >> devices is also useful.  It is larger RAID setups that will be more
> >> >> sensitive to IO sizes being properly aligned on RAID stripe and/or chunk
> >> >> size boundaries.
> >> >
> >> > Here are tests results of xfs/ext4/btrfs read/write on HW RAID6/MD RAID6/DM stripe target.
> >> > Each case run 0.5 hour, so it took 36 hours to finish all the tests on 4.1-rc4 and 4.1-rc4-patched kernels.
> >> >
> >> > No performance regressions were introduced.
> >> >
> >> > Test server: Dell R730xd(2 sockets/48 logical cpus/264G memory)
> >> > HW RAID6/MD RAID6/DM stripe target were configured with 10 HDDs, each 280G
> >> > Stripe size 64k and 128k were tested.
> >> >
> >> > devs="/dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk"
> >> > spare_devs="/dev/sdl /dev/sdm"
> >> > stripe_size=64 (or 128)
> >> >
> >> > MD RAID6 was created by:
> >> > mdadm --create --verbose /dev/md0 --level=6 --raid-devices=10 $devs --spare-devices=2 $spare_devs -c $stripe_size
> >> >
> >> > DM stripe target was created by:
> >> > pvcreate $devs
> >> > vgcreate striped_vol_group $devs
> >> > lvcreate -i10 -I${stripe_size} -L2T -nstriped_logical_volume striped_vol_group
> >
> > DM had a regression relative to merge_bvec that wasn't fixed until
> > recently (it wasn't in 4.1-rc4), see commit 1c220c69ce0 ("dm: fix
> > casting bug in dm_merge_bvec()").  It was introduced in 4.1.
> >
> > So your 4.1-rc4 DM stripe testing may have effectively been with
> > merge_bvec disabled.
> 
> I'l rebase it to latest Linus tree and re-run DM stripe testing.

Here is the results for 4.1-rc7. Also looks good.

5. DM: stripe size 64k
		4.1-rc7		4.1-rc7-patched
		-------		---------------
		(MB/s)		(MB/s)
xfs read:	784.0		783.5  -0.06%
xfs write:	751.8		768.8  +2.26%
ext4 read:	837.0		832.3  -0.56%
ext4 write:	806.8		814.3  +0.92%
btrfs read:	787.5		786.1  -0.17%
btrfs write:	722.8		718.7  -0.56%


6. DM: stripe size 128k
		4.1-rc7		4.1-rc7-patched
		-------		---------------
		(MB/s)		(MB/s)
xfs read:	1045.5		1068.8  +2.22%
xfs write:	1058.9		1052.7  -0.58%
ext4 read:	1001.8		1020.7  +1.88%
ext4 write:	1049.9		1053.7  +0.36%
btrfs read:	1082.8		1084.8  +0.18%
btrfs write:	948.15		948.74  +0.06%


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lin June 18, 2015, 5:27 a.m. UTC | #25
On Wed, Jun 10, 2015 at 2:46 PM, Mike Snitzer <snitzer@redhat.com> wrote:
> On Wed, Jun 10 2015 at  5:20pm -0400,
> Ming Lin <mlin@kernel.org> wrote:
>
>> On Mon, Jun 8, 2015 at 11:09 PM, Ming Lin <mlin@kernel.org> wrote:
>> > On Thu, 2015-06-04 at 17:06 -0400, Mike Snitzer wrote:
>> >> We need to test on large HW raid setups like a Netapp filer (or even
>> >> local SAS drives connected via some SAS controller).  Like a 8+2 drive
>> >> RAID6 or 8+1 RAID5 setup.  Testing with MD raid on JBOD setups with 8
>> >> devices is also useful.  It is larger RAID setups that will be more
>> >> sensitive to IO sizes being properly aligned on RAID stripe and/or chunk
>> >> size boundaries.
>> >
>> > Here are tests results of xfs/ext4/btrfs read/write on HW RAID6/MD RAID6/DM stripe target.
>> > Each case run 0.5 hour, so it took 36 hours to finish all the tests on 4.1-rc4 and 4.1-rc4-patched kernels.
>> >
>> > No performance regressions were introduced.
>> >
>> > Test server: Dell R730xd(2 sockets/48 logical cpus/264G memory)
>> > HW RAID6/MD RAID6/DM stripe target were configured with 10 HDDs, each 280G
>> > Stripe size 64k and 128k were tested.
>> >
>> > devs="/dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk"
>> > spare_devs="/dev/sdl /dev/sdm"
>> > stripe_size=64 (or 128)
>> >
>> > MD RAID6 was created by:
>> > mdadm --create --verbose /dev/md0 --level=6 --raid-devices=10 $devs --spare-devices=2 $spare_devs -c $stripe_size
>> >
>> > DM stripe target was created by:
>> > pvcreate $devs
>> > vgcreate striped_vol_group $devs
>> > lvcreate -i10 -I${stripe_size} -L2T -nstriped_logical_volume striped_vol_group
>
> DM had a regression relative to merge_bvec that wasn't fixed until
> recently (it wasn't in 4.1-rc4), see commit 1c220c69ce0 ("dm: fix
> casting bug in dm_merge_bvec()").  It was introduced in 4.1.
>
> So your 4.1-rc4 DM stripe testing may have effectively been with
> merge_bvec disabled.
>
>> > Here is an example of fio script for stripe size 128k:
>> > [global]
>> > ioengine=libaio
>> > iodepth=64
>> > direct=1
>> > runtime=1800
>> > time_based
>> > group_reporting
>> > numjobs=48
>> > gtod_reduce=0
>> > norandommap
>> > write_iops_log=fs
>> >
>> > [job1]
>> > bs=1280K
>> > directory=/mnt
>> > size=5G
>> > rw=read
>> >
>> > All results here: http://minggr.net/pub/20150608/fio_results/
>> >
>> > Results summary:
>> >
>> > 1. HW RAID6: stripe size 64k
>> >                 4.1-rc4         4.1-rc4-patched
>> >                 -------         ---------------
>> >                 (MB/s)          (MB/s)
>> > xfs read:       821.23          812.20  -1.09%
>> > xfs write:      753.16          754.42  +0.16%
>> > ext4 read:      827.80          834.82  +0.84%
>> > ext4 write:     783.08          777.58  -0.70%
>> > btrfs read:     859.26          871.68  +1.44%
>> > btrfs write:    815.63          844.40  +3.52%
>> >
>> > 2. HW RAID6: stripe size 128k
>> >                 4.1-rc4         4.1-rc4-patched
>> >                 -------         ---------------
>> >                 (MB/s)          (MB/s)
>> > xfs read:       948.27          979.11  +3.25%
>> > xfs write:      820.78          819.94  -0.10%
>> > ext4 read:      978.35          997.92  +2.00%
>> > ext4 write:     853.51          847.97  -0.64%
>> > btrfs read:     1013.1          1015.6  +0.24%
>> > btrfs write:    854.43          850.42  -0.46%
>> >
>> > 3. MD RAID6: stripe size 64k
>> >                 4.1-rc4         4.1-rc4-patched
>> >                 -------         ---------------
>> >                 (MB/s)          (MB/s)
>> > xfs read:       847.34          869.43  +2.60%
>> > xfs write:      198.67          199.03  +0.18%
>> > ext4 read:      763.89          767.79  +0.51%
>> > ext4 write:     281.44          282.83  +0.49%
>> > btrfs read:     756.02          743.69  -1.63%
>> > btrfs write:    268.37          265.93  -0.90%
>> >
>> > 4. MD RAID6: stripe size 128k
>> >                 4.1-rc4         4.1-rc4-patched
>> >                 -------         ---------------
>> >                 (MB/s)          (MB/s)
>> > xfs read:       993.04          1014.1  +2.12%
>> > xfs write:      293.06          298.95  +2.00%
>> > ext4 read:      1019.6          1020.9  +0.12%
>> > ext4 write:     371.51          371.47  -0.01%
>> > btrfs read:     1000.4          1020.8  +2.03%
>> > btrfs write:    241.08          246.77  +2.36%
>> >
>> > 5. DM: stripe size 64k
>> >                 4.1-rc4         4.1-rc4-patched
>> >                 -------         ---------------
>> >                 (MB/s)          (MB/s)
>> > xfs read:       1084.4          1080.1  -0.39%
>> > xfs write:      1071.1          1063.4  -0.71%
>> > ext4 read:      991.54          1003.7  +1.22%
>> > ext4 write:     1069.7          1052.2  -1.63%
>> > btrfs read:     1076.1          1082.1  +0.55%
>> > btrfs write:    968.98          965.07  -0.40%
>> >
>> > 6. DM: stripe size 128k
>> >                 4.1-rc4         4.1-rc4-patched
>> >                 -------         ---------------
>> >                 (MB/s)          (MB/s)
>> > xfs read:       1020.4          1066.1  +4.47%
>> > xfs write:      1058.2          1066.6  +0.79%
>> > ext4 read:      990.72          988.19  -0.25%
>> > ext4 write:     1050.4          1070.2  +1.88%
>> > btrfs read:     1080.9          1074.7  -0.57%
>> > btrfs write:    975.10          972.76  -0.23%
>>
>> Hi Mike,
>>
>> How about these numbers?
>
> Looks fairly good.  I just am not sure the workload is going to test the
> code paths in question like we'd hope.  I'll have to set aside some time
> to think through scenarios to test.

Hi Mike,

Will you get a chance to think about it?

Thanks.

>
> My concern still remains that at some point it the future we'll regret
> not having merge_bvec but it'll be too late.  That is just my own FUD at
> this point...
>
>> I'm also happy to run other fio jobs your team used.
>
> I've been busy getting DM changes for the 4.2 merge window finalized.
> As such I haven't connected with others on the team to discuss this
> issue.
>
> I'll see if we can make time in the next 2 days.  But I also have
> RHEL-specific kernel deadlines I'm coming up against.
>
> Seems late to be staging this extensive a change for 4.2... are you
> pushing for this code to land in the 4.2 merge window?  Or do we have
> time to work this further and target the 4.3 merge?
>
> Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox

Patch

diff --git a/block/blk-core.c b/block/blk-core.c
index 7871603..fbbb337 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -619,6 +619,10 @@  struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	if (q->id < 0)
 		goto fail_q;
 
+	q->bio_split = bioset_create(BIO_POOL_SIZE, 0);
+	if (!q->bio_split)
+		goto fail_id;
+
 	q->backing_dev_info.ra_pages =
 			(VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
 	q->backing_dev_info.state = 0;
@@ -628,7 +632,7 @@  struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 
 	err = bdi_init(&q->backing_dev_info);
 	if (err)
-		goto fail_id;
+		goto fail_split;
 
 	setup_timer(&q->backing_dev_info.laptop_mode_wb_timer,
 		    laptop_mode_timer_fn, (unsigned long) q);
@@ -670,6 +674,8 @@  struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 
 fail_bdi:
 	bdi_destroy(&q->backing_dev_info);
+fail_split:
+	bioset_free(q->bio_split);
 fail_id:
 	ida_simple_remove(&blk_queue_ida, q->id);
 fail_q:
@@ -1586,6 +1592,8 @@  void blk_queue_bio(struct request_queue *q, struct bio *bio)
 	struct request *req;
 	unsigned int request_count = 0;
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	/*
 	 * low level driver can indicate that it wants pages above a
 	 * certain limit bounced to low memory (ie for highmem, or even
@@ -1809,15 +1817,6 @@  generic_make_request_checks(struct bio *bio)
 		goto end_io;
 	}
 
-	if (likely(bio_is_rw(bio) &&
-		   nr_sectors > queue_max_hw_sectors(q))) {
-		printk(KERN_ERR "bio too big device %s (%u > %u)\n",
-		       bdevname(bio->bi_bdev, b),
-		       bio_sectors(bio),
-		       queue_max_hw_sectors(q));
-		goto end_io;
-	}
-
 	part = bio->bi_bdev->bd_part;
 	if (should_fail_request(part, bio->bi_iter.bi_size) ||
 	    should_fail_request(&part_to_disk(part)->part0,
diff --git a/block/blk-merge.c b/block/blk-merge.c
index fd3fee8..dc14255 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -9,12 +9,158 @@ 
 
 #include "blk.h"
 
+static struct bio *blk_bio_discard_split(struct request_queue *q,
+					 struct bio *bio,
+					 struct bio_set *bs)
+{
+	unsigned int max_discard_sectors, granularity;
+	int alignment;
+	sector_t tmp;
+	unsigned split_sectors;
+
+	/* Zero-sector (unknown) and one-sector granularities are the same.  */
+	granularity = max(q->limits.discard_granularity >> 9, 1U);
+
+	max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9);
+	max_discard_sectors -= max_discard_sectors % granularity;
+
+	if (unlikely(!max_discard_sectors)) {
+		/* XXX: warn */
+		return NULL;
+	}
+
+	if (bio_sectors(bio) <= max_discard_sectors)
+		return NULL;
+
+	split_sectors = max_discard_sectors;
+
+	/*
+	 * If the next starting sector would be misaligned, stop the discard at
+	 * the previous aligned sector.
+	 */
+	alignment = (q->limits.discard_alignment >> 9) % granularity;
+
+	tmp = bio->bi_iter.bi_sector + split_sectors - alignment;
+	tmp = sector_div(tmp, granularity);
+
+	if (split_sectors > tmp)
+		split_sectors -= tmp;
+
+	return bio_split(bio, split_sectors, GFP_NOIO, bs);
+}
+
+static struct bio *blk_bio_write_same_split(struct request_queue *q,
+					    struct bio *bio,
+					    struct bio_set *bs)
+{
+	if (!q->limits.max_write_same_sectors)
+		return NULL;
+
+	if (bio_sectors(bio) <= q->limits.max_write_same_sectors)
+		return NULL;
+
+	return bio_split(bio, q->limits.max_write_same_sectors, GFP_NOIO, bs);
+}
+
+static struct bio *blk_bio_segment_split(struct request_queue *q,
+					 struct bio *bio,
+					 struct bio_set *bs)
+{
+	struct bio *split;
+	struct bio_vec bv, bvprv;
+	struct bvec_iter iter;
+	unsigned seg_size = 0, nsegs = 0;
+	int prev = 0;
+
+	struct bvec_merge_data bvm = {
+		.bi_bdev	= bio->bi_bdev,
+		.bi_sector	= bio->bi_iter.bi_sector,
+		.bi_size	= 0,
+		.bi_rw		= bio->bi_rw,
+	};
+
+	bio_for_each_segment(bv, bio, iter) {
+		if (q->merge_bvec_fn &&
+		    q->merge_bvec_fn(q, &bvm, &bv) < (int) bv.bv_len)
+			goto split;
+
+		bvm.bi_size += bv.bv_len;
+
+		if (bvm.bi_size >> 9 > queue_max_sectors(q))
+			goto split;
+
+		/*
+		 * If the queue doesn't support SG gaps and adding this
+		 * offset would create a gap, disallow it.
+		 */
+		if (q->queue_flags & (1 << QUEUE_FLAG_SG_GAPS) &&
+		    prev && bvec_gap_to_prev(&bvprv, bv.bv_offset))
+			goto split;
+
+		if (prev && blk_queue_cluster(q)) {
+			if (seg_size + bv.bv_len > queue_max_segment_size(q))
+				goto new_segment;
+			if (!BIOVEC_PHYS_MERGEABLE(&bvprv, &bv))
+				goto new_segment;
+			if (!BIOVEC_SEG_BOUNDARY(q, &bvprv, &bv))
+				goto new_segment;
+
+			seg_size += bv.bv_len;
+			bvprv = bv;
+			prev = 1;
+			continue;
+		}
+new_segment:
+		if (nsegs == queue_max_segments(q))
+			goto split;
+
+		nsegs++;
+		bvprv = bv;
+		prev = 1;
+		seg_size = bv.bv_len;
+	}
+
+	return NULL;
+split:
+	split = bio_clone_bioset(bio, GFP_NOIO, bs);
+
+	split->bi_iter.bi_size -= iter.bi_size;
+	bio->bi_iter = iter;
+
+	if (bio_integrity(bio)) {
+		bio_integrity_advance(bio, split->bi_iter.bi_size);
+		bio_integrity_trim(split, 0, bio_sectors(split));
+	}
+
+	return split;
+}
+
+void blk_queue_split(struct request_queue *q, struct bio **bio,
+		     struct bio_set *bs)
+{
+	struct bio *split;
+
+	if ((*bio)->bi_rw & REQ_DISCARD)
+		split = blk_bio_discard_split(q, *bio, bs);
+	else if ((*bio)->bi_rw & REQ_WRITE_SAME)
+		split = blk_bio_write_same_split(q, *bio, bs);
+	else
+		split = blk_bio_segment_split(q, *bio, q->bio_split);
+
+	if (split) {
+		bio_chain(split, *bio);
+		generic_make_request(*bio);
+		*bio = split;
+	}
+}
+EXPORT_SYMBOL(blk_queue_split);
+
 static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 					     struct bio *bio,
 					     bool no_sg_merge)
 {
 	struct bio_vec bv, bvprv = { NULL };
-	int cluster, high, highprv = 1;
+	int cluster, prev = 0;
 	unsigned int seg_size, nr_phys_segs;
 	struct bio *fbio, *bbio;
 	struct bvec_iter iter;
@@ -36,7 +182,6 @@  static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 	cluster = blk_queue_cluster(q);
 	seg_size = 0;
 	nr_phys_segs = 0;
-	high = 0;
 	for_each_bio(bio) {
 		bio_for_each_segment(bv, bio, iter) {
 			/*
@@ -46,13 +191,7 @@  static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 			if (no_sg_merge)
 				goto new_segment;
 
-			/*
-			 * the trick here is making sure that a high page is
-			 * never considered part of another segment, since
-			 * that might change with the bounce page.
-			 */
-			high = page_to_pfn(bv.bv_page) > queue_bounce_pfn(q);
-			if (!high && !highprv && cluster) {
+			if (prev && cluster) {
 				if (seg_size + bv.bv_len
 				    > queue_max_segment_size(q))
 					goto new_segment;
@@ -72,8 +211,8 @@  new_segment:
 
 			nr_phys_segs++;
 			bvprv = bv;
+			prev = 1;
 			seg_size = bv.bv_len;
-			highprv = high;
 		}
 		bbio = bio;
 	}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index e68b71b..e7fae76 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1256,6 +1256,8 @@  static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		return;
 	}
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	rq = blk_mq_map_request(q, bio, &data);
 	if (unlikely(!rq))
 		return;
@@ -1339,6 +1341,8 @@  static void blk_sq_make_request(struct request_queue *q, struct bio *bio)
 		return;
 	}
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	if (use_plug && !blk_queue_nomerges(q) &&
 	    blk_attempt_plug_merge(q, bio, &request_count))
 		return;
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 3907202..a6265bc 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1497,6 +1497,8 @@  void drbd_make_request(struct request_queue *q, struct bio *bio)
 	struct drbd_device *device = (struct drbd_device *) q->queuedata;
 	unsigned long start_jif;
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	start_jif = jiffies;
 
 	/*
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 09e628da..ea10bd9 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2446,6 +2446,10 @@  static void pkt_make_request(struct request_queue *q, struct bio *bio)
 	char b[BDEVNAME_SIZE];
 	struct bio *split;
 
+	blk_queue_bounce(q, &bio);
+
+	blk_queue_split(q, &bio, q->bio_split);
+
 	pd = q->queuedata;
 	if (!pd) {
 		pr_err("%s incorrect request queue\n",
@@ -2476,8 +2480,6 @@  static void pkt_make_request(struct request_queue *q, struct bio *bio)
 		goto end_io;
 	}
 
-	blk_queue_bounce(q, &bio);
-
 	do {
 		sector_t zone = get_zone(bio->bi_iter.bi_sector, pd);
 		sector_t last_zone = get_zone(bio_end_sector(bio) - 1, pd);
diff --git a/drivers/block/ps3vram.c b/drivers/block/ps3vram.c
index ef45cfb..e32e799 100644
--- a/drivers/block/ps3vram.c
+++ b/drivers/block/ps3vram.c
@@ -605,6 +605,8 @@  static void ps3vram_make_request(struct request_queue *q, struct bio *bio)
 
 	dev_dbg(&dev->core, "%s\n", __func__);
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	spin_lock_irq(&priv->lock);
 	busy = !bio_list_empty(&priv->list);
 	bio_list_add(&priv->list, bio);
diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c
index ac8c62c..50ef199 100644
--- a/drivers/block/rsxx/dev.c
+++ b/drivers/block/rsxx/dev.c
@@ -148,6 +148,8 @@  static void rsxx_make_request(struct request_queue *q, struct bio *bio)
 	struct rsxx_bio_meta *bio_meta;
 	int st = -EINVAL;
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	might_sleep();
 
 	if (!card)
diff --git a/drivers/block/umem.c b/drivers/block/umem.c
index 4cf81b5..13d577c 100644
--- a/drivers/block/umem.c
+++ b/drivers/block/umem.c
@@ -531,6 +531,8 @@  static void mm_make_request(struct request_queue *q, struct bio *bio)
 		 (unsigned long long)bio->bi_iter.bi_sector,
 		 bio->bi_iter.bi_size);
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	spin_lock_irq(&card->lock);
 	*card->biotail = bio;
 	bio->bi_next = NULL;
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 8dcbced..36a004e 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -981,6 +981,8 @@  static void zram_make_request(struct request_queue *queue, struct bio *bio)
 	if (unlikely(!zram_meta_get(zram)))
 		goto error;
 
+	blk_queue_split(queue, &bio, queue->bio_split);
+
 	if (!valid_io_request(zram, bio->bi_iter.bi_sector,
 					bio->bi_iter.bi_size)) {
 		atomic64_inc(&zram->stats.invalid_io);
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index a930b72..34f6063 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1784,6 +1784,8 @@  static void dm_make_request(struct request_queue *q, struct bio *bio)
 
 	map = dm_get_live_table(md, &srcu_idx);
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	generic_start_io_acct(rw, bio_sectors(bio), &dm_disk(md)->part0);
 
 	/* if we're suspended, we have to queue this io for later */
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 593a024..046b3c9 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -257,6 +257,8 @@  static void md_make_request(struct request_queue *q, struct bio *bio)
 	unsigned int sectors;
 	int cpu;
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	if (mddev == NULL || mddev->pers == NULL
 	    || !mddev->ready) {
 		bio_io_error(bio);
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index da21281..267ca3a 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -826,6 +826,8 @@  dcssblk_make_request(struct request_queue *q, struct bio *bio)
 	unsigned long source_addr;
 	unsigned long bytes_done;
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	bytes_done = 0;
 	dev_info = bio->bi_bdev->bd_disk->private_data;
 	if (dev_info == NULL)
diff --git a/drivers/s390/block/xpram.c b/drivers/s390/block/xpram.c
index 7d4e939..1305ed3 100644
--- a/drivers/s390/block/xpram.c
+++ b/drivers/s390/block/xpram.c
@@ -190,6 +190,8 @@  static void xpram_make_request(struct request_queue *q, struct bio *bio)
 	unsigned long page_addr;
 	unsigned long bytes;
 
+	blk_queue_split(q, &bio, q->bio_split);
+
 	if ((bio->bi_iter.bi_sector & 7) != 0 ||
 	    (bio->bi_iter.bi_size & 4095) != 0)
 		/* Request is not page-aligned. */
diff --git a/drivers/staging/lustre/lustre/llite/lloop.c b/drivers/staging/lustre/lustre/llite/lloop.c
index 413a840..a8645a9 100644
--- a/drivers/staging/lustre/lustre/llite/lloop.c
+++ b/drivers/staging/lustre/lustre/llite/lloop.c
@@ -340,6 +340,8 @@  static void loop_make_request(struct request_queue *q, struct bio *old_bio)
 	int rw = bio_rw(old_bio);
 	int inactive;
 
+	blk_queue_split(q, &old_bio, q->bio_split);
+
 	if (!lo)
 		goto err;
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7f9a516..93b81a2 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -488,6 +488,7 @@  struct request_queue {
 
 	struct blk_mq_tag_set	*tag_set;
 	struct list_head	tag_set_list;
+	struct bio_set		*bio_split;
 };
 
 #define QUEUE_FLAG_QUEUED	1	/* uses generic tag queueing */
@@ -812,6 +813,8 @@  extern void blk_rq_unprep_clone(struct request *rq);
 extern int blk_insert_cloned_request(struct request_queue *q,
 				     struct request *rq);
 extern void blk_delay_queue(struct request_queue *, unsigned long);
+extern void blk_queue_split(struct request_queue *, struct bio **,
+			    struct bio_set *);
 extern void blk_recount_segments(struct request_queue *, struct bio *);
 extern int scsi_verify_blk_ioctl(struct block_device *, unsigned int);
 extern int scsi_cmd_blk_ioctl(struct block_device *, fmode_t,