diff mbox series

[11/19] block: implement bio helper to add iter bvec pages to bio

Message ID 20190211190049.7888-13-axboe@kernel.dk (mailing list archive)
State New, archived
Headers show
Series [01/19] fs: add an iopoll method to struct file_operations | expand

Commit Message

Jens Axboe Feb. 11, 2019, 7 p.m. UTC
For an ITER_BVEC, we can just iterate the iov and add the pages
to the bio directly. This requires that the caller doesn't releases
the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.

The current two callers of bio_iov_iter_get_pages() are updated to
check if they need to release pages on completion. This makes them
work with bvecs that contain kernel mapped pages already.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
 fs/block_dev.c            |  5 ++--
 fs/iomap.c                |  5 ++--
 include/linux/blk_types.h |  1 +
 4 files changed, 56 insertions(+), 14 deletions(-)

Comments

Ming Lei Feb. 20, 2019, 10:58 p.m. UTC | #1
On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> For an ITER_BVEC, we can just iterate the iov and add the pages
> to the bio directly. This requires that the caller doesn't releases
> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> 
> The current two callers of bio_iov_iter_get_pages() are updated to
> check if they need to release pages on completion. This makes them
> work with bvecs that contain kernel mapped pages already.
> 
> Reviewed-by: Hannes Reinecke <hare@suse.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>  fs/block_dev.c            |  5 ++--
>  fs/iomap.c                |  5 ++--
>  include/linux/blk_types.h |  1 +
>  4 files changed, 56 insertions(+), 14 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 4db1008309ed..330df572cfb8 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>  }
>  EXPORT_SYMBOL(bio_add_page);
>  
> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> +{
> +	const struct bio_vec *bv = iter->bvec;
> +	unsigned int len;
> +	size_t size;
> +
> +	len = min_t(size_t, bv->bv_len, iter->count);
> +	size = bio_add_page(bio, bv->bv_page, len,
> +				bv->bv_offset + iter->iov_offset);

iter->iov_offset needs to be subtracted from 'len', looks
the following delta change[1] is required, otherwise memory corruption
can be observed when running xfstests over loop/dio.

Another interesting thing is that bio_add_page() is capable of
adding multi contiguous pages actually, especially loop uses
ITER_BVEC to pass multi-page bvecs. Even though pages in loop's
ITER_BVEC may belong to user-space, looks it is still safe to not
grab the page ref given it has been done by fs. 

[1]
diff --git a/block/bio.c b/block/bio.c
index 3b49963676fc..df99bb3816a1 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -842,7 +842,10 @@ static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
 	unsigned int len;
 	size_t size;
 
-	len = min_t(size_t, bv->bv_len, iter->count);
+	if (WARN_ON_ONCE(iter->iov_offset > bv->bv_len))
+		return -EINVAL;
+
+	len = min_t(size_t, bv->bv_len - iter->iov_offset, iter->count);
 	size = bio_add_page(bio, bv->bv_page, len,
 				bv->bv_offset + iter->iov_offset);
 	if (size == len) {

Thanks,
Ming
Jens Axboe Feb. 21, 2019, 5:45 p.m. UTC | #2
On 2/20/19 3:58 PM, Ming Lei wrote:
> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>> For an ITER_BVEC, we can just iterate the iov and add the pages
>> to the bio directly. This requires that the caller doesn't releases
>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>
>> The current two callers of bio_iov_iter_get_pages() are updated to
>> check if they need to release pages on completion. This makes them
>> work with bvecs that contain kernel mapped pages already.
>>
>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>  fs/block_dev.c            |  5 ++--
>>  fs/iomap.c                |  5 ++--
>>  include/linux/blk_types.h |  1 +
>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>
>> diff --git a/block/bio.c b/block/bio.c
>> index 4db1008309ed..330df572cfb8 100644
>> --- a/block/bio.c
>> +++ b/block/bio.c
>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>  }
>>  EXPORT_SYMBOL(bio_add_page);
>>  
>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>> +{
>> +	const struct bio_vec *bv = iter->bvec;
>> +	unsigned int len;
>> +	size_t size;
>> +
>> +	len = min_t(size_t, bv->bv_len, iter->count);
>> +	size = bio_add_page(bio, bv->bv_page, len,
>> +				bv->bv_offset + iter->iov_offset);
> 
> iter->iov_offset needs to be subtracted from 'len', looks
> the following delta change[1] is required, otherwise memory corruption
> can be observed when running xfstests over loop/dio.

Thanks, I folded this in.
Eric Biggers Feb. 26, 2019, 3:46 a.m. UTC | #3
Hi Jens,

On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> On 2/20/19 3:58 PM, Ming Lei wrote:
> > On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> >> For an ITER_BVEC, we can just iterate the iov and add the pages
> >> to the bio directly. This requires that the caller doesn't releases
> >> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> >>
> >> The current two callers of bio_iov_iter_get_pages() are updated to
> >> check if they need to release pages on completion. This makes them
> >> work with bvecs that contain kernel mapped pages already.
> >>
> >> Reviewed-by: Hannes Reinecke <hare@suse.com>
> >> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >> ---
> >>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> >>  fs/block_dev.c            |  5 ++--
> >>  fs/iomap.c                |  5 ++--
> >>  include/linux/blk_types.h |  1 +
> >>  4 files changed, 56 insertions(+), 14 deletions(-)
> >>
> >> diff --git a/block/bio.c b/block/bio.c
> >> index 4db1008309ed..330df572cfb8 100644
> >> --- a/block/bio.c
> >> +++ b/block/bio.c
> >> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> >>  }
> >>  EXPORT_SYMBOL(bio_add_page);
> >>  
> >> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> >> +{
> >> +	const struct bio_vec *bv = iter->bvec;
> >> +	unsigned int len;
> >> +	size_t size;
> >> +
> >> +	len = min_t(size_t, bv->bv_len, iter->count);
> >> +	size = bio_add_page(bio, bv->bv_page, len,
> >> +				bv->bv_offset + iter->iov_offset);
> > 
> > iter->iov_offset needs to be subtracted from 'len', looks
> > the following delta change[1] is required, otherwise memory corruption
> > can be observed when running xfstests over loop/dio.
> 
> Thanks, I folded this in.
> 
> -- 
> Jens Axboe
> 

syzkaller started hitting a crash on linux-next starting with this commit, and
it still occurs even with your latest version that has Ming's fix folded in.
Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
Sun Feb 24 08:20:53 2019 -0700.

Reproducer:

#define _GNU_SOURCE
#include <fcntl.h>
#include <linux/loop.h>
#include <sys/ioctl.h>
#include <sys/sendfile.h>
#include <sys/syscall.h>
#include <unistd.h>

int main(void)
{
        int memfd, loopfd;

        memfd = syscall(__NR_memfd_create, "foo", 0);

        pwrite(memfd, "\xa8", 1, 4096);

        loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);

        ioctl(loopfd, LOOP_SET_FD, memfd);

        sendfile(loopfd, loopfd, NULL, 1000000);
}


Crash:

page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
flags: 0x100000000000000()
raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff
page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
------------[ cut here ]------------
kernel BUG at include/linux/mm.h:546!
invalid opcode: 0000 [#1] SMP
CPU: 1 PID: 173 Comm: syz_mm Not tainted 5.0.0-rc6-00007-ga566653ab5ab8 #22
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014
RIP: 0010:put_page_testzero include/linux/mm.h:546 [inline]
RIP: 0010:put_page include/linux/mm.h:992 [inline]
RIP: 0010:generic_pipe_buf_release+0x37/0x40 fs/pipe.c:225
Code: 50 ff a8 01 48 0f 45 fa 8b 47 34 85 c0 74 0f f0 ff 4f 34 74 02 5d c3 e8 c7 1b fa ff 5d c3 48 c7 c6 60 aa b1 81 e8 59 25 fc ff <0f> 0b 0f 1f 80 00 00 00 00 55 48 89 e5 41 56 41 55 41 54 53 e8 a0
RSP: 0018:ffffc90000783cb0 EFLAGS: 00010246
RAX: 000000000000003e RBX: ffff88807c358800 RCX: 0000000000000006
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff88807fc95420
RBP: ffffc90000783cb0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000001000
R13: 0000000000001000 R14: 0000000000000000 R15: ffff88807c0b6e00
FS:  00007fd858adb240(0000) GS:ffff88807fc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055dc13859000 CR3: 000000007a96b000 CR4: 00000000003406e0
Call Trace:
 pipe_buf_release include/linux/pipe_fs_i.h:136 [inline]
 iter_file_splice_write+0x2df/0x3f0 fs/splice.c:763
 do_splice_from fs/splice.c:851 [inline]
 direct_splice_actor+0x31/0x40 fs/splice.c:1023
 splice_direct_to_actor+0xff/0x240 fs/splice.c:978
 do_splice_direct+0x92/0xc0 fs/splice.c:1066
 do_sendfile+0x1be/0x390 fs/read_write.c:1436
 __do_sys_sendfile64 fs/read_write.c:1497 [inline]
 __se_sys_sendfile64+0xa6/0xc0 fs/read_write.c:1483
 __x64_sys_sendfile64+0x19/0x20 fs/read_write.c:1483
 do_syscall_64+0x4a/0x180 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fd858bd224e
Code: 89 ce 5b e9 b4 fd ff ff 0f 1f 40 00 31 c0 5b c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 f3 0f 1e fa 49 89 ca b8 28 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e2 cb 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007fffc517d148 EFLAGS: 00000206 ORIG_RAX: 0000000000000028
RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fd858bd224e
RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000004
RBP: 0000000000000003 R08: 00007fd858ca0be0 R09: 00007fffc517d240
R10: 00000000000f4240 R11: 0000000000000206 R12: 000055dc13858100
R13: 00007fffc517d240 R14: 0000000000000000 R15: 0000000000000000
---[ end trace 1d878656972e4a26 ]---
RIP: 0010:put_page_testzero include/linux/mm.h:546 [inline]
RIP: 0010:put_page include/linux/mm.h:992 [inline]
RIP: 0010:generic_pipe_buf_release+0x37/0x40 fs/pipe.c:225
Code: 50 ff a8 01 48 0f 45 fa 8b 47 34 85 c0 74 0f f0 ff 4f 34 74 02 5d c3 e8 c7 1b fa ff 5d c3 48 c7 c6 60 aa b1 81 e8 59 25 fc ff <0f> 0b 0f 1f 80 00 00 00 00 55 48 89 e5 41 56 41 55 41 54 53 e8 a0
RSP: 0018:ffffc90000783cb0 EFLAGS: 00010246
RAX: 000000000000003e RBX: ffff88807c358800 RCX: 0000000000000006
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff88807fc95420
RBP: ffffc90000783cb0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000001000
R13: 0000000000001000 R14: 0000000000000000 R15: ffff88807c0b6e00
FS:  00007fd858adb240(0000) GS:ffff88807fc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055dc13859000 CR3: 000000007a96b000 CR4: 00000000003406e0

- Eric
Jens Axboe Feb. 26, 2019, 4:34 a.m. UTC | #4
On 2/25/19 8:46 PM, Eric Biggers wrote:
> Hi Jens,
> 
> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
>> On 2/20/19 3:58 PM, Ming Lei wrote:
>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
>>>> to the bio directly. This requires that the caller doesn't releases
>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>>>
>>>> The current two callers of bio_iov_iter_get_pages() are updated to
>>>> check if they need to release pages on completion. This makes them
>>>> work with bvecs that contain kernel mapped pages already.
>>>>
>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>> ---
>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>>>  fs/block_dev.c            |  5 ++--
>>>>  fs/iomap.c                |  5 ++--
>>>>  include/linux/blk_types.h |  1 +
>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/block/bio.c b/block/bio.c
>>>> index 4db1008309ed..330df572cfb8 100644
>>>> --- a/block/bio.c
>>>> +++ b/block/bio.c
>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>>>  }
>>>>  EXPORT_SYMBOL(bio_add_page);
>>>>  
>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>>>> +{
>>>> +	const struct bio_vec *bv = iter->bvec;
>>>> +	unsigned int len;
>>>> +	size_t size;
>>>> +
>>>> +	len = min_t(size_t, bv->bv_len, iter->count);
>>>> +	size = bio_add_page(bio, bv->bv_page, len,
>>>> +				bv->bv_offset + iter->iov_offset);
>>>
>>> iter->iov_offset needs to be subtracted from 'len', looks
>>> the following delta change[1] is required, otherwise memory corruption
>>> can be observed when running xfstests over loop/dio.
>>
>> Thanks, I folded this in.
>>
>> -- 
>> Jens Axboe
>>
> 
> syzkaller started hitting a crash on linux-next starting with this commit, and
> it still occurs even with your latest version that has Ming's fix folded in.
> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> Sun Feb 24 08:20:53 2019 -0700.
> 
> Reproducer:
> 
> #define _GNU_SOURCE
> #include <fcntl.h>
> #include <linux/loop.h>
> #include <sys/ioctl.h>
> #include <sys/sendfile.h>
> #include <sys/syscall.h>
> #include <unistd.h>
> 
> int main(void)
> {
>         int memfd, loopfd;
> 
>         memfd = syscall(__NR_memfd_create, "foo", 0);
> 
>         pwrite(memfd, "\xa8", 1, 4096);
> 
>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> 
>         ioctl(loopfd, LOOP_SET_FD, memfd);
> 
>         sendfile(loopfd, loopfd, NULL, 1000000);
> }
> 
> 
> Crash:
> 
> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> flags: 0x100000000000000()
> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> raw: 0000000000000000 0000000000000000 00000000ffffffff
> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)

I see what this is, I'll cut a fix for this tomorrow.
Jens Axboe Feb. 26, 2019, 3:54 p.m. UTC | #5
On 2/25/19 9:34 PM, Jens Axboe wrote:
> On 2/25/19 8:46 PM, Eric Biggers wrote:
>> Hi Jens,
>>
>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
>>> On 2/20/19 3:58 PM, Ming Lei wrote:
>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
>>>>> to the bio directly. This requires that the caller doesn't releases
>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>>>>
>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
>>>>> check if they need to release pages on completion. This makes them
>>>>> work with bvecs that contain kernel mapped pages already.
>>>>>
>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>> ---
>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>>>>  fs/block_dev.c            |  5 ++--
>>>>>  fs/iomap.c                |  5 ++--
>>>>>  include/linux/blk_types.h |  1 +
>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>>>>
>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>> index 4db1008309ed..330df572cfb8 100644
>>>>> --- a/block/bio.c
>>>>> +++ b/block/bio.c
>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>>>>  }
>>>>>  EXPORT_SYMBOL(bio_add_page);
>>>>>  
>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>>>>> +{
>>>>> +	const struct bio_vec *bv = iter->bvec;
>>>>> +	unsigned int len;
>>>>> +	size_t size;
>>>>> +
>>>>> +	len = min_t(size_t, bv->bv_len, iter->count);
>>>>> +	size = bio_add_page(bio, bv->bv_page, len,
>>>>> +				bv->bv_offset + iter->iov_offset);
>>>>
>>>> iter->iov_offset needs to be subtracted from 'len', looks
>>>> the following delta change[1] is required, otherwise memory corruption
>>>> can be observed when running xfstests over loop/dio.
>>>
>>> Thanks, I folded this in.
>>>
>>> -- 
>>> Jens Axboe
>>>
>>
>> syzkaller started hitting a crash on linux-next starting with this commit, and
>> it still occurs even with your latest version that has Ming's fix folded in.
>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
>> Sun Feb 24 08:20:53 2019 -0700.
>>
>> Reproducer:
>>
>> #define _GNU_SOURCE
>> #include <fcntl.h>
>> #include <linux/loop.h>
>> #include <sys/ioctl.h>
>> #include <sys/sendfile.h>
>> #include <sys/syscall.h>
>> #include <unistd.h>
>>
>> int main(void)
>> {
>>         int memfd, loopfd;
>>
>>         memfd = syscall(__NR_memfd_create, "foo", 0);
>>
>>         pwrite(memfd, "\xa8", 1, 4096);
>>
>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
>>
>>         ioctl(loopfd, LOOP_SET_FD, memfd);
>>
>>         sendfile(loopfd, loopfd, NULL, 1000000);
>> }
>>
>>
>> Crash:
>>
>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
>> flags: 0x100000000000000()
>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
>> raw: 0000000000000000 0000000000000000 00000000ffffffff
>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> 
> I see what this is, I'll cut a fix for this tomorrow.

Folded in a fix for this, it's in my current io_uring branch and my for-next
branch.
Ming Lei Feb. 27, 2019, 1:21 a.m. UTC | #6
On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 2/25/19 9:34 PM, Jens Axboe wrote:
> > On 2/25/19 8:46 PM, Eric Biggers wrote:
> >> Hi Jens,
> >>
> >> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> >>> On 2/20/19 3:58 PM, Ming Lei wrote:
> >>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> >>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> >>>>> to the bio directly. This requires that the caller doesn't releases
> >>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> >>>>>
> >>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> >>>>> check if they need to release pages on completion. This makes them
> >>>>> work with bvecs that contain kernel mapped pages already.
> >>>>>
> >>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> >>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>>>> ---
> >>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> >>>>>  fs/block_dev.c            |  5 ++--
> >>>>>  fs/iomap.c                |  5 ++--
> >>>>>  include/linux/blk_types.h |  1 +
> >>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> >>>>>
> >>>>> diff --git a/block/bio.c b/block/bio.c
> >>>>> index 4db1008309ed..330df572cfb8 100644
> >>>>> --- a/block/bio.c
> >>>>> +++ b/block/bio.c
> >>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> >>>>>  }
> >>>>>  EXPORT_SYMBOL(bio_add_page);
> >>>>>
> >>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> >>>>> +{
> >>>>> + const struct bio_vec *bv = iter->bvec;
> >>>>> + unsigned int len;
> >>>>> + size_t size;
> >>>>> +
> >>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> >>>>> + size = bio_add_page(bio, bv->bv_page, len,
> >>>>> +                         bv->bv_offset + iter->iov_offset);
> >>>>
> >>>> iter->iov_offset needs to be subtracted from 'len', looks
> >>>> the following delta change[1] is required, otherwise memory corruption
> >>>> can be observed when running xfstests over loop/dio.
> >>>
> >>> Thanks, I folded this in.
> >>>
> >>> --
> >>> Jens Axboe
> >>>
> >>
> >> syzkaller started hitting a crash on linux-next starting with this commit, and
> >> it still occurs even with your latest version that has Ming's fix folded in.
> >> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> >> Sun Feb 24 08:20:53 2019 -0700.
> >>
> >> Reproducer:
> >>
> >> #define _GNU_SOURCE
> >> #include <fcntl.h>
> >> #include <linux/loop.h>
> >> #include <sys/ioctl.h>
> >> #include <sys/sendfile.h>
> >> #include <sys/syscall.h>
> >> #include <unistd.h>
> >>
> >> int main(void)
> >> {
> >>         int memfd, loopfd;
> >>
> >>         memfd = syscall(__NR_memfd_create, "foo", 0);
> >>
> >>         pwrite(memfd, "\xa8", 1, 4096);
> >>
> >>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> >>
> >>         ioctl(loopfd, LOOP_SET_FD, memfd);
> >>
> >>         sendfile(loopfd, loopfd, NULL, 1000000);
> >> }
> >>
> >>
> >> Crash:
> >>
> >> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> >> flags: 0x100000000000000()
> >> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> >> raw: 0000000000000000 0000000000000000 00000000ffffffff
> >> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> >
> > I see what this is, I'll cut a fix for this tomorrow.
>
> Folded in a fix for this, it's in my current io_uring branch and my for-next
> branch.

Hi Jens,

I saw the following change is added:

+ if (size == len) {
+ /*
+ * For the normal O_DIRECT case, we could skip grabbing this
+ * reference and then not have to put them again when IO
+ * completes. But this breaks some in-kernel users, like
+ * splicing to/from a loop device, where we release the pipe
+ * pages unconditionally. If we can fix that case, we can
+ * get rid of the get here and the need to call
+ * bio_release_pages() at IO completion time.
+ */
+ get_page(bv->bv_page);

Now the 'bv' may point to more than one page, so the following one may be
needed:

int i;
struct bvec_iter_all iter_all;
struct bio_vec *tmp;

mp_bvec_for_each_segment(tmp, bv, i, iter_all)
      get_page(tmp->bv_page);

Thanks,
Ming Lei
Jens Axboe Feb. 27, 2019, 1:47 a.m. UTC | #7
On 2/26/19 6:21 PM, Ming Lei wrote:
> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 2/25/19 9:34 PM, Jens Axboe wrote:
>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
>>>> Hi Jens,
>>>>
>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
>>>>>>> to the bio directly. This requires that the caller doesn't releases
>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>>>>>>
>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
>>>>>>> check if they need to release pages on completion. This makes them
>>>>>>> work with bvecs that contain kernel mapped pages already.
>>>>>>>
>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>> ---
>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>>>>>>  fs/block_dev.c            |  5 ++--
>>>>>>>  fs/iomap.c                |  5 ++--
>>>>>>>  include/linux/blk_types.h |  1 +
>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>>>>>>
>>>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>>>> index 4db1008309ed..330df572cfb8 100644
>>>>>>> --- a/block/bio.c
>>>>>>> +++ b/block/bio.c
>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>>>>>>  }
>>>>>>>  EXPORT_SYMBOL(bio_add_page);
>>>>>>>
>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>>>>>>> +{
>>>>>>> + const struct bio_vec *bv = iter->bvec;
>>>>>>> + unsigned int len;
>>>>>>> + size_t size;
>>>>>>> +
>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
>>>>>>> +                         bv->bv_offset + iter->iov_offset);
>>>>>>
>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
>>>>>> the following delta change[1] is required, otherwise memory corruption
>>>>>> can be observed when running xfstests over loop/dio.
>>>>>
>>>>> Thanks, I folded this in.
>>>>>
>>>>> --
>>>>> Jens Axboe
>>>>>
>>>>
>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
>>>> it still occurs even with your latest version that has Ming's fix folded in.
>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
>>>> Sun Feb 24 08:20:53 2019 -0700.
>>>>
>>>> Reproducer:
>>>>
>>>> #define _GNU_SOURCE
>>>> #include <fcntl.h>
>>>> #include <linux/loop.h>
>>>> #include <sys/ioctl.h>
>>>> #include <sys/sendfile.h>
>>>> #include <sys/syscall.h>
>>>> #include <unistd.h>
>>>>
>>>> int main(void)
>>>> {
>>>>         int memfd, loopfd;
>>>>
>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
>>>>
>>>>         pwrite(memfd, "\xa8", 1, 4096);
>>>>
>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
>>>>
>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
>>>>
>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
>>>> }
>>>>
>>>>
>>>> Crash:
>>>>
>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
>>>> flags: 0x100000000000000()
>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
>>>
>>> I see what this is, I'll cut a fix for this tomorrow.
>>
>> Folded in a fix for this, it's in my current io_uring branch and my for-next
>> branch.
> 
> Hi Jens,
> 
> I saw the following change is added:
> 
> + if (size == len) {
> + /*
> + * For the normal O_DIRECT case, we could skip grabbing this
> + * reference and then not have to put them again when IO
> + * completes. But this breaks some in-kernel users, like
> + * splicing to/from a loop device, where we release the pipe
> + * pages unconditionally. If we can fix that case, we can
> + * get rid of the get here and the need to call
> + * bio_release_pages() at IO completion time.
> + */
> + get_page(bv->bv_page);
> 
> Now the 'bv' may point to more than one page, so the following one may be
> needed:
> 
> int i;
> struct bvec_iter_all iter_all;
> struct bio_vec *tmp;
> 
> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
>       get_page(tmp->bv_page);

I guess that would be the safest, even if we don't currently have more
than one page in there. I'll fix it up.
Ming Lei Feb. 27, 2019, 1:53 a.m. UTC | #8
On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
> On 2/26/19 6:21 PM, Ming Lei wrote:
> > On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> On 2/25/19 9:34 PM, Jens Axboe wrote:
> >>> On 2/25/19 8:46 PM, Eric Biggers wrote:
> >>>> Hi Jens,
> >>>>
> >>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> >>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
> >>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> >>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> >>>>>>> to the bio directly. This requires that the caller doesn't releases
> >>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> >>>>>>>
> >>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> >>>>>>> check if they need to release pages on completion. This makes them
> >>>>>>> work with bvecs that contain kernel mapped pages already.
> >>>>>>>
> >>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> >>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>>>>>> ---
> >>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> >>>>>>>  fs/block_dev.c            |  5 ++--
> >>>>>>>  fs/iomap.c                |  5 ++--
> >>>>>>>  include/linux/blk_types.h |  1 +
> >>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/block/bio.c b/block/bio.c
> >>>>>>> index 4db1008309ed..330df572cfb8 100644
> >>>>>>> --- a/block/bio.c
> >>>>>>> +++ b/block/bio.c
> >>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> >>>>>>>  }
> >>>>>>>  EXPORT_SYMBOL(bio_add_page);
> >>>>>>>
> >>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> >>>>>>> +{
> >>>>>>> + const struct bio_vec *bv = iter->bvec;
> >>>>>>> + unsigned int len;
> >>>>>>> + size_t size;
> >>>>>>> +
> >>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> >>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
> >>>>>>> +                         bv->bv_offset + iter->iov_offset);
> >>>>>>
> >>>>>> iter->iov_offset needs to be subtracted from 'len', looks
> >>>>>> the following delta change[1] is required, otherwise memory corruption
> >>>>>> can be observed when running xfstests over loop/dio.
> >>>>>
> >>>>> Thanks, I folded this in.
> >>>>>
> >>>>> --
> >>>>> Jens Axboe
> >>>>>
> >>>>
> >>>> syzkaller started hitting a crash on linux-next starting with this commit, and
> >>>> it still occurs even with your latest version that has Ming's fix folded in.
> >>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> >>>> Sun Feb 24 08:20:53 2019 -0700.
> >>>>
> >>>> Reproducer:
> >>>>
> >>>> #define _GNU_SOURCE
> >>>> #include <fcntl.h>
> >>>> #include <linux/loop.h>
> >>>> #include <sys/ioctl.h>
> >>>> #include <sys/sendfile.h>
> >>>> #include <sys/syscall.h>
> >>>> #include <unistd.h>
> >>>>
> >>>> int main(void)
> >>>> {
> >>>>         int memfd, loopfd;
> >>>>
> >>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
> >>>>
> >>>>         pwrite(memfd, "\xa8", 1, 4096);
> >>>>
> >>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> >>>>
> >>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
> >>>>
> >>>>         sendfile(loopfd, loopfd, NULL, 1000000);
> >>>> }
> >>>>
> >>>>
> >>>> Crash:
> >>>>
> >>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> >>>> flags: 0x100000000000000()
> >>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> >>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
> >>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> >>>
> >>> I see what this is, I'll cut a fix for this tomorrow.
> >>
> >> Folded in a fix for this, it's in my current io_uring branch and my for-next
> >> branch.
> > 
> > Hi Jens,
> > 
> > I saw the following change is added:
> > 
> > + if (size == len) {
> > + /*
> > + * For the normal O_DIRECT case, we could skip grabbing this
> > + * reference and then not have to put them again when IO
> > + * completes. But this breaks some in-kernel users, like
> > + * splicing to/from a loop device, where we release the pipe
> > + * pages unconditionally. If we can fix that case, we can
> > + * get rid of the get here and the need to call
> > + * bio_release_pages() at IO completion time.
> > + */
> > + get_page(bv->bv_page);
> > 
> > Now the 'bv' may point to more than one page, so the following one may be
> > needed:
> > 
> > int i;
> > struct bvec_iter_all iter_all;
> > struct bio_vec *tmp;
> > 
> > mp_bvec_for_each_segment(tmp, bv, i, iter_all)
> >       get_page(tmp->bv_page);
> 
> I guess that would be the safest, even if we don't currently have more
> than one page in there. I'll fix it up.

It is easy to see multipage bvec from loop, :-)

Thanks,
Ming
Jens Axboe Feb. 27, 2019, 1:57 a.m. UTC | #9
On 2/26/19 6:53 PM, Ming Lei wrote:
> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
>> On 2/26/19 6:21 PM, Ming Lei wrote:
>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
>>>>>> Hi Jens,
>>>>>>
>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>>>>>>>>
>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
>>>>>>>>> check if they need to release pages on completion. This makes them
>>>>>>>>> work with bvecs that contain kernel mapped pages already.
>>>>>>>>>
>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>>>> ---
>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>>>>>>>>  fs/block_dev.c            |  5 ++--
>>>>>>>>>  fs/iomap.c                |  5 ++--
>>>>>>>>>  include/linux/blk_types.h |  1 +
>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
>>>>>>>>> --- a/block/bio.c
>>>>>>>>> +++ b/block/bio.c
>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>>>>>>>>  }
>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
>>>>>>>>>
>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>>>>>>>>> +{
>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
>>>>>>>>> + unsigned int len;
>>>>>>>>> + size_t size;
>>>>>>>>> +
>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
>>>>>>>>
>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
>>>>>>>> the following delta change[1] is required, otherwise memory corruption
>>>>>>>> can be observed when running xfstests over loop/dio.
>>>>>>>
>>>>>>> Thanks, I folded this in.
>>>>>>>
>>>>>>> --
>>>>>>> Jens Axboe
>>>>>>>
>>>>>>
>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
>>>>>> Sun Feb 24 08:20:53 2019 -0700.
>>>>>>
>>>>>> Reproducer:
>>>>>>
>>>>>> #define _GNU_SOURCE
>>>>>> #include <fcntl.h>
>>>>>> #include <linux/loop.h>
>>>>>> #include <sys/ioctl.h>
>>>>>> #include <sys/sendfile.h>
>>>>>> #include <sys/syscall.h>
>>>>>> #include <unistd.h>
>>>>>>
>>>>>> int main(void)
>>>>>> {
>>>>>>         int memfd, loopfd;
>>>>>>
>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
>>>>>>
>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
>>>>>>
>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
>>>>>>
>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
>>>>>>
>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
>>>>>> }
>>>>>>
>>>>>>
>>>>>> Crash:
>>>>>>
>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
>>>>>> flags: 0x100000000000000()
>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
>>>>>
>>>>> I see what this is, I'll cut a fix for this tomorrow.
>>>>
>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
>>>> branch.
>>>
>>> Hi Jens,
>>>
>>> I saw the following change is added:
>>>
>>> + if (size == len) {
>>> + /*
>>> + * For the normal O_DIRECT case, we could skip grabbing this
>>> + * reference and then not have to put them again when IO
>>> + * completes. But this breaks some in-kernel users, like
>>> + * splicing to/from a loop device, where we release the pipe
>>> + * pages unconditionally. If we can fix that case, we can
>>> + * get rid of the get here and the need to call
>>> + * bio_release_pages() at IO completion time.
>>> + */
>>> + get_page(bv->bv_page);
>>>
>>> Now the 'bv' may point to more than one page, so the following one may be
>>> needed:
>>>
>>> int i;
>>> struct bvec_iter_all iter_all;
>>> struct bio_vec *tmp;
>>>
>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
>>>       get_page(tmp->bv_page);
>>
>> I guess that would be the safest, even if we don't currently have more
>> than one page in there. I'll fix it up.
> 
> It is easy to see multipage bvec from loop, :-)

Speaking of this, I took a quick look at why we've now regressed a lot
on IOPS perf with the multipage work. It looks like it's all related to
the (much) fatter setup around iteration, which is related to this very
topic too.

Basically setup of things like bio_for_each_bvec() and indexing through
nth_page() is MUCH slower than before.

We need to do something about this, it's like tossing out months of
optimizations.
Ming Lei Feb. 27, 2019, 2:21 a.m. UTC | #10
On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
> On 2/26/19 6:53 PM, Ming Lei wrote:
> > On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
> >> On 2/26/19 6:21 PM, Ming Lei wrote:
> >>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>
> >>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
> >>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
> >>>>>> Hi Jens,
> >>>>>>
> >>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> >>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
> >>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> >>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> >>>>>>>>> to the bio directly. This requires that the caller doesn't releases
> >>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> >>>>>>>>>
> >>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> >>>>>>>>> check if they need to release pages on completion. This makes them
> >>>>>>>>> work with bvecs that contain kernel mapped pages already.
> >>>>>>>>>
> >>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> >>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>>>>>>>> ---
> >>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> >>>>>>>>>  fs/block_dev.c            |  5 ++--
> >>>>>>>>>  fs/iomap.c                |  5 ++--
> >>>>>>>>>  include/linux/blk_types.h |  1 +
> >>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> >>>>>>>>>
> >>>>>>>>> diff --git a/block/bio.c b/block/bio.c
> >>>>>>>>> index 4db1008309ed..330df572cfb8 100644
> >>>>>>>>> --- a/block/bio.c
> >>>>>>>>> +++ b/block/bio.c
> >>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> >>>>>>>>>  }
> >>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
> >>>>>>>>>
> >>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> >>>>>>>>> +{
> >>>>>>>>> + const struct bio_vec *bv = iter->bvec;
> >>>>>>>>> + unsigned int len;
> >>>>>>>>> + size_t size;
> >>>>>>>>> +
> >>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> >>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
> >>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
> >>>>>>>>
> >>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
> >>>>>>>> the following delta change[1] is required, otherwise memory corruption
> >>>>>>>> can be observed when running xfstests over loop/dio.
> >>>>>>>
> >>>>>>> Thanks, I folded this in.
> >>>>>>>
> >>>>>>> --
> >>>>>>> Jens Axboe
> >>>>>>>
> >>>>>>
> >>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
> >>>>>> it still occurs even with your latest version that has Ming's fix folded in.
> >>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> >>>>>> Sun Feb 24 08:20:53 2019 -0700.
> >>>>>>
> >>>>>> Reproducer:
> >>>>>>
> >>>>>> #define _GNU_SOURCE
> >>>>>> #include <fcntl.h>
> >>>>>> #include <linux/loop.h>
> >>>>>> #include <sys/ioctl.h>
> >>>>>> #include <sys/sendfile.h>
> >>>>>> #include <sys/syscall.h>
> >>>>>> #include <unistd.h>
> >>>>>>
> >>>>>> int main(void)
> >>>>>> {
> >>>>>>         int memfd, loopfd;
> >>>>>>
> >>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
> >>>>>>
> >>>>>>         pwrite(memfd, "\xa8", 1, 4096);
> >>>>>>
> >>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> >>>>>>
> >>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
> >>>>>>
> >>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
> >>>>>> }
> >>>>>>
> >>>>>>
> >>>>>> Crash:
> >>>>>>
> >>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> >>>>>> flags: 0x100000000000000()
> >>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> >>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
> >>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> >>>>>
> >>>>> I see what this is, I'll cut a fix for this tomorrow.
> >>>>
> >>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
> >>>> branch.
> >>>
> >>> Hi Jens,
> >>>
> >>> I saw the following change is added:
> >>>
> >>> + if (size == len) {
> >>> + /*
> >>> + * For the normal O_DIRECT case, we could skip grabbing this
> >>> + * reference and then not have to put them again when IO
> >>> + * completes. But this breaks some in-kernel users, like
> >>> + * splicing to/from a loop device, where we release the pipe
> >>> + * pages unconditionally. If we can fix that case, we can
> >>> + * get rid of the get here and the need to call
> >>> + * bio_release_pages() at IO completion time.
> >>> + */
> >>> + get_page(bv->bv_page);
> >>>
> >>> Now the 'bv' may point to more than one page, so the following one may be
> >>> needed:
> >>>
> >>> int i;
> >>> struct bvec_iter_all iter_all;
> >>> struct bio_vec *tmp;
> >>>
> >>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
> >>>       get_page(tmp->bv_page);
> >>
> >> I guess that would be the safest, even if we don't currently have more
> >> than one page in there. I'll fix it up.
> > 
> > It is easy to see multipage bvec from loop, :-)
> 
> Speaking of this, I took a quick look at why we've now regressed a lot
> on IOPS perf with the multipage work. It looks like it's all related to
> the (much) fatter setup around iteration, which is related to this very
> topic too.
> 
> Basically setup of things like bio_for_each_bvec() and indexing through
> nth_page() is MUCH slower than before.

But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
needs that. However, bio_for_each_segment() isn't called from
blk_queue_split() and blk_rq_map_sg().

One issue is that bio_for_each_bvec() still advances by page size
instead of bvec->len, I guess that is the problem, will cook a patch
for your test.

> 
> We need to do something about this, it's like tossing out months of
> optimizations.

Some following optimization can be done, such as removing
biovec_phys_mergeable() from blk_bio_segment_split().


Thanks,
Ming
Jens Axboe Feb. 27, 2019, 2:28 a.m. UTC | #11
On 2/26/19 7:21 PM, Ming Lei wrote:
> On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
>> On 2/26/19 6:53 PM, Ming Lei wrote:
>>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
>>>> On 2/26/19 6:21 PM, Ming Lei wrote:
>>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>
>>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
>>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
>>>>>>>> Hi Jens,
>>>>>>>>
>>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
>>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
>>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
>>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
>>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>>>>>>>>>>
>>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
>>>>>>>>>>> check if they need to release pages on completion. This makes them
>>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
>>>>>>>>>>>
>>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>>>>>> ---
>>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>  fs/block_dev.c            |  5 ++--
>>>>>>>>>>>  fs/iomap.c                |  5 ++--
>>>>>>>>>>>  include/linux/blk_types.h |  1 +
>>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
>>>>>>>>>>> --- a/block/bio.c
>>>>>>>>>>> +++ b/block/bio.c
>>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>>>>>>>>>>  }
>>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
>>>>>>>>>>>
>>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>>>>>>>>>>> +{
>>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
>>>>>>>>>>> + unsigned int len;
>>>>>>>>>>> + size_t size;
>>>>>>>>>>> +
>>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
>>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
>>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
>>>>>>>>>>
>>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
>>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
>>>>>>>>>> can be observed when running xfstests over loop/dio.
>>>>>>>>>
>>>>>>>>> Thanks, I folded this in.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Jens Axboe
>>>>>>>>>
>>>>>>>>
>>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
>>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
>>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
>>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
>>>>>>>>
>>>>>>>> Reproducer:
>>>>>>>>
>>>>>>>> #define _GNU_SOURCE
>>>>>>>> #include <fcntl.h>
>>>>>>>> #include <linux/loop.h>
>>>>>>>> #include <sys/ioctl.h>
>>>>>>>> #include <sys/sendfile.h>
>>>>>>>> #include <sys/syscall.h>
>>>>>>>> #include <unistd.h>
>>>>>>>>
>>>>>>>> int main(void)
>>>>>>>> {
>>>>>>>>         int memfd, loopfd;
>>>>>>>>
>>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
>>>>>>>>
>>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
>>>>>>>>
>>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
>>>>>>>>
>>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
>>>>>>>>
>>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> Crash:
>>>>>>>>
>>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
>>>>>>>> flags: 0x100000000000000()
>>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
>>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
>>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
>>>>>>>
>>>>>>> I see what this is, I'll cut a fix for this tomorrow.
>>>>>>
>>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
>>>>>> branch.
>>>>>
>>>>> Hi Jens,
>>>>>
>>>>> I saw the following change is added:
>>>>>
>>>>> + if (size == len) {
>>>>> + /*
>>>>> + * For the normal O_DIRECT case, we could skip grabbing this
>>>>> + * reference and then not have to put them again when IO
>>>>> + * completes. But this breaks some in-kernel users, like
>>>>> + * splicing to/from a loop device, where we release the pipe
>>>>> + * pages unconditionally. If we can fix that case, we can
>>>>> + * get rid of the get here and the need to call
>>>>> + * bio_release_pages() at IO completion time.
>>>>> + */
>>>>> + get_page(bv->bv_page);
>>>>>
>>>>> Now the 'bv' may point to more than one page, so the following one may be
>>>>> needed:
>>>>>
>>>>> int i;
>>>>> struct bvec_iter_all iter_all;
>>>>> struct bio_vec *tmp;
>>>>>
>>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
>>>>>       get_page(tmp->bv_page);
>>>>
>>>> I guess that would be the safest, even if we don't currently have more
>>>> than one page in there. I'll fix it up.
>>>
>>> It is easy to see multipage bvec from loop, :-)
>>
>> Speaking of this, I took a quick look at why we've now regressed a lot
>> on IOPS perf with the multipage work. It looks like it's all related to
>> the (much) fatter setup around iteration, which is related to this very
>> topic too.
>>
>> Basically setup of things like bio_for_each_bvec() and indexing through
>> nth_page() is MUCH slower than before.
> 
> But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
> needs that. However, bio_for_each_segment() isn't called from
> blk_queue_split() and blk_rq_map_sg().
> 
> One issue is that bio_for_each_bvec() still advances by page size
> instead of bvec->len, I guess that is the problem, will cook a patch
> for your test.

Probably won't make a difference for my test case...

>> We need to do something about this, it's like tossing out months of
>> optimizations.
> 
> Some following optimization can be done, such as removing
> biovec_phys_mergeable() from blk_bio_segment_split().

I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
that it is possible. But iteration startup cost is a problem in a lot of
spots, and a split fast path will only help a bit for that specific
case.

5% regressions is HUGE. I know I've mentioned this before, I just want
to really stress how big of a deal that is. It's enough to make me
consider just reverting it again, which sucks, but I don't feel great
shipping something that is known that much slower.

Suggestions?
Ming Lei Feb. 27, 2019, 2:37 a.m. UTC | #12
On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
> On 2/26/19 7:21 PM, Ming Lei wrote:
> > On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
> >> On 2/26/19 6:53 PM, Ming Lei wrote:
> >>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
> >>>> On 2/26/19 6:21 PM, Ming Lei wrote:
> >>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>
> >>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
> >>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
> >>>>>>>> Hi Jens,
> >>>>>>>>
> >>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> >>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
> >>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> >>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> >>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
> >>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> >>>>>>>>>>>
> >>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> >>>>>>>>>>> check if they need to release pages on completion. This makes them
> >>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
> >>>>>>>>>>>
> >>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> >>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>>>>>>>>>> ---
> >>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> >>>>>>>>>>>  fs/block_dev.c            |  5 ++--
> >>>>>>>>>>>  fs/iomap.c                |  5 ++--
> >>>>>>>>>>>  include/linux/blk_types.h |  1 +
> >>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> >>>>>>>>>>>
> >>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
> >>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
> >>>>>>>>>>> --- a/block/bio.c
> >>>>>>>>>>> +++ b/block/bio.c
> >>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> >>>>>>>>>>>  }
> >>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
> >>>>>>>>>>>
> >>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> >>>>>>>>>>> +{
> >>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
> >>>>>>>>>>> + unsigned int len;
> >>>>>>>>>>> + size_t size;
> >>>>>>>>>>> +
> >>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> >>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
> >>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
> >>>>>>>>>>
> >>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
> >>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
> >>>>>>>>>> can be observed when running xfstests over loop/dio.
> >>>>>>>>>
> >>>>>>>>> Thanks, I folded this in.
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Jens Axboe
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
> >>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
> >>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> >>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
> >>>>>>>>
> >>>>>>>> Reproducer:
> >>>>>>>>
> >>>>>>>> #define _GNU_SOURCE
> >>>>>>>> #include <fcntl.h>
> >>>>>>>> #include <linux/loop.h>
> >>>>>>>> #include <sys/ioctl.h>
> >>>>>>>> #include <sys/sendfile.h>
> >>>>>>>> #include <sys/syscall.h>
> >>>>>>>> #include <unistd.h>
> >>>>>>>>
> >>>>>>>> int main(void)
> >>>>>>>> {
> >>>>>>>>         int memfd, loopfd;
> >>>>>>>>
> >>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
> >>>>>>>>
> >>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
> >>>>>>>>
> >>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> >>>>>>>>
> >>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
> >>>>>>>>
> >>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Crash:
> >>>>>>>>
> >>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> >>>>>>>> flags: 0x100000000000000()
> >>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> >>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
> >>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> >>>>>>>
> >>>>>>> I see what this is, I'll cut a fix for this tomorrow.
> >>>>>>
> >>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
> >>>>>> branch.
> >>>>>
> >>>>> Hi Jens,
> >>>>>
> >>>>> I saw the following change is added:
> >>>>>
> >>>>> + if (size == len) {
> >>>>> + /*
> >>>>> + * For the normal O_DIRECT case, we could skip grabbing this
> >>>>> + * reference and then not have to put them again when IO
> >>>>> + * completes. But this breaks some in-kernel users, like
> >>>>> + * splicing to/from a loop device, where we release the pipe
> >>>>> + * pages unconditionally. If we can fix that case, we can
> >>>>> + * get rid of the get here and the need to call
> >>>>> + * bio_release_pages() at IO completion time.
> >>>>> + */
> >>>>> + get_page(bv->bv_page);
> >>>>>
> >>>>> Now the 'bv' may point to more than one page, so the following one may be
> >>>>> needed:
> >>>>>
> >>>>> int i;
> >>>>> struct bvec_iter_all iter_all;
> >>>>> struct bio_vec *tmp;
> >>>>>
> >>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
> >>>>>       get_page(tmp->bv_page);
> >>>>
> >>>> I guess that would be the safest, even if we don't currently have more
> >>>> than one page in there. I'll fix it up.
> >>>
> >>> It is easy to see multipage bvec from loop, :-)
> >>
> >> Speaking of this, I took a quick look at why we've now regressed a lot
> >> on IOPS perf with the multipage work. It looks like it's all related to
> >> the (much) fatter setup around iteration, which is related to this very
> >> topic too.
> >>
> >> Basically setup of things like bio_for_each_bvec() and indexing through
> >> nth_page() is MUCH slower than before.
> > 
> > But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
> > needs that. However, bio_for_each_segment() isn't called from
> > blk_queue_split() and blk_rq_map_sg().
> > 
> > One issue is that bio_for_each_bvec() still advances by page size
> > instead of bvec->len, I guess that is the problem, will cook a patch
> > for your test.
> 
> Probably won't make a difference for my test case...
> 
> >> We need to do something about this, it's like tossing out months of
> >> optimizations.
> > 
> > Some following optimization can be done, such as removing
> > biovec_phys_mergeable() from blk_bio_segment_split().
> 
> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
> that it is possible. But iteration startup cost is a problem in a lot of
> spots, and a split fast path will only help a bit for that specific
> case.
> 
> 5% regressions is HUGE. I know I've mentioned this before, I just want
> to really stress how big of a deal that is. It's enough to make me
> consider just reverting it again, which sucks, but I don't feel great
> shipping something that is known that much slower.
> 
> Suggestions?

You mentioned nth_page() costs much in bio_for_each_bvec(), but which
shouldn't call into nth_page(). I will look into it first.

Thanks,
Ming
Jens Axboe Feb. 27, 2019, 2:43 a.m. UTC | #13
On 2/26/19 7:37 PM, Ming Lei wrote:
> On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
>> On 2/26/19 7:21 PM, Ming Lei wrote:
>>> On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
>>>> On 2/26/19 6:53 PM, Ming Lei wrote:
>>>>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
>>>>>> On 2/26/19 6:21 PM, Ming Lei wrote:
>>>>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>
>>>>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
>>>>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
>>>>>>>>>> Hi Jens,
>>>>>>>>>>
>>>>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
>>>>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
>>>>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>>>>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
>>>>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
>>>>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
>>>>>>>>>>>>> check if they need to release pages on completion. This makes them
>>>>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>>>>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>  fs/block_dev.c            |  5 ++--
>>>>>>>>>>>>>  fs/iomap.c                |  5 ++--
>>>>>>>>>>>>>  include/linux/blk_types.h |  1 +
>>>>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
>>>>>>>>>>>>> --- a/block/bio.c
>>>>>>>>>>>>> +++ b/block/bio.c
>>>>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>>>>>>>>>>>>  }
>>>>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
>>>>>>>>>>>>>
>>>>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
>>>>>>>>>>>>> + unsigned int len;
>>>>>>>>>>>>> + size_t size;
>>>>>>>>>>>>> +
>>>>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
>>>>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
>>>>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
>>>>>>>>>>>>
>>>>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
>>>>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
>>>>>>>>>>>> can be observed when running xfstests over loop/dio.
>>>>>>>>>>>
>>>>>>>>>>> Thanks, I folded this in.
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Jens Axboe
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
>>>>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
>>>>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
>>>>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
>>>>>>>>>>
>>>>>>>>>> Reproducer:
>>>>>>>>>>
>>>>>>>>>> #define _GNU_SOURCE
>>>>>>>>>> #include <fcntl.h>
>>>>>>>>>> #include <linux/loop.h>
>>>>>>>>>> #include <sys/ioctl.h>
>>>>>>>>>> #include <sys/sendfile.h>
>>>>>>>>>> #include <sys/syscall.h>
>>>>>>>>>> #include <unistd.h>
>>>>>>>>>>
>>>>>>>>>> int main(void)
>>>>>>>>>> {
>>>>>>>>>>         int memfd, loopfd;
>>>>>>>>>>
>>>>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
>>>>>>>>>>
>>>>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
>>>>>>>>>>
>>>>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
>>>>>>>>>>
>>>>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
>>>>>>>>>>
>>>>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Crash:
>>>>>>>>>>
>>>>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
>>>>>>>>>> flags: 0x100000000000000()
>>>>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
>>>>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
>>>>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
>>>>>>>>>
>>>>>>>>> I see what this is, I'll cut a fix for this tomorrow.
>>>>>>>>
>>>>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
>>>>>>>> branch.
>>>>>>>
>>>>>>> Hi Jens,
>>>>>>>
>>>>>>> I saw the following change is added:
>>>>>>>
>>>>>>> + if (size == len) {
>>>>>>> + /*
>>>>>>> + * For the normal O_DIRECT case, we could skip grabbing this
>>>>>>> + * reference and then not have to put them again when IO
>>>>>>> + * completes. But this breaks some in-kernel users, like
>>>>>>> + * splicing to/from a loop device, where we release the pipe
>>>>>>> + * pages unconditionally. If we can fix that case, we can
>>>>>>> + * get rid of the get here and the need to call
>>>>>>> + * bio_release_pages() at IO completion time.
>>>>>>> + */
>>>>>>> + get_page(bv->bv_page);
>>>>>>>
>>>>>>> Now the 'bv' may point to more than one page, so the following one may be
>>>>>>> needed:
>>>>>>>
>>>>>>> int i;
>>>>>>> struct bvec_iter_all iter_all;
>>>>>>> struct bio_vec *tmp;
>>>>>>>
>>>>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
>>>>>>>       get_page(tmp->bv_page);
>>>>>>
>>>>>> I guess that would be the safest, even if we don't currently have more
>>>>>> than one page in there. I'll fix it up.
>>>>>
>>>>> It is easy to see multipage bvec from loop, :-)
>>>>
>>>> Speaking of this, I took a quick look at why we've now regressed a lot
>>>> on IOPS perf with the multipage work. It looks like it's all related to
>>>> the (much) fatter setup around iteration, which is related to this very
>>>> topic too.
>>>>
>>>> Basically setup of things like bio_for_each_bvec() and indexing through
>>>> nth_page() is MUCH slower than before.
>>>
>>> But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
>>> needs that. However, bio_for_each_segment() isn't called from
>>> blk_queue_split() and blk_rq_map_sg().
>>>
>>> One issue is that bio_for_each_bvec() still advances by page size
>>> instead of bvec->len, I guess that is the problem, will cook a patch
>>> for your test.
>>
>> Probably won't make a difference for my test case...
>>
>>>> We need to do something about this, it's like tossing out months of
>>>> optimizations.
>>>
>>> Some following optimization can be done, such as removing
>>> biovec_phys_mergeable() from blk_bio_segment_split().
>>
>> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
>> that it is possible. But iteration startup cost is a problem in a lot of
>> spots, and a split fast path will only help a bit for that specific
>> case.
>>
>> 5% regressions is HUGE. I know I've mentioned this before, I just want
>> to really stress how big of a deal that is. It's enough to make me
>> consider just reverting it again, which sucks, but I don't feel great
>> shipping something that is known that much slower.
>>
>> Suggestions?
> 
> You mentioned nth_page() costs much in bio_for_each_bvec(), but which
> shouldn't call into nth_page(). I will look into it first.

I'll check on the test box tomorrow, I lost connectivity before. I'll
double check in the morning.

I'd focus on the blk_rq_map_sg() path, since that's the biggest cycle
consumer.
Ming Lei Feb. 27, 2019, 3:09 a.m. UTC | #14
On Tue, Feb 26, 2019 at 07:43:32PM -0700, Jens Axboe wrote:
> On 2/26/19 7:37 PM, Ming Lei wrote:
> > On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
> >> On 2/26/19 7:21 PM, Ming Lei wrote:
> >>> On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
> >>>> On 2/26/19 6:53 PM, Ming Lei wrote:
> >>>>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
> >>>>>> On 2/26/19 6:21 PM, Ming Lei wrote:
> >>>>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>>>
> >>>>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
> >>>>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
> >>>>>>>>>> Hi Jens,
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> >>>>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
> >>>>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> >>>>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> >>>>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
> >>>>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> >>>>>>>>>>>>> check if they need to release pages on completion. This makes them
> >>>>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> >>>>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>>>>>>>>>>>> ---
> >>>>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> >>>>>>>>>>>>>  fs/block_dev.c            |  5 ++--
> >>>>>>>>>>>>>  fs/iomap.c                |  5 ++--
> >>>>>>>>>>>>>  include/linux/blk_types.h |  1 +
> >>>>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
> >>>>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
> >>>>>>>>>>>>> --- a/block/bio.c
> >>>>>>>>>>>>> +++ b/block/bio.c
> >>>>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> >>>>>>>>>>>>>  }
> >>>>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> >>>>>>>>>>>>> +{
> >>>>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
> >>>>>>>>>>>>> + unsigned int len;
> >>>>>>>>>>>>> + size_t size;
> >>>>>>>>>>>>> +
> >>>>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> >>>>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
> >>>>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
> >>>>>>>>>>>>
> >>>>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
> >>>>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
> >>>>>>>>>>>> can be observed when running xfstests over loop/dio.
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks, I folded this in.
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Jens Axboe
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
> >>>>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
> >>>>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> >>>>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
> >>>>>>>>>>
> >>>>>>>>>> Reproducer:
> >>>>>>>>>>
> >>>>>>>>>> #define _GNU_SOURCE
> >>>>>>>>>> #include <fcntl.h>
> >>>>>>>>>> #include <linux/loop.h>
> >>>>>>>>>> #include <sys/ioctl.h>
> >>>>>>>>>> #include <sys/sendfile.h>
> >>>>>>>>>> #include <sys/syscall.h>
> >>>>>>>>>> #include <unistd.h>
> >>>>>>>>>>
> >>>>>>>>>> int main(void)
> >>>>>>>>>> {
> >>>>>>>>>>         int memfd, loopfd;
> >>>>>>>>>>
> >>>>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
> >>>>>>>>>>
> >>>>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
> >>>>>>>>>>
> >>>>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> >>>>>>>>>>
> >>>>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
> >>>>>>>>>>
> >>>>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Crash:
> >>>>>>>>>>
> >>>>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> >>>>>>>>>> flags: 0x100000000000000()
> >>>>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> >>>>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
> >>>>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> >>>>>>>>>
> >>>>>>>>> I see what this is, I'll cut a fix for this tomorrow.
> >>>>>>>>
> >>>>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
> >>>>>>>> branch.
> >>>>>>>
> >>>>>>> Hi Jens,
> >>>>>>>
> >>>>>>> I saw the following change is added:
> >>>>>>>
> >>>>>>> + if (size == len) {
> >>>>>>> + /*
> >>>>>>> + * For the normal O_DIRECT case, we could skip grabbing this
> >>>>>>> + * reference and then not have to put them again when IO
> >>>>>>> + * completes. But this breaks some in-kernel users, like
> >>>>>>> + * splicing to/from a loop device, where we release the pipe
> >>>>>>> + * pages unconditionally. If we can fix that case, we can
> >>>>>>> + * get rid of the get here and the need to call
> >>>>>>> + * bio_release_pages() at IO completion time.
> >>>>>>> + */
> >>>>>>> + get_page(bv->bv_page);
> >>>>>>>
> >>>>>>> Now the 'bv' may point to more than one page, so the following one may be
> >>>>>>> needed:
> >>>>>>>
> >>>>>>> int i;
> >>>>>>> struct bvec_iter_all iter_all;
> >>>>>>> struct bio_vec *tmp;
> >>>>>>>
> >>>>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
> >>>>>>>       get_page(tmp->bv_page);
> >>>>>>
> >>>>>> I guess that would be the safest, even if we don't currently have more
> >>>>>> than one page in there. I'll fix it up.
> >>>>>
> >>>>> It is easy to see multipage bvec from loop, :-)
> >>>>
> >>>> Speaking of this, I took a quick look at why we've now regressed a lot
> >>>> on IOPS perf with the multipage work. It looks like it's all related to
> >>>> the (much) fatter setup around iteration, which is related to this very
> >>>> topic too.
> >>>>
> >>>> Basically setup of things like bio_for_each_bvec() and indexing through
> >>>> nth_page() is MUCH slower than before.
> >>>
> >>> But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
> >>> needs that. However, bio_for_each_segment() isn't called from
> >>> blk_queue_split() and blk_rq_map_sg().
> >>>
> >>> One issue is that bio_for_each_bvec() still advances by page size
> >>> instead of bvec->len, I guess that is the problem, will cook a patch
> >>> for your test.
> >>
> >> Probably won't make a difference for my test case...
> >>
> >>>> We need to do something about this, it's like tossing out months of
> >>>> optimizations.
> >>>
> >>> Some following optimization can be done, such as removing
> >>> biovec_phys_mergeable() from blk_bio_segment_split().
> >>
> >> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
> >> that it is possible. But iteration startup cost is a problem in a lot of
> >> spots, and a split fast path will only help a bit for that specific
> >> case.
> >>
> >> 5% regressions is HUGE. I know I've mentioned this before, I just want
> >> to really stress how big of a deal that is. It's enough to make me
> >> consider just reverting it again, which sucks, but I don't feel great
> >> shipping something that is known that much slower.
> >>
> >> Suggestions?
> > 
> > You mentioned nth_page() costs much in bio_for_each_bvec(), but which
> > shouldn't call into nth_page(). I will look into it first.
> 
> I'll check on the test box tomorrow, I lost connectivity before. I'll
> double check in the morning.
> 
> I'd focus on the blk_rq_map_sg() path, since that's the biggest cycle
> consumer.

Hi Jens,

Could you test the following patch which may improve on the 4k randio
test case?

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 066b66430523..c1ad8abbd9d6 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -447,7 +447,7 @@ static int blk_phys_contig_segment(struct request_queue *q, struct bio *bio,
 	return biovec_phys_mergeable(q, &end_bv, &nxt_bv);
 }
 
-static struct scatterlist *blk_next_sg(struct scatterlist **sg,
+static inline struct scatterlist *blk_next_sg(struct scatterlist **sg,
 		struct scatterlist *sglist)
 {
 	if (!*sg)
@@ -483,7 +483,7 @@ static unsigned blk_bvec_map_sg(struct request_queue *q,
 
 		offset = (total + bvec->bv_offset) % PAGE_SIZE;
 		idx = (total + bvec->bv_offset) / PAGE_SIZE;
-		pg = nth_page(bvec->bv_page, idx);
+		pg = bvec_nth_page(bvec->bv_page, idx);
 
 		sg_set_page(*sg, pg, seg_size, offset);
 
@@ -512,7 +512,12 @@ __blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec,
 		(*sg)->length += nbytes;
 	} else {
 new_segment:
-		(*nsegs) += blk_bvec_map_sg(q, bvec, sglist, sg);
+		if (bvec->bv_offset + bvec->bv_len <= PAGE_SIZE) {
+			*sg = blk_next_sg(sg, sglist);
+			sg_set_page(*sg, bvec->bv_page, nbytes, bvec->bv_offset);
+			(*nsegs) += 1;
+		} else
+			(*nsegs) += blk_bvec_map_sg(q, bvec, sglist, sg);
 	}
 	*bvprv = *bvec;
 }
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 30a57b68d017..4376f683c08a 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -51,6 +51,11 @@ struct bvec_iter_all {
 	unsigned	done;
 };
 
+static inline struct page *bvec_nth_page(struct page *page, int idx)
+{
+	return idx == 0 ? page : nth_page(page, idx);
+}
+
 /*
  * various member access, note that bio_data should of course not be used
  * on highmem page vectors
@@ -87,8 +92,8 @@ struct bvec_iter_all {
 	      PAGE_SIZE - bvec_iter_offset((bvec), (iter)))
 
 #define bvec_iter_page(bvec, iter)				\
-	nth_page(mp_bvec_iter_page((bvec), (iter)),		\
-		 mp_bvec_iter_page_idx((bvec), (iter)))
+	bvec_nth_page(mp_bvec_iter_page((bvec), (iter)),		\
+		      mp_bvec_iter_page_idx((bvec), (iter)))
 
 #define bvec_iter_bvec(bvec, iter)				\
 ((struct bio_vec) {						\
@@ -171,7 +176,7 @@ static inline void mp_bvec_last_segment(const struct bio_vec *bvec,
 	unsigned total = bvec->bv_offset + bvec->bv_len;
 	unsigned last_page = (total - 1) / PAGE_SIZE;
 
-	seg->bv_page = nth_page(bvec->bv_page, last_page);
+	seg->bv_page = bvec_nth_page(bvec->bv_page, last_page);
 
 	/* the whole segment is inside the last page */
 	if (bvec->bv_offset >= last_page * PAGE_SIZE) {

thanks,
Ming
Jens Axboe Feb. 27, 2019, 3:37 a.m. UTC | #15
On 2/26/19 8:09 PM, Ming Lei wrote:
> On Tue, Feb 26, 2019 at 07:43:32PM -0700, Jens Axboe wrote:
>> On 2/26/19 7:37 PM, Ming Lei wrote:
>>> On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
>>>> On 2/26/19 7:21 PM, Ming Lei wrote:
>>>>> On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
>>>>>> On 2/26/19 6:53 PM, Ming Lei wrote:
>>>>>>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
>>>>>>>> On 2/26/19 6:21 PM, Ming Lei wrote:
>>>>>>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>>>
>>>>>>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
>>>>>>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
>>>>>>>>>>>> Hi Jens,
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
>>>>>>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
>>>>>>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>>>>>>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
>>>>>>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
>>>>>>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
>>>>>>>>>>>>>>> check if they need to release pages on completion. This makes them
>>>>>>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>>>>>>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>>>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>  fs/block_dev.c            |  5 ++--
>>>>>>>>>>>>>>>  fs/iomap.c                |  5 ++--
>>>>>>>>>>>>>>>  include/linux/blk_types.h |  1 +
>>>>>>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>>>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
>>>>>>>>>>>>>>> --- a/block/bio.c
>>>>>>>>>>>>>>> +++ b/block/bio.c
>>>>>>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
>>>>>>>>>>>>>>> + unsigned int len;
>>>>>>>>>>>>>>> + size_t size;
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
>>>>>>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
>>>>>>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
>>>>>>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
>>>>>>>>>>>>>> can be observed when running xfstests over loop/dio.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks, I folded this in.
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Jens Axboe
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
>>>>>>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
>>>>>>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
>>>>>>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
>>>>>>>>>>>>
>>>>>>>>>>>> Reproducer:
>>>>>>>>>>>>
>>>>>>>>>>>> #define _GNU_SOURCE
>>>>>>>>>>>> #include <fcntl.h>
>>>>>>>>>>>> #include <linux/loop.h>
>>>>>>>>>>>> #include <sys/ioctl.h>
>>>>>>>>>>>> #include <sys/sendfile.h>
>>>>>>>>>>>> #include <sys/syscall.h>
>>>>>>>>>>>> #include <unistd.h>
>>>>>>>>>>>>
>>>>>>>>>>>> int main(void)
>>>>>>>>>>>> {
>>>>>>>>>>>>         int memfd, loopfd;
>>>>>>>>>>>>
>>>>>>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
>>>>>>>>>>>>
>>>>>>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
>>>>>>>>>>>>
>>>>>>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
>>>>>>>>>>>>
>>>>>>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
>>>>>>>>>>>>
>>>>>>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Crash:
>>>>>>>>>>>>
>>>>>>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
>>>>>>>>>>>> flags: 0x100000000000000()
>>>>>>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
>>>>>>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
>>>>>>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
>>>>>>>>>>>
>>>>>>>>>>> I see what this is, I'll cut a fix for this tomorrow.
>>>>>>>>>>
>>>>>>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
>>>>>>>>>> branch.
>>>>>>>>>
>>>>>>>>> Hi Jens,
>>>>>>>>>
>>>>>>>>> I saw the following change is added:
>>>>>>>>>
>>>>>>>>> + if (size == len) {
>>>>>>>>> + /*
>>>>>>>>> + * For the normal O_DIRECT case, we could skip grabbing this
>>>>>>>>> + * reference and then not have to put them again when IO
>>>>>>>>> + * completes. But this breaks some in-kernel users, like
>>>>>>>>> + * splicing to/from a loop device, where we release the pipe
>>>>>>>>> + * pages unconditionally. If we can fix that case, we can
>>>>>>>>> + * get rid of the get here and the need to call
>>>>>>>>> + * bio_release_pages() at IO completion time.
>>>>>>>>> + */
>>>>>>>>> + get_page(bv->bv_page);
>>>>>>>>>
>>>>>>>>> Now the 'bv' may point to more than one page, so the following one may be
>>>>>>>>> needed:
>>>>>>>>>
>>>>>>>>> int i;
>>>>>>>>> struct bvec_iter_all iter_all;
>>>>>>>>> struct bio_vec *tmp;
>>>>>>>>>
>>>>>>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
>>>>>>>>>       get_page(tmp->bv_page);
>>>>>>>>
>>>>>>>> I guess that would be the safest, even if we don't currently have more
>>>>>>>> than one page in there. I'll fix it up.
>>>>>>>
>>>>>>> It is easy to see multipage bvec from loop, :-)
>>>>>>
>>>>>> Speaking of this, I took a quick look at why we've now regressed a lot
>>>>>> on IOPS perf with the multipage work. It looks like it's all related to
>>>>>> the (much) fatter setup around iteration, which is related to this very
>>>>>> topic too.
>>>>>>
>>>>>> Basically setup of things like bio_for_each_bvec() and indexing through
>>>>>> nth_page() is MUCH slower than before.
>>>>>
>>>>> But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
>>>>> needs that. However, bio_for_each_segment() isn't called from
>>>>> blk_queue_split() and blk_rq_map_sg().
>>>>>
>>>>> One issue is that bio_for_each_bvec() still advances by page size
>>>>> instead of bvec->len, I guess that is the problem, will cook a patch
>>>>> for your test.
>>>>
>>>> Probably won't make a difference for my test case...
>>>>
>>>>>> We need to do something about this, it's like tossing out months of
>>>>>> optimizations.
>>>>>
>>>>> Some following optimization can be done, such as removing
>>>>> biovec_phys_mergeable() from blk_bio_segment_split().
>>>>
>>>> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
>>>> that it is possible. But iteration startup cost is a problem in a lot of
>>>> spots, and a split fast path will only help a bit for that specific
>>>> case.
>>>>
>>>> 5% regressions is HUGE. I know I've mentioned this before, I just want
>>>> to really stress how big of a deal that is. It's enough to make me
>>>> consider just reverting it again, which sucks, but I don't feel great
>>>> shipping something that is known that much slower.
>>>>
>>>> Suggestions?
>>>
>>> You mentioned nth_page() costs much in bio_for_each_bvec(), but which
>>> shouldn't call into nth_page(). I will look into it first.
>>
>> I'll check on the test box tomorrow, I lost connectivity before. I'll
>> double check in the morning.
>>
>> I'd focus on the blk_rq_map_sg() path, since that's the biggest cycle
>> consumer.
> 
> Hi Jens,
> 
> Could you test the following patch which may improve on the 4k randio
> test case?

A bit, it's up 1% with this patch. I'm going to try without the
get_page/put_page that we had earlier, to see where we are in regards to
the old baseline.
Jens Axboe Feb. 27, 2019, 3:43 a.m. UTC | #16
On 2/26/19 8:37 PM, Jens Axboe wrote:
> On 2/26/19 8:09 PM, Ming Lei wrote:
>> On Tue, Feb 26, 2019 at 07:43:32PM -0700, Jens Axboe wrote:
>>> On 2/26/19 7:37 PM, Ming Lei wrote:
>>>> On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
>>>>> On 2/26/19 7:21 PM, Ming Lei wrote:
>>>>>> On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
>>>>>>> On 2/26/19 6:53 PM, Ming Lei wrote:
>>>>>>>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
>>>>>>>>> On 2/26/19 6:21 PM, Ming Lei wrote:
>>>>>>>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
>>>>>>>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
>>>>>>>>>>>>> Hi Jens,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
>>>>>>>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
>>>>>>>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>>>>>>>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
>>>>>>>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
>>>>>>>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
>>>>>>>>>>>>>>>> check if they need to release pages on completion. This makes them
>>>>>>>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>>>>>>>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>>>>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>  fs/block_dev.c            |  5 ++--
>>>>>>>>>>>>>>>>  fs/iomap.c                |  5 ++--
>>>>>>>>>>>>>>>>  include/linux/blk_types.h |  1 +
>>>>>>>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>>>>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
>>>>>>>>>>>>>>>> --- a/block/bio.c
>>>>>>>>>>>>>>>> +++ b/block/bio.c
>>>>>>>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
>>>>>>>>>>>>>>>> + unsigned int len;
>>>>>>>>>>>>>>>> + size_t size;
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
>>>>>>>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
>>>>>>>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
>>>>>>>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
>>>>>>>>>>>>>>> can be observed when running xfstests over loop/dio.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks, I folded this in.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Jens Axboe
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
>>>>>>>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
>>>>>>>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
>>>>>>>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Reproducer:
>>>>>>>>>>>>>
>>>>>>>>>>>>> #define _GNU_SOURCE
>>>>>>>>>>>>> #include <fcntl.h>
>>>>>>>>>>>>> #include <linux/loop.h>
>>>>>>>>>>>>> #include <sys/ioctl.h>
>>>>>>>>>>>>> #include <sys/sendfile.h>
>>>>>>>>>>>>> #include <sys/syscall.h>
>>>>>>>>>>>>> #include <unistd.h>
>>>>>>>>>>>>>
>>>>>>>>>>>>> int main(void)
>>>>>>>>>>>>> {
>>>>>>>>>>>>>         int memfd, loopfd;
>>>>>>>>>>>>>
>>>>>>>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
>>>>>>>>>>>>>
>>>>>>>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
>>>>>>>>>>>>>
>>>>>>>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
>>>>>>>>>>>>>
>>>>>>>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
>>>>>>>>>>>>>
>>>>>>>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Crash:
>>>>>>>>>>>>>
>>>>>>>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
>>>>>>>>>>>>> flags: 0x100000000000000()
>>>>>>>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
>>>>>>>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
>>>>>>>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
>>>>>>>>>>>>
>>>>>>>>>>>> I see what this is, I'll cut a fix for this tomorrow.
>>>>>>>>>>>
>>>>>>>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
>>>>>>>>>>> branch.
>>>>>>>>>>
>>>>>>>>>> Hi Jens,
>>>>>>>>>>
>>>>>>>>>> I saw the following change is added:
>>>>>>>>>>
>>>>>>>>>> + if (size == len) {
>>>>>>>>>> + /*
>>>>>>>>>> + * For the normal O_DIRECT case, we could skip grabbing this
>>>>>>>>>> + * reference and then not have to put them again when IO
>>>>>>>>>> + * completes. But this breaks some in-kernel users, like
>>>>>>>>>> + * splicing to/from a loop device, where we release the pipe
>>>>>>>>>> + * pages unconditionally. If we can fix that case, we can
>>>>>>>>>> + * get rid of the get here and the need to call
>>>>>>>>>> + * bio_release_pages() at IO completion time.
>>>>>>>>>> + */
>>>>>>>>>> + get_page(bv->bv_page);
>>>>>>>>>>
>>>>>>>>>> Now the 'bv' may point to more than one page, so the following one may be
>>>>>>>>>> needed:
>>>>>>>>>>
>>>>>>>>>> int i;
>>>>>>>>>> struct bvec_iter_all iter_all;
>>>>>>>>>> struct bio_vec *tmp;
>>>>>>>>>>
>>>>>>>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
>>>>>>>>>>       get_page(tmp->bv_page);
>>>>>>>>>
>>>>>>>>> I guess that would be the safest, even if we don't currently have more
>>>>>>>>> than one page in there. I'll fix it up.
>>>>>>>>
>>>>>>>> It is easy to see multipage bvec from loop, :-)
>>>>>>>
>>>>>>> Speaking of this, I took a quick look at why we've now regressed a lot
>>>>>>> on IOPS perf with the multipage work. It looks like it's all related to
>>>>>>> the (much) fatter setup around iteration, which is related to this very
>>>>>>> topic too.
>>>>>>>
>>>>>>> Basically setup of things like bio_for_each_bvec() and indexing through
>>>>>>> nth_page() is MUCH slower than before.
>>>>>>
>>>>>> But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
>>>>>> needs that. However, bio_for_each_segment() isn't called from
>>>>>> blk_queue_split() and blk_rq_map_sg().
>>>>>>
>>>>>> One issue is that bio_for_each_bvec() still advances by page size
>>>>>> instead of bvec->len, I guess that is the problem, will cook a patch
>>>>>> for your test.
>>>>>
>>>>> Probably won't make a difference for my test case...
>>>>>
>>>>>>> We need to do something about this, it's like tossing out months of
>>>>>>> optimizations.
>>>>>>
>>>>>> Some following optimization can be done, such as removing
>>>>>> biovec_phys_mergeable() from blk_bio_segment_split().
>>>>>
>>>>> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
>>>>> that it is possible. But iteration startup cost is a problem in a lot of
>>>>> spots, and a split fast path will only help a bit for that specific
>>>>> case.
>>>>>
>>>>> 5% regressions is HUGE. I know I've mentioned this before, I just want
>>>>> to really stress how big of a deal that is. It's enough to make me
>>>>> consider just reverting it again, which sucks, but I don't feel great
>>>>> shipping something that is known that much slower.
>>>>>
>>>>> Suggestions?
>>>>
>>>> You mentioned nth_page() costs much in bio_for_each_bvec(), but which
>>>> shouldn't call into nth_page(). I will look into it first.
>>>
>>> I'll check on the test box tomorrow, I lost connectivity before. I'll
>>> double check in the morning.
>>>
>>> I'd focus on the blk_rq_map_sg() path, since that's the biggest cycle
>>> consumer.
>>
>> Hi Jens,
>>
>> Could you test the following patch which may improve on the 4k randio
>> test case?
> 
> A bit, it's up 1% with this patch. I'm going to try without the
> get_page/put_page that we had earlier, to see where we are in regards to
> the old baseline.

~1548K now, down from 1615-1620K, which matches the numbers. That's down
now roughly 4%, instead of the original 5%, with this recent patch being
the source of that reclaimed 1%.

So that's a good start, but still 4% to go.
Ming Lei Feb. 27, 2019, 3:44 a.m. UTC | #17
On Tue, Feb 26, 2019 at 08:37:05PM -0700, Jens Axboe wrote:
> On 2/26/19 8:09 PM, Ming Lei wrote:
> > On Tue, Feb 26, 2019 at 07:43:32PM -0700, Jens Axboe wrote:
> >> On 2/26/19 7:37 PM, Ming Lei wrote:
> >>> On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
> >>>> On 2/26/19 7:21 PM, Ming Lei wrote:
> >>>>> On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
> >>>>>> On 2/26/19 6:53 PM, Ming Lei wrote:
> >>>>>>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
> >>>>>>>> On 2/26/19 6:21 PM, Ming Lei wrote:
> >>>>>>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
> >>>>>>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
> >>>>>>>>>>>> Hi Jens,
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> >>>>>>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
> >>>>>>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> >>>>>>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> >>>>>>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
> >>>>>>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> >>>>>>>>>>>>>>> check if they need to release pages on completion. This makes them
> >>>>>>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> >>>>>>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>>>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> >>>>>>>>>>>>>>>  fs/block_dev.c            |  5 ++--
> >>>>>>>>>>>>>>>  fs/iomap.c                |  5 ++--
> >>>>>>>>>>>>>>>  include/linux/blk_types.h |  1 +
> >>>>>>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
> >>>>>>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
> >>>>>>>>>>>>>>> --- a/block/bio.c
> >>>>>>>>>>>>>>> +++ b/block/bio.c
> >>>>>>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> >>>>>>>>>>>>>>>  }
> >>>>>>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> >>>>>>>>>>>>>>> +{
> >>>>>>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
> >>>>>>>>>>>>>>> + unsigned int len;
> >>>>>>>>>>>>>>> + size_t size;
> >>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> >>>>>>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
> >>>>>>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
> >>>>>>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
> >>>>>>>>>>>>>> can be observed when running xfstests over loop/dio.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks, I folded this in.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> Jens Axboe
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
> >>>>>>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
> >>>>>>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> >>>>>>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Reproducer:
> >>>>>>>>>>>>
> >>>>>>>>>>>> #define _GNU_SOURCE
> >>>>>>>>>>>> #include <fcntl.h>
> >>>>>>>>>>>> #include <linux/loop.h>
> >>>>>>>>>>>> #include <sys/ioctl.h>
> >>>>>>>>>>>> #include <sys/sendfile.h>
> >>>>>>>>>>>> #include <sys/syscall.h>
> >>>>>>>>>>>> #include <unistd.h>
> >>>>>>>>>>>>
> >>>>>>>>>>>> int main(void)
> >>>>>>>>>>>> {
> >>>>>>>>>>>>         int memfd, loopfd;
> >>>>>>>>>>>>
> >>>>>>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
> >>>>>>>>>>>>
> >>>>>>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
> >>>>>>>>>>>>
> >>>>>>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> >>>>>>>>>>>>
> >>>>>>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
> >>>>>>>>>>>>
> >>>>>>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
> >>>>>>>>>>>> }
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Crash:
> >>>>>>>>>>>>
> >>>>>>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> >>>>>>>>>>>> flags: 0x100000000000000()
> >>>>>>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> >>>>>>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
> >>>>>>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> >>>>>>>>>>>
> >>>>>>>>>>> I see what this is, I'll cut a fix for this tomorrow.
> >>>>>>>>>>
> >>>>>>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
> >>>>>>>>>> branch.
> >>>>>>>>>
> >>>>>>>>> Hi Jens,
> >>>>>>>>>
> >>>>>>>>> I saw the following change is added:
> >>>>>>>>>
> >>>>>>>>> + if (size == len) {
> >>>>>>>>> + /*
> >>>>>>>>> + * For the normal O_DIRECT case, we could skip grabbing this
> >>>>>>>>> + * reference and then not have to put them again when IO
> >>>>>>>>> + * completes. But this breaks some in-kernel users, like
> >>>>>>>>> + * splicing to/from a loop device, where we release the pipe
> >>>>>>>>> + * pages unconditionally. If we can fix that case, we can
> >>>>>>>>> + * get rid of the get here and the need to call
> >>>>>>>>> + * bio_release_pages() at IO completion time.
> >>>>>>>>> + */
> >>>>>>>>> + get_page(bv->bv_page);
> >>>>>>>>>
> >>>>>>>>> Now the 'bv' may point to more than one page, so the following one may be
> >>>>>>>>> needed:
> >>>>>>>>>
> >>>>>>>>> int i;
> >>>>>>>>> struct bvec_iter_all iter_all;
> >>>>>>>>> struct bio_vec *tmp;
> >>>>>>>>>
> >>>>>>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
> >>>>>>>>>       get_page(tmp->bv_page);
> >>>>>>>>
> >>>>>>>> I guess that would be the safest, even if we don't currently have more
> >>>>>>>> than one page in there. I'll fix it up.
> >>>>>>>
> >>>>>>> It is easy to see multipage bvec from loop, :-)
> >>>>>>
> >>>>>> Speaking of this, I took a quick look at why we've now regressed a lot
> >>>>>> on IOPS perf with the multipage work. It looks like it's all related to
> >>>>>> the (much) fatter setup around iteration, which is related to this very
> >>>>>> topic too.
> >>>>>>
> >>>>>> Basically setup of things like bio_for_each_bvec() and indexing through
> >>>>>> nth_page() is MUCH slower than before.
> >>>>>
> >>>>> But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
> >>>>> needs that. However, bio_for_each_segment() isn't called from
> >>>>> blk_queue_split() and blk_rq_map_sg().
> >>>>>
> >>>>> One issue is that bio_for_each_bvec() still advances by page size
> >>>>> instead of bvec->len, I guess that is the problem, will cook a patch
> >>>>> for your test.
> >>>>
> >>>> Probably won't make a difference for my test case...
> >>>>
> >>>>>> We need to do something about this, it's like tossing out months of
> >>>>>> optimizations.
> >>>>>
> >>>>> Some following optimization can be done, such as removing
> >>>>> biovec_phys_mergeable() from blk_bio_segment_split().
> >>>>
> >>>> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
> >>>> that it is possible. But iteration startup cost is a problem in a lot of
> >>>> spots, and a split fast path will only help a bit for that specific
> >>>> case.
> >>>>
> >>>> 5% regressions is HUGE. I know I've mentioned this before, I just want
> >>>> to really stress how big of a deal that is. It's enough to make me
> >>>> consider just reverting it again, which sucks, but I don't feel great
> >>>> shipping something that is known that much slower.
> >>>>
> >>>> Suggestions?
> >>>
> >>> You mentioned nth_page() costs much in bio_for_each_bvec(), but which
> >>> shouldn't call into nth_page(). I will look into it first.
> >>
> >> I'll check on the test box tomorrow, I lost connectivity before. I'll
> >> double check in the morning.
> >>
> >> I'd focus on the blk_rq_map_sg() path, since that's the biggest cycle
> >> consumer.
> > 
> > Hi Jens,
> > 
> > Could you test the following patch which may improve on the 4k randio
> > test case?
> 
> A bit, it's up 1% with this patch. I'm going to try without the
> get_page/put_page that we had earlier, to see where we are in regards to
> the old baseline.

OK, today I will test io_uring over null_blk on one real machine and see
if something can be improved.

Thanks,
Ming
Jens Axboe Feb. 27, 2019, 4:05 a.m. UTC | #18
On 2/26/19 8:44 PM, Ming Lei wrote:
> On Tue, Feb 26, 2019 at 08:37:05PM -0700, Jens Axboe wrote:
>> On 2/26/19 8:09 PM, Ming Lei wrote:
>>> On Tue, Feb 26, 2019 at 07:43:32PM -0700, Jens Axboe wrote:
>>>> On 2/26/19 7:37 PM, Ming Lei wrote:
>>>>> On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
>>>>>> On 2/26/19 7:21 PM, Ming Lei wrote:
>>>>>>> On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
>>>>>>>> On 2/26/19 6:53 PM, Ming Lei wrote:
>>>>>>>>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
>>>>>>>>>> On 2/26/19 6:21 PM, Ming Lei wrote:
>>>>>>>>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
>>>>>>>>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
>>>>>>>>>>>>>> Hi Jens,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
>>>>>>>>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
>>>>>>>>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>>>>>>>>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
>>>>>>>>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
>>>>>>>>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
>>>>>>>>>>>>>>>>> check if they need to release pages on completion. This makes them
>>>>>>>>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>>>>>>>>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>>>>>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>>  fs/block_dev.c            |  5 ++--
>>>>>>>>>>>>>>>>>  fs/iomap.c                |  5 ++--
>>>>>>>>>>>>>>>>>  include/linux/blk_types.h |  1 +
>>>>>>>>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>>>>>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
>>>>>>>>>>>>>>>>> --- a/block/bio.c
>>>>>>>>>>>>>>>>> +++ b/block/bio.c
>>>>>>>>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
>>>>>>>>>>>>>>>>> + unsigned int len;
>>>>>>>>>>>>>>>>> + size_t size;
>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
>>>>>>>>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
>>>>>>>>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
>>>>>>>>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
>>>>>>>>>>>>>>>> can be observed when running xfstests over loop/dio.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks, I folded this in.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Jens Axboe
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
>>>>>>>>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
>>>>>>>>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
>>>>>>>>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Reproducer:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> #define _GNU_SOURCE
>>>>>>>>>>>>>> #include <fcntl.h>
>>>>>>>>>>>>>> #include <linux/loop.h>
>>>>>>>>>>>>>> #include <sys/ioctl.h>
>>>>>>>>>>>>>> #include <sys/sendfile.h>
>>>>>>>>>>>>>> #include <sys/syscall.h>
>>>>>>>>>>>>>> #include <unistd.h>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> int main(void)
>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>         int memfd, loopfd;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Crash:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
>>>>>>>>>>>>>> flags: 0x100000000000000()
>>>>>>>>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
>>>>>>>>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
>>>>>>>>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
>>>>>>>>>>>>>
>>>>>>>>>>>>> I see what this is, I'll cut a fix for this tomorrow.
>>>>>>>>>>>>
>>>>>>>>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
>>>>>>>>>>>> branch.
>>>>>>>>>>>
>>>>>>>>>>> Hi Jens,
>>>>>>>>>>>
>>>>>>>>>>> I saw the following change is added:
>>>>>>>>>>>
>>>>>>>>>>> + if (size == len) {
>>>>>>>>>>> + /*
>>>>>>>>>>> + * For the normal O_DIRECT case, we could skip grabbing this
>>>>>>>>>>> + * reference and then not have to put them again when IO
>>>>>>>>>>> + * completes. But this breaks some in-kernel users, like
>>>>>>>>>>> + * splicing to/from a loop device, where we release the pipe
>>>>>>>>>>> + * pages unconditionally. If we can fix that case, we can
>>>>>>>>>>> + * get rid of the get here and the need to call
>>>>>>>>>>> + * bio_release_pages() at IO completion time.
>>>>>>>>>>> + */
>>>>>>>>>>> + get_page(bv->bv_page);
>>>>>>>>>>>
>>>>>>>>>>> Now the 'bv' may point to more than one page, so the following one may be
>>>>>>>>>>> needed:
>>>>>>>>>>>
>>>>>>>>>>> int i;
>>>>>>>>>>> struct bvec_iter_all iter_all;
>>>>>>>>>>> struct bio_vec *tmp;
>>>>>>>>>>>
>>>>>>>>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
>>>>>>>>>>>       get_page(tmp->bv_page);
>>>>>>>>>>
>>>>>>>>>> I guess that would be the safest, even if we don't currently have more
>>>>>>>>>> than one page in there. I'll fix it up.
>>>>>>>>>
>>>>>>>>> It is easy to see multipage bvec from loop, :-)
>>>>>>>>
>>>>>>>> Speaking of this, I took a quick look at why we've now regressed a lot
>>>>>>>> on IOPS perf with the multipage work. It looks like it's all related to
>>>>>>>> the (much) fatter setup around iteration, which is related to this very
>>>>>>>> topic too.
>>>>>>>>
>>>>>>>> Basically setup of things like bio_for_each_bvec() and indexing through
>>>>>>>> nth_page() is MUCH slower than before.
>>>>>>>
>>>>>>> But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
>>>>>>> needs that. However, bio_for_each_segment() isn't called from
>>>>>>> blk_queue_split() and blk_rq_map_sg().
>>>>>>>
>>>>>>> One issue is that bio_for_each_bvec() still advances by page size
>>>>>>> instead of bvec->len, I guess that is the problem, will cook a patch
>>>>>>> for your test.
>>>>>>
>>>>>> Probably won't make a difference for my test case...
>>>>>>
>>>>>>>> We need to do something about this, it's like tossing out months of
>>>>>>>> optimizations.
>>>>>>>
>>>>>>> Some following optimization can be done, such as removing
>>>>>>> biovec_phys_mergeable() from blk_bio_segment_split().
>>>>>>
>>>>>> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
>>>>>> that it is possible. But iteration startup cost is a problem in a lot of
>>>>>> spots, and a split fast path will only help a bit for that specific
>>>>>> case.
>>>>>>
>>>>>> 5% regressions is HUGE. I know I've mentioned this before, I just want
>>>>>> to really stress how big of a deal that is. It's enough to make me
>>>>>> consider just reverting it again, which sucks, but I don't feel great
>>>>>> shipping something that is known that much slower.
>>>>>>
>>>>>> Suggestions?
>>>>>
>>>>> You mentioned nth_page() costs much in bio_for_each_bvec(), but which
>>>>> shouldn't call into nth_page(). I will look into it first.
>>>>
>>>> I'll check on the test box tomorrow, I lost connectivity before. I'll
>>>> double check in the morning.
>>>>
>>>> I'd focus on the blk_rq_map_sg() path, since that's the biggest cycle
>>>> consumer.
>>>
>>> Hi Jens,
>>>
>>> Could you test the following patch which may improve on the 4k randio
>>> test case?
>>
>> A bit, it's up 1% with this patch. I'm going to try without the
>> get_page/put_page that we had earlier, to see where we are in regards to
>> the old baseline.
> 
> OK, today I will test io_uring over null_blk on one real machine and see
> if something can be improved.

For reference, I'm running the default t/io_uring from fio, which is
QD=128, fixed files/buffers, and polled. Running it on two devices to
max out the CPU core:

sudo taskset -c 0 t/io_uring /dev/nvme1n1 /dev/nvme5n1

since nvme1n1 tops out at 1164K 4k rand reads (hardware limit). Just
tried with null_blk, since I haven't done that before, and I get about
1875K from a single device with the same test case. Using 2 devices
yields the same result, so we're CPU core bound at that point. We don't
get the sg walk with null_blk though, but I do see about 4%
blk_queue_split() time with that.
Jens Axboe Feb. 27, 2019, 4:06 a.m. UTC | #19
On 2/26/19 9:05 PM, Jens Axboe wrote:
> On 2/26/19 8:44 PM, Ming Lei wrote:
>> On Tue, Feb 26, 2019 at 08:37:05PM -0700, Jens Axboe wrote:
>>> On 2/26/19 8:09 PM, Ming Lei wrote:
>>>> On Tue, Feb 26, 2019 at 07:43:32PM -0700, Jens Axboe wrote:
>>>>> On 2/26/19 7:37 PM, Ming Lei wrote:
>>>>>> On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
>>>>>>> On 2/26/19 7:21 PM, Ming Lei wrote:
>>>>>>>> On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
>>>>>>>>> On 2/26/19 6:53 PM, Ming Lei wrote:
>>>>>>>>>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
>>>>>>>>>>> On 2/26/19 6:21 PM, Ming Lei wrote:
>>>>>>>>>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
>>>>>>>>>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
>>>>>>>>>>>>>>> Hi Jens,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
>>>>>>>>>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
>>>>>>>>>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>>>>>>>>>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
>>>>>>>>>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
>>>>>>>>>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
>>>>>>>>>>>>>>>>>> check if they need to release pages on completion. This makes them
>>>>>>>>>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>>>>>>>>>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>>>>>>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>>>  fs/block_dev.c            |  5 ++--
>>>>>>>>>>>>>>>>>>  fs/iomap.c                |  5 ++--
>>>>>>>>>>>>>>>>>>  include/linux/blk_types.h |  1 +
>>>>>>>>>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>>>>>>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
>>>>>>>>>>>>>>>>>> --- a/block/bio.c
>>>>>>>>>>>>>>>>>> +++ b/block/bio.c
>>>>>>>>>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
>>>>>>>>>>>>>>>>>> + unsigned int len;
>>>>>>>>>>>>>>>>>> + size_t size;
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
>>>>>>>>>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
>>>>>>>>>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
>>>>>>>>>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
>>>>>>>>>>>>>>>>> can be observed when running xfstests over loop/dio.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks, I folded this in.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Jens Axboe
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
>>>>>>>>>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
>>>>>>>>>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
>>>>>>>>>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Reproducer:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> #define _GNU_SOURCE
>>>>>>>>>>>>>>> #include <fcntl.h>
>>>>>>>>>>>>>>> #include <linux/loop.h>
>>>>>>>>>>>>>>> #include <sys/ioctl.h>
>>>>>>>>>>>>>>> #include <sys/sendfile.h>
>>>>>>>>>>>>>>> #include <sys/syscall.h>
>>>>>>>>>>>>>>> #include <unistd.h>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> int main(void)
>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>         int memfd, loopfd;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Crash:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
>>>>>>>>>>>>>>> flags: 0x100000000000000()
>>>>>>>>>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
>>>>>>>>>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
>>>>>>>>>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I see what this is, I'll cut a fix for this tomorrow.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
>>>>>>>>>>>>> branch.
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Jens,
>>>>>>>>>>>>
>>>>>>>>>>>> I saw the following change is added:
>>>>>>>>>>>>
>>>>>>>>>>>> + if (size == len) {
>>>>>>>>>>>> + /*
>>>>>>>>>>>> + * For the normal O_DIRECT case, we could skip grabbing this
>>>>>>>>>>>> + * reference and then not have to put them again when IO
>>>>>>>>>>>> + * completes. But this breaks some in-kernel users, like
>>>>>>>>>>>> + * splicing to/from a loop device, where we release the pipe
>>>>>>>>>>>> + * pages unconditionally. If we can fix that case, we can
>>>>>>>>>>>> + * get rid of the get here and the need to call
>>>>>>>>>>>> + * bio_release_pages() at IO completion time.
>>>>>>>>>>>> + */
>>>>>>>>>>>> + get_page(bv->bv_page);
>>>>>>>>>>>>
>>>>>>>>>>>> Now the 'bv' may point to more than one page, so the following one may be
>>>>>>>>>>>> needed:
>>>>>>>>>>>>
>>>>>>>>>>>> int i;
>>>>>>>>>>>> struct bvec_iter_all iter_all;
>>>>>>>>>>>> struct bio_vec *tmp;
>>>>>>>>>>>>
>>>>>>>>>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
>>>>>>>>>>>>       get_page(tmp->bv_page);
>>>>>>>>>>>
>>>>>>>>>>> I guess that would be the safest, even if we don't currently have more
>>>>>>>>>>> than one page in there. I'll fix it up.
>>>>>>>>>>
>>>>>>>>>> It is easy to see multipage bvec from loop, :-)
>>>>>>>>>
>>>>>>>>> Speaking of this, I took a quick look at why we've now regressed a lot
>>>>>>>>> on IOPS perf with the multipage work. It looks like it's all related to
>>>>>>>>> the (much) fatter setup around iteration, which is related to this very
>>>>>>>>> topic too.
>>>>>>>>>
>>>>>>>>> Basically setup of things like bio_for_each_bvec() and indexing through
>>>>>>>>> nth_page() is MUCH slower than before.
>>>>>>>>
>>>>>>>> But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
>>>>>>>> needs that. However, bio_for_each_segment() isn't called from
>>>>>>>> blk_queue_split() and blk_rq_map_sg().
>>>>>>>>
>>>>>>>> One issue is that bio_for_each_bvec() still advances by page size
>>>>>>>> instead of bvec->len, I guess that is the problem, will cook a patch
>>>>>>>> for your test.
>>>>>>>
>>>>>>> Probably won't make a difference for my test case...
>>>>>>>
>>>>>>>>> We need to do something about this, it's like tossing out months of
>>>>>>>>> optimizations.
>>>>>>>>
>>>>>>>> Some following optimization can be done, such as removing
>>>>>>>> biovec_phys_mergeable() from blk_bio_segment_split().
>>>>>>>
>>>>>>> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
>>>>>>> that it is possible. But iteration startup cost is a problem in a lot of
>>>>>>> spots, and a split fast path will only help a bit for that specific
>>>>>>> case.
>>>>>>>
>>>>>>> 5% regressions is HUGE. I know I've mentioned this before, I just want
>>>>>>> to really stress how big of a deal that is. It's enough to make me
>>>>>>> consider just reverting it again, which sucks, but I don't feel great
>>>>>>> shipping something that is known that much slower.
>>>>>>>
>>>>>>> Suggestions?
>>>>>>
>>>>>> You mentioned nth_page() costs much in bio_for_each_bvec(), but which
>>>>>> shouldn't call into nth_page(). I will look into it first.
>>>>>
>>>>> I'll check on the test box tomorrow, I lost connectivity before. I'll
>>>>> double check in the morning.
>>>>>
>>>>> I'd focus on the blk_rq_map_sg() path, since that's the biggest cycle
>>>>> consumer.
>>>>
>>>> Hi Jens,
>>>>
>>>> Could you test the following patch which may improve on the 4k randio
>>>> test case?
>>>
>>> A bit, it's up 1% with this patch. I'm going to try without the
>>> get_page/put_page that we had earlier, to see where we are in regards to
>>> the old baseline.
>>
>> OK, today I will test io_uring over null_blk on one real machine and see
>> if something can be improved.
> 
> For reference, I'm running the default t/io_uring from fio, which is
> QD=128, fixed files/buffers, and polled. Running it on two devices to
> max out the CPU core:
> 
> sudo taskset -c 0 t/io_uring /dev/nvme1n1 /dev/nvme5n1

Forgot to mention, this is loading nvme with 12 poll queues, which is of
course important to get good performance on this test case.
Christoph Hellwig Feb. 27, 2019, 7:42 p.m. UTC | #20
On Tue, Feb 26, 2019 at 09:06:23PM -0700, Jens Axboe wrote:
> On 2/26/19 9:05 PM, Jens Axboe wrote:
> > On 2/26/19 8:44 PM, Ming Lei wrote:
> >> On Tue, Feb 26, 2019 at 08:37:05PM -0700, Jens Axboe wrote:
> >>> On 2/26/19 8:09 PM, Ming Lei wrote:
> >>>> On Tue, Feb 26, 2019 at 07:43:32PM -0700, Jens Axboe wrote:
> >>>>> On 2/26/19 7:37 PM, Ming Lei wrote:
> >>>>>> On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
> >>>>>>> On 2/26/19 7:21 PM, Ming Lei wrote:
> >>>>>>>> On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
> >>>>>>>>> On 2/26/19 6:53 PM, Ming Lei wrote:
> >>>>>>>>>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
> >>>>>>>>>>> On 2/26/19 6:21 PM, Ming Lei wrote:
> >>>>>>>>>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
> >>>>>>>>>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
> >>>>>>>>>>>>>>> Hi Jens,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> >>>>>>>>>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
> >>>>>>>>>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> >>>>>>>>>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> >>>>>>>>>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
> >>>>>>>>>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> >>>>>>>>>>>>>>>>>> check if they need to release pages on completion. This makes them
> >>>>>>>>>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> >>>>>>>>>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>>>>>>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>>>>>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> >>>>>>>>>>>>>>>>>>  fs/block_dev.c            |  5 ++--
> >>>>>>>>>>>>>>>>>>  fs/iomap.c                |  5 ++--
> >>>>>>>>>>>>>>>>>>  include/linux/blk_types.h |  1 +
> >>>>>>>>>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
> >>>>>>>>>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
> >>>>>>>>>>>>>>>>>> --- a/block/bio.c
> >>>>>>>>>>>>>>>>>> +++ b/block/bio.c
> >>>>>>>>>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> >>>>>>>>>>>>>>>>>>  }
> >>>>>>>>>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> >>>>>>>>>>>>>>>>>> +{
> >>>>>>>>>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
> >>>>>>>>>>>>>>>>>> + unsigned int len;
> >>>>>>>>>>>>>>>>>> + size_t size;
> >>>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> >>>>>>>>>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
> >>>>>>>>>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
> >>>>>>>>>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
> >>>>>>>>>>>>>>>>> can be observed when running xfstests over loop/dio.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks, I folded this in.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>> Jens Axboe
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
> >>>>>>>>>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
> >>>>>>>>>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> >>>>>>>>>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Reproducer:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> #define _GNU_SOURCE
> >>>>>>>>>>>>>>> #include <fcntl.h>
> >>>>>>>>>>>>>>> #include <linux/loop.h>
> >>>>>>>>>>>>>>> #include <sys/ioctl.h>
> >>>>>>>>>>>>>>> #include <sys/sendfile.h>
> >>>>>>>>>>>>>>> #include <sys/syscall.h>
> >>>>>>>>>>>>>>> #include <unistd.h>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> int main(void)
> >>>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>>         int memfd, loopfd;
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
> >>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Crash:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> >>>>>>>>>>>>>>> flags: 0x100000000000000()
> >>>>>>>>>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> >>>>>>>>>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
> >>>>>>>>>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I see what this is, I'll cut a fix for this tomorrow.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
> >>>>>>>>>>>>> branch.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi Jens,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I saw the following change is added:
> >>>>>>>>>>>>
> >>>>>>>>>>>> + if (size == len) {
> >>>>>>>>>>>> + /*
> >>>>>>>>>>>> + * For the normal O_DIRECT case, we could skip grabbing this
> >>>>>>>>>>>> + * reference and then not have to put them again when IO
> >>>>>>>>>>>> + * completes. But this breaks some in-kernel users, like
> >>>>>>>>>>>> + * splicing to/from a loop device, where we release the pipe
> >>>>>>>>>>>> + * pages unconditionally. If we can fix that case, we can
> >>>>>>>>>>>> + * get rid of the get here and the need to call
> >>>>>>>>>>>> + * bio_release_pages() at IO completion time.
> >>>>>>>>>>>> + */
> >>>>>>>>>>>> + get_page(bv->bv_page);
> >>>>>>>>>>>>
> >>>>>>>>>>>> Now the 'bv' may point to more than one page, so the following one may be
> >>>>>>>>>>>> needed:
> >>>>>>>>>>>>
> >>>>>>>>>>>> int i;
> >>>>>>>>>>>> struct bvec_iter_all iter_all;
> >>>>>>>>>>>> struct bio_vec *tmp;
> >>>>>>>>>>>>
> >>>>>>>>>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
> >>>>>>>>>>>>       get_page(tmp->bv_page);
> >>>>>>>>>>>
> >>>>>>>>>>> I guess that would be the safest, even if we don't currently have more
> >>>>>>>>>>> than one page in there. I'll fix it up.
> >>>>>>>>>>
> >>>>>>>>>> It is easy to see multipage bvec from loop, :-)
> >>>>>>>>>
> >>>>>>>>> Speaking of this, I took a quick look at why we've now regressed a lot
> >>>>>>>>> on IOPS perf with the multipage work. It looks like it's all related to
> >>>>>>>>> the (much) fatter setup around iteration, which is related to this very
> >>>>>>>>> topic too.
> >>>>>>>>>
> >>>>>>>>> Basically setup of things like bio_for_each_bvec() and indexing through
> >>>>>>>>> nth_page() is MUCH slower than before.
> >>>>>>>>
> >>>>>>>> But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
> >>>>>>>> needs that. However, bio_for_each_segment() isn't called from
> >>>>>>>> blk_queue_split() and blk_rq_map_sg().
> >>>>>>>>
> >>>>>>>> One issue is that bio_for_each_bvec() still advances by page size
> >>>>>>>> instead of bvec->len, I guess that is the problem, will cook a patch
> >>>>>>>> for your test.
> >>>>>>>
> >>>>>>> Probably won't make a difference for my test case...
> >>>>>>>
> >>>>>>>>> We need to do something about this, it's like tossing out months of
> >>>>>>>>> optimizations.
> >>>>>>>>
> >>>>>>>> Some following optimization can be done, such as removing
> >>>>>>>> biovec_phys_mergeable() from blk_bio_segment_split().
> >>>>>>>
> >>>>>>> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
> >>>>>>> that it is possible. But iteration startup cost is a problem in a lot of
> >>>>>>> spots, and a split fast path will only help a bit for that specific
> >>>>>>> case.
> >>>>>>>
> >>>>>>> 5% regressions is HUGE. I know I've mentioned this before, I just want
> >>>>>>> to really stress how big of a deal that is. It's enough to make me
> >>>>>>> consider just reverting it again, which sucks, but I don't feel great
> >>>>>>> shipping something that is known that much slower.
> >>>>>>>
> >>>>>>> Suggestions?
> >>>>>>
> >>>>>> You mentioned nth_page() costs much in bio_for_each_bvec(), but which
> >>>>>> shouldn't call into nth_page(). I will look into it first.
> >>>>>
> >>>>> I'll check on the test box tomorrow, I lost connectivity before. I'll
> >>>>> double check in the morning.
> >>>>>
> >>>>> I'd focus on the blk_rq_map_sg() path, since that's the biggest cycle
> >>>>> consumer.
> >>>>
> >>>> Hi Jens,
> >>>>
> >>>> Could you test the following patch which may improve on the 4k randio
> >>>> test case?
> >>>
> >>> A bit, it's up 1% with this patch. I'm going to try without the
> >>> get_page/put_page that we had earlier, to see where we are in regards to
> >>> the old baseline.
> >>
> >> OK, today I will test io_uring over null_blk on one real machine and see
> >> if something can be improved.
> > 
> > For reference, I'm running the default t/io_uring from fio, which is
> > QD=128, fixed files/buffers, and polled. Running it on two devices to
> > max out the CPU core:
> > 
> > sudo taskset -c 0 t/io_uring /dev/nvme1n1 /dev/nvme5n1
> 
> Forgot to mention, this is loading nvme with 12 poll queues, which is of
> course important to get good performance on this test case.

Btw, is your nvme device SGL capable?  There is some low hanging fruit
in that IFF a device has SGL support we can basically dumb down
blk_mq_map_sg to never split in this case ever because we don't have
any segment size limits.

PRPs only unforturtunately are a little dumb and could lead to all kinds
of whacky splitting.
Ming Lei Feb. 27, 2019, 11:35 p.m. UTC | #21
On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
> On 2/26/19 7:21 PM, Ming Lei wrote:
> > On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
> >> On 2/26/19 6:53 PM, Ming Lei wrote:
> >>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
> >>>> On 2/26/19 6:21 PM, Ming Lei wrote:
> >>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>
> >>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
> >>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
> >>>>>>>> Hi Jens,
> >>>>>>>>
> >>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> >>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
> >>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> >>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> >>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
> >>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> >>>>>>>>>>>
> >>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> >>>>>>>>>>> check if they need to release pages on completion. This makes them
> >>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
> >>>>>>>>>>>
> >>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> >>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>>>>>>>>>> ---
> >>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> >>>>>>>>>>>  fs/block_dev.c            |  5 ++--
> >>>>>>>>>>>  fs/iomap.c                |  5 ++--
> >>>>>>>>>>>  include/linux/blk_types.h |  1 +
> >>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> >>>>>>>>>>>
> >>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
> >>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
> >>>>>>>>>>> --- a/block/bio.c
> >>>>>>>>>>> +++ b/block/bio.c
> >>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> >>>>>>>>>>>  }
> >>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
> >>>>>>>>>>>
> >>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> >>>>>>>>>>> +{
> >>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
> >>>>>>>>>>> + unsigned int len;
> >>>>>>>>>>> + size_t size;
> >>>>>>>>>>> +
> >>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> >>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
> >>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
> >>>>>>>>>>
> >>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
> >>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
> >>>>>>>>>> can be observed when running xfstests over loop/dio.
> >>>>>>>>>
> >>>>>>>>> Thanks, I folded this in.
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Jens Axboe
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
> >>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
> >>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> >>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
> >>>>>>>>
> >>>>>>>> Reproducer:
> >>>>>>>>
> >>>>>>>> #define _GNU_SOURCE
> >>>>>>>> #include <fcntl.h>
> >>>>>>>> #include <linux/loop.h>
> >>>>>>>> #include <sys/ioctl.h>
> >>>>>>>> #include <sys/sendfile.h>
> >>>>>>>> #include <sys/syscall.h>
> >>>>>>>> #include <unistd.h>
> >>>>>>>>
> >>>>>>>> int main(void)
> >>>>>>>> {
> >>>>>>>>         int memfd, loopfd;
> >>>>>>>>
> >>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
> >>>>>>>>
> >>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
> >>>>>>>>
> >>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> >>>>>>>>
> >>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
> >>>>>>>>
> >>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Crash:
> >>>>>>>>
> >>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> >>>>>>>> flags: 0x100000000000000()
> >>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> >>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
> >>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> >>>>>>>
> >>>>>>> I see what this is, I'll cut a fix for this tomorrow.
> >>>>>>
> >>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
> >>>>>> branch.
> >>>>>
> >>>>> Hi Jens,
> >>>>>
> >>>>> I saw the following change is added:
> >>>>>
> >>>>> + if (size == len) {
> >>>>> + /*
> >>>>> + * For the normal O_DIRECT case, we could skip grabbing this
> >>>>> + * reference and then not have to put them again when IO
> >>>>> + * completes. But this breaks some in-kernel users, like
> >>>>> + * splicing to/from a loop device, where we release the pipe
> >>>>> + * pages unconditionally. If we can fix that case, we can
> >>>>> + * get rid of the get here and the need to call
> >>>>> + * bio_release_pages() at IO completion time.
> >>>>> + */
> >>>>> + get_page(bv->bv_page);
> >>>>>
> >>>>> Now the 'bv' may point to more than one page, so the following one may be
> >>>>> needed:
> >>>>>
> >>>>> int i;
> >>>>> struct bvec_iter_all iter_all;
> >>>>> struct bio_vec *tmp;
> >>>>>
> >>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
> >>>>>       get_page(tmp->bv_page);
> >>>>
> >>>> I guess that would be the safest, even if we don't currently have more
> >>>> than one page in there. I'll fix it up.
> >>>
> >>> It is easy to see multipage bvec from loop, :-)
> >>
> >> Speaking of this, I took a quick look at why we've now regressed a lot
> >> on IOPS perf with the multipage work. It looks like it's all related to
> >> the (much) fatter setup around iteration, which is related to this very
> >> topic too.
> >>
> >> Basically setup of things like bio_for_each_bvec() and indexing through
> >> nth_page() is MUCH slower than before.
> > 
> > But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
> > needs that. However, bio_for_each_segment() isn't called from
> > blk_queue_split() and blk_rq_map_sg().
> > 
> > One issue is that bio_for_each_bvec() still advances by page size
> > instead of bvec->len, I guess that is the problem, will cook a patch
> > for your test.
> 
> Probably won't make a difference for my test case...

The thing is that bvec_iter_len() becomes much slower than before,
I will work a patch for you soon.

Thanks,
Ming
Ming Lei Feb. 28, 2019, 8:37 a.m. UTC | #22
On Wed, Feb 27, 2019 at 11:42:41AM -0800, Christoph Hellwig wrote:
> On Tue, Feb 26, 2019 at 09:06:23PM -0700, Jens Axboe wrote:
> > On 2/26/19 9:05 PM, Jens Axboe wrote:
> > > On 2/26/19 8:44 PM, Ming Lei wrote:
> > >> On Tue, Feb 26, 2019 at 08:37:05PM -0700, Jens Axboe wrote:
> > >>> On 2/26/19 8:09 PM, Ming Lei wrote:
> > >>>> On Tue, Feb 26, 2019 at 07:43:32PM -0700, Jens Axboe wrote:
> > >>>>> On 2/26/19 7:37 PM, Ming Lei wrote:
> > >>>>>> On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
> > >>>>>>> On 2/26/19 7:21 PM, Ming Lei wrote:
> > >>>>>>>> On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
> > >>>>>>>>> On 2/26/19 6:53 PM, Ming Lei wrote:
> > >>>>>>>>>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
> > >>>>>>>>>>> On 2/26/19 6:21 PM, Ming Lei wrote:
> > >>>>>>>>>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
> > >>>>>>>>>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
> > >>>>>>>>>>>>>>> Hi Jens,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> > >>>>>>>>>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
> > >>>>>>>>>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> > >>>>>>>>>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> > >>>>>>>>>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
> > >>>>>>>>>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> > >>>>>>>>>>>>>>>>>> check if they need to release pages on completion. This makes them
> > >>>>>>>>>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> > >>>>>>>>>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> > >>>>>>>>>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> > >>>>>>>>>>>>>>>>>> ---
> > >>>>>>>>>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> > >>>>>>>>>>>>>>>>>>  fs/block_dev.c            |  5 ++--
> > >>>>>>>>>>>>>>>>>>  fs/iomap.c                |  5 ++--
> > >>>>>>>>>>>>>>>>>>  include/linux/blk_types.h |  1 +
> > >>>>>>>>>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
> > >>>>>>>>>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
> > >>>>>>>>>>>>>>>>>> --- a/block/bio.c
> > >>>>>>>>>>>>>>>>>> +++ b/block/bio.c
> > >>>>>>>>>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> > >>>>>>>>>>>>>>>>>>  }
> > >>>>>>>>>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> > >>>>>>>>>>>>>>>>>> +{
> > >>>>>>>>>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
> > >>>>>>>>>>>>>>>>>> + unsigned int len;
> > >>>>>>>>>>>>>>>>>> + size_t size;
> > >>>>>>>>>>>>>>>>>> +
> > >>>>>>>>>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> > >>>>>>>>>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
> > >>>>>>>>>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
> > >>>>>>>>>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
> > >>>>>>>>>>>>>>>>> can be observed when running xfstests over loop/dio.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Thanks, I folded this in.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> --
> > >>>>>>>>>>>>>>>> Jens Axboe
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
> > >>>>>>>>>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
> > >>>>>>>>>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> > >>>>>>>>>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Reproducer:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> #define _GNU_SOURCE
> > >>>>>>>>>>>>>>> #include <fcntl.h>
> > >>>>>>>>>>>>>>> #include <linux/loop.h>
> > >>>>>>>>>>>>>>> #include <sys/ioctl.h>
> > >>>>>>>>>>>>>>> #include <sys/sendfile.h>
> > >>>>>>>>>>>>>>> #include <sys/syscall.h>
> > >>>>>>>>>>>>>>> #include <unistd.h>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> int main(void)
> > >>>>>>>>>>>>>>> {
> > >>>>>>>>>>>>>>>         int memfd, loopfd;
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
> > >>>>>>>>>>>>>>> }
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Crash:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> > >>>>>>>>>>>>>>> flags: 0x100000000000000()
> > >>>>>>>>>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> > >>>>>>>>>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
> > >>>>>>>>>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I see what this is, I'll cut a fix for this tomorrow.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
> > >>>>>>>>>>>>> branch.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Hi Jens,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I saw the following change is added:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> + if (size == len) {
> > >>>>>>>>>>>> + /*
> > >>>>>>>>>>>> + * For the normal O_DIRECT case, we could skip grabbing this
> > >>>>>>>>>>>> + * reference and then not have to put them again when IO
> > >>>>>>>>>>>> + * completes. But this breaks some in-kernel users, like
> > >>>>>>>>>>>> + * splicing to/from a loop device, where we release the pipe
> > >>>>>>>>>>>> + * pages unconditionally. If we can fix that case, we can
> > >>>>>>>>>>>> + * get rid of the get here and the need to call
> > >>>>>>>>>>>> + * bio_release_pages() at IO completion time.
> > >>>>>>>>>>>> + */
> > >>>>>>>>>>>> + get_page(bv->bv_page);
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Now the 'bv' may point to more than one page, so the following one may be
> > >>>>>>>>>>>> needed:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> int i;
> > >>>>>>>>>>>> struct bvec_iter_all iter_all;
> > >>>>>>>>>>>> struct bio_vec *tmp;
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
> > >>>>>>>>>>>>       get_page(tmp->bv_page);
> > >>>>>>>>>>>
> > >>>>>>>>>>> I guess that would be the safest, even if we don't currently have more
> > >>>>>>>>>>> than one page in there. I'll fix it up.
> > >>>>>>>>>>
> > >>>>>>>>>> It is easy to see multipage bvec from loop, :-)
> > >>>>>>>>>
> > >>>>>>>>> Speaking of this, I took a quick look at why we've now regressed a lot
> > >>>>>>>>> on IOPS perf with the multipage work. It looks like it's all related to
> > >>>>>>>>> the (much) fatter setup around iteration, which is related to this very
> > >>>>>>>>> topic too.
> > >>>>>>>>>
> > >>>>>>>>> Basically setup of things like bio_for_each_bvec() and indexing through
> > >>>>>>>>> nth_page() is MUCH slower than before.
> > >>>>>>>>
> > >>>>>>>> But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
> > >>>>>>>> needs that. However, bio_for_each_segment() isn't called from
> > >>>>>>>> blk_queue_split() and blk_rq_map_sg().
> > >>>>>>>>
> > >>>>>>>> One issue is that bio_for_each_bvec() still advances by page size
> > >>>>>>>> instead of bvec->len, I guess that is the problem, will cook a patch
> > >>>>>>>> for your test.
> > >>>>>>>
> > >>>>>>> Probably won't make a difference for my test case...
> > >>>>>>>
> > >>>>>>>>> We need to do something about this, it's like tossing out months of
> > >>>>>>>>> optimizations.
> > >>>>>>>>
> > >>>>>>>> Some following optimization can be done, such as removing
> > >>>>>>>> biovec_phys_mergeable() from blk_bio_segment_split().
> > >>>>>>>
> > >>>>>>> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
> > >>>>>>> that it is possible. But iteration startup cost is a problem in a lot of
> > >>>>>>> spots, and a split fast path will only help a bit for that specific
> > >>>>>>> case.
> > >>>>>>>
> > >>>>>>> 5% regressions is HUGE. I know I've mentioned this before, I just want
> > >>>>>>> to really stress how big of a deal that is. It's enough to make me
> > >>>>>>> consider just reverting it again, which sucks, but I don't feel great
> > >>>>>>> shipping something that is known that much slower.
> > >>>>>>>
> > >>>>>>> Suggestions?
> > >>>>>>
> > >>>>>> You mentioned nth_page() costs much in bio_for_each_bvec(), but which
> > >>>>>> shouldn't call into nth_page(). I will look into it first.
> > >>>>>
> > >>>>> I'll check on the test box tomorrow, I lost connectivity before. I'll
> > >>>>> double check in the morning.
> > >>>>>
> > >>>>> I'd focus on the blk_rq_map_sg() path, since that's the biggest cycle
> > >>>>> consumer.
> > >>>>
> > >>>> Hi Jens,
> > >>>>
> > >>>> Could you test the following patch which may improve on the 4k randio
> > >>>> test case?
> > >>>
> > >>> A bit, it's up 1% with this patch. I'm going to try without the
> > >>> get_page/put_page that we had earlier, to see where we are in regards to
> > >>> the old baseline.
> > >>
> > >> OK, today I will test io_uring over null_blk on one real machine and see
> > >> if something can be improved.
> > > 
> > > For reference, I'm running the default t/io_uring from fio, which is
> > > QD=128, fixed files/buffers, and polled. Running it on two devices to
> > > max out the CPU core:
> > > 
> > > sudo taskset -c 0 t/io_uring /dev/nvme1n1 /dev/nvme5n1
> > 
> > Forgot to mention, this is loading nvme with 12 poll queues, which is of
> > course important to get good performance on this test case.
> 
> Btw, is your nvme device SGL capable?  There is some low hanging fruit
> in that IFF a device has SGL support we can basically dumb down
> blk_mq_map_sg to never split in this case ever because we don't have
> any segment size limits.

Indeed.

In case of SGL, big sg list may not be needed and blk_rq_map_sg() can be
skipped if proper DMA mapping interface is to return the dma address
for each segment. That can be one big improvement.

Thanks,
Ming
Christoph Hellwig March 8, 2019, 7:55 a.m. UTC | #23
On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
> On 2/26/19 7:21 PM, Ming Lei wrote:
> > On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
> >> On 2/26/19 6:53 PM, Ming Lei wrote:
> >>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
> >>>> On 2/26/19 6:21 PM, Ming Lei wrote:
> >>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>
> >>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
> >>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
> >>>>>>>> Hi Jens,
> >>>>>>>>
> >>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> >>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
> >>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> >>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> >>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
> >>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> >>>>>>>>>>>
> >>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> >>>>>>>>>>> check if they need to release pages on completion. This makes them
> >>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
> >>>>>>>>>>>
> >>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> >>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>>>>>>>>>> ---
> >>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> >>>>>>>>>>>  fs/block_dev.c            |  5 ++--
> >>>>>>>>>>>  fs/iomap.c                |  5 ++--
> >>>>>>>>>>>  include/linux/blk_types.h |  1 +
> >>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> >>>>>>>>>>>
> >>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
> >>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
> >>>>>>>>>>> --- a/block/bio.c
> >>>>>>>>>>> +++ b/block/bio.c
> >>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> >>>>>>>>>>>  }
> >>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
> >>>>>>>>>>>
> >>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> >>>>>>>>>>> +{
> >>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
> >>>>>>>>>>> + unsigned int len;
> >>>>>>>>>>> + size_t size;
> >>>>>>>>>>> +
> >>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> >>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
> >>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
> >>>>>>>>>>
> >>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
> >>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
> >>>>>>>>>> can be observed when running xfstests over loop/dio.
> >>>>>>>>>
> >>>>>>>>> Thanks, I folded this in.
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Jens Axboe
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
> >>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
> >>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> >>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
> >>>>>>>>
> >>>>>>>> Reproducer:
> >>>>>>>>
> >>>>>>>> #define _GNU_SOURCE
> >>>>>>>> #include <fcntl.h>
> >>>>>>>> #include <linux/loop.h>
> >>>>>>>> #include <sys/ioctl.h>
> >>>>>>>> #include <sys/sendfile.h>
> >>>>>>>> #include <sys/syscall.h>
> >>>>>>>> #include <unistd.h>
> >>>>>>>>
> >>>>>>>> int main(void)
> >>>>>>>> {
> >>>>>>>>         int memfd, loopfd;
> >>>>>>>>
> >>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
> >>>>>>>>
> >>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
> >>>>>>>>
> >>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> >>>>>>>>
> >>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
> >>>>>>>>
> >>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Crash:
> >>>>>>>>
> >>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> >>>>>>>> flags: 0x100000000000000()
> >>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> >>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
> >>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> >>>>>>>
> >>>>>>> I see what this is, I'll cut a fix for this tomorrow.
> >>>>>>
> >>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
> >>>>>> branch.
> >>>>>
> >>>>> Hi Jens,
> >>>>>
> >>>>> I saw the following change is added:
> >>>>>
> >>>>> + if (size == len) {
> >>>>> + /*
> >>>>> + * For the normal O_DIRECT case, we could skip grabbing this
> >>>>> + * reference and then not have to put them again when IO
> >>>>> + * completes. But this breaks some in-kernel users, like
> >>>>> + * splicing to/from a loop device, where we release the pipe
> >>>>> + * pages unconditionally. If we can fix that case, we can
> >>>>> + * get rid of the get here and the need to call
> >>>>> + * bio_release_pages() at IO completion time.
> >>>>> + */
> >>>>> + get_page(bv->bv_page);
> >>>>>
> >>>>> Now the 'bv' may point to more than one page, so the following one may be
> >>>>> needed:
> >>>>>
> >>>>> int i;
> >>>>> struct bvec_iter_all iter_all;
> >>>>> struct bio_vec *tmp;
> >>>>>
> >>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
> >>>>>       get_page(tmp->bv_page);
> >>>>
> > Some following optimization can be done, such as removing
> > biovec_phys_mergeable() from blk_bio_segment_split().
> 
> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
> that it is possible. But iteration startup cost is a problem in a lot of
> spots, and a split fast path will only help a bit for that specific
> case.

FYI, I've got a nice fast path for the driver side in nvme here, but
I'll need to do some more testing before submitting it:

http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/nvme-optimize-single-segment-io

But in the block layer I think one major issue is all the phys_segments
crap.  What we really should do is to remove bi_phys_segments and all
the front/back segment crap and only do the calculation of the actual
per-bio segments once, just before adding the bio to the segment.

And don't bother with it at all unless the driver has weird segment
size or boundary limitations.
Christoph Hellwig March 8, 2019, 8:18 a.m. UTC | #24
On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
> Speaking of this, I took a quick look at why we've now regressed a lot
> on IOPS perf with the multipage work. It looks like it's all related to
> the (much) fatter setup around iteration, which is related to this very
> topic too.

> Basically setup of things like bio_for_each_bvec() and indexing through
> nth_page() is MUCH slower than before.

I haven't quite figure out what the point of nth_page is.  If we
physically merge the page structures should also be consecuite
in memory in general.  The only case where this could theoretically
not be the case is with CONFIG_DISCONTIGMEM, but in that case we should
check this once in biovec_phys_mergeable, and only for that case.

Does this patch make a difference for you on x86?

--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -53,7 +53,7 @@ struct bvec_iter_all {
 
 static inline struct page *bvec_nth_page(struct page *page, int idx)
 {
-	return idx == 0 ? page : nth_page(page, idx);
+	return page + idx;
 }
 
 /*
Ming Lei March 8, 2019, 9:12 a.m. UTC | #25
On Fri, Mar 08, 2019 at 08:55:22AM +0100, Christoph Hellwig wrote:
> On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
> > On 2/26/19 7:21 PM, Ming Lei wrote:
> > > On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
> > >> On 2/26/19 6:53 PM, Ming Lei wrote:
> > >>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
> > >>>> On 2/26/19 6:21 PM, Ming Lei wrote:
> > >>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> > >>>>>>
> > >>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
> > >>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
> > >>>>>>>> Hi Jens,
> > >>>>>>>>
> > >>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> > >>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
> > >>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> > >>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> > >>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
> > >>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> > >>>>>>>>>>>
> > >>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> > >>>>>>>>>>> check if they need to release pages on completion. This makes them
> > >>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> > >>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> > >>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> > >>>>>>>>>>> ---
> > >>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> > >>>>>>>>>>>  fs/block_dev.c            |  5 ++--
> > >>>>>>>>>>>  fs/iomap.c                |  5 ++--
> > >>>>>>>>>>>  include/linux/blk_types.h |  1 +
> > >>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> > >>>>>>>>>>>
> > >>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
> > >>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
> > >>>>>>>>>>> --- a/block/bio.c
> > >>>>>>>>>>> +++ b/block/bio.c
> > >>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> > >>>>>>>>>>>  }
> > >>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
> > >>>>>>>>>>>
> > >>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> > >>>>>>>>>>> +{
> > >>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
> > >>>>>>>>>>> + unsigned int len;
> > >>>>>>>>>>> + size_t size;
> > >>>>>>>>>>> +
> > >>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> > >>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
> > >>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
> > >>>>>>>>>>
> > >>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
> > >>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
> > >>>>>>>>>> can be observed when running xfstests over loop/dio.
> > >>>>>>>>>
> > >>>>>>>>> Thanks, I folded this in.
> > >>>>>>>>>
> > >>>>>>>>> --
> > >>>>>>>>> Jens Axboe
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
> > >>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
> > >>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> > >>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
> > >>>>>>>>
> > >>>>>>>> Reproducer:
> > >>>>>>>>
> > >>>>>>>> #define _GNU_SOURCE
> > >>>>>>>> #include <fcntl.h>
> > >>>>>>>> #include <linux/loop.h>
> > >>>>>>>> #include <sys/ioctl.h>
> > >>>>>>>> #include <sys/sendfile.h>
> > >>>>>>>> #include <sys/syscall.h>
> > >>>>>>>> #include <unistd.h>
> > >>>>>>>>
> > >>>>>>>> int main(void)
> > >>>>>>>> {
> > >>>>>>>>         int memfd, loopfd;
> > >>>>>>>>
> > >>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
> > >>>>>>>>
> > >>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
> > >>>>>>>>
> > >>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> > >>>>>>>>
> > >>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
> > >>>>>>>>
> > >>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
> > >>>>>>>> }
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Crash:
> > >>>>>>>>
> > >>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> > >>>>>>>> flags: 0x100000000000000()
> > >>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> > >>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
> > >>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> > >>>>>>>
> > >>>>>>> I see what this is, I'll cut a fix for this tomorrow.
> > >>>>>>
> > >>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
> > >>>>>> branch.
> > >>>>>
> > >>>>> Hi Jens,
> > >>>>>
> > >>>>> I saw the following change is added:
> > >>>>>
> > >>>>> + if (size == len) {
> > >>>>> + /*
> > >>>>> + * For the normal O_DIRECT case, we could skip grabbing this
> > >>>>> + * reference and then not have to put them again when IO
> > >>>>> + * completes. But this breaks some in-kernel users, like
> > >>>>> + * splicing to/from a loop device, where we release the pipe
> > >>>>> + * pages unconditionally. If we can fix that case, we can
> > >>>>> + * get rid of the get here and the need to call
> > >>>>> + * bio_release_pages() at IO completion time.
> > >>>>> + */
> > >>>>> + get_page(bv->bv_page);
> > >>>>>
> > >>>>> Now the 'bv' may point to more than one page, so the following one may be
> > >>>>> needed:
> > >>>>>
> > >>>>> int i;
> > >>>>> struct bvec_iter_all iter_all;
> > >>>>> struct bio_vec *tmp;
> > >>>>>
> > >>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
> > >>>>>       get_page(tmp->bv_page);
> > >>>>
> > > Some following optimization can be done, such as removing
> > > biovec_phys_mergeable() from blk_bio_segment_split().
> > 
> > I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
> > that it is possible. But iteration startup cost is a problem in a lot of
> > spots, and a split fast path will only help a bit for that specific
> > case.
> 
> FYI, I've got a nice fast path for the driver side in nvme here, but
> I'll need to do some more testing before submitting it:
> 
> http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/nvme-optimize-single-segment-io
> 
> But in the block layer I think one major issue is all the phys_segments
> crap.  What we really should do is to remove bi_phys_segments and all
> the front/back segment crap and only do the calculation of the actual
> per-bio segments once, just before adding the bio to the segment.

I have enabled multi-page bvec for passthrough IO in the following:

https://github.com/ming1/linux/commits/v5.0-blk-for-blk_post_mp

in which .bi_phys_segments becomes same with .bi_vcnt for passthrough bio.

Also intra-bvec merging in one bio has been killed, then only the merge
between bios is required, and seems we still need front/back segment size,
especially some use cases(such as mkfs) may make lots of small mergeable bios.

> 
> And don't bother with it at all unless the driver has weird segment
> size or boundary limitations.

It should be easy to observe that .bv_len is bigger than max segment
size.

Thanks,
Ming
diff mbox series

Patch

diff --git a/block/bio.c b/block/bio.c
index 4db1008309ed..330df572cfb8 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -828,6 +828,23 @@  int bio_add_page(struct bio *bio, struct page *page,
 }
 EXPORT_SYMBOL(bio_add_page);
 
+static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
+{
+	const struct bio_vec *bv = iter->bvec;
+	unsigned int len;
+	size_t size;
+
+	len = min_t(size_t, bv->bv_len, iter->count);
+	size = bio_add_page(bio, bv->bv_page, len,
+				bv->bv_offset + iter->iov_offset);
+	if (size == len) {
+		iov_iter_advance(iter, size);
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
 #define PAGE_PTRS_PER_BVEC     (sizeof(struct bio_vec) / sizeof(struct page *))
 
 /**
@@ -876,23 +893,43 @@  static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 }
 
 /**
- * bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio
+ * bio_iov_iter_get_pages - add user or kernel pages to a bio
  * @bio: bio to add pages to
- * @iter: iov iterator describing the region to be mapped
+ * @iter: iov iterator describing the region to be added
+ *
+ * This takes either an iterator pointing to user memory, or one pointing to
+ * kernel pages (BVEC iterator). If we're adding user pages, we pin them and
+ * map them into the kernel. On IO completion, the caller should put those
+ * pages. If we're adding kernel pages, we just have to add the pages to the
+ * bio directly. We don't grab an extra reference to those pages (the user
+ * should already have that), and we don't put the page on IO completion.
+ * The caller needs to check if the bio is flagged BIO_NO_PAGE_REF on IO
+ * completion. If it isn't, then pages should be released.
  *
- * Pins pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
  * The function tries, but does not guarantee, to pin as many pages as
- * fit into the bio, or are requested in *iter, whatever is smaller.
- * If MM encounters an error pinning the requested pages, it stops.
- * Error is returned only if 0 pages could be pinned.
+ * fit into the bio, or are requested in *iter, whatever is smaller. If
+ * MM encounters an error pinning the requested pages, it stops. Error
+ * is returned only if 0 pages could be pinned.
  */
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
+	const bool is_bvec = iov_iter_is_bvec(iter);
 	unsigned short orig_vcnt = bio->bi_vcnt;
 
+	/*
+	 * If this is a BVEC iter, then the pages are kernel pages. Don't
+	 * release them on IO completion.
+	 */
+	if (is_bvec)
+		bio_set_flag(bio, BIO_NO_PAGE_REF);
+
 	do {
-		int ret = __bio_iov_iter_get_pages(bio, iter);
+		int ret;
+
+		if (is_bvec)
+			ret = __bio_iov_bvec_add_pages(bio, iter);
+		else
+			ret = __bio_iov_iter_get_pages(bio, iter);
 
 		if (unlikely(ret))
 			return bio->bi_vcnt > orig_vcnt ? 0 : ret;
@@ -1634,7 +1671,8 @@  static void bio_dirty_fn(struct work_struct *work)
 		next = bio->bi_private;
 
 		bio_set_pages_dirty(bio);
-		bio_release_pages(bio);
+		if (!bio_flagged(bio, BIO_NO_PAGE_REF))
+			bio_release_pages(bio);
 		bio_put(bio);
 	}
 }
@@ -1650,7 +1688,8 @@  void bio_check_pages_dirty(struct bio *bio)
 			goto defer;
 	}
 
-	bio_release_pages(bio);
+	if (!bio_flagged(bio, BIO_NO_PAGE_REF))
+		bio_release_pages(bio);
 	bio_put(bio);
 	return;
 defer:
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 392e2bfb636f..051ab41d1c61 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -338,8 +338,9 @@  static void blkdev_bio_end_io(struct bio *bio)
 		struct bio_vec *bvec;
 		int i;
 
-		bio_for_each_segment_all(bvec, bio, i)
-			put_page(bvec->bv_page);
+		if (!bio_flagged(bio, BIO_NO_PAGE_REF))
+			bio_for_each_segment_all(bvec, bio, i)
+				put_page(bvec->bv_page);
 		bio_put(bio);
 	}
 }
diff --git a/fs/iomap.c b/fs/iomap.c
index 2ac9eb746d44..9389cf0a1c6f 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1591,8 +1591,9 @@  static void iomap_dio_bio_end_io(struct bio *bio)
 		struct bio_vec *bvec;
 		int i;
 
-		bio_for_each_segment_all(bvec, bio, i)
-			put_page(bvec->bv_page);
+		if (!bio_flagged(bio, BIO_NO_PAGE_REF))
+			bio_for_each_segment_all(bvec, bio, i)
+				put_page(bvec->bv_page);
 		bio_put(bio);
 	}
 }
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index d66bf5f32610..791fee35df88 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -215,6 +215,7 @@  struct bio {
 /*
  * bio flags
  */
+#define BIO_NO_PAGE_REF	0	/* don't put release vec pages */
 #define BIO_SEG_VALID	1	/* bi_phys_segments valid */
 #define BIO_CLONED	2	/* doesn't own data */
 #define BIO_BOUNCED	3	/* bio is a bounce bio */