diff mbox series

block: Fix __blkdev_direct_IO()

Message ID 20190801102151.7846-1-damien.lemoal@wdc.com (mailing list archive)
State New, archived
Headers show
Series block: Fix __blkdev_direct_IO() | expand

Commit Message

Damien Le Moal Aug. 1, 2019, 10:21 a.m. UTC
The recent fix to properly handle IOCB_NOWAIT for async O_DIRECT IO
(patch 6a43074e2f46) introduced two problems with BIO fragment handling
for direct IOs:
1) The dio size processed is claculated by incrementing the ret variable
by the size of the bio fragment issued for the dio. However, this size
is obtained directly from bio->bi_iter.bi_size AFTER the bio submission
which may result in referencing the bi_size value after the bio
completed, resulting in an incorrect value use.
2) The ret variable is not incremented by the size of the last bio
fragment issued for the bio, leading to an invalid IO size being
returned to the user.

Fix both problem by using dio->size (which is incremented before the bio
submission) to update the value of ret after bio submissions, including
for the last bio fragment issued.

Fixes: 6a43074e2f46 ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
Reported-by: Masato Suzuki <masato.suzuki@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 fs/block_dev.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Jens Axboe Aug. 1, 2019, 7:51 p.m. UTC | #1
On 8/1/19 4:21 AM, Damien Le Moal wrote:
> The recent fix to properly handle IOCB_NOWAIT for async O_DIRECT IO
> (patch 6a43074e2f46) introduced two problems with BIO fragment handling
> for direct IOs:
> 1) The dio size processed is claculated by incrementing the ret variable
> by the size of the bio fragment issued for the dio. However, this size
> is obtained directly from bio->bi_iter.bi_size AFTER the bio submission
> which may result in referencing the bi_size value after the bio
> completed, resulting in an incorrect value use.
> 2) The ret variable is not incremented by the size of the last bio
> fragment issued for the bio, leading to an invalid IO size being
> returned to the user.
> 
> Fix both problem by using dio->size (which is incremented before the bio
> submission) to update the value of ret after bio submissions, including
> for the last bio fragment issued.

Thanks, applied. Do you have a test case? I ran this through the usual
xfstests and block bits, but didn't catch anything.
Damien Le Moal Aug. 2, 2019, 1:05 a.m. UTC | #2
On 2019/08/02 4:51, Jens Axboe wrote:
> On 8/1/19 4:21 AM, Damien Le Moal wrote:
>> The recent fix to properly handle IOCB_NOWAIT for async O_DIRECT IO
>> (patch 6a43074e2f46) introduced two problems with BIO fragment handling
>> for direct IOs:
>> 1) The dio size processed is claculated by incrementing the ret variable
>> by the size of the bio fragment issued for the dio. However, this size
>> is obtained directly from bio->bi_iter.bi_size AFTER the bio submission
>> which may result in referencing the bi_size value after the bio
>> completed, resulting in an incorrect value use.
>> 2) The ret variable is not incremented by the size of the last bio
>> fragment issued for the bio, leading to an invalid IO size being
>> returned to the user.
>>
>> Fix both problem by using dio->size (which is incremented before the bio
>> submission) to update the value of ret after bio submissions, including
>> for the last bio fragment issued.
> 
> Thanks, applied. Do you have a test case? I ran this through the usual
> xfstests and block bits, but didn't catch anything.
> 

The problem was detected with our weekly RC test runs for zoned block devices.
RC1 last week was OK, but failures happen on RC2 Monday. We never hit a oops for
the BIO reference after submission but we were getting a lot of unaligned write
errors for all types of zoned drive tested (real SMR disks, tcmu-runner ZBC
handler disks and nullblk in zoned mode) using various applications (fio, dd, ...)

Masato isolated the problem to very large direct writes and could reliably
recreate the problem with a dd doing a single 8MB write to a sequential zone.
With this case, blktrace showed that the 8MB write was split into multiple BIOs
(expected) and the whole 8MB being cleanly written sequentially. But this was
followed by a stream of small 4K writes starting around the end of the 8MB
chunk, but within it, causing unaligned write errors (overwrite in sequential
zones not being possible).

dd if=/dev/zero of=/dev/nullb0 bs=8M oflag=direct count=1

On a nullblk disk in zoned mode should recreate the problem, and blktrace
revealing that dd sees a short write for the 8M IO and issues remaining as 4K
requests.

Using a regular disk, this however does not generate any error at all, which may
explain why you did not see any problem. I think we can create a blktest case
for this using nullblk in zoned mode. Would you like us to send one ?

Best regards.
Jens Axboe Aug. 2, 2019, 3:57 a.m. UTC | #3
On 8/1/19 7:05 PM, Damien Le Moal wrote:
> On 2019/08/02 4:51, Jens Axboe wrote:
>> On 8/1/19 4:21 AM, Damien Le Moal wrote:
>>> The recent fix to properly handle IOCB_NOWAIT for async O_DIRECT IO
>>> (patch 6a43074e2f46) introduced two problems with BIO fragment handling
>>> for direct IOs:
>>> 1) The dio size processed is claculated by incrementing the ret variable
>>> by the size of the bio fragment issued for the dio. However, this size
>>> is obtained directly from bio->bi_iter.bi_size AFTER the bio submission
>>> which may result in referencing the bi_size value after the bio
>>> completed, resulting in an incorrect value use.
>>> 2) The ret variable is not incremented by the size of the last bio
>>> fragment issued for the bio, leading to an invalid IO size being
>>> returned to the user.
>>>
>>> Fix both problem by using dio->size (which is incremented before the bio
>>> submission) to update the value of ret after bio submissions, including
>>> for the last bio fragment issued.
>>
>> Thanks, applied. Do you have a test case? I ran this through the usual
>> xfstests and block bits, but didn't catch anything.
>>
> 
> The problem was detected with our weekly RC test runs for zoned block
> devices.  RC1 last week was OK, but failures happen on RC2 Monday. We
> never hit a oops for the BIO reference after submission but we were
> getting a lot of unaligned write errors for all types of zoned drive
> tested (real SMR disks, tcmu-runner ZBC handler disks and nullblk in
> zoned mode) using various applications (fio, dd, ...)
> 
> Masato isolated the problem to very large direct writes and could
> reliably recreate the problem with a dd doing a single 8MB write to a
> sequential zone.  With this case, blktrace showed that the 8MB write
> was split into multiple BIOs (expected) and the whole 8MB being
> cleanly written sequentially. But this was followed by a stream of
> small 4K writes starting around the end of the 8MB chunk, but within
> it, causing unaligned write errors (overwrite in sequential zones not
> being possible).
> 
> dd if=/dev/zero of=/dev/nullb0 bs=8M oflag=direct count=1
> 
> On a nullblk disk in zoned mode should recreate the problem, and
> blktrace revealing that dd sees a short write for the 8M IO and issues
> remaining as 4K requests.
> 
> Using a regular disk, this however does not generate any error at all,
> which may explain why you did not see any problem. I think we can
> create a blktest case for this using nullblk in zoned mode. Would you
> like us to send one ?

Thanks for the detailed explanation. I think we should absolutely have a
blktests case for this, even if O_DIRECT isn't strictly block level,
it's definitely in a grey zone that I think we should cover.
Damien Le Moal Aug. 2, 2019, 4:11 a.m. UTC | #4
On 2019/08/02 12:57, Jens Axboe wrote:
> On 8/1/19 7:05 PM, Damien Le Moal wrote:
>> On 2019/08/02 4:51, Jens Axboe wrote:
>>> On 8/1/19 4:21 AM, Damien Le Moal wrote:
>>>> The recent fix to properly handle IOCB_NOWAIT for async O_DIRECT IO
>>>> (patch 6a43074e2f46) introduced two problems with BIO fragment handling
>>>> for direct IOs:
>>>> 1) The dio size processed is claculated by incrementing the ret variable
>>>> by the size of the bio fragment issued for the dio. However, this size
>>>> is obtained directly from bio->bi_iter.bi_size AFTER the bio submission
>>>> which may result in referencing the bi_size value after the bio
>>>> completed, resulting in an incorrect value use.
>>>> 2) The ret variable is not incremented by the size of the last bio
>>>> fragment issued for the bio, leading to an invalid IO size being
>>>> returned to the user.
>>>>
>>>> Fix both problem by using dio->size (which is incremented before the bio
>>>> submission) to update the value of ret after bio submissions, including
>>>> for the last bio fragment issued.
>>>
>>> Thanks, applied. Do you have a test case? I ran this through the usual
>>> xfstests and block bits, but didn't catch anything.
>>>
>>
>> The problem was detected with our weekly RC test runs for zoned block
>> devices.  RC1 last week was OK, but failures happen on RC2 Monday. We
>> never hit a oops for the BIO reference after submission but we were
>> getting a lot of unaligned write errors for all types of zoned drive
>> tested (real SMR disks, tcmu-runner ZBC handler disks and nullblk in
>> zoned mode) using various applications (fio, dd, ...)
>>
>> Masato isolated the problem to very large direct writes and could
>> reliably recreate the problem with a dd doing a single 8MB write to a
>> sequential zone.  With this case, blktrace showed that the 8MB write
>> was split into multiple BIOs (expected) and the whole 8MB being
>> cleanly written sequentially. But this was followed by a stream of
>> small 4K writes starting around the end of the 8MB chunk, but within
>> it, causing unaligned write errors (overwrite in sequential zones not
>> being possible).
>>
>> dd if=/dev/zero of=/dev/nullb0 bs=8M oflag=direct count=1
>>
>> On a nullblk disk in zoned mode should recreate the problem, and
>> blktrace revealing that dd sees a short write for the 8M IO and issues
>> remaining as 4K requests.
>>
>> Using a regular disk, this however does not generate any error at all,
>> which may explain why you did not see any problem. I think we can
>> create a blktest case for this using nullblk in zoned mode. Would you
>> like us to send one ?
> 
> Thanks for the detailed explanation. I think we should absolutely have a
> blktests case for this, even if O_DIRECT isn't strictly block level,
> it's definitely in a grey zone that I think we should cover.
> 

OK. We will work on a test case and post it.
Bart Van Assche Aug. 5, 2019, 5:11 p.m. UTC | #5
On 8/1/19 3:21 AM, Damien Le Moal wrote:
> The recent fix to properly handle IOCB_NOWAIT for async O_DIRECT IO
> (patch 6a43074e2f46) introduced two problems with BIO fragment handling
> for direct IOs:
> 1) The dio size processed is claculated by incrementing the ret variable
> by the size of the bio fragment issued for the dio. However, this size
> is obtained directly from bio->bi_iter.bi_size AFTER the bio submission
> which may result in referencing the bi_size value after the bio
> completed, resulting in an incorrect value use.
> 2) The ret variable is not incremented by the size of the last bio
> fragment issued for the bio, leading to an invalid IO size being
> returned to the user.
> 
> Fix both problem by using dio->size (which is incremented before the bio
> submission) to update the value of ret after bio submissions, including
> for the last bio fragment issued.
> 
> Fixes: 6a43074e2f46 ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
> Reported-by: Masato Suzuki <masato.suzuki@wdc.com>
> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
> ---
>   fs/block_dev.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index c2a85b587922..75cc7f424b3a 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -439,6 +439,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>   					ret = -EAGAIN;
>   				goto error;
>   			}
> +			ret = dio->size;
>   
>   			if (polled)
>   				WRITE_ONCE(iocb->ki_cookie, qc);
> @@ -465,7 +466,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>   				ret = -EAGAIN;
>   			goto error;
>   		}
> -		ret += bio->bi_iter.bi_size;
> +		ret = dio->size;
>   
>   		bio = bio_alloc(gfp, nr_pages);
>   		if (!bio) {

Hi Damien,

Had you verified this patch with blktests and KASAN enabled? I think the
above patch introduced the following KASAN complaint:

==================================================================
BUG: KASAN: use-after-free in blkdev_direct_IO+0x9d5/0xa20
Read of size 8 at addr ffff888114d872c8 by task fio/12456

CPU: 3 PID: 12456 Comm: fio Not tainted 5.3.0-rc3-dbg+ #6
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Call Trace:
  dump_stack+0x86/0xca
  print_address_description+0x74/0x32d
  __kasan_report.cold.6+0x1b/0x36
  kasan_report+0x12/0x17
  __asan_load8+0x54/0x90
  blkdev_direct_IO+0x9d5/0xa20
  generic_file_direct_write+0x10e/0x200
  __generic_file_write_iter+0x120/0x2a0
  blkdev_write_iter+0x13c/0x220
  aio_write+0x1bb/0x2d0
  io_submit_one+0xe30/0x18e0
  __x64_sys_io_submit+0x117/0x350
  do_syscall_64+0x71/0x270
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Allocated by task 12456:
  save_stack+0x21/0x90
  __kasan_kmalloc.constprop.9+0xc7/0xd0
  kasan_slab_alloc+0x11/0x20
  kmem_cache_alloc+0xf9/0x3a0
  mempool_alloc_slab+0x15/0x20
  mempool_alloc+0xef/0x260
  bio_alloc_bioset+0x148/0x310
  blkdev_direct_IO+0x223/0xa20
  generic_file_direct_write+0x10e/0x200
  __generic_file_write_iter+0x120/0x2a0
  blkdev_write_iter+0x13c/0x220
  aio_write+0x1bb/0x2d0
  io_submit_one+0xe30/0x18e0
  __x64_sys_io_submit+0x117/0x350
  do_syscall_64+0x71/0x270
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 26:
  save_stack+0x21/0x90
  __kasan_slab_free+0x139/0x190
  kasan_slab_free+0xe/0x10
  slab_free_freelist_hook+0x67/0x1e0
  kmem_cache_free+0x102/0x320
  mempool_free_slab+0x17/0x20
  mempool_free+0x63/0x160
  bio_free+0x144/0x210
  bio_put+0x59/0x60
  blkdev_bio_end_io+0x172/0x1f0
  bio_endio+0x2b8/0x4c0
  blk_update_request+0x180/0x4f0
  end_clone_bio+0xbb/0xd0
  bio_endio+0x2b8/0x4c0
  blk_update_request+0x180/0x4f0
  blk_mq_end_request+0x33/0x200
  nvme_complete_rq+0xff/0x420 [nvme_core]
  nvme_rdma_complete_rq+0xba/0x100 [nvme_rdma]
  blk_mq_complete_request+0x174/0x250
  nvme_rdma_recv_done+0x330/0x5a0 [nvme_rdma]
  __ib_process_cq+0x8c/0x100 [ib_core]
  ib_poll_handler+0x47/0xb0 [ib_core]
  irq_poll_softirq+0xf5/0x270
  __do_softirq+0x128/0x60f

(gdb) list *(blkdev_direct_IO+0x9d5)
0x49c5 is in blkdev_direct_IO (fs/block_dev.c:442).
437                             if (qc == BLK_QC_T_EAGAIN) {
438                                     if (!ret)
439                                             ret = -EAGAIN;
440                                     goto error;
441                             }
442                             ret = dio->size;
443
444                             if (polled)
445                                     WRITE_ONCE(iocb->ki_cookie, qc);
446                             break;
Jens Axboe Aug. 6, 2019, 7:53 p.m. UTC | #6
On 8/5/19 10:11 AM, Bart Van Assche wrote:
> On 8/1/19 3:21 AM, Damien Le Moal wrote:
>> The recent fix to properly handle IOCB_NOWAIT for async O_DIRECT IO
>> (patch 6a43074e2f46) introduced two problems with BIO fragment handling
>> for direct IOs:
>> 1) The dio size processed is claculated by incrementing the ret variable
>> by the size of the bio fragment issued for the dio. However, this size
>> is obtained directly from bio->bi_iter.bi_size AFTER the bio submission
>> which may result in referencing the bi_size value after the bio
>> completed, resulting in an incorrect value use.
>> 2) The ret variable is not incremented by the size of the last bio
>> fragment issued for the bio, leading to an invalid IO size being
>> returned to the user.
>>
>> Fix both problem by using dio->size (which is incremented before the bio
>> submission) to update the value of ret after bio submissions, including
>> for the last bio fragment issued.
>>
>> Fixes: 6a43074e2f46 ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
>> Reported-by: Masato Suzuki <masato.suzuki@wdc.com>
>> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
>> ---
>>    fs/block_dev.c | 3 ++-
>>    1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index c2a85b587922..75cc7f424b3a 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -439,6 +439,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>>    					ret = -EAGAIN;
>>    				goto error;
>>    			}
>> +			ret = dio->size;
>>    
>>    			if (polled)
>>    				WRITE_ONCE(iocb->ki_cookie, qc);
>> @@ -465,7 +466,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>>    				ret = -EAGAIN;
>>    			goto error;
>>    		}
>> -		ret += bio->bi_iter.bi_size;
>> +		ret = dio->size;
>>    
>>    		bio = bio_alloc(gfp, nr_pages);
>>    		if (!bio) {
> 
> Hi Damien,
> 
> Had you verified this patch with blktests and KASAN enabled? I think the
> above patch introduced the following KASAN complaint:

I posted this in another thread, can you try?


diff --git a/fs/block_dev.c b/fs/block_dev.c
index a6f7c892cb4a..131e2e0582a6 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -349,7 +349,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 	loff_t pos = iocb->ki_pos;
 	blk_qc_t qc = BLK_QC_T_NONE;
 	gfp_t gfp;
-	ssize_t ret;
+	int ret;
 
 	if ((pos | iov_iter_alignment(iter)) &
 	    (bdev_logical_block_size(bdev) - 1))
@@ -386,8 +386,6 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 
 	ret = 0;
 	for (;;) {
-		int err;
-
 		bio_set_dev(bio, bdev);
 		bio->bi_iter.bi_sector = pos >> 9;
 		bio->bi_write_hint = iocb->ki_hint;
@@ -395,10 +393,8 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 		bio->bi_end_io = blkdev_bio_end_io;
 		bio->bi_ioprio = iocb->ki_ioprio;
 
-		err = bio_iov_iter_get_pages(bio, iter);
-		if (unlikely(err)) {
-			if (!ret)
-				ret = err;
+		ret = bio_iov_iter_get_pages(bio, iter);
+		if (unlikely(ret)) {
 			bio->bi_status = BLK_STS_IOERR;
 			bio_endio(bio);
 			break;
@@ -421,7 +417,6 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 		if (nowait)
 			bio->bi_opf |= (REQ_NOWAIT | REQ_NOWAIT_INLINE);
 
-		dio->size += bio->bi_iter.bi_size;
 		pos += bio->bi_iter.bi_size;
 
 		nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
@@ -433,13 +428,13 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 				polled = true;
 			}
 
+			dio->size += bio->bi_iter.bi_size;
 			qc = submit_bio(bio);
 			if (qc == BLK_QC_T_EAGAIN) {
-				if (!ret)
-					ret = -EAGAIN;
+				dio->size -= bio->bi_iter.bi_size;
+				ret = -EAGAIN;
 				goto error;
 			}
-			ret = dio->size;
 
 			if (polled)
 				WRITE_ONCE(iocb->ki_cookie, qc);
@@ -460,18 +455,17 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 			atomic_inc(&dio->ref);
 		}
 
+		dio->size += bio->bi_iter.bi_size;
 		qc = submit_bio(bio);
 		if (qc == BLK_QC_T_EAGAIN) {
-			if (!ret)
-				ret = -EAGAIN;
+			dio->size -= bio->bi_iter.bi_size;
+			ret = -EAGAIN;
 			goto error;
 		}
-		ret = dio->size;
 
 		bio = bio_alloc(gfp, nr_pages);
 		if (!bio) {
-			if (!ret)
-				ret = -EAGAIN;
+			ret = -EAGAIN;
 			goto error;
 		}
 	}
@@ -496,6 +490,8 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 out:
 	if (!ret)
 		ret = blk_status_to_errno(dio->bio.bi_status);
+	if (likely(!ret))
+		ret = dio->size;
 
 	bio_put(&dio->bio);
 	return ret;
Damien Le Moal Aug. 9, 2019, 1:01 p.m. UTC | #7
On 2019/08/07 4:53, Jens Axboe wrote:
>> Hi Damien,
>>
>> Had you verified this patch with blktests and KASAN enabled? I think the
>> above patch introduced the following KASAN complaint:
> 
> I posted this in another thread, can you try?
> 
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index a6f7c892cb4a..131e2e0582a6 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -349,7 +349,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>  	loff_t pos = iocb->ki_pos;
>  	blk_qc_t qc = BLK_QC_T_NONE;
>  	gfp_t gfp;
> -	ssize_t ret;
> +	int ret;
>  
>  	if ((pos | iov_iter_alignment(iter)) &
>  	    (bdev_logical_block_size(bdev) - 1))
> @@ -386,8 +386,6 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>  
>  	ret = 0;
>  	for (;;) {
> -		int err;
> -
>  		bio_set_dev(bio, bdev);
>  		bio->bi_iter.bi_sector = pos >> 9;
>  		bio->bi_write_hint = iocb->ki_hint;
> @@ -395,10 +393,8 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>  		bio->bi_end_io = blkdev_bio_end_io;
>  		bio->bi_ioprio = iocb->ki_ioprio;
>  
> -		err = bio_iov_iter_get_pages(bio, iter);
> -		if (unlikely(err)) {
> -			if (!ret)
> -				ret = err;
> +		ret = bio_iov_iter_get_pages(bio, iter);
> +		if (unlikely(ret)) {
>  			bio->bi_status = BLK_STS_IOERR;
>  			bio_endio(bio);
>  			break;
> @@ -421,7 +417,6 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>  		if (nowait)
>  			bio->bi_opf |= (REQ_NOWAIT | REQ_NOWAIT_INLINE);
>  
> -		dio->size += bio->bi_iter.bi_size;
>  		pos += bio->bi_iter.bi_size;
>  
>  		nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
> @@ -433,13 +428,13 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>  				polled = true;
>  			}
>  
> +			dio->size += bio->bi_iter.bi_size;
>  			qc = submit_bio(bio);
>  			if (qc == BLK_QC_T_EAGAIN) {
> -				if (!ret)
> -					ret = -EAGAIN;
> +				dio->size -= bio->bi_iter.bi_size;
> +				ret = -EAGAIN;
>  				goto error;
>  			}
> -			ret = dio->size;
>  
>  			if (polled)
>  				WRITE_ONCE(iocb->ki_cookie, qc);
> @@ -460,18 +455,17 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>  			atomic_inc(&dio->ref);
>  		}
>  
> +		dio->size += bio->bi_iter.bi_size;
>  		qc = submit_bio(bio);
>  		if (qc == BLK_QC_T_EAGAIN) {
> -			if (!ret)
> -				ret = -EAGAIN;
> +			dio->size -= bio->bi_iter.bi_size;
> +			ret = -EAGAIN;
>  			goto error;
>  		}
> -		ret = dio->size;
>  
>  		bio = bio_alloc(gfp, nr_pages);
>  		if (!bio) {
> -			if (!ret)
> -				ret = -EAGAIN;
> +			ret = -EAGAIN;
>  			goto error;
>  		}
>  	}
> @@ -496,6 +490,8 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>  out:
>  	if (!ret)
>  		ret = blk_status_to_errno(dio->bio.bi_status);
> +	if (likely(!ret))
> +		ret = dio->size;
>  
>  	bio_put(&dio->bio);
>  	return ret;
> 

Jens,

I tested a slightly modified version of your patch. I think it is 100%
equivalent, but a little cleaner in my opinion.

diff --git a/fs/block_dev.c b/fs/block_dev.c
index a6f7c892cb4a..343cd716c53b 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -349,7 +349,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter
*iter, int nr_pages)
        loff_t pos = iocb->ki_pos;
        blk_qc_t qc = BLK_QC_T_NONE;
        gfp_t gfp;
-       ssize_t ret;
+       ssize_t ret = 0;

        if ((pos | iov_iter_alignment(iter)) &
            (bdev_logical_block_size(bdev) - 1))
@@ -386,7 +386,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter
*iter, int nr_pages)

        ret = 0;
        for (;;) {
-               int err;
+               unsigned int bio_size;

                bio_set_dev(bio, bdev);
                bio->bi_iter.bi_sector = pos >> 9;
@@ -395,10 +395,8 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter
*iter, int nr_pages)
                bio->bi_end_io = blkdev_bio_end_io;
                bio->bi_ioprio = iocb->ki_ioprio;

-               err = bio_iov_iter_get_pages(bio, iter);
-               if (unlikely(err)) {
-                       if (!ret)
-                               ret = err;
+               ret = bio_iov_iter_get_pages(bio, iter);
+               if (unlikely(ret)) {
                        bio->bi_status = BLK_STS_IOERR;
                        bio_endio(bio);
                        break;
@@ -421,7 +419,8 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter
*iter, int nr_pages)
                if (nowait)
                        bio->bi_opf |= (REQ_NOWAIT | REQ_NOWAIT_INLINE);

-               dio->size += bio->bi_iter.bi_size;
+               bio_size = bio->bi_iter.bi_size;
+               dio->size += bio_size;
                pos += bio->bi_iter.bi_size;

                nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
@@ -435,11 +434,11 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter
*iter, int nr_pages)

                        qc = submit_bio(bio);
                        if (qc == BLK_QC_T_EAGAIN) {
-                               if (!ret)
+                               dio->size -= bio_size;
+                               if (!dio->size)
                                        ret = -EAGAIN;
                                goto error;
                        }
-                       ret = dio->size;

                        if (polled)
                                WRITE_ONCE(iocb->ki_cookie, qc);
@@ -462,18 +461,15 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter
*iter, int nr_pages)

                qc = submit_bio(bio);
                if (qc == BLK_QC_T_EAGAIN) {
-                       if (!ret)
+                       dio->size -= bio_size;
+                       if (!dio->size)
                                ret = -EAGAIN;
                        goto error;
                }
-               ret = dio->size;

                bio = bio_alloc(gfp, nr_pages);
-               if (!bio) {
-                       if (!ret)
-                               ret = -EAGAIN;
+               if (!bio)
                        goto error;
-               }
        }

        if (!is_poll)
@@ -496,6 +492,8 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter
*iter, int nr_pages)
 out:
        if (!ret)
                ret = blk_status_to_errno(dio->bio.bi_status);
+       if (likely(!ret))
+               ret = dio->size;

        bio_put(&dio->bio);
        return ret;

To test and make sure that I can consistently create EAGAIN errors with
RWF_NOWAIT, I set an HDD QD to 1, its nr_requests to 4 (minimum), enable
"nomerges" and set the disk max_sectors_kb to 512 KB. With this, at most 4 512KB
requests can be issued at any time, meaning that for a DIO larger than
5123KBx4=2MB, BIO fragments issuing will likely hit EAGAIN due to the lack of tags.

With this, here is what I get for tests with 4MB direct write IOs, with all
cases (sync/async, nowait or not).

## sync, wait case
4194304 4194304			<- requested size, size return by pwritev2()
0 KB read, 4096 KB writen	<- from iostat, KB amount read/written

regular sync+wait case is OK.

## sync, nowait case
4194304 4194304			<- KASAN use-after-free warning
0 KB read, 4608 KB writen	<- One more 512KB BIO issued than reported

With sync+nowait, sizes are correct but iostat shows too many writes. Also of
note is that there was some fragments failed on submission with EAGAIN, and
despite that, the size returned match the size asked for instead of a smaller size.

## async, wait case
4194304 4194304
0 KB read, 4096 KB writen

## async, nowait case
4194304 4194304
0 KB read, 4096 KB writen

The 2 async cases are OK.

Something is wrong with the sync+nowait case, but I cannot quite yet figure out
what. What I gather so far is this:
- BIO_MAX_PAGES being 256, the BIO fragments are 1MB. But the disk max_sector_kb
being 512KB, the BIO fragments get split into 2 512KB requests. It looks like
one of the fragment split is failed with EAGAIN, but the other split makes it to
the disk, showing an excess 512KB. Changing the disk max_sector_kb to 256 or
128, I see the excess being exactly that every time. That said, the total size
should still be lower than the requested 4MB since the BIO fragments submission
stopped before the entire dio was processed.

I have been playing around with the following patch on top of the previous one,
to make sure that BIOs are being freed (I believe the current code leaks them on
EAGAIN) and the dio completed.

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 343cd716c53b..bbb694c4a8c4 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -295,6 +295,31 @@ static int blkdev_iopoll(struct kiocb *kiocb, bool wait)
        return blk_poll(q, READ_ONCE(kiocb->ki_cookie), wait);
 }

+static void blkdev_dio_end(struct blkdev_dio *dio)
+{
+       pr_info("--- DIO done %zu B\n", dio->size);
+       if (!dio->is_sync) {
+               struct kiocb *iocb = dio->iocb;
+               ssize_t ret;
+
+               if (likely(!dio->bio.bi_status)) {
+                       ret = dio->size;
+                       iocb->ki_pos += ret;
+               } else {
+                       ret = blk_status_to_errno(dio->bio.bi_status);
+               }
+
+               dio->iocb->ki_complete(iocb, ret, 0);
+               if (dio->multi_bio)
+                       bio_put(&dio->bio);
+       } else {
+               struct task_struct *waiter = dio->waiter;
+
+               WRITE_ONCE(dio->waiter, NULL);
+               blk_wake_io_task(waiter);
+       }
+}
+
 static void blkdev_bio_end_io(struct bio *bio)
 {
        struct blkdev_dio *dio = bio->bi_private;
@@ -303,28 +328,9 @@ static void blkdev_bio_end_io(struct bio *bio)
        if (bio->bi_status && !dio->bio.bi_status)
                dio->bio.bi_status = bio->bi_status;

-       if (!dio->multi_bio || atomic_dec_and_test(&dio->ref)) {
-               if (!dio->is_sync) {
-                       struct kiocb *iocb = dio->iocb;
-                       ssize_t ret;
-
-                       if (likely(!dio->bio.bi_status)) {
-                               ret = dio->size;
-                               iocb->ki_pos += ret;
-                       } else {
-                               ret = blk_status_to_errno(dio->bio.bi_status);
-                       }
-
-                       dio->iocb->ki_complete(iocb, ret, 0);
-                       if (dio->multi_bio)
-                               bio_put(&dio->bio);
-               } else {
-                       struct task_struct *waiter = dio->waiter;
-
-                       WRITE_ONCE(dio->waiter, NULL);
-                       blk_wake_io_task(waiter);
-               }
-       }
+       pr_info("--- BIO done, dio %zu B\n", dio->size);
+       if (!dio->multi_bio || atomic_dec_and_test(&dio->ref))
+               blkdev_dio_end(dio);

        if (should_dirty) {
                bio_check_pages_dirty(bio);
@@ -351,6 +357,8 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter
*iter, int nr_pages)
        gfp_t gfp;
        ssize_t ret = 0;

+       pr_info("====== __blkdev_direct_IO ======\n");
+
        if ((pos | iov_iter_alignment(iter)) &
            (bdev_logical_block_size(bdev) - 1))
                return -EINVAL;
@@ -421,7 +429,9 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter
*iter, int nr_pages)

                bio_size = bio->bi_iter.bi_size;
                dio->size += bio_size;
-               pos += bio->bi_iter.bi_size;
+               pos += bio_size;
+
+               pr_info("+++ BIO frag %u B\n", bio_size);

                nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
                if (!nr_pages) {
@@ -435,9 +445,9 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter
*iter, int nr_pages)
                        qc = submit_bio(bio);
                        if (qc == BLK_QC_T_EAGAIN) {
                                dio->size -= bio_size;
-                               if (!dio->size)
-                                       ret = -EAGAIN;
-                               goto error;
+                               ret = -EAGAIN;
+                               pr_info("last BIO submit eagain, dio %zu B\n",
dio->size);
+                               break;
                        }

                        if (polled)
@@ -462,19 +472,53 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter
*iter, int nr_pages)
                qc = submit_bio(bio);
                if (qc == BLK_QC_T_EAGAIN) {
                        dio->size -= bio_size;
-                       if (!dio->size)
-                               ret = -EAGAIN;
-                       goto error;
+                       pr_info("BIO submit eagain, dio %zu B\n", dio->size);
+                       /*
+                        * The reference on dio taken above is for the next
+                        * fragment, not his one.
+                        */
+                       atomic_dec(&dio->ref);
+                       ret = -EAGAIN;
+                       break;
                }

                bio = bio_alloc(gfp, nr_pages);
-               if (!bio)
-                       goto error;
+               if (!bio) {
+                       pr_info("BIO alloc eagain, dio %zu B\n", dio->size);
+                       ret = -EAGAIN;
+                       break;
+               }
        }

        if (!is_poll)
                blk_finish_plug(&plug);

+       if (ret == -EAGAIN) {
+               WARN_ON(!nowait);
+
+               /*
+                * If no BIO could be issued in the nowait case, fail everything
+                * with EAGAIN.
+                */
+               if (!dio->size)
+                       goto out;
+
+               /*
+                * A fragment of a multi-bio dio could not be issued in the
+                * nowait case: allow issued fragments to proceed (resulting in
+                * a short dio) and ensure completion of the dio by completing
+                * the failed fragment to decrement the extra reference taken
+                * on the dio and release the bio pages.
+                */
+               pr_info("DIO partial %zu B\n", dio->size);
+               WARN_ON(bio == &dio->bio);
+               if (bio) {
+                       bio_uninit(bio);
+                       blkdev_bio_end_io(bio);
+               }
+               ret = 0;
+       }
+
        if (!is_sync)
                return -EIOCBQUEUED;

@@ -497,10 +541,6 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter
*iter, int nr_pages)

        bio_put(&dio->bio);
        return ret;
-error:
-       if (!is_poll)
-               blk_finish_plug(&plug);
-       goto out;
 }

 static ssize_t

With this patch, I get a different behavior:

## sync, wait
4194304 4194304
0 KB read, 4096 KB writen

No change here.

## sync, nowait
... KASAN warning and hang, no output from test application
The debug messages show this:

[   90.142525] ====== __blkdev_direct_IO ======
[   90.147503] +++ BIO frag 1048576 B
[   90.151729] +++ BIO frag 1048576 B
[   90.155256] BIO submit eagain, dio 1048576 B
[   90.159571] DIO partial 1048576 B
[   90.162914] --- BIO done, dio 1048576 B
[   90.167345] --- BIO done, dio 1048576 B
[   90.171237] --- DIO done 1048576 B

That is, the first 1MB fragment is issued, the second 1MB fragment hits -EAGAIN
on submission and the DIO is completed as partial 1MB. But there are 2 fragment
completion shown being processed by blkdev_bio_end_io(), while only one was
issued. Right after these messages, KASAN screams:

==================================================================
[   90.182106] BUG: KASAN: use-after-free in blk_rq_map_sg+0x1227/0x1660
[   90.188561] Read of size 8 at addr ffff8890079b1100 by task kworker/31:1H/932
[   90.195725]
[   90.197235] CPU: 31 PID: 932 Comm: kworker/31:1H Not tainted 5.3.0-rc3+ #211

Clearly, a BIO is being freed while being executed, likely one of the split of
the large 1MB BIO that was failed with EAGAIN.

## async, wait
4194304 4194304
0 KB read, 4096 KB writen

No problem here, same as sync+wait case.

## async, nowait
4194304 3145728
0 KB read, 3072 KB writen

Here, the application return and shows a partial short write of 3MB. The debug
messages show that the last fragment failed submission on EAGAIN, leading to the
correctly reported 3MB short write.

[  190.437652] ====== __blkdev_direct_IO ======
[  190.442590] +++ BIO frag 1048576 B
[  190.446688] +++ BIO frag 1048576 B
[  190.450795] +++ BIO frag 1048576 B
[  190.454237] --- BIO done, dio 3145728 B
[  190.458790] +++ BIO frag 1048576 B
[  190.462297] last BIO submit eagain, dio 3145728 B
[  190.467045] DIO partial 3145728 B
[  190.470388] --- BIO done, dio 3145728 B
[  190.474288] --- BIO done, dio 3145728 B
[  190.480198] --- BIO done, dio 3145728 B
[  190.484092] --- DIO done 3145728 B

But still getting KASAN warning and hang. The warning is the same as sync+nowait
case:

==================================================================
[  190.494974] BUG: KASAN: use-after-free in blk_rq_map_sg+0x1227/0x1660
[  190.501433] Read of size 8 at addr ffff889050965500 by task kworker/31:1H/1125

And the hang is also the same, an oops with an an infinite stream of the
following messages.

[  190.917591] swiotlb_tbl_map_single: 4 callbacks suppressed
[  190.917596] mpt3sas 0000:d8:00.0: swiotlb buffer is full (sz: 4096 bytes),
total 0 (slots), used 0 (slots)
[  190.932770] mpt3sas 0000:d8:00.0: overflow 0xffe7a3fcc18c0000+4096 of DMA
mask 7fffffffffffffff bus mask 0
[  190.942450] WARNING: CPU: 31 PID: 1125 at kernel/dma/direct.c:35
report_addr+0xc5/0x170
[  190.950460] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver
nfs lockd grace fscache nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT
ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip6table_mangle
ip6table_raw ip6table_security iptable_mangle iptable_raw iptable_security
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set nfnetlink
ip6table_filter ip6_tables sunrpc x86_pkg_temp_thermal coretemp kvm_intel kvm
crc32_pclmul acpi_power_meter mpt3sas i40e nvme raid_class nvme_core skd
[  190.996072] CPU: 31 PID: 1125 Comm: kworker/31:1H Tainted: G    B
5.3.0-rc3+ #211
[  191.004598] Hardware name: Supermicro Super Server/X11DPL-i, BIOS 3.0b 03/04/2019
[  191.012088] Workqueue: kblockd blk_mq_run_work_fn
[  191.016805] RIP: 0010:report_addr+0xc5/0x170
[  191.021086] Code: e1 48 c1 e9 03 80 3c 01 00 0f 85 97 00 00 00 4d 8b 24 24 b8
fe ff ff ff 49 39 c4 76 3a 80 3d 39 57 81 02 00 0f 84 a0 00 00 00 <0f> 0b 48 b8
00 00 00 00 00 fc ff df 48 c7 04 03 00 00 00 00 48 8b
[  191.039855] RSP: 0018:ffff88904f8776e8 EFLAGS: 00010282
[  191.045091] RAX: 0000000000000000 RBX: 1ffff11209f0eede RCX: 0000000000000000
[  191.052230] RDX: ffffed1209f0ee84 RSI: 0000000000000000 RDI: ffffed1209f0eed0
[  191.059373] RBP: ffff888817606f28 R08: 000000000000005e R09: ffffed120b2f7689
[  191.066513] R10: ffffed120b2f7688 R11: ffff8890597bb447 R12: 7fffffffffffffff
[  191.073656] R13: 7fffffffffffffff R14: ffe7a3fcc18c0fff R15: ffff888620280abc
[  191.080795] FS:  0000000000000000(0000) GS:ffff889059780000(0000)
knlGS:0000000000000000
[  191.088889] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  191.094646] CR2: 00007fafc9cea000 CR3: 000000107d614003 CR4: 00000000005606e0
[  191.101788] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  191.108929] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  191.116068] PKRU: 55555554
[  191.118781] Call Trace:
[  191.121239]  ? dma_direct_unmap_sg+0xb0/0xb0
[  191.125516]  ? swiotlb_map+0xd7/0x490
[  191.129192]  ? _raw_spin_unlock_irqrestore+0x45/0x50
[  191.134169]  dma_direct_map_page+0x23b/0x2f0
[  191.138447]  ? __kasan_report.cold+0xd/0x3e
[  191.142643]  ? report_addr+0x170/0x170
[  191.146403]  ? blk_rq_map_sg+0x1227/0x1660
[  191.150510]  ? blk_rq_map_sg+0x1118/0x1660
[  191.154623]  dma_direct_map_sg+0x108/0x1e0
[  191.158731]  scsi_dma_map+0x18e/0x290
[  191.162412]  _base_build_sg_scmd_ieee+0x134/0x1870 [mpt3sas]
[  191.168079]  ? memset+0x20/0x40
[  191.171241]  scsih_qcmd+0xfb7/0x1a20 [mpt3sas]
[  191.175701]  scsi_queue_rq+0xdb4/0x2ce0
[  191.179547]  blk_mq_dispatch_rq_list+0xbbc/0x1560
[  191.184261]  ? elv_rb_del+0x3b/0x80
[  191.187762]  ? blk_mq_get_driver_tag+0x4c0/0x4c0
[  191.192388]  ? dd_dispatch_request+0x210/0x930
[  191.196847]  blk_mq_do_dispatch_sched+0x16c/0x380
[  191.201560]  ? blk_mq_sched_mark_restart_hctx+0x60/0x60
[  191.206793]  ? __lock_acquire+0xd55/0x4560
[  191.210901]  blk_mq_sched_dispatch_requests+0x33c/0x530
[  191.216134]  ? lock_acquire+0x11d/0x2f0
[  191.219984]  ? blk_mq_sched_restart+0x70/0x70
[  191.224351]  ? process_one_work+0x677/0x1270
[  191.228635]  __blk_mq_run_hw_queue+0xeb/0x210
[  191.232999]  ? hctx_lock+0x160/0x160
[  191.236589]  ? process_one_work+0x693/0x1270
[  191.240872]  process_one_work+0x73a/0x1270
[  191.244981]  ? pwq_dec_nr_in_flight+0x2d0/0x2d0
[  191.249518]  ? do_raw_spin_lock+0x120/0x280
[  191.253718]  worker_thread+0x94/0xc70
[  191.257391]  ? __kthread_parkme+0xbd/0x1a0
[  191.261498]  ? process_one_work+0x1270/0x1270
[  191.265864]  kthread+0x2ed/0x3f0
[  191.269104]  ? kthread_park+0x130/0x130
[  191.272954]  ret_from_fork+0x24/0x30
[  191.276547] irq event stamp: 90
[  191.279695] hardirqs last  enabled at (89): [<ffffffff812bb154>]
ktime_get+0x104/0x130
[  191.287616] hardirqs last disabled at (90): [<ffffffff8282f1b4>]
_raw_spin_lock_irqsave+0x14/0x50
[  191.296491] softirqs last  enabled at (52): [<ffffffff8232df68>]
flush_backlog+0x378/0x5a0
[  191.304758] softirqs last disabled at (48): [<ffffffff8232dbf5>]
flush_backlog+0x5/0x5a0
[  191.312853] ---[ end trace f878467310c76256 ]---
[  191.317491] sd 10:0:0:0: scsi_dma_map failed: request for 524288 bytes!
[  191.343446] mpt3sas 0000:d8:00.0: swiotlb buffer is full (sz: 4096 bytes),
total 0 (slots), used 0 (slots)
[  191.353128] sd 10:0:0:0: scsi_dma_map failed: request for 524288 bytes!
[  191.387552] mpt3sas 0000:d8:00.0: swiotlb buffer is full (sz: 4096 bytes),
total 0 (slots), used 0 (slots)
[  191.397268] sd 10:0:0:0: scsi_dma_map failed: request for 524288 bytes!
[  191.431556] mpt3sas 0000:d8:00.0: swiotlb buffer is full (sz: 4096 bytes),
total 0 (slots), used 0 (slots)
[  191.441264] sd 10:0:0:0: scsi_dma_map failed: request for 524288 bytes!
[  191.475552] mpt3sas 0000:d8:00.0: swiotlb buffer is full (sz: 4096 bytes),
total 0 (slots), used 0 (slots)
[  191.485263] sd 10:0:0:0: scsi_dma_map failed: request for 524288 bytes!
[  191.519550] mpt3sas 0000:d8:00.0: swiotlb buffer is full (sz: 4096 bytes),
total 0 (slots), used 0 (slots)
[  191.529265] sd 10:0:0:0: scsi_dma_map failed: request for 524288 bytes!
[  191.563542] mpt3sas 0000:d8:00.0: swiotlb buffer is full (sz: 4096 bytes),
total 0 (slots), used 0 (slots)
[  191.573259] sd 10:0:0:0: scsi_dma_map failed: request for 524288 bytes!
[  191.607551] mpt3sas 0000:d8:00.0: swiotlb buffer is full (sz: 4096 bytes),
total 0 (slots), used 0 (slots)
[  191.617264] sd 10:0:0:0: scsi_dma_map failed: request for 524288 bytes!
[  191.651549] mpt3sas 0000:d8:00.0: swiotlb buffer is full (sz: 4096 bytes),
total 0 (slots), used 0 (slots)
[  191.661255] sd 10:0:0:0: scsi_dma_map failed: request for 524288 bytes!
[  191.695545] mpt3sas 0000:d8:00.0: swiotlb buffer is full (sz: 4096 bytes),
total 0 (slots), used 0 (slots)
[  191.705256] sd 10:0:0:0: scsi_dma_map failed: request for 524288 bytes!
[  191.739546] sd 10:0:0:0: scsi_dma_map failed: request for 524288 bytes!
[  191.773545] sd 10:0:0:0: scsi_dma_map failed: request for 524288 bytes!
[  191.807549] sd 10:0:0:0: scsi_dma_map failed: request for 524288 bytes!
[  191.841553] sd 10:0:0:0: scsi_dma_map failed: request for 524288 bytes!
[  191.875544] sd 10:0:0:0: scsi_dma_map failed: request for 524288 bytes!
[  191.909547] sd 10:0:0:0: scsi_dma_map failed: request for 524288 bytes!
[  191.943466] sd 10:0:0:0: scsi_dma_map failed: request for 524288 bytes!
[  191.977543] sd 10:0:0:0: scsi_dma_map failed: request for 524288 bytes!
[  192.011547] sd 10:0:0:0: scsi_dma_map failed: request for 524288 bytes!
[  192.045550] sd 10:0:0:0: scsi_dma_map failed: request for 524288 bytes!
[  192.079539] sd 10:0:0:0: scsi_dma_map failed: request for 524288 bytes!
[  192.113537] sd 10:0:0:0: scsi_dma_map failed: request for 524288 bytes!
[  192.147543] sd 10:0:0:0: scsi_dma_map failed: request for 524288 bytes!

I am still digging into all this, but strongly suspecting that this is related
to the NOWAIT flag not correctly handling cases where a BIO gets split into
smaller fragments and some of the fragments failing to be created with -EAGAIN.
Not 100% sure yet though.

Best regards.
Jens Axboe Aug. 9, 2019, 3:52 p.m. UTC | #8
On 8/9/19 6:01 AM, Damien Le Moal wrote:
> On 2019/08/07 4:53, Jens Axboe wrote:
>>> Hi Damien,
>>>
>>> Had you verified this patch with blktests and KASAN enabled? I think the
>>> above patch introduced the following KASAN complaint:
>>
>> I posted this in another thread, can you try?
>>
>>
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index a6f7c892cb4a..131e2e0582a6 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -349,7 +349,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>>   	loff_t pos = iocb->ki_pos;
>>   	blk_qc_t qc = BLK_QC_T_NONE;
>>   	gfp_t gfp;
>> -	ssize_t ret;
>> +	int ret;
>>   
>>   	if ((pos | iov_iter_alignment(iter)) &
>>   	    (bdev_logical_block_size(bdev) - 1))
>> @@ -386,8 +386,6 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>>   
>>   	ret = 0;
>>   	for (;;) {
>> -		int err;
>> -
>>   		bio_set_dev(bio, bdev);
>>   		bio->bi_iter.bi_sector = pos >> 9;
>>   		bio->bi_write_hint = iocb->ki_hint;
>> @@ -395,10 +393,8 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>>   		bio->bi_end_io = blkdev_bio_end_io;
>>   		bio->bi_ioprio = iocb->ki_ioprio;
>>   
>> -		err = bio_iov_iter_get_pages(bio, iter);
>> -		if (unlikely(err)) {
>> -			if (!ret)
>> -				ret = err;
>> +		ret = bio_iov_iter_get_pages(bio, iter);
>> +		if (unlikely(ret)) {
>>   			bio->bi_status = BLK_STS_IOERR;
>>   			bio_endio(bio);
>>   			break;
>> @@ -421,7 +417,6 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>>   		if (nowait)
>>   			bio->bi_opf |= (REQ_NOWAIT | REQ_NOWAIT_INLINE);
>>   
>> -		dio->size += bio->bi_iter.bi_size;
>>   		pos += bio->bi_iter.bi_size;
>>   
>>   		nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
>> @@ -433,13 +428,13 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>>   				polled = true;
>>   			}
>>   
>> +			dio->size += bio->bi_iter.bi_size;
>>   			qc = submit_bio(bio);
>>   			if (qc == BLK_QC_T_EAGAIN) {
>> -				if (!ret)
>> -					ret = -EAGAIN;
>> +				dio->size -= bio->bi_iter.bi_size;
>> +				ret = -EAGAIN;
>>   				goto error;
>>   			}
>> -			ret = dio->size;
>>   
>>   			if (polled)
>>   				WRITE_ONCE(iocb->ki_cookie, qc);
>> @@ -460,18 +455,17 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>>   			atomic_inc(&dio->ref);
>>   		}
>>   
>> +		dio->size += bio->bi_iter.bi_size;
>>   		qc = submit_bio(bio);
>>   		if (qc == BLK_QC_T_EAGAIN) {
>> -			if (!ret)
>> -				ret = -EAGAIN;
>> +			dio->size -= bio->bi_iter.bi_size;
>> +			ret = -EAGAIN;
>>   			goto error;
>>   		}
>> -		ret = dio->size;
>>   
>>   		bio = bio_alloc(gfp, nr_pages);
>>   		if (!bio) {
>> -			if (!ret)
>> -				ret = -EAGAIN;
>> +			ret = -EAGAIN;
>>   			goto error;
>>   		}
>>   	}
>> @@ -496,6 +490,8 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>>   out:
>>   	if (!ret)
>>   		ret = blk_status_to_errno(dio->bio.bi_status);
>> +	if (likely(!ret))
>> +		ret = dio->size;
>>   
>>   	bio_put(&dio->bio);
>>   	return ret;
>>
> 
> Jens,
> 
> I tested a slightly modified version of your patch. I think it is 100%
> equivalent, but a little cleaner in my opinion.

[snip]

I already queued up the other one days ago, don't think there's much
difference between them. In any case, I've read your full email and I'll
take a look at the NOWAIT + sync case. I'm currently out though, so it
won't be until Monday. Because of that, I'll ship what I have today as
it fixes the most common case, then go over the sync+nowait on Monday to
see what's up there.
diff mbox series

Patch

diff --git a/fs/block_dev.c b/fs/block_dev.c
index c2a85b587922..75cc7f424b3a 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -439,6 +439,7 @@  __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 					ret = -EAGAIN;
 				goto error;
 			}
+			ret = dio->size;
 
 			if (polled)
 				WRITE_ONCE(iocb->ki_cookie, qc);
@@ -465,7 +466,7 @@  __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 				ret = -EAGAIN;
 			goto error;
 		}
-		ret += bio->bi_iter.bi_size;
+		ret = dio->size;
 
 		bio = bio_alloc(gfp, nr_pages);
 		if (!bio) {