diff mbox

Silent data corruption in blkdev_direct_IO()

Message ID 3419a3ae-da82-9c20-26e1-7c9ed14ff8ed@kernel.dk (mailing list archive)
State New, archived
Headers show

Commit Message

Jens Axboe July 12, 2018, 3:08 p.m. UTC
On 7/12/18 8:36 AM, Hannes Reinecke wrote:
> Hi Jens, Christoph,
> 
> we're currently hunting down a silent data corruption occurring due to
> commit 72ecad22d9f1 ("block: support a full bio worth of IO for
> simplified bdev direct-io").
> 
> While the whole thing is still hazy on the details, the one thing we've
> found is that reverting that patch fixes the data corruption.
> 
> And looking closer, I've found this:
> 
> static ssize_t
> blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> {
> 	int nr_pages;
> 
> 	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
> 	if (!nr_pages)
> 		return 0;
> 	if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
> 		return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
> 
> 	return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));
> }
> 
> When checking the call path
> __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc()
> I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES.
> 
> So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ?
> It's not that we can handle it in __blkdev_direct_IO() ...

The logic could be cleaned up like below, the sync part is really all
we care about. What is the test case for this? async or sync?

I also don't remember why it's BIO_MAX_PAGES + 1...

Comments

Martin Wilck July 12, 2018, 4:11 p.m. UTC | #1
On Thu, 2018-07-12 at 09:08 -0600, Jens Axboe wrote:
> On 7/12/18 8:36 AM, Hannes Reinecke wrote:
> > Hi Jens, Christoph,
> > 
> > we're currently hunting down a silent data corruption occurring due
> > to
> > commit 72ecad22d9f1 ("block: support a full bio worth of IO for
> > simplified bdev direct-io").
> > 
> > While the whole thing is still hazy on the details, the one thing
> > we've
> > found is that reverting that patch fixes the data corruption.
> > 
> > And looking closer, I've found this:
> > 
> > static ssize_t
> > blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> > {
> > 	int nr_pages;
> > 
> > 	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
> > 	if (!nr_pages)
> > 		return 0;
> > 	if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
> > 		return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
> > 
> > 	return __blkdev_direct_IO(iocb, iter, min(nr_pages,
> > BIO_MAX_PAGES));
> > }
> > 
> > When checking the call path
> > __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc()
> > I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES.
> > 
> > So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ?
> > It's not that we can handle it in __blkdev_direct_IO() ...
> 
> The logic could be cleaned up like below, the sync part is really all
> we care about. What is the test case for this? async or sync?

It's sync, and the corruption is in the __blkdev_direct_IO_simple()
path. It starts to occur with your "block: support a full bio worth of
IO for simplified bdev direct-io" patch, which causes the "simple" path
to be taken for larger IO sizes.

Martin



> 
> I also don't remember why it's BIO_MAX_PAGES + 1...
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 0dd87aaeb39a..14ef3d71b55f 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, struct
> iov_iter *iter)
>  {
>  	int nr_pages;
>  
> -	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
> +	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
>  	if (!nr_pages)
>  		return 0;
> -	if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
> +	if (is_sync_kiocb(iocb))
>  		return __blkdev_direct_IO_simple(iocb, iter,
> nr_pages);
>  
> -	return __blkdev_direct_IO(iocb, iter, min(nr_pages,
> BIO_MAX_PAGES));
> +	return __blkdev_direct_IO(iocb, iter, nr_pages);
>  }
>  
>  static __init int blkdev_init(void)
>
Hannes Reinecke July 12, 2018, 4:14 p.m. UTC | #2
On 07/12/2018 05:08 PM, Jens Axboe wrote:
> On 7/12/18 8:36 AM, Hannes Reinecke wrote:
>> Hi Jens, Christoph,
>>
>> we're currently hunting down a silent data corruption occurring due to
>> commit 72ecad22d9f1 ("block: support a full bio worth of IO for
>> simplified bdev direct-io").
>>
>> While the whole thing is still hazy on the details, the one thing we've
>> found is that reverting that patch fixes the data corruption.
>>
>> And looking closer, I've found this:
>>
>> static ssize_t
>> blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>> {
>> 	int nr_pages;
>>
>> 	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
>> 	if (!nr_pages)
>> 		return 0;
>> 	if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
>> 		return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
>>
>> 	return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));
>> }
>>
>> When checking the call path
>> __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc()
>> I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES.
>>
>> So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ?
>> It's not that we can handle it in __blkdev_direct_IO() ...
> 
> The logic could be cleaned up like below, the sync part is really all
> we care about. What is the test case for this? async or sync?
> 
> I also don't remember why it's BIO_MAX_PAGES + 1...
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 0dd87aaeb39a..14ef3d71b55f 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>   {
>   	int nr_pages;
>   
> -	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
> +	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
>   	if (!nr_pages)
>   		return 0;
> -	if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
> +	if (is_sync_kiocb(iocb))
>   		return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
>   
> -	return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));
> +	return __blkdev_direct_IO(iocb, iter, nr_pages);
>   }
>   
>   static __init int blkdev_init(void)
> 
Hmm. We'll give it a go, but somehow I feel this won't solve our problem.
Another question (which probably shows my complete ignorance):
What happens if the iter holds >= BIO_MAX_PAGES?
'iov_iter_npages' will only ever return BIO_MAX_PAGES, independent on 
whether the iter contains more pages.
What happens to those left-over pages?
Will they ever be processed?

Cheers,

Hannes
Jens Axboe July 12, 2018, 4:20 p.m. UTC | #3
On 7/12/18 10:14 AM, Hannes Reinecke wrote:
> On 07/12/2018 05:08 PM, Jens Axboe wrote:
>> On 7/12/18 8:36 AM, Hannes Reinecke wrote:
>>> Hi Jens, Christoph,
>>>
>>> we're currently hunting down a silent data corruption occurring due to
>>> commit 72ecad22d9f1 ("block: support a full bio worth of IO for
>>> simplified bdev direct-io").
>>>
>>> While the whole thing is still hazy on the details, the one thing we've
>>> found is that reverting that patch fixes the data corruption.
>>>
>>> And looking closer, I've found this:
>>>
>>> static ssize_t
>>> blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>> {
>>> 	int nr_pages;
>>>
>>> 	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
>>> 	if (!nr_pages)
>>> 		return 0;
>>> 	if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
>>> 		return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
>>>
>>> 	return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));
>>> }
>>>
>>> When checking the call path
>>> __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc()
>>> I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES.
>>>
>>> So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ?
>>> It's not that we can handle it in __blkdev_direct_IO() ...
>>
>> The logic could be cleaned up like below, the sync part is really all
>> we care about. What is the test case for this? async or sync?
>>
>> I also don't remember why it's BIO_MAX_PAGES + 1...
>>
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index 0dd87aaeb39a..14ef3d71b55f 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>   {
>>   	int nr_pages;
>>   
>> -	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
>> +	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
>>   	if (!nr_pages)
>>   		return 0;
>> -	if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
>> +	if (is_sync_kiocb(iocb))
>>   		return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
>>   
>> -	return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));
>> +	return __blkdev_direct_IO(iocb, iter, nr_pages);
>>   }
>>   
>>   static __init int blkdev_init(void)
>>
> Hmm. We'll give it a go, but somehow I feel this won't solve our problem.

It probably won't, the only joker here is the BIO_MAX_PAGES + 1. But it
does simplify that part...

> Another question (which probably shows my complete ignorance):
> What happens if the iter holds >= BIO_MAX_PAGES?
> 'iov_iter_npages' will only ever return BIO_MAX_PAGES, independent on 
> whether the iter contains more pages.
> What happens to those left-over pages?
> Will they ever be processed?

Short read or write, we rely on being called again for the remainder.
Jens Axboe July 12, 2018, 4:42 p.m. UTC | #4
On 7/12/18 10:20 AM, Jens Axboe wrote:
> On 7/12/18 10:14 AM, Hannes Reinecke wrote:
>> On 07/12/2018 05:08 PM, Jens Axboe wrote:
>>> On 7/12/18 8:36 AM, Hannes Reinecke wrote:
>>>> Hi Jens, Christoph,
>>>>
>>>> we're currently hunting down a silent data corruption occurring due to
>>>> commit 72ecad22d9f1 ("block: support a full bio worth of IO for
>>>> simplified bdev direct-io").
>>>>
>>>> While the whole thing is still hazy on the details, the one thing we've
>>>> found is that reverting that patch fixes the data corruption.
>>>>
>>>> And looking closer, I've found this:
>>>>
>>>> static ssize_t
>>>> blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>>> {
>>>> 	int nr_pages;
>>>>
>>>> 	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
>>>> 	if (!nr_pages)
>>>> 		return 0;
>>>> 	if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
>>>> 		return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
>>>>
>>>> 	return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));
>>>> }
>>>>
>>>> When checking the call path
>>>> __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc()
>>>> I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES.
>>>>
>>>> So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ?
>>>> It's not that we can handle it in __blkdev_direct_IO() ...
>>>
>>> The logic could be cleaned up like below, the sync part is really all
>>> we care about. What is the test case for this? async or sync?
>>>
>>> I also don't remember why it's BIO_MAX_PAGES + 1...
>>>
>>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>>> index 0dd87aaeb39a..14ef3d71b55f 100644
>>> --- a/fs/block_dev.c
>>> +++ b/fs/block_dev.c
>>> @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>>   {
>>>   	int nr_pages;
>>>   
>>> -	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
>>> +	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
>>>   	if (!nr_pages)
>>>   		return 0;
>>> -	if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
>>> +	if (is_sync_kiocb(iocb))
>>>   		return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
>>>   
>>> -	return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));
>>> +	return __blkdev_direct_IO(iocb, iter, nr_pages);
>>>   }
>>>   
>>>   static __init int blkdev_init(void)
>>>
>> Hmm. We'll give it a go, but somehow I feel this won't solve our problem.
> 
> It probably won't, the only joker here is the BIO_MAX_PAGES + 1. But it
> does simplify that part...

OK, now I remember. The +1 is just to check if there are actually more
pages. __blkdev_direct_IO_simple() only does one bio, so it has to fit
within that one bio. __blkdev_direct_IO() will loop just fine and
will finish any size, BIO_MAX_PAGES at the time.

Hence the patch I sent is wrong, the code actually looks fine. Which
means we're back to trying to figure out what is going on here. It'd
be great with a test case...
Martin Wilck July 13, 2018, 6:47 a.m. UTC | #5
On Thu, 2018-07-12 at 10:42 -0600, Jens Axboe wrote:
> On 7/12/18 10:20 AM, Jens Axboe wrote:
> > On 7/12/18 10:14 AM, Hannes Reinecke wrote:
> > > On 07/12/2018 05:08 PM, Jens Axboe wrote:
> > > > On 7/12/18 8:36 AM, Hannes Reinecke wrote:
> > > > > Hi Jens, Christoph,
> > > > > 
> > > > > we're currently hunting down a silent data corruption
> > > > > occurring due to
> > > > > commit 72ecad22d9f1 ("block: support a full bio worth of IO
> > > > > for
> > > > > simplified bdev direct-io").
> > > > > 
> > > > > While the whole thing is still hazy on the details, the one
> > > > > thing we've
> > > > > found is that reverting that patch fixes the data corruption.
> > > > > 
> > > > > And looking closer, I've found this:
> > > > > 
> > > > > static ssize_t
> > > > > blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> > > > > {
> > > > > 	int nr_pages;
> > > > > 
> > > > > 	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
> > > > > 	if (!nr_pages)
> > > > > 		return 0;
> > > > > 	if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
> > > > > 		return __blkdev_direct_IO_simple(iocb, iter,
> > > > > nr_pages);
> > > > > 
> > > > > 	return __blkdev_direct_IO(iocb, iter, min(nr_pages,
> > > > > BIO_MAX_PAGES));
> > > > > }
> > > > > 
> > > > > When checking the call path
> > > > > __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc()
> > > > > I found that bvec_alloc() will fail if nr_pages >
> > > > > BIO_MAX_PAGES.
> > > > > 
> > > > > So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ?
> > > > > It's not that we can handle it in __blkdev_direct_IO() ...
> > > > 
> > > > The logic could be cleaned up like below, the sync part is
> > > > really all
> > > > we care about. What is the test case for this? async or sync?
> > > > 
> > > > I also don't remember why it's BIO_MAX_PAGES + 1...
> > > > 
> > > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > > index 0dd87aaeb39a..14ef3d71b55f 100644
> > > > --- a/fs/block_dev.c
> > > > +++ b/fs/block_dev.c
> > > > @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb,
> > > > struct iov_iter *iter)
> > > >   {
> > > >   	int nr_pages;
> > > >   
> > > > -	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
> > > > +	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
> > > >   	if (!nr_pages)
> > > >   		return 0;
> > > > -	if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
> > > > +	if (is_sync_kiocb(iocb))
> > > >   		return __blkdev_direct_IO_simple(iocb, iter,
> > > > nr_pages);
> > > >   
> > > > -	return __blkdev_direct_IO(iocb, iter, min(nr_pages,
> > > > BIO_MAX_PAGES));
> > > > +	return __blkdev_direct_IO(iocb, iter, nr_pages);
> > > >   }
> > > >   
> > > >   static __init int blkdev_init(void)
> > > > 
> > > 
> > > Hmm. We'll give it a go, but somehow I feel this won't solve our
> > > problem.
> > 
> > It probably won't, the only joker here is the BIO_MAX_PAGES + 1.
> > But it
> > does simplify that part...
> 
> OK, now I remember. The +1 is just to check if there are actually
> more
> pages. __blkdev_direct_IO_simple() only does one bio, so it has to
> fit
> within that one bio. __blkdev_direct_IO() will loop just fine and
> will finish any size, BIO_MAX_PAGES at the time.

Right. Hannes, I think we (at least myself) have been irritated by
looking at outdated code. The key point which I missed is that
__blkdev_direct_IO() is called with min(nr_pages, BIO_MAX_PAGES), and
advances beyond that later in the loop.

> Hence the patch I sent is wrong, the code actually looks fine. Which
> means we're back to trying to figure out what is going on here. It'd
> be great with a test case...

Unfortunately we have no reproducer just yet. Only the customer can
reproduce it. The scenario is a data base running on a KVM virtual
machine on top of a virtio-scsi volume backed by a multipath map, with
cache='none' in qemu.

My late thinking is that if, for whatever reason I don't yet
understand, blkdev_direct_IO() resulted in a short write,
__generic_file_write_iter() would fall back to buffered writing, which
might be a possible explanation for the data corruption we observe.
That's just speculation at the current stage.

Regards
Martin
diff mbox

Patch

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 0dd87aaeb39a..14ef3d71b55f 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -424,13 +424,13 @@  blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 {
 	int nr_pages;
 
-	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
+	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
 	if (!nr_pages)
 		return 0;
-	if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
+	if (is_sync_kiocb(iocb))
 		return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
 
-	return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));
+	return __blkdev_direct_IO(iocb, iter, nr_pages);
 }
 
 static __init int blkdev_init(void)