Message ID | 3419a3ae-da82-9c20-26e1-7c9ed14ff8ed@kernel.dk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, 2018-07-12 at 09:08 -0600, Jens Axboe wrote: > On 7/12/18 8:36 AM, Hannes Reinecke wrote: > > Hi Jens, Christoph, > > > > we're currently hunting down a silent data corruption occurring due > > to > > commit 72ecad22d9f1 ("block: support a full bio worth of IO for > > simplified bdev direct-io"). > > > > While the whole thing is still hazy on the details, the one thing > > we've > > found is that reverting that patch fixes the data corruption. > > > > And looking closer, I've found this: > > > > static ssize_t > > blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) > > { > > int nr_pages; > > > > nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); > > if (!nr_pages) > > return 0; > > if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) > > return __blkdev_direct_IO_simple(iocb, iter, nr_pages); > > > > return __blkdev_direct_IO(iocb, iter, min(nr_pages, > > BIO_MAX_PAGES)); > > } > > > > When checking the call path > > __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc() > > I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES. > > > > So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ? > > It's not that we can handle it in __blkdev_direct_IO() ... > > The logic could be cleaned up like below, the sync part is really all > we care about. What is the test case for this? async or sync? It's sync, and the corruption is in the __blkdev_direct_IO_simple() path. It starts to occur with your "block: support a full bio worth of IO for simplified bdev direct-io" patch, which causes the "simple" path to be taken for larger IO sizes. Martin > > I also don't remember why it's BIO_MAX_PAGES + 1... > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 0dd87aaeb39a..14ef3d71b55f 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, struct > iov_iter *iter) > { > int nr_pages; > > - nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); > + nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES); > if (!nr_pages) > return 0; > - if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) > + if (is_sync_kiocb(iocb)) > return __blkdev_direct_IO_simple(iocb, iter, > nr_pages); > > - return __blkdev_direct_IO(iocb, iter, min(nr_pages, > BIO_MAX_PAGES)); > + return __blkdev_direct_IO(iocb, iter, nr_pages); > } > > static __init int blkdev_init(void) >
On 07/12/2018 05:08 PM, Jens Axboe wrote: > On 7/12/18 8:36 AM, Hannes Reinecke wrote: >> Hi Jens, Christoph, >> >> we're currently hunting down a silent data corruption occurring due to >> commit 72ecad22d9f1 ("block: support a full bio worth of IO for >> simplified bdev direct-io"). >> >> While the whole thing is still hazy on the details, the one thing we've >> found is that reverting that patch fixes the data corruption. >> >> And looking closer, I've found this: >> >> static ssize_t >> blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) >> { >> int nr_pages; >> >> nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); >> if (!nr_pages) >> return 0; >> if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) >> return __blkdev_direct_IO_simple(iocb, iter, nr_pages); >> >> return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES)); >> } >> >> When checking the call path >> __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc() >> I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES. >> >> So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ? >> It's not that we can handle it in __blkdev_direct_IO() ... > > The logic could be cleaned up like below, the sync part is really all > we care about. What is the test case for this? async or sync? > > I also don't remember why it's BIO_MAX_PAGES + 1... > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 0dd87aaeb39a..14ef3d71b55f 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) > { > int nr_pages; > > - nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); > + nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES); > if (!nr_pages) > return 0; > - if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) > + if (is_sync_kiocb(iocb)) > return __blkdev_direct_IO_simple(iocb, iter, nr_pages); > > - return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES)); > + return __blkdev_direct_IO(iocb, iter, nr_pages); > } > > static __init int blkdev_init(void) > Hmm. We'll give it a go, but somehow I feel this won't solve our problem. Another question (which probably shows my complete ignorance): What happens if the iter holds >= BIO_MAX_PAGES? 'iov_iter_npages' will only ever return BIO_MAX_PAGES, independent on whether the iter contains more pages. What happens to those left-over pages? Will they ever be processed? Cheers, Hannes
On 7/12/18 10:14 AM, Hannes Reinecke wrote: > On 07/12/2018 05:08 PM, Jens Axboe wrote: >> On 7/12/18 8:36 AM, Hannes Reinecke wrote: >>> Hi Jens, Christoph, >>> >>> we're currently hunting down a silent data corruption occurring due to >>> commit 72ecad22d9f1 ("block: support a full bio worth of IO for >>> simplified bdev direct-io"). >>> >>> While the whole thing is still hazy on the details, the one thing we've >>> found is that reverting that patch fixes the data corruption. >>> >>> And looking closer, I've found this: >>> >>> static ssize_t >>> blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) >>> { >>> int nr_pages; >>> >>> nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); >>> if (!nr_pages) >>> return 0; >>> if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) >>> return __blkdev_direct_IO_simple(iocb, iter, nr_pages); >>> >>> return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES)); >>> } >>> >>> When checking the call path >>> __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc() >>> I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES. >>> >>> So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ? >>> It's not that we can handle it in __blkdev_direct_IO() ... >> >> The logic could be cleaned up like below, the sync part is really all >> we care about. What is the test case for this? async or sync? >> >> I also don't remember why it's BIO_MAX_PAGES + 1... >> >> diff --git a/fs/block_dev.c b/fs/block_dev.c >> index 0dd87aaeb39a..14ef3d71b55f 100644 >> --- a/fs/block_dev.c >> +++ b/fs/block_dev.c >> @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) >> { >> int nr_pages; >> >> - nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); >> + nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES); >> if (!nr_pages) >> return 0; >> - if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) >> + if (is_sync_kiocb(iocb)) >> return __blkdev_direct_IO_simple(iocb, iter, nr_pages); >> >> - return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES)); >> + return __blkdev_direct_IO(iocb, iter, nr_pages); >> } >> >> static __init int blkdev_init(void) >> > Hmm. We'll give it a go, but somehow I feel this won't solve our problem. It probably won't, the only joker here is the BIO_MAX_PAGES + 1. But it does simplify that part... > Another question (which probably shows my complete ignorance): > What happens if the iter holds >= BIO_MAX_PAGES? > 'iov_iter_npages' will only ever return BIO_MAX_PAGES, independent on > whether the iter contains more pages. > What happens to those left-over pages? > Will they ever be processed? Short read or write, we rely on being called again for the remainder.
On 7/12/18 10:20 AM, Jens Axboe wrote: > On 7/12/18 10:14 AM, Hannes Reinecke wrote: >> On 07/12/2018 05:08 PM, Jens Axboe wrote: >>> On 7/12/18 8:36 AM, Hannes Reinecke wrote: >>>> Hi Jens, Christoph, >>>> >>>> we're currently hunting down a silent data corruption occurring due to >>>> commit 72ecad22d9f1 ("block: support a full bio worth of IO for >>>> simplified bdev direct-io"). >>>> >>>> While the whole thing is still hazy on the details, the one thing we've >>>> found is that reverting that patch fixes the data corruption. >>>> >>>> And looking closer, I've found this: >>>> >>>> static ssize_t >>>> blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) >>>> { >>>> int nr_pages; >>>> >>>> nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); >>>> if (!nr_pages) >>>> return 0; >>>> if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) >>>> return __blkdev_direct_IO_simple(iocb, iter, nr_pages); >>>> >>>> return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES)); >>>> } >>>> >>>> When checking the call path >>>> __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc() >>>> I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES. >>>> >>>> So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ? >>>> It's not that we can handle it in __blkdev_direct_IO() ... >>> >>> The logic could be cleaned up like below, the sync part is really all >>> we care about. What is the test case for this? async or sync? >>> >>> I also don't remember why it's BIO_MAX_PAGES + 1... >>> >>> diff --git a/fs/block_dev.c b/fs/block_dev.c >>> index 0dd87aaeb39a..14ef3d71b55f 100644 >>> --- a/fs/block_dev.c >>> +++ b/fs/block_dev.c >>> @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) >>> { >>> int nr_pages; >>> >>> - nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); >>> + nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES); >>> if (!nr_pages) >>> return 0; >>> - if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) >>> + if (is_sync_kiocb(iocb)) >>> return __blkdev_direct_IO_simple(iocb, iter, nr_pages); >>> >>> - return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES)); >>> + return __blkdev_direct_IO(iocb, iter, nr_pages); >>> } >>> >>> static __init int blkdev_init(void) >>> >> Hmm. We'll give it a go, but somehow I feel this won't solve our problem. > > It probably won't, the only joker here is the BIO_MAX_PAGES + 1. But it > does simplify that part... OK, now I remember. The +1 is just to check if there are actually more pages. __blkdev_direct_IO_simple() only does one bio, so it has to fit within that one bio. __blkdev_direct_IO() will loop just fine and will finish any size, BIO_MAX_PAGES at the time. Hence the patch I sent is wrong, the code actually looks fine. Which means we're back to trying to figure out what is going on here. It'd be great with a test case...
On Thu, 2018-07-12 at 10:42 -0600, Jens Axboe wrote: > On 7/12/18 10:20 AM, Jens Axboe wrote: > > On 7/12/18 10:14 AM, Hannes Reinecke wrote: > > > On 07/12/2018 05:08 PM, Jens Axboe wrote: > > > > On 7/12/18 8:36 AM, Hannes Reinecke wrote: > > > > > Hi Jens, Christoph, > > > > > > > > > > we're currently hunting down a silent data corruption > > > > > occurring due to > > > > > commit 72ecad22d9f1 ("block: support a full bio worth of IO > > > > > for > > > > > simplified bdev direct-io"). > > > > > > > > > > While the whole thing is still hazy on the details, the one > > > > > thing we've > > > > > found is that reverting that patch fixes the data corruption. > > > > > > > > > > And looking closer, I've found this: > > > > > > > > > > static ssize_t > > > > > blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) > > > > > { > > > > > int nr_pages; > > > > > > > > > > nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); > > > > > if (!nr_pages) > > > > > return 0; > > > > > if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) > > > > > return __blkdev_direct_IO_simple(iocb, iter, > > > > > nr_pages); > > > > > > > > > > return __blkdev_direct_IO(iocb, iter, min(nr_pages, > > > > > BIO_MAX_PAGES)); > > > > > } > > > > > > > > > > When checking the call path > > > > > __blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc() > > > > > I found that bvec_alloc() will fail if nr_pages > > > > > > BIO_MAX_PAGES. > > > > > > > > > > So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ? > > > > > It's not that we can handle it in __blkdev_direct_IO() ... > > > > > > > > The logic could be cleaned up like below, the sync part is > > > > really all > > > > we care about. What is the test case for this? async or sync? > > > > > > > > I also don't remember why it's BIO_MAX_PAGES + 1... > > > > > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > > > index 0dd87aaeb39a..14ef3d71b55f 100644 > > > > --- a/fs/block_dev.c > > > > +++ b/fs/block_dev.c > > > > @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, > > > > struct iov_iter *iter) > > > > { > > > > int nr_pages; > > > > > > > > - nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); > > > > + nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES); > > > > if (!nr_pages) > > > > return 0; > > > > - if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) > > > > + if (is_sync_kiocb(iocb)) > > > > return __blkdev_direct_IO_simple(iocb, iter, > > > > nr_pages); > > > > > > > > - return __blkdev_direct_IO(iocb, iter, min(nr_pages, > > > > BIO_MAX_PAGES)); > > > > + return __blkdev_direct_IO(iocb, iter, nr_pages); > > > > } > > > > > > > > static __init int blkdev_init(void) > > > > > > > > > > Hmm. We'll give it a go, but somehow I feel this won't solve our > > > problem. > > > > It probably won't, the only joker here is the BIO_MAX_PAGES + 1. > > But it > > does simplify that part... > > OK, now I remember. The +1 is just to check if there are actually > more > pages. __blkdev_direct_IO_simple() only does one bio, so it has to > fit > within that one bio. __blkdev_direct_IO() will loop just fine and > will finish any size, BIO_MAX_PAGES at the time. Right. Hannes, I think we (at least myself) have been irritated by looking at outdated code. The key point which I missed is that __blkdev_direct_IO() is called with min(nr_pages, BIO_MAX_PAGES), and advances beyond that later in the loop. > Hence the patch I sent is wrong, the code actually looks fine. Which > means we're back to trying to figure out what is going on here. It'd > be great with a test case... Unfortunately we have no reproducer just yet. Only the customer can reproduce it. The scenario is a data base running on a KVM virtual machine on top of a virtio-scsi volume backed by a multipath map, with cache='none' in qemu. My late thinking is that if, for whatever reason I don't yet understand, blkdev_direct_IO() resulted in a short write, __generic_file_write_iter() would fall back to buffered writing, which might be a possible explanation for the data corruption we observe. That's just speculation at the current stage. Regards Martin
diff --git a/fs/block_dev.c b/fs/block_dev.c index 0dd87aaeb39a..14ef3d71b55f 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) { int nr_pages; - nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); + nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES); if (!nr_pages) return 0; - if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) + if (is_sync_kiocb(iocb)) return __blkdev_direct_IO_simple(iocb, iter, nr_pages); - return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES)); + return __blkdev_direct_IO(iocb, iter, nr_pages); } static __init int blkdev_init(void)