Message ID | 262a1e171d091626edbd23c637cb138ba9d84ed8.1478733376.git.osandov@fb.com (mailing list archive) |
---|---|
State | Accepted |
Headers | show |
On Wed, Nov 09, 2016 at 03:26:50PM -0800, Omar Sandoval wrote: > From: Omar Sandoval <osandov@fb.com> > > My QEMU VM was seeing inexplicable I/O errors that I tracked down to > errors coming from the qcow2 virtual drive in the host system. The qcow2 > file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT. > Every once in awhile, pread() or pwrite() would return EEXIST, which > makes no sense. This turned out to be a bug in btrfs_get_extent(). > > Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map > insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where > two threads race on adding the same extent map to an inode's extent map > tree. However, if the added em is merged with an adjacent em in the > extent tree, then we'll end up with an existing extent that is not > identical to but instead encompasses the extent we tried to add. When we > call merge_extent_mapping() to find the nonoverlapping part of the new > em, the arithmetic overflows because there is no such thing. We then end > up trying to add a bogus em to the em_tree, which results in a EEXIST > that can bubble all the way up to userspace. Is this possibly the same bug that Liu Bo fixes in https://patchwork.kernel.org/patch/9413129/ ? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/10/16 00:26, Omar Sandoval wrote: > From: Omar Sandoval <osandov@fb.com> > > My QEMU VM was seeing inexplicable I/O errors that I tracked down to > errors coming from the qcow2 virtual drive in the host system. The qcow2 > file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT. > Every once in awhile, pread() or pwrite() would return EEXIST, which > makes no sense. This turned out to be a bug in btrfs_get_extent(). > > Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map > insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where > two threads race on adding the same extent map to an inode's extent map > tree. However, if the added em is merged with an adjacent em in the > extent tree, then we'll end up with an existing extent that is not > identical to but instead encompasses the extent we tried to add. When we > call merge_extent_mapping() to find the nonoverlapping part of the new > em, the arithmetic overflows because there is no such thing. We then end > up trying to add a bogus em to the em_tree, which results in a EEXIST > that can bubble all the way up to userspace. > > Fix it by extending the identical extent map special case. > > Signed-off-by: Omar Sandoval <osandov@fb.com> > --- > Applies to 4.9-rc4. > > Here [1] is a reproducer for this bug that doesn't involve firing up a > QEMU VM. Also, a big shoutout to BCC [2] and BPF for making it possible > to debug this on my laptop without compiling a custom kernel and > rebooting just to add printks [3]. > > 1: https://gist.github.com/osandov/d08aabe5d4dec15517e9fde17012fd3b I can't really make this reproducer fail. It builds and runs fine, but just exits with no messages (other than the one about drop_caches in dmesg). It creates the 1MB file and always returns 0. Ideas? -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Nov 10, 2016 at 04:11:35PM +0100, Holger Hoffstätte wrote: > On 11/10/16 00:26, Omar Sandoval wrote: > > From: Omar Sandoval <osandov@fb.com> > > > > My QEMU VM was seeing inexplicable I/O errors that I tracked down to > > errors coming from the qcow2 virtual drive in the host system. The qcow2 > > file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT. > > Every once in awhile, pread() or pwrite() would return EEXIST, which > > makes no sense. This turned out to be a bug in btrfs_get_extent(). > > > > Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map > > insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where > > two threads race on adding the same extent map to an inode's extent map > > tree. However, if the added em is merged with an adjacent em in the > > extent tree, then we'll end up with an existing extent that is not > > identical to but instead encompasses the extent we tried to add. When we > > call merge_extent_mapping() to find the nonoverlapping part of the new > > em, the arithmetic overflows because there is no such thing. We then end > > up trying to add a bogus em to the em_tree, which results in a EEXIST > > that can bubble all the way up to userspace. > > > > Fix it by extending the identical extent map special case. > > > > Signed-off-by: Omar Sandoval <osandov@fb.com> > > --- > > Applies to 4.9-rc4. > > > > Here [1] is a reproducer for this bug that doesn't involve firing up a > > QEMU VM. Also, a big shoutout to BCC [2] and BPF for making it possible > > to debug this on my laptop without compiling a custom kernel and > > rebooting just to add printks [3]. > > > > 1: https://gist.github.com/osandov/d08aabe5d4dec15517e9fde17012fd3b > > I can't really make this reproducer fail. It builds and runs fine, but just > exits with no messages (other than the one about drop_caches in dmesg). > It creates the 1MB file and always returns 0. Ideas? > > -h It's a race condition, so it doesn't happen 100% of the time. I imagine it depends on the storage speed, as well. On my laptop, which is dm-crypt on top of an SSD, it works about 50% of the time. Could you just try running it 100 times or something and see if it fails?
On 11/10/16 16:06, David Sterba wrote: > On Wed, Nov 09, 2016 at 03:26:50PM -0800, Omar Sandoval wrote: >> From: Omar Sandoval <osandov@fb.com> >> [snip] >> Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map >> insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where >> two threads race on adding the same extent map to an inode's extent map >> tree. However, if the added em is merged with an adjacent em in the >> extent tree, then we'll end up with an existing extent that is not >> identical to but instead encompasses the extent we tried to add. When we >> call merge_extent_mapping() to find the nonoverlapping part of the new >> em, the arithmetic overflows because there is no such thing. We then end >> up trying to add a bogus em to the em_tree, which results in a EEXIST >> that can bubble all the way up to userspace. > > Is this possibly the same bug that Liu Bo fixes in > https://patchwork.kernel.org/patch/9413129/ ? Seem similar, but I can't reproduce without that patch either.. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Nov 10, 2016 at 04:06:51PM +0100, David Sterba wrote: > On Wed, Nov 09, 2016 at 03:26:50PM -0800, Omar Sandoval wrote: > > From: Omar Sandoval <osandov@fb.com> > > > > My QEMU VM was seeing inexplicable I/O errors that I tracked down to > > errors coming from the qcow2 virtual drive in the host system. The qcow2 > > file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT. > > Every once in awhile, pread() or pwrite() would return EEXIST, which > > makes no sense. This turned out to be a bug in btrfs_get_extent(). > > > > Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map > > insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where > > two threads race on adding the same extent map to an inode's extent map > > tree. However, if the added em is merged with an adjacent em in the > > extent tree, then we'll end up with an existing extent that is not > > identical to but instead encompasses the extent we tried to add. When we > > call merge_extent_mapping() to find the nonoverlapping part of the new > > em, the arithmetic overflows because there is no such thing. We then end > > up trying to add a bogus em to the em_tree, which results in a EEXIST > > that can bubble all the way up to userspace. > > Is this possibly the same bug that Liu Bo fixes in > https://patchwork.kernel.org/patch/9413129/ ? This looks different. In my case, btrfs_get_extent() fails before we'd even get that far into btrfs_get_blocks_direct().
On 11/10/16 16:37, Omar Sandoval wrote: > On Thu, Nov 10, 2016 at 04:11:35PM +0100, Holger Hoffstätte wrote: >> On 11/10/16 00:26, Omar Sandoval wrote: >>> From: Omar Sandoval <osandov@fb.com> >>> >>> My QEMU VM was seeing inexplicable I/O errors that I tracked down to >>> errors coming from the qcow2 virtual drive in the host system. The qcow2 >>> file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT. >>> Every once in awhile, pread() or pwrite() would return EEXIST, which >>> makes no sense. This turned out to be a bug in btrfs_get_extent(). >>> >>> Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map >>> insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where >>> two threads race on adding the same extent map to an inode's extent map >>> tree. However, if the added em is merged with an adjacent em in the >>> extent tree, then we'll end up with an existing extent that is not >>> identical to but instead encompasses the extent we tried to add. When we >>> call merge_extent_mapping() to find the nonoverlapping part of the new >>> em, the arithmetic overflows because there is no such thing. We then end >>> up trying to add a bogus em to the em_tree, which results in a EEXIST >>> that can bubble all the way up to userspace. >>> >>> Fix it by extending the identical extent map special case. >>> >>> Signed-off-by: Omar Sandoval <osandov@fb.com> >>> --- >>> Applies to 4.9-rc4. >>> >>> Here [1] is a reproducer for this bug that doesn't involve firing up a >>> QEMU VM. Also, a big shoutout to BCC [2] and BPF for making it possible >>> to debug this on my laptop without compiling a custom kernel and >>> rebooting just to add printks [3]. >>> >>> 1: https://gist.github.com/osandov/d08aabe5d4dec15517e9fde17012fd3b >> >> I can't really make this reproducer fail. It builds and runs fine, but just >> exits with no messages (other than the one about drop_caches in dmesg). >> It creates the 1MB file and always returns 0. Ideas? >> >> -h > > It's a race condition, so it doesn't happen 100% of the time. I imagine > it depends on the storage speed, as well. On my laptop, which is > dm-crypt on top of an SSD, it works about 50% of the time. Could you > just try running it 100 times or something and see if it fails? $for i ($(seq 1 1000)) ./pread_eexist_repro /mnt/test/$i || echo "fail" ..couple of thousand runs without problem, only lots of fallocating and cache dropping. Oh well, I tried. :) -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Nov 10, 2016 at 05:01:44PM +0100, Holger Hoffstätte wrote: > On 11/10/16 16:37, Omar Sandoval wrote: > > On Thu, Nov 10, 2016 at 04:11:35PM +0100, Holger Hoffstätte wrote: > >> On 11/10/16 00:26, Omar Sandoval wrote: > >>> From: Omar Sandoval <osandov@fb.com> > >>> > >>> My QEMU VM was seeing inexplicable I/O errors that I tracked down to > >>> errors coming from the qcow2 virtual drive in the host system. The qcow2 > >>> file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT. > >>> Every once in awhile, pread() or pwrite() would return EEXIST, which > >>> makes no sense. This turned out to be a bug in btrfs_get_extent(). > >>> > >>> Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map > >>> insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where > >>> two threads race on adding the same extent map to an inode's extent map > >>> tree. However, if the added em is merged with an adjacent em in the > >>> extent tree, then we'll end up with an existing extent that is not > >>> identical to but instead encompasses the extent we tried to add. When we > >>> call merge_extent_mapping() to find the nonoverlapping part of the new > >>> em, the arithmetic overflows because there is no such thing. We then end > >>> up trying to add a bogus em to the em_tree, which results in a EEXIST > >>> that can bubble all the way up to userspace. > >>> > >>> Fix it by extending the identical extent map special case. > >>> > >>> Signed-off-by: Omar Sandoval <osandov@fb.com> > >>> --- > >>> Applies to 4.9-rc4. > >>> > >>> Here [1] is a reproducer for this bug that doesn't involve firing up a > >>> QEMU VM. Also, a big shoutout to BCC [2] and BPF for making it possible > >>> to debug this on my laptop without compiling a custom kernel and > >>> rebooting just to add printks [3]. > >>> > >>> 1: https://gist.github.com/osandov/d08aabe5d4dec15517e9fde17012fd3b > >> > >> I can't really make this reproducer fail. It builds and runs fine, but just > >> exits with no messages (other than the one about drop_caches in dmesg). > >> It creates the 1MB file and always returns 0. Ideas? > >> > >> -h > > > > It's a race condition, so it doesn't happen 100% of the time. I imagine > > it depends on the storage speed, as well. On my laptop, which is > > dm-crypt on top of an SSD, it works about 50% of the time. Could you > > just try running it 100 times or something and see if it fails? > > $for i ($(seq 1 1000)) ./pread_eexist_repro /mnt/test/$i || echo "fail" > > ..couple of thousand runs without problem, only lots of fallocating and > cache dropping. > > Oh well, I tried. :) > > -h Just out of curiousity, what kind of disk were you trying this on? I've only been able to trigger it on my laptop and a VM running on my laptop.
On 11/10/16 17:20, Omar Sandoval wrote: > On Thu, Nov 10, 2016 at 05:01:44PM +0100, Holger Hoffstätte wrote: >> On 11/10/16 16:37, Omar Sandoval wrote: >>> On Thu, Nov 10, 2016 at 04:11:35PM +0100, Holger Hoffstätte wrote: >>>> On 11/10/16 00:26, Omar Sandoval wrote: >>>>> From: Omar Sandoval <osandov@fb.com> >>>>> >>>>> My QEMU VM was seeing inexplicable I/O errors that I tracked down to >>>>> errors coming from the qcow2 virtual drive in the host system. The qcow2 >>>>> file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT. >>>>> Every once in awhile, pread() or pwrite() would return EEXIST, which >>>>> makes no sense. This turned out to be a bug in btrfs_get_extent(). >>>>> >>>>> Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map >>>>> insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where >>>>> two threads race on adding the same extent map to an inode's extent map >>>>> tree. However, if the added em is merged with an adjacent em in the >>>>> extent tree, then we'll end up with an existing extent that is not >>>>> identical to but instead encompasses the extent we tried to add. When we >>>>> call merge_extent_mapping() to find the nonoverlapping part of the new >>>>> em, the arithmetic overflows because there is no such thing. We then end >>>>> up trying to add a bogus em to the em_tree, which results in a EEXIST >>>>> that can bubble all the way up to userspace. >>>>> >>>>> Fix it by extending the identical extent map special case. >>>>> >>>>> Signed-off-by: Omar Sandoval <osandov@fb.com> >>>>> --- >>>>> Applies to 4.9-rc4. >>>>> >>>>> Here [1] is a reproducer for this bug that doesn't involve firing up a >>>>> QEMU VM. Also, a big shoutout to BCC [2] and BPF for making it possible >>>>> to debug this on my laptop without compiling a custom kernel and >>>>> rebooting just to add printks [3]. >>>>> >>>>> 1: https://gist.github.com/osandov/d08aabe5d4dec15517e9fde17012fd3b >>>> >>>> I can't really make this reproducer fail. It builds and runs fine, but just >>>> exits with no messages (other than the one about drop_caches in dmesg). >>>> It creates the 1MB file and always returns 0. Ideas? >>>> >>>> -h >>> >>> It's a race condition, so it doesn't happen 100% of the time. I imagine >>> it depends on the storage speed, as well. On my laptop, which is >>> dm-crypt on top of an SSD, it works about 50% of the time. Could you >>> just try running it 100 times or something and see if it fails? >> >> $for i ($(seq 1 1000)) ./pread_eexist_repro /mnt/test/$i || echo "fail" >> >> ..couple of thousand runs without problem, only lots of fallocating and >> cache dropping. >> >> Oh well, I tried. :) >> >> -h > > Just out of curiousity, what kind of disk were you trying this on? I've > only been able to trigger it on my laptop and a VM running on my laptop. Tried on both an SSD and an old slowpoke 2.5" rotational disk on USB2. But I also have a ton of other patches and a custom CPU scheduler, so everything it likely my fault anyway. Don't sweat it. :) From what I can tell the explanation of the problem and the change itself make sense. Would have been nice to be able to repro. cheers, -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 09, 2016 at 03:26:50PM -0800, Omar Sandoval wrote: > From: Omar Sandoval <osandov@fb.com> > > My QEMU VM was seeing inexplicable I/O errors that I tracked down to > errors coming from the qcow2 virtual drive in the host system. The qcow2 > file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT. > Every once in awhile, pread() or pwrite() would return EEXIST, which > makes no sense. This turned out to be a bug in btrfs_get_extent(). > > Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map > insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where > two threads race on adding the same extent map to an inode's extent map > tree. However, if the added em is merged with an adjacent em in the > extent tree, then we'll end up with an existing extent that is not > identical to but instead encompasses the extent we tried to add. When we > call merge_extent_mapping() to find the nonoverlapping part of the new > em, the arithmetic overflows because there is no such thing. We then end > up trying to add a bogus em to the em_tree, which results in a EEXIST > that can bubble all the way up to userspace. I don't get how this could happen(even after reading Commit 8dff9c853410), btrfs_get_extent in direct_IO is protected by lock_extent_direct, the assumption is that a racy thread should be blocked by lock_extent_direct and when it gets the lock, it finds the just-inserted em when going into btrfs_get_extent if its offset is within [em->start, extent_map_end(em)]. I think we may also need to figure out why the above doesn't work as expected besides fixing another special case. Thanks, -liubo > > Fix it by extending the identical extent map special case. > > Signed-off-by: Omar Sandoval <osandov@fb.com> > --- > Applies to 4.9-rc4. > > Here [1] is a reproducer for this bug that doesn't involve firing up a > QEMU VM. Also, a big shoutout to BCC [2] and BPF for making it possible > to debug this on my laptop without compiling a custom kernel and > rebooting just to add printks [3]. > > 1: https://gist.github.com/osandov/d08aabe5d4dec15517e9fde17012fd3b > 2: https://github.com/iovisor/bcc > 3: https://gist.github.com/osandov/eb1db868ce10c3af9e00b90f3a65bf9f > > fs/btrfs/inode.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 2b790bd..e5cf589 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -7049,11 +7049,11 @@ struct extent_map *btrfs_get_extent(struct inode *inode, struct page *page, > * extent causing the -EEXIST. > */ > if (existing->start == em->start && > - extent_map_end(existing) == extent_map_end(em) && > + extent_map_end(existing) >= extent_map_end(em) && > em->block_start == existing->block_start) { > /* > - * these two extents are the same, it happens > - * with inlines especially > + * The existing extent map already encompasses the > + * entire extent map we tried to add. > */ > free_extent_map(em); > em = existing; > -- > 2.10.2 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Nov 10, 2016 at 12:01:20PM -0800, Liu Bo wrote: > On Wed, Nov 09, 2016 at 03:26:50PM -0800, Omar Sandoval wrote: > > From: Omar Sandoval <osandov@fb.com> > > > > My QEMU VM was seeing inexplicable I/O errors that I tracked down to > > errors coming from the qcow2 virtual drive in the host system. The qcow2 > > file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT. > > Every once in awhile, pread() or pwrite() would return EEXIST, which > > makes no sense. This turned out to be a bug in btrfs_get_extent(). > > > > Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map > > insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where > > two threads race on adding the same extent map to an inode's extent map > > tree. However, if the added em is merged with an adjacent em in the > > extent tree, then we'll end up with an existing extent that is not > > identical to but instead encompasses the extent we tried to add. When we > > call merge_extent_mapping() to find the nonoverlapping part of the new > > em, the arithmetic overflows because there is no such thing. We then end > > up trying to add a bogus em to the em_tree, which results in a EEXIST > > that can bubble all the way up to userspace. > > I don't get how this could happen(even after reading Commit > 8dff9c853410), btrfs_get_extent in direct_IO is protected by > lock_extent_direct, the assumption is that a racy thread should be > blocked by lock_extent_direct and when it gets the lock, it finds the > just-inserted em when going into btrfs_get_extent if its offset is > within [em->start, extent_map_end(em)]. > > I think we may also need to figure out why the above doesn't work as > expected besides fixing another special case. > > Thanks, > > -liubo lock_extent_direct() only protects the range you're doing I/O into, not the entire extent. If two threads are doing two non-overlapping reads in the same extent, then you can get this race.
On Thu, Nov 10, 2016 at 12:09:06PM -0800, Omar Sandoval wrote: > On Thu, Nov 10, 2016 at 12:01:20PM -0800, Liu Bo wrote: > > On Wed, Nov 09, 2016 at 03:26:50PM -0800, Omar Sandoval wrote: > > > From: Omar Sandoval <osandov@fb.com> > > > > > > My QEMU VM was seeing inexplicable I/O errors that I tracked down to > > > errors coming from the qcow2 virtual drive in the host system. The qcow2 > > > file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT. > > > Every once in awhile, pread() or pwrite() would return EEXIST, which > > > makes no sense. This turned out to be a bug in btrfs_get_extent(). > > > > > > Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map > > > insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where > > > two threads race on adding the same extent map to an inode's extent map > > > tree. However, if the added em is merged with an adjacent em in the > > > extent tree, then we'll end up with an existing extent that is not > > > identical to but instead encompasses the extent we tried to add. When we > > > call merge_extent_mapping() to find the nonoverlapping part of the new > > > em, the arithmetic overflows because there is no such thing. We then end > > > up trying to add a bogus em to the em_tree, which results in a EEXIST > > > that can bubble all the way up to userspace. > > > > I don't get how this could happen(even after reading Commit > > 8dff9c853410), btrfs_get_extent in direct_IO is protected by > > lock_extent_direct, the assumption is that a racy thread should be > > blocked by lock_extent_direct and when it gets the lock, it finds the > > just-inserted em when going into btrfs_get_extent if its offset is > > within [em->start, extent_map_end(em)]. > > > > I think we may also need to figure out why the above doesn't work as > > expected besides fixing another special case. > > > > Thanks, > > > > -liubo > > lock_extent_direct() only protects the range you're doing I/O into, not > the entire extent. If two threads are doing two non-overlapping reads in > the same extent, then you can get this race. More concretely, assume the extent tree on disk has: +-------------------------+-------------------------------+ |start=0,len=8192,bytenr=0|start=8192,len=8192,bytenr=8192| +-------------------------+-------------------------------+ And the extent map tree in memory has a single em cached for the second extent {start=8192, len=8192, bytenr=8192}. Then, two threads try do do direct I/O reads: Thread 1 | Thread 2 ---------------------------------------+------------------------------- pread(offset=0, nbyte=4096) | pread(offset=4096, nbyte=4096) lock_extent_direct(start=0, end=4095) | lock_extent_direct(start=4096, end=8191) btrfs_get_extent(start=0, len=4096) | btrfs_get_extent(start=4096, len4096) lookup_extent_mapping() = NULL | lookup_extent_mapping() = NULL reads extent from B-tree | reads extent from B-tree | write_lock(&em_tree->lock) | add_extent_mapping(start=0, len=8192, bytenr=0) | try_merge_map() | em_tree now has {start=0, len=16384, bytenr=0} | write_unlock(&em_tree->lock) write_lock(&em_tree->lock) | add_extent_mapping(start=0, len=8192, | bytenr=0) = -EEXIST | search_extent_mapping() = {start=0, | len=16384, | bytenr=0} | merge_extent_mapping() does bogus math | and overflows, returns EEXIST |
On Thu, Nov 10, 2016 at 12:24:13PM -0800, Omar Sandoval wrote: > On Thu, Nov 10, 2016 at 12:09:06PM -0800, Omar Sandoval wrote: > > On Thu, Nov 10, 2016 at 12:01:20PM -0800, Liu Bo wrote: > > > On Wed, Nov 09, 2016 at 03:26:50PM -0800, Omar Sandoval wrote: > > > > From: Omar Sandoval <osandov@fb.com> > > > > > > > > My QEMU VM was seeing inexplicable I/O errors that I tracked down to > > > > errors coming from the qcow2 virtual drive in the host system. The qcow2 > > > > file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT. > > > > Every once in awhile, pread() or pwrite() would return EEXIST, which > > > > makes no sense. This turned out to be a bug in btrfs_get_extent(). > > > > > > > > Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map > > > > insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where > > > > two threads race on adding the same extent map to an inode's extent map > > > > tree. However, if the added em is merged with an adjacent em in the > > > > extent tree, then we'll end up with an existing extent that is not > > > > identical to but instead encompasses the extent we tried to add. When we > > > > call merge_extent_mapping() to find the nonoverlapping part of the new > > > > em, the arithmetic overflows because there is no such thing. We then end > > > > up trying to add a bogus em to the em_tree, which results in a EEXIST > > > > that can bubble all the way up to userspace. > > > > > > I don't get how this could happen(even after reading Commit > > > 8dff9c853410), btrfs_get_extent in direct_IO is protected by > > > lock_extent_direct, the assumption is that a racy thread should be > > > blocked by lock_extent_direct and when it gets the lock, it finds the > > > just-inserted em when going into btrfs_get_extent if its offset is > > > within [em->start, extent_map_end(em)]. > > > > > > I think we may also need to figure out why the above doesn't work as > > > expected besides fixing another special case. > > > > > > Thanks, > > > > > > -liubo > > > > lock_extent_direct() only protects the range you're doing I/O into, not > > the entire extent. If two threads are doing two non-overlapping reads in > > the same extent, then you can get this race. > > More concretely, assume the extent tree on disk has: > > +-------------------------+-------------------------------+ > |start=0,len=8192,bytenr=0|start=8192,len=8192,bytenr=8192| > +-------------------------+-------------------------------+ > > And the extent map tree in memory has a single em cached for the second > extent {start=8192, len=8192, bytenr=8192}. Then, two threads try do do > direct I/O reads: > > Thread 1 | Thread 2 > ---------------------------------------+------------------------------- > pread(offset=0, nbyte=4096) | pread(offset=4096, nbyte=4096) > lock_extent_direct(start=0, end=4095) | lock_extent_direct(start=4096, end=8191) > btrfs_get_extent(start=0, len=4096) | btrfs_get_extent(start=4096, len4096) > lookup_extent_mapping() = NULL | lookup_extent_mapping() = NULL > reads extent from B-tree | reads extent from B-tree > | write_lock(&em_tree->lock) > | add_extent_mapping(start=0, len=8192, bytenr=0) > | try_merge_map() > | em_tree now has {start=0, len=16384, bytenr=0} > | write_unlock(&em_tree->lock) > write_lock(&em_tree->lock) | > add_extent_mapping(start=0, len=8192, | > bytenr=0) = -EEXIST | > search_extent_mapping() = {start=0, | > len=16384, | > bytenr=0} | > merge_extent_mapping() does bogus math | > and overflows, returns EEXIST | Yeah, so much fun. The problem is that we lock and request [0, 4096], but we insert a em of [0, 8192] instead. So if we insert a [0, 4096] em, then we can make sure that the em returned by btrfs_get_extent is protected from race by the range of lock_extent_direct. I'll give it a shot and do some testing. For this patch, Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Would you please make a reproducer for fstests? Thanks, -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Nov 10, 2016 at 02:38:14PM -0800, Liu Bo wrote: > On Thu, Nov 10, 2016 at 12:24:13PM -0800, Omar Sandoval wrote: > > On Thu, Nov 10, 2016 at 12:09:06PM -0800, Omar Sandoval wrote: > > > On Thu, Nov 10, 2016 at 12:01:20PM -0800, Liu Bo wrote: > > > > On Wed, Nov 09, 2016 at 03:26:50PM -0800, Omar Sandoval wrote: > > > > > From: Omar Sandoval <osandov@fb.com> > > > > > > > > > > My QEMU VM was seeing inexplicable I/O errors that I tracked down to > > > > > errors coming from the qcow2 virtual drive in the host system. The qcow2 > > > > > file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT. > > > > > Every once in awhile, pread() or pwrite() would return EEXIST, which > > > > > makes no sense. This turned out to be a bug in btrfs_get_extent(). > > > > > > > > > > Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map > > > > > insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where > > > > > two threads race on adding the same extent map to an inode's extent map > > > > > tree. However, if the added em is merged with an adjacent em in the > > > > > extent tree, then we'll end up with an existing extent that is not > > > > > identical to but instead encompasses the extent we tried to add. When we > > > > > call merge_extent_mapping() to find the nonoverlapping part of the new > > > > > em, the arithmetic overflows because there is no such thing. We then end > > > > > up trying to add a bogus em to the em_tree, which results in a EEXIST > > > > > that can bubble all the way up to userspace. > > > > > > > > I don't get how this could happen(even after reading Commit > > > > 8dff9c853410), btrfs_get_extent in direct_IO is protected by > > > > lock_extent_direct, the assumption is that a racy thread should be > > > > blocked by lock_extent_direct and when it gets the lock, it finds the > > > > just-inserted em when going into btrfs_get_extent if its offset is > > > > within [em->start, extent_map_end(em)]. > > > > > > > > I think we may also need to figure out why the above doesn't work as > > > > expected besides fixing another special case. > > > > > > > > Thanks, > > > > > > > > -liubo > > > > > > lock_extent_direct() only protects the range you're doing I/O into, not > > > the entire extent. If two threads are doing two non-overlapping reads in > > > the same extent, then you can get this race. > > > > More concretely, assume the extent tree on disk has: > > > > +-------------------------+-------------------------------+ > > |start=0,len=8192,bytenr=0|start=8192,len=8192,bytenr=8192| > > +-------------------------+-------------------------------+ > > > > And the extent map tree in memory has a single em cached for the second > > extent {start=8192, len=8192, bytenr=8192}. Then, two threads try do do > > direct I/O reads: > > > > Thread 1 | Thread 2 > > ---------------------------------------+------------------------------- > > pread(offset=0, nbyte=4096) | pread(offset=4096, nbyte=4096) > > lock_extent_direct(start=0, end=4095) | lock_extent_direct(start=4096, end=8191) > > btrfs_get_extent(start=0, len=4096) | btrfs_get_extent(start=4096, len4096) > > lookup_extent_mapping() = NULL | lookup_extent_mapping() = NULL > > reads extent from B-tree | reads extent from B-tree > > | write_lock(&em_tree->lock) > > | add_extent_mapping(start=0, len=8192, bytenr=0) > > | try_merge_map() > > | em_tree now has {start=0, len=16384, bytenr=0} > > | write_unlock(&em_tree->lock) > > write_lock(&em_tree->lock) | > > add_extent_mapping(start=0, len=8192, | > > bytenr=0) = -EEXIST | > > search_extent_mapping() = {start=0, | > > len=16384, | > > bytenr=0} | > > merge_extent_mapping() does bogus math | > > and overflows, returns EEXIST | > > Yeah, so much fun. > > The problem is that we lock and request [0, 4096], but we insert a em of > [0, 8192] instead. So if we insert a [0, 4096] em, then we can make > sure that the em returned by btrfs_get_extent is protected from race by > the range of lock_extent_direct. > > I'll give it a shot and do some testing. > > For this patch, > > Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Thank you! > Would you please make a reproducer for fstests? Sure. Trying to trigger this with xfs_io never works because it's such a narrow race window, but I'll send in my reproducer and see what the xfstests guys think about adding new binaries. > Thanks, > > -liubo
On Thu, Nov 10, 2016 at 04:06:51PM +0100, David Sterba wrote: > On Wed, Nov 09, 2016 at 03:26:50PM -0800, Omar Sandoval wrote: > > From: Omar Sandoval <osandov@fb.com> > > > > My QEMU VM was seeing inexplicable I/O errors that I tracked down to > > errors coming from the qcow2 virtual drive in the host system. The qcow2 > > file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT. > > Every once in awhile, pread() or pwrite() would return EEXIST, which > > makes no sense. This turned out to be a bug in btrfs_get_extent(). > > > > Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map > > insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where > > two threads race on adding the same extent map to an inode's extent map > > tree. However, if the added em is merged with an adjacent em in the > > extent tree, then we'll end up with an existing extent that is not > > identical to but instead encompasses the extent we tried to add. When we > > call merge_extent_mapping() to find the nonoverlapping part of the new > > em, the arithmetic overflows because there is no such thing. We then end > > up trying to add a bogus em to the em_tree, which results in a EEXIST > > that can bubble all the way up to userspace. > > Is this possibly the same bug that Liu Bo fixes in > https://patchwork.kernel.org/patch/9413129/ ? It's not, Omar is fixing a bug for read, my patch was to save space for during DIO write. Thanks, -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Nov 10, 2016 at 02:45:36PM -0800, Omar Sandoval wrote: > On Thu, Nov 10, 2016 at 02:38:14PM -0800, Liu Bo wrote: > > On Thu, Nov 10, 2016 at 12:24:13PM -0800, Omar Sandoval wrote: > > > On Thu, Nov 10, 2016 at 12:09:06PM -0800, Omar Sandoval wrote: > > > > On Thu, Nov 10, 2016 at 12:01:20PM -0800, Liu Bo wrote: > > > > > On Wed, Nov 09, 2016 at 03:26:50PM -0800, Omar Sandoval wrote: > > > > > > From: Omar Sandoval <osandov@fb.com> > > > > > > > > > > > > My QEMU VM was seeing inexplicable I/O errors that I tracked down to > > > > > > errors coming from the qcow2 virtual drive in the host system. The qcow2 > > > > > > file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT. > > > > > > Every once in awhile, pread() or pwrite() would return EEXIST, which > > > > > > makes no sense. This turned out to be a bug in btrfs_get_extent(). > > > > > > > > > > > > Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map > > > > > > insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where > > > > > > two threads race on adding the same extent map to an inode's extent map > > > > > > tree. However, if the added em is merged with an adjacent em in the > > > > > > extent tree, then we'll end up with an existing extent that is not > > > > > > identical to but instead encompasses the extent we tried to add. When we > > > > > > call merge_extent_mapping() to find the nonoverlapping part of the new > > > > > > em, the arithmetic overflows because there is no such thing. We then end > > > > > > up trying to add a bogus em to the em_tree, which results in a EEXIST > > > > > > that can bubble all the way up to userspace. > > > > > > > > > > I don't get how this could happen(even after reading Commit > > > > > 8dff9c853410), btrfs_get_extent in direct_IO is protected by > > > > > lock_extent_direct, the assumption is that a racy thread should be > > > > > blocked by lock_extent_direct and when it gets the lock, it finds the > > > > > just-inserted em when going into btrfs_get_extent if its offset is > > > > > within [em->start, extent_map_end(em)]. > > > > > > > > > > I think we may also need to figure out why the above doesn't work as > > > > > expected besides fixing another special case. > > > > > > > > > > Thanks, > > > > > > > > > > -liubo > > > > > > > > lock_extent_direct() only protects the range you're doing I/O into, not > > > > the entire extent. If two threads are doing two non-overlapping reads in > > > > the same extent, then you can get this race. > > > > > > More concretely, assume the extent tree on disk has: > > > > > > +-------------------------+-------------------------------+ > > > |start=0,len=8192,bytenr=0|start=8192,len=8192,bytenr=8192| > > > +-------------------------+-------------------------------+ > > > > > > And the extent map tree in memory has a single em cached for the second > > > extent {start=8192, len=8192, bytenr=8192}. Then, two threads try do do > > > direct I/O reads: > > > > > > Thread 1 | Thread 2 > > > ---------------------------------------+------------------------------- > > > pread(offset=0, nbyte=4096) | pread(offset=4096, nbyte=4096) > > > lock_extent_direct(start=0, end=4095) | lock_extent_direct(start=4096, end=8191) > > > btrfs_get_extent(start=0, len=4096) | btrfs_get_extent(start=4096, len4096) > > > lookup_extent_mapping() = NULL | lookup_extent_mapping() = NULL > > > reads extent from B-tree | reads extent from B-tree > > > | write_lock(&em_tree->lock) > > > | add_extent_mapping(start=0, len=8192, bytenr=0) > > > | try_merge_map() > > > | em_tree now has {start=0, len=16384, bytenr=0} > > > | write_unlock(&em_tree->lock) > > > write_lock(&em_tree->lock) | > > > add_extent_mapping(start=0, len=8192, | > > > bytenr=0) = -EEXIST | > > > search_extent_mapping() = {start=0, | > > > len=16384, | > > > bytenr=0} | > > > merge_extent_mapping() does bogus math | > > > and overflows, returns EEXIST | > > > > Yeah, so much fun. > > > > The problem is that we lock and request [0, 4096], but we insert a em of > > [0, 8192] instead. So if we insert a [0, 4096] em, then we can make > > sure that the em returned by btrfs_get_extent is protected from race by > > the range of lock_extent_direct. > > > > I'll give it a shot and do some testing. > > > > For this patch, > > > > Reviewed-by: Liu Bo <bo.li.liu@oracle.com> > > Thank you! > > > Would you please make a reproducer for fstests? > > Sure. Trying to trigger this with xfs_io never works because it's such a > narrow race window, but I'll send in my reproducer and see what the > xfstests guys think about adding new binaries. > xfstests patch is here: http://marc.info/?l=fstests&m=147934258732500&w=2
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 2b790bd..e5cf589 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7049,11 +7049,11 @@ struct extent_map *btrfs_get_extent(struct inode *inode, struct page *page, * extent causing the -EEXIST. */ if (existing->start == em->start && - extent_map_end(existing) == extent_map_end(em) && + extent_map_end(existing) >= extent_map_end(em) && em->block_start == existing->block_start) { /* - * these two extents are the same, it happens - * with inlines especially + * The existing extent map already encompasses the + * entire extent map we tried to add. */ free_extent_map(em); em = existing;