diff mbox

Btrfs: deal with existing encompassing extent map in btrfs_get_extent()

Message ID

262a1e171d091626edbd23c637cb138ba9d84ed8.1478733376.git.osandov@fb.com (mailing list archive)

State

Accepted

Headers

show

From: Omar Sandoval <osandov@osandov.com>
To: linux-btrfs@vger.kernel.org
Cc: kernel-team@fb.com
Subject: [PATCH] Btrfs: deal with existing encompassing extent map in
	btrfs_get_extent()
Date: Wed,  9 Nov 2016 15:26:50 -0800
Message-Id: <262a1e171d091626edbd23c637cb138ba9d84ed8.1478733376.git.osandov@fb.com>
Sender: linux-btrfs-owner@vger.kernel.org
Precedence: bulk

Commit Message

Omar Sandoval Nov. 9, 2016, 11:26 p.m. UTC

From: Omar Sandoval <osandov@fb.com>

My QEMU VM was seeing inexplicable I/O errors that I tracked down to
errors coming from the qcow2 virtual drive in the host system. The qcow2
file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT.
Every once in awhile, pread() or pwrite() would return EEXIST, which
makes no sense. This turned out to be a bug in btrfs_get_extent().

Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map
insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where
two threads race on adding the same extent map to an inode's extent map
tree. However, if the added em is merged with an adjacent em in the
extent tree, then we'll end up with an existing extent that is not
identical to but instead encompasses the extent we tried to add. When we
call merge_extent_mapping() to find the nonoverlapping part of the new
em, the arithmetic overflows because there is no such thing. We then end
up trying to add a bogus em to the em_tree, which results in a EEXIST
that can bubble all the way up to userspace.

Fix it by extending the identical extent map special case.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
Applies to 4.9-rc4.

Here [1] is a reproducer for this bug that doesn't involve firing up a
QEMU VM. Also, a big shoutout to BCC [2] and BPF for making it possible
to debug this on my laptop without compiling a custom kernel and
rebooting just to add printks [3].

1: https://gist.github.com/osandov/d08aabe5d4dec15517e9fde17012fd3b
2: https://github.com/iovisor/bcc
3: https://gist.github.com/osandov/eb1db868ce10c3af9e00b90f3a65bf9f

 fs/btrfs/inode.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comments

David Sterba Nov. 10, 2016, 3:06 p.m. UTC | #1

On Wed, Nov 09, 2016 at 03:26:50PM -0800, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> My QEMU VM was seeing inexplicable I/O errors that I tracked down to
> errors coming from the qcow2 virtual drive in the host system. The qcow2
> file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT.
> Every once in awhile, pread() or pwrite() would return EEXIST, which
> makes no sense. This turned out to be a bug in btrfs_get_extent().
> 
> Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map
> insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where
> two threads race on adding the same extent map to an inode's extent map
> tree. However, if the added em is merged with an adjacent em in the
> extent tree, then we'll end up with an existing extent that is not
> identical to but instead encompasses the extent we tried to add. When we
> call merge_extent_mapping() to find the nonoverlapping part of the new
> em, the arithmetic overflows because there is no such thing. We then end
> up trying to add a bogus em to the em_tree, which results in a EEXIST
> that can bubble all the way up to userspace.

Is this possibly the same bug that Liu Bo fixes in
https://patchwork.kernel.org/patch/9413129/ ?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Holger Hoffstätte Nov. 10, 2016, 3:11 p.m. UTC | #2

On 11/10/16 00:26, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> My QEMU VM was seeing inexplicable I/O errors that I tracked down to
> errors coming from the qcow2 virtual drive in the host system. The qcow2
> file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT.
> Every once in awhile, pread() or pwrite() would return EEXIST, which
> makes no sense. This turned out to be a bug in btrfs_get_extent().
> 
> Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map
> insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where
> two threads race on adding the same extent map to an inode's extent map
> tree. However, if the added em is merged with an adjacent em in the
> extent tree, then we'll end up with an existing extent that is not
> identical to but instead encompasses the extent we tried to add. When we
> call merge_extent_mapping() to find the nonoverlapping part of the new
> em, the arithmetic overflows because there is no such thing. We then end
> up trying to add a bogus em to the em_tree, which results in a EEXIST
> that can bubble all the way up to userspace.
> 
> Fix it by extending the identical extent map special case.
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
> Applies to 4.9-rc4.
> 
> Here [1] is a reproducer for this bug that doesn't involve firing up a
> QEMU VM. Also, a big shoutout to BCC [2] and BPF for making it possible
> to debug this on my laptop without compiling a custom kernel and
> rebooting just to add printks [3].
> 
> 1: https://gist.github.com/osandov/d08aabe5d4dec15517e9fde17012fd3b

I can't really make this reproducer fail. It builds and runs fine, but just
exits with no messages (other than the one about drop_caches in dmesg).
It creates the 1MB file and always returns 0. Ideas?

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Omar Sandoval Nov. 10, 2016, 3:37 p.m. UTC | #3

On Thu, Nov 10, 2016 at 04:11:35PM +0100, Holger Hoffstätte wrote:
> On 11/10/16 00:26, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > My QEMU VM was seeing inexplicable I/O errors that I tracked down to
> > errors coming from the qcow2 virtual drive in the host system. The qcow2
> > file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT.
> > Every once in awhile, pread() or pwrite() would return EEXIST, which
> > makes no sense. This turned out to be a bug in btrfs_get_extent().
> > 
> > Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map
> > insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where
> > two threads race on adding the same extent map to an inode's extent map
> > tree. However, if the added em is merged with an adjacent em in the
> > extent tree, then we'll end up with an existing extent that is not
> > identical to but instead encompasses the extent we tried to add. When we
> > call merge_extent_mapping() to find the nonoverlapping part of the new
> > em, the arithmetic overflows because there is no such thing. We then end
> > up trying to add a bogus em to the em_tree, which results in a EEXIST
> > that can bubble all the way up to userspace.
> > 
> > Fix it by extending the identical extent map special case.
> > 
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> > Applies to 4.9-rc4.
> > 
> > Here [1] is a reproducer for this bug that doesn't involve firing up a
> > QEMU VM. Also, a big shoutout to BCC [2] and BPF for making it possible
> > to debug this on my laptop without compiling a custom kernel and
> > rebooting just to add printks [3].
> > 
> > 1: https://gist.github.com/osandov/d08aabe5d4dec15517e9fde17012fd3b
> 
> I can't really make this reproducer fail. It builds and runs fine, but just
> exits with no messages (other than the one about drop_caches in dmesg).
> It creates the 1MB file and always returns 0. Ideas?
> 
> -h

It's a race condition, so it doesn't happen 100% of the time. I imagine
it depends on the storage speed, as well. On my laptop, which is
dm-crypt on top of an SSD, it works about 50% of the time. Could you
just try running it 100 times or something and see if it fails?

Holger Hoffstätte Nov. 10, 2016, 3:37 p.m. UTC | #4

On 11/10/16 16:06, David Sterba wrote:
> On Wed, Nov 09, 2016 at 03:26:50PM -0800, Omar Sandoval wrote:
>> From: Omar Sandoval <osandov@fb.com>
>> [snip]
>> Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map
>> insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where
>> two threads race on adding the same extent map to an inode's extent map
>> tree. However, if the added em is merged with an adjacent em in the
>> extent tree, then we'll end up with an existing extent that is not
>> identical to but instead encompasses the extent we tried to add. When we
>> call merge_extent_mapping() to find the nonoverlapping part of the new
>> em, the arithmetic overflows because there is no such thing. We then end
>> up trying to add a bogus em to the em_tree, which results in a EEXIST
>> that can bubble all the way up to userspace.
> 
> Is this possibly the same bug that Liu Bo fixes in
> https://patchwork.kernel.org/patch/9413129/ ?

Seem similar, but I can't reproduce without that patch either..

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Omar Sandoval Nov. 10, 2016, 3:42 p.m. UTC | #5

On Thu, Nov 10, 2016 at 04:06:51PM +0100, David Sterba wrote:
> On Wed, Nov 09, 2016 at 03:26:50PM -0800, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > My QEMU VM was seeing inexplicable I/O errors that I tracked down to
> > errors coming from the qcow2 virtual drive in the host system. The qcow2
> > file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT.
> > Every once in awhile, pread() or pwrite() would return EEXIST, which
> > makes no sense. This turned out to be a bug in btrfs_get_extent().
> > 
> > Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map
> > insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where
> > two threads race on adding the same extent map to an inode's extent map
> > tree. However, if the added em is merged with an adjacent em in the
> > extent tree, then we'll end up with an existing extent that is not
> > identical to but instead encompasses the extent we tried to add. When we
> > call merge_extent_mapping() to find the nonoverlapping part of the new
> > em, the arithmetic overflows because there is no such thing. We then end
> > up trying to add a bogus em to the em_tree, which results in a EEXIST
> > that can bubble all the way up to userspace.
> 
> Is this possibly the same bug that Liu Bo fixes in
> https://patchwork.kernel.org/patch/9413129/ ?

This looks different. In my case, btrfs_get_extent() fails before we'd
even get that far into btrfs_get_blocks_direct().

Holger Hoffstätte Nov. 10, 2016, 4:01 p.m. UTC | #6

On 11/10/16 16:37, Omar Sandoval wrote:
> On Thu, Nov 10, 2016 at 04:11:35PM +0100, Holger Hoffstätte wrote:
>> On 11/10/16 00:26, Omar Sandoval wrote:
>>> From: Omar Sandoval <osandov@fb.com>
>>>
>>> My QEMU VM was seeing inexplicable I/O errors that I tracked down to
>>> errors coming from the qcow2 virtual drive in the host system. The qcow2
>>> file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT.
>>> Every once in awhile, pread() or pwrite() would return EEXIST, which
>>> makes no sense. This turned out to be a bug in btrfs_get_extent().
>>>
>>> Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map
>>> insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where
>>> two threads race on adding the same extent map to an inode's extent map
>>> tree. However, if the added em is merged with an adjacent em in the
>>> extent tree, then we'll end up with an existing extent that is not
>>> identical to but instead encompasses the extent we tried to add. When we
>>> call merge_extent_mapping() to find the nonoverlapping part of the new
>>> em, the arithmetic overflows because there is no such thing. We then end
>>> up trying to add a bogus em to the em_tree, which results in a EEXIST
>>> that can bubble all the way up to userspace.
>>>
>>> Fix it by extending the identical extent map special case.
>>>
>>> Signed-off-by: Omar Sandoval <osandov@fb.com>
>>> ---
>>> Applies to 4.9-rc4.
>>>
>>> Here [1] is a reproducer for this bug that doesn't involve firing up a
>>> QEMU VM. Also, a big shoutout to BCC [2] and BPF for making it possible
>>> to debug this on my laptop without compiling a custom kernel and
>>> rebooting just to add printks [3].
>>>
>>> 1: https://gist.github.com/osandov/d08aabe5d4dec15517e9fde17012fd3b
>>
>> I can't really make this reproducer fail. It builds and runs fine, but just
>> exits with no messages (other than the one about drop_caches in dmesg).
>> It creates the 1MB file and always returns 0. Ideas?
>>
>> -h
> 
> It's a race condition, so it doesn't happen 100% of the time. I imagine
> it depends on the storage speed, as well. On my laptop, which is
> dm-crypt on top of an SSD, it works about 50% of the time. Could you
> just try running it 100 times or something and see if it fails?

$for i ($(seq 1 1000)) ./pread_eexist_repro /mnt/test/$i || echo "fail"

..couple of thousand runs without problem, only lots of fallocating and
cache dropping.

Oh well, I tried. :)

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Omar Sandoval Nov. 10, 2016, 4:20 p.m. UTC | #7

On Thu, Nov 10, 2016 at 05:01:44PM +0100, Holger Hoffstätte wrote:
> On 11/10/16 16:37, Omar Sandoval wrote:
> > On Thu, Nov 10, 2016 at 04:11:35PM +0100, Holger Hoffstätte wrote:
> >> On 11/10/16 00:26, Omar Sandoval wrote:
> >>> From: Omar Sandoval <osandov@fb.com>
> >>>
> >>> My QEMU VM was seeing inexplicable I/O errors that I tracked down to
> >>> errors coming from the qcow2 virtual drive in the host system. The qcow2
> >>> file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT.
> >>> Every once in awhile, pread() or pwrite() would return EEXIST, which
> >>> makes no sense. This turned out to be a bug in btrfs_get_extent().
> >>>
> >>> Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map
> >>> insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where
> >>> two threads race on adding the same extent map to an inode's extent map
> >>> tree. However, if the added em is merged with an adjacent em in the
> >>> extent tree, then we'll end up with an existing extent that is not
> >>> identical to but instead encompasses the extent we tried to add. When we
> >>> call merge_extent_mapping() to find the nonoverlapping part of the new
> >>> em, the arithmetic overflows because there is no such thing. We then end
> >>> up trying to add a bogus em to the em_tree, which results in a EEXIST
> >>> that can bubble all the way up to userspace.
> >>>
> >>> Fix it by extending the identical extent map special case.
> >>>
> >>> Signed-off-by: Omar Sandoval <osandov@fb.com>
> >>> ---
> >>> Applies to 4.9-rc4.
> >>>
> >>> Here [1] is a reproducer for this bug that doesn't involve firing up a
> >>> QEMU VM. Also, a big shoutout to BCC [2] and BPF for making it possible
> >>> to debug this on my laptop without compiling a custom kernel and
> >>> rebooting just to add printks [3].
> >>>
> >>> 1: https://gist.github.com/osandov/d08aabe5d4dec15517e9fde17012fd3b
> >>
> >> I can't really make this reproducer fail. It builds and runs fine, but just
> >> exits with no messages (other than the one about drop_caches in dmesg).
> >> It creates the 1MB file and always returns 0. Ideas?
> >>
> >> -h
> > 
> > It's a race condition, so it doesn't happen 100% of the time. I imagine
> > it depends on the storage speed, as well. On my laptop, which is
> > dm-crypt on top of an SSD, it works about 50% of the time. Could you
> > just try running it 100 times or something and see if it fails?
> 
> $for i ($(seq 1 1000)) ./pread_eexist_repro /mnt/test/$i || echo "fail"
> 
> ..couple of thousand runs without problem, only lots of fallocating and
> cache dropping.
> 
> Oh well, I tried. :)
> 
> -h

Just out of curiousity, what kind of disk were you trying this on? I've
only been able to trigger it on my laptop and a VM running on my laptop.

Holger Hoffstätte Nov. 10, 2016, 4:31 p.m. UTC | #8

On 11/10/16 17:20, Omar Sandoval wrote:
> On Thu, Nov 10, 2016 at 05:01:44PM +0100, Holger Hoffstätte wrote:
>> On 11/10/16 16:37, Omar Sandoval wrote:
>>> On Thu, Nov 10, 2016 at 04:11:35PM +0100, Holger Hoffstätte wrote:
>>>> On 11/10/16 00:26, Omar Sandoval wrote:
>>>>> From: Omar Sandoval <osandov@fb.com>
>>>>>
>>>>> My QEMU VM was seeing inexplicable I/O errors that I tracked down to
>>>>> errors coming from the qcow2 virtual drive in the host system. The qcow2
>>>>> file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT.
>>>>> Every once in awhile, pread() or pwrite() would return EEXIST, which
>>>>> makes no sense. This turned out to be a bug in btrfs_get_extent().
>>>>>
>>>>> Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map
>>>>> insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where
>>>>> two threads race on adding the same extent map to an inode's extent map
>>>>> tree. However, if the added em is merged with an adjacent em in the
>>>>> extent tree, then we'll end up with an existing extent that is not
>>>>> identical to but instead encompasses the extent we tried to add. When we
>>>>> call merge_extent_mapping() to find the nonoverlapping part of the new
>>>>> em, the arithmetic overflows because there is no such thing. We then end
>>>>> up trying to add a bogus em to the em_tree, which results in a EEXIST
>>>>> that can bubble all the way up to userspace.
>>>>>
>>>>> Fix it by extending the identical extent map special case.
>>>>>
>>>>> Signed-off-by: Omar Sandoval <osandov@fb.com>
>>>>> ---
>>>>> Applies to 4.9-rc4.
>>>>>
>>>>> Here [1] is a reproducer for this bug that doesn't involve firing up a
>>>>> QEMU VM. Also, a big shoutout to BCC [2] and BPF for making it possible
>>>>> to debug this on my laptop without compiling a custom kernel and
>>>>> rebooting just to add printks [3].
>>>>>
>>>>> 1: https://gist.github.com/osandov/d08aabe5d4dec15517e9fde17012fd3b
>>>>
>>>> I can't really make this reproducer fail. It builds and runs fine, but just
>>>> exits with no messages (other than the one about drop_caches in dmesg).
>>>> It creates the 1MB file and always returns 0. Ideas?
>>>>
>>>> -h
>>>
>>> It's a race condition, so it doesn't happen 100% of the time. I imagine
>>> it depends on the storage speed, as well. On my laptop, which is
>>> dm-crypt on top of an SSD, it works about 50% of the time. Could you
>>> just try running it 100 times or something and see if it fails?
>>
>> $for i ($(seq 1 1000)) ./pread_eexist_repro /mnt/test/$i || echo "fail"
>>
>> ..couple of thousand runs without problem, only lots of fallocating and
>> cache dropping.
>>
>> Oh well, I tried. :)
>>
>> -h
> 
> Just out of curiousity, what kind of disk were you trying this on? I've
> only been able to trigger it on my laptop and a VM running on my laptop.

Tried on both an SSD and an old slowpoke 2.5" rotational disk on USB2.
But I also have a ton of other patches and a custom CPU scheduler, so
everything it likely my fault anyway. Don't sweat it. :)

From what I can tell the explanation of the problem and the change itself
make sense. Would have been nice to be able to repro.

cheers,

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Liu Bo Nov. 10, 2016, 8:01 p.m. UTC | #9

On Wed, Nov 09, 2016 at 03:26:50PM -0800, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> My QEMU VM was seeing inexplicable I/O errors that I tracked down to
> errors coming from the qcow2 virtual drive in the host system. The qcow2
> file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT.
> Every once in awhile, pread() or pwrite() would return EEXIST, which
> makes no sense. This turned out to be a bug in btrfs_get_extent().
> 
> Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map
> insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where
> two threads race on adding the same extent map to an inode's extent map
> tree. However, if the added em is merged with an adjacent em in the
> extent tree, then we'll end up with an existing extent that is not
> identical to but instead encompasses the extent we tried to add. When we
> call merge_extent_mapping() to find the nonoverlapping part of the new
> em, the arithmetic overflows because there is no such thing. We then end
> up trying to add a bogus em to the em_tree, which results in a EEXIST
> that can bubble all the way up to userspace.

I don't get how this could happen(even after reading Commit
8dff9c853410), btrfs_get_extent in direct_IO is protected by
lock_extent_direct, the assumption is that a racy thread should be
blocked by lock_extent_direct and when it gets the lock, it finds the
just-inserted em when going into btrfs_get_extent if its offset is
within [em->start, extent_map_end(em)].

I think we may also need to figure out why the above doesn't work as
expected besides fixing another special case.

Thanks,

-liubo

> 
> Fix it by extending the identical extent map special case.
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
> Applies to 4.9-rc4.
> 
> Here [1] is a reproducer for this bug that doesn't involve firing up a
> QEMU VM. Also, a big shoutout to BCC [2] and BPF for making it possible
> to debug this on my laptop without compiling a custom kernel and
> rebooting just to add printks [3].
> 
> 1: https://gist.github.com/osandov/d08aabe5d4dec15517e9fde17012fd3b
> 2: https://github.com/iovisor/bcc
> 3: https://gist.github.com/osandov/eb1db868ce10c3af9e00b90f3a65bf9f
> 
>  fs/btrfs/inode.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 2b790bd..e5cf589 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7049,11 +7049,11 @@ struct extent_map *btrfs_get_extent(struct inode *inode, struct page *page,
>  		 * extent causing the -EEXIST.
>  		 */
>  		if (existing->start == em->start &&
> -		    extent_map_end(existing) == extent_map_end(em) &&
> +		    extent_map_end(existing) >= extent_map_end(em) &&
>  		    em->block_start == existing->block_start) {
>  			/*
> -			 * these two extents are the same, it happens
> -			 * with inlines especially
> +			 * The existing extent map already encompasses the
> +			 * entire extent map we tried to add.
>  			 */
>  			free_extent_map(em);
>  			em = existing;
> -- 
> 2.10.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Omar Sandoval Nov. 10, 2016, 8:09 p.m. UTC | #10

On Thu, Nov 10, 2016 at 12:01:20PM -0800, Liu Bo wrote:
> On Wed, Nov 09, 2016 at 03:26:50PM -0800, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > My QEMU VM was seeing inexplicable I/O errors that I tracked down to
> > errors coming from the qcow2 virtual drive in the host system. The qcow2
> > file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT.
> > Every once in awhile, pread() or pwrite() would return EEXIST, which
> > makes no sense. This turned out to be a bug in btrfs_get_extent().
> > 
> > Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map
> > insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where
> > two threads race on adding the same extent map to an inode's extent map
> > tree. However, if the added em is merged with an adjacent em in the
> > extent tree, then we'll end up with an existing extent that is not
> > identical to but instead encompasses the extent we tried to add. When we
> > call merge_extent_mapping() to find the nonoverlapping part of the new
> > em, the arithmetic overflows because there is no such thing. We then end
> > up trying to add a bogus em to the em_tree, which results in a EEXIST
> > that can bubble all the way up to userspace.
> 
> I don't get how this could happen(even after reading Commit
> 8dff9c853410), btrfs_get_extent in direct_IO is protected by
> lock_extent_direct, the assumption is that a racy thread should be
> blocked by lock_extent_direct and when it gets the lock, it finds the
> just-inserted em when going into btrfs_get_extent if its offset is
> within [em->start, extent_map_end(em)].
> 
> I think we may also need to figure out why the above doesn't work as
> expected besides fixing another special case.
> 
> Thanks,
> 
> -liubo

lock_extent_direct() only protects the range you're doing I/O into, not
the entire extent. If two threads are doing two non-overlapping reads in
the same extent, then you can get this race.

Omar Sandoval Nov. 10, 2016, 8:24 p.m. UTC | #11

On Thu, Nov 10, 2016 at 12:09:06PM -0800, Omar Sandoval wrote:
> On Thu, Nov 10, 2016 at 12:01:20PM -0800, Liu Bo wrote:
> > On Wed, Nov 09, 2016 at 03:26:50PM -0800, Omar Sandoval wrote:
> > > From: Omar Sandoval <osandov@fb.com>
> > > 
> > > My QEMU VM was seeing inexplicable I/O errors that I tracked down to
> > > errors coming from the qcow2 virtual drive in the host system. The qcow2
> > > file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT.
> > > Every once in awhile, pread() or pwrite() would return EEXIST, which
> > > makes no sense. This turned out to be a bug in btrfs_get_extent().
> > > 
> > > Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map
> > > insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where
> > > two threads race on adding the same extent map to an inode's extent map
> > > tree. However, if the added em is merged with an adjacent em in the
> > > extent tree, then we'll end up with an existing extent that is not
> > > identical to but instead encompasses the extent we tried to add. When we
> > > call merge_extent_mapping() to find the nonoverlapping part of the new
> > > em, the arithmetic overflows because there is no such thing. We then end
> > > up trying to add a bogus em to the em_tree, which results in a EEXIST
> > > that can bubble all the way up to userspace.
> > 
> > I don't get how this could happen(even after reading Commit
> > 8dff9c853410), btrfs_get_extent in direct_IO is protected by
> > lock_extent_direct, the assumption is that a racy thread should be
> > blocked by lock_extent_direct and when it gets the lock, it finds the
> > just-inserted em when going into btrfs_get_extent if its offset is
> > within [em->start, extent_map_end(em)].
> > 
> > I think we may also need to figure out why the above doesn't work as
> > expected besides fixing another special case.
> > 
> > Thanks,
> > 
> > -liubo
> 
> lock_extent_direct() only protects the range you're doing I/O into, not
> the entire extent. If two threads are doing two non-overlapping reads in
> the same extent, then you can get this race.

More concretely, assume the extent tree on disk has:

+-------------------------+-------------------------------+
|start=0,len=8192,bytenr=0|start=8192,len=8192,bytenr=8192|
+-------------------------+-------------------------------+

And the extent map tree in memory has a single em cached for the second
extent {start=8192, len=8192, bytenr=8192}. Then, two threads try do do
direct I/O reads:

Thread 1                               | Thread 2
---------------------------------------+-------------------------------
pread(offset=0, nbyte=4096)            | pread(offset=4096, nbyte=4096)
lock_extent_direct(start=0, end=4095)  | lock_extent_direct(start=4096, end=8191)
btrfs_get_extent(start=0, len=4096)    | btrfs_get_extent(start=4096, len4096)
  lookup_extent_mapping() = NULL       |   lookup_extent_mapping() = NULL
  reads extent from B-tree             |   reads extent from B-tree
                                       |   write_lock(&em_tree->lock)
				       |   add_extent_mapping(start=0, len=8192, bytenr=0)
				       |     try_merge_map()
				       |     em_tree now has {start=0, len=16384, bytenr=0}
				       |   write_unlock(&em_tree->lock)
write_lock(&em_tree->lock)             |
add_extent_mapping(start=0, len=8192,  |
                   bytenr=0) = -EEXIST |
search_extent_mapping() = {start=0,    |
                           len=16384,  |
			   bytenr=0}   |
merge_extent_mapping() does bogus math |
and overflows, returns EEXIST          |

Liu Bo Nov. 10, 2016, 10:38 p.m. UTC | #12

On Thu, Nov 10, 2016 at 12:24:13PM -0800, Omar Sandoval wrote:
> On Thu, Nov 10, 2016 at 12:09:06PM -0800, Omar Sandoval wrote:
> > On Thu, Nov 10, 2016 at 12:01:20PM -0800, Liu Bo wrote:
> > > On Wed, Nov 09, 2016 at 03:26:50PM -0800, Omar Sandoval wrote:
> > > > From: Omar Sandoval <osandov@fb.com>
> > > > 
> > > > My QEMU VM was seeing inexplicable I/O errors that I tracked down to
> > > > errors coming from the qcow2 virtual drive in the host system. The qcow2
> > > > file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT.
> > > > Every once in awhile, pread() or pwrite() would return EEXIST, which
> > > > makes no sense. This turned out to be a bug in btrfs_get_extent().
> > > > 
> > > > Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map
> > > > insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where
> > > > two threads race on adding the same extent map to an inode's extent map
> > > > tree. However, if the added em is merged with an adjacent em in the
> > > > extent tree, then we'll end up with an existing extent that is not
> > > > identical to but instead encompasses the extent we tried to add. When we
> > > > call merge_extent_mapping() to find the nonoverlapping part of the new
> > > > em, the arithmetic overflows because there is no such thing. We then end
> > > > up trying to add a bogus em to the em_tree, which results in a EEXIST
> > > > that can bubble all the way up to userspace.
> > > 
> > > I don't get how this could happen(even after reading Commit
> > > 8dff9c853410), btrfs_get_extent in direct_IO is protected by
> > > lock_extent_direct, the assumption is that a racy thread should be
> > > blocked by lock_extent_direct and when it gets the lock, it finds the
> > > just-inserted em when going into btrfs_get_extent if its offset is
> > > within [em->start, extent_map_end(em)].
> > > 
> > > I think we may also need to figure out why the above doesn't work as
> > > expected besides fixing another special case.
> > > 
> > > Thanks,
> > > 
> > > -liubo
> > 
> > lock_extent_direct() only protects the range you're doing I/O into, not
> > the entire extent. If two threads are doing two non-overlapping reads in
> > the same extent, then you can get this race.
> 
> More concretely, assume the extent tree on disk has:
> 
> +-------------------------+-------------------------------+
> |start=0,len=8192,bytenr=0|start=8192,len=8192,bytenr=8192|
> +-------------------------+-------------------------------+
> 
> And the extent map tree in memory has a single em cached for the second
> extent {start=8192, len=8192, bytenr=8192}. Then, two threads try do do
> direct I/O reads:
> 
> Thread 1                               | Thread 2
> ---------------------------------------+-------------------------------
> pread(offset=0, nbyte=4096)            | pread(offset=4096, nbyte=4096)
> lock_extent_direct(start=0, end=4095)  | lock_extent_direct(start=4096, end=8191)
> btrfs_get_extent(start=0, len=4096)    | btrfs_get_extent(start=4096, len4096)
>   lookup_extent_mapping() = NULL       |   lookup_extent_mapping() = NULL
>   reads extent from B-tree             |   reads extent from B-tree
>                                        |   write_lock(&em_tree->lock)
> 				       |   add_extent_mapping(start=0, len=8192, bytenr=0)
> 				       |     try_merge_map()
> 				       |     em_tree now has {start=0, len=16384, bytenr=0}
> 				       |   write_unlock(&em_tree->lock)
> write_lock(&em_tree->lock)             |
> add_extent_mapping(start=0, len=8192,  |
>                    bytenr=0) = -EEXIST |
> search_extent_mapping() = {start=0,    |
>                            len=16384,  |
> 			   bytenr=0}   |
> merge_extent_mapping() does bogus math |
> and overflows, returns EEXIST          |

Yeah, so much fun.

The problem is that we lock and request [0, 4096], but we insert a em of
[0, 8192] instead.  So if we insert a [0, 4096] em, then we can make
sure that the em returned by btrfs_get_extent is protected from race by
the range of lock_extent_direct.

I'll give it a shot and do some testing.

For this patch,

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>

Would you please make a reproducer for fstests?

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Omar Sandoval Nov. 10, 2016, 10:45 p.m. UTC | #13

On Thu, Nov 10, 2016 at 02:38:14PM -0800, Liu Bo wrote:
> On Thu, Nov 10, 2016 at 12:24:13PM -0800, Omar Sandoval wrote:
> > On Thu, Nov 10, 2016 at 12:09:06PM -0800, Omar Sandoval wrote:
> > > On Thu, Nov 10, 2016 at 12:01:20PM -0800, Liu Bo wrote:
> > > > On Wed, Nov 09, 2016 at 03:26:50PM -0800, Omar Sandoval wrote:
> > > > > From: Omar Sandoval <osandov@fb.com>
> > > > > 
> > > > > My QEMU VM was seeing inexplicable I/O errors that I tracked down to
> > > > > errors coming from the qcow2 virtual drive in the host system. The qcow2
> > > > > file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT.
> > > > > Every once in awhile, pread() or pwrite() would return EEXIST, which
> > > > > makes no sense. This turned out to be a bug in btrfs_get_extent().
> > > > > 
> > > > > Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map
> > > > > insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where
> > > > > two threads race on adding the same extent map to an inode's extent map
> > > > > tree. However, if the added em is merged with an adjacent em in the
> > > > > extent tree, then we'll end up with an existing extent that is not
> > > > > identical to but instead encompasses the extent we tried to add. When we
> > > > > call merge_extent_mapping() to find the nonoverlapping part of the new
> > > > > em, the arithmetic overflows because there is no such thing. We then end
> > > > > up trying to add a bogus em to the em_tree, which results in a EEXIST
> > > > > that can bubble all the way up to userspace.
> > > > 
> > > > I don't get how this could happen(even after reading Commit
> > > > 8dff9c853410), btrfs_get_extent in direct_IO is protected by
> > > > lock_extent_direct, the assumption is that a racy thread should be
> > > > blocked by lock_extent_direct and when it gets the lock, it finds the
> > > > just-inserted em when going into btrfs_get_extent if its offset is
> > > > within [em->start, extent_map_end(em)].
> > > > 
> > > > I think we may also need to figure out why the above doesn't work as
> > > > expected besides fixing another special case.
> > > > 
> > > > Thanks,
> > > > 
> > > > -liubo
> > > 
> > > lock_extent_direct() only protects the range you're doing I/O into, not
> > > the entire extent. If two threads are doing two non-overlapping reads in
> > > the same extent, then you can get this race.
> > 
> > More concretely, assume the extent tree on disk has:
> > 
> > +-------------------------+-------------------------------+
> > |start=0,len=8192,bytenr=0|start=8192,len=8192,bytenr=8192|
> > +-------------------------+-------------------------------+
> > 
> > And the extent map tree in memory has a single em cached for the second
> > extent {start=8192, len=8192, bytenr=8192}. Then, two threads try do do
> > direct I/O reads:
> > 
> > Thread 1                               | Thread 2
> > ---------------------------------------+-------------------------------
> > pread(offset=0, nbyte=4096)            | pread(offset=4096, nbyte=4096)
> > lock_extent_direct(start=0, end=4095)  | lock_extent_direct(start=4096, end=8191)
> > btrfs_get_extent(start=0, len=4096)    | btrfs_get_extent(start=4096, len4096)
> >   lookup_extent_mapping() = NULL       |   lookup_extent_mapping() = NULL
> >   reads extent from B-tree             |   reads extent from B-tree
> >                                        |   write_lock(&em_tree->lock)
> > 				       |   add_extent_mapping(start=0, len=8192, bytenr=0)
> > 				       |     try_merge_map()
> > 				       |     em_tree now has {start=0, len=16384, bytenr=0}
> > 				       |   write_unlock(&em_tree->lock)
> > write_lock(&em_tree->lock)             |
> > add_extent_mapping(start=0, len=8192,  |
> >                    bytenr=0) = -EEXIST |
> > search_extent_mapping() = {start=0,    |
> >                            len=16384,  |
> > 			   bytenr=0}   |
> > merge_extent_mapping() does bogus math |
> > and overflows, returns EEXIST          |
> 
> Yeah, so much fun.
> 
> The problem is that we lock and request [0, 4096], but we insert a em of
> [0, 8192] instead.  So if we insert a [0, 4096] em, then we can make
> sure that the em returned by btrfs_get_extent is protected from race by
> the range of lock_extent_direct.
> 
> I'll give it a shot and do some testing.
> 
> For this patch,
> 
> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>

Thank you!

> Would you please make a reproducer for fstests?

Sure. Trying to trigger this with xfs_io never works because it's such a
narrow race window, but I'll send in my reproducer and see what the
xfstests guys think about adding new binaries.

> Thanks,
> 
> -liubo

Liu Bo Nov. 11, 2016, 12:36 a.m. UTC | #14

On Thu, Nov 10, 2016 at 04:06:51PM +0100, David Sterba wrote:
> On Wed, Nov 09, 2016 at 03:26:50PM -0800, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > My QEMU VM was seeing inexplicable I/O errors that I tracked down to
> > errors coming from the qcow2 virtual drive in the host system. The qcow2
> > file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT.
> > Every once in awhile, pread() or pwrite() would return EEXIST, which
> > makes no sense. This turned out to be a bug in btrfs_get_extent().
> > 
> > Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map
> > insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where
> > two threads race on adding the same extent map to an inode's extent map
> > tree. However, if the added em is merged with an adjacent em in the
> > extent tree, then we'll end up with an existing extent that is not
> > identical to but instead encompasses the extent we tried to add. When we
> > call merge_extent_mapping() to find the nonoverlapping part of the new
> > em, the arithmetic overflows because there is no such thing. We then end
> > up trying to add a bogus em to the em_tree, which results in a EEXIST
> > that can bubble all the way up to userspace.
> 
> Is this possibly the same bug that Liu Bo fixes in
> https://patchwork.kernel.org/patch/9413129/ ?

It's not, Omar is fixing a bug for read, my patch was to save space for
during DIO write.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Omar Sandoval Nov. 17, 2016, 12:32 a.m. UTC | #15

On Thu, Nov 10, 2016 at 02:45:36PM -0800, Omar Sandoval wrote:
> On Thu, Nov 10, 2016 at 02:38:14PM -0800, Liu Bo wrote:
> > On Thu, Nov 10, 2016 at 12:24:13PM -0800, Omar Sandoval wrote:
> > > On Thu, Nov 10, 2016 at 12:09:06PM -0800, Omar Sandoval wrote:
> > > > On Thu, Nov 10, 2016 at 12:01:20PM -0800, Liu Bo wrote:
> > > > > On Wed, Nov 09, 2016 at 03:26:50PM -0800, Omar Sandoval wrote:
> > > > > > From: Omar Sandoval <osandov@fb.com>
> > > > > > 
> > > > > > My QEMU VM was seeing inexplicable I/O errors that I tracked down to
> > > > > > errors coming from the qcow2 virtual drive in the host system. The qcow2
> > > > > > file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT.
> > > > > > Every once in awhile, pread() or pwrite() would return EEXIST, which
> > > > > > makes no sense. This turned out to be a bug in btrfs_get_extent().
> > > > > > 
> > > > > > Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map
> > > > > > insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where
> > > > > > two threads race on adding the same extent map to an inode's extent map
> > > > > > tree. However, if the added em is merged with an adjacent em in the
> > > > > > extent tree, then we'll end up with an existing extent that is not
> > > > > > identical to but instead encompasses the extent we tried to add. When we
> > > > > > call merge_extent_mapping() to find the nonoverlapping part of the new
> > > > > > em, the arithmetic overflows because there is no such thing. We then end
> > > > > > up trying to add a bogus em to the em_tree, which results in a EEXIST
> > > > > > that can bubble all the way up to userspace.
> > > > > 
> > > > > I don't get how this could happen(even after reading Commit
> > > > > 8dff9c853410), btrfs_get_extent in direct_IO is protected by
> > > > > lock_extent_direct, the assumption is that a racy thread should be
> > > > > blocked by lock_extent_direct and when it gets the lock, it finds the
> > > > > just-inserted em when going into btrfs_get_extent if its offset is
> > > > > within [em->start, extent_map_end(em)].
> > > > > 
> > > > > I think we may also need to figure out why the above doesn't work as
> > > > > expected besides fixing another special case.
> > > > > 
> > > > > Thanks,
> > > > > 
> > > > > -liubo
> > > > 
> > > > lock_extent_direct() only protects the range you're doing I/O into, not
> > > > the entire extent. If two threads are doing two non-overlapping reads in
> > > > the same extent, then you can get this race.
> > > 
> > > More concretely, assume the extent tree on disk has:
> > > 
> > > +-------------------------+-------------------------------+
> > > |start=0,len=8192,bytenr=0|start=8192,len=8192,bytenr=8192|
> > > +-------------------------+-------------------------------+
> > > 
> > > And the extent map tree in memory has a single em cached for the second
> > > extent {start=8192, len=8192, bytenr=8192}. Then, two threads try do do
> > > direct I/O reads:
> > > 
> > > Thread 1                               | Thread 2
> > > ---------------------------------------+-------------------------------
> > > pread(offset=0, nbyte=4096)            | pread(offset=4096, nbyte=4096)
> > > lock_extent_direct(start=0, end=4095)  | lock_extent_direct(start=4096, end=8191)
> > > btrfs_get_extent(start=0, len=4096)    | btrfs_get_extent(start=4096, len4096)
> > >   lookup_extent_mapping() = NULL       |   lookup_extent_mapping() = NULL
> > >   reads extent from B-tree             |   reads extent from B-tree
> > >                                        |   write_lock(&em_tree->lock)
> > > 				       |   add_extent_mapping(start=0, len=8192, bytenr=0)
> > > 				       |     try_merge_map()
> > > 				       |     em_tree now has {start=0, len=16384, bytenr=0}
> > > 				       |   write_unlock(&em_tree->lock)
> > > write_lock(&em_tree->lock)             |
> > > add_extent_mapping(start=0, len=8192,  |
> > >                    bytenr=0) = -EEXIST |
> > > search_extent_mapping() = {start=0,    |
> > >                            len=16384,  |
> > > 			   bytenr=0}   |
> > > merge_extent_mapping() does bogus math |
> > > and overflows, returns EEXIST          |
> > 
> > Yeah, so much fun.
> > 
> > The problem is that we lock and request [0, 4096], but we insert a em of
> > [0, 8192] instead.  So if we insert a [0, 4096] em, then we can make
> > sure that the em returned by btrfs_get_extent is protected from race by
> > the range of lock_extent_direct.
> > 
> > I'll give it a shot and do some testing.
> > 
> > For this patch,
> > 
> > Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
> 
> Thank you!
> 
> > Would you please make a reproducer for fstests?
> 
> Sure. Trying to trigger this with xfs_io never works because it's such a
> narrow race window, but I'll send in my reproducer and see what the
> xfstests guys think about adding new binaries.
> 

xfstests patch is here: http://marc.info/?l=fstests&m=147934258732500&w=2

diff mbox

Patch

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 2b790bd..e5cf589 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7049,11 +7049,11 @@  struct extent_map *btrfs_get_extent(struct inode *inode, struct page *page,
 		 * extent causing the -EEXIST.
 		 */
 		if (existing->start == em->start &&
-		    extent_map_end(existing) == extent_map_end(em) &&
+		    extent_map_end(existing) >= extent_map_end(em) &&
 		    em->block_start == existing->block_start) {
 			/*
-			 * these two extents are the same, it happens
-			 * with inlines especially
+			 * The existing extent map already encompasses the
+			 * entire extent map we tried to add.
 			 */
 			free_extent_map(em);
 			em = existing;