Message ID | 20200617162746.3780660-1-boris@bur.io (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race | expand |
On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov <boris@bur.io> wrote: > > Under somewhat convoluted conditions, it is possible to attempt to > release an extent_buffer that is under io, which triggers a BUG_ON in > btrfs_release_extent_buffer_pages. > > This relies on a few different factors. First, extent_buffer reads done > as readahead for searching use WAIT_NONE, so they free the local extent > buffer reference while the io is outstanding. However, they should still > be protected by TREE_REF. However, if the system is doing signficant > reclaim, and simultaneously heavily accessing the extent_buffers, it is > possible for releasepage to race with two concurrent readahead attempts > in a way that leaves TREE_REF unset when the readahead extent buffer is > released. > > Essentially, if two tasks race to allocate a new extent_buffer, but the > winner who attempts the first io is rebuffed by a page being locked > (likely by the reclaim itself) then the loser will still go ahead with > issuing the readahead. The loser's call to find_extent_buffer must also > race with the reclaim task reading the extent_buffer's refcount as 1 in > a way that allows the reclaim to re-clear the TREE_REF checked by > find_extent_buffer. > > The following represents an example execution demonstrating the race: > > CPU0 CPU1 CPU2 > reada_for_search reada_for_search > readahead_tree_block readahead_tree_block > find_create_tree_block find_create_tree_block > alloc_extent_buffer alloc_extent_buffer > find_extent_buffer // not found > allocates eb > lock pages > associate pages to eb > insert eb into radix tree > set TREE_REF, refs == 2 > unlock pages > read_extent_buffer_pages // WAIT_NONE > not uptodate (brand new eb) > lock_page > if !trylock_page > goto unlock_exit // not an error > free_extent_buffer > release_extent_buffer > atomic_dec_and_test refs to 1 > find_extent_buffer // found > try_release_extent_buffer > take refs_lock > reads refs == 1; no io > atomic_inc_not_zero refs to 2 > mark_buffer_accessed > check_buffer_tree_ref > // not STALE, won't take refs_lock > refs == 2; TREE_REF set // no action > read_extent_buffer_pages // WAIT_NONE > clear TREE_REF > release_extent_buffer > atomic_dec_and_test refs to 1 > unlock_page > still not uptodate (CPU1 read failed on trylock_page) > locks pages > set io_pages > 0 > submit io > return > release_extent_buffer Small detail, missing the call to free_extent_buffer() from readahead_tree_block(), which is what ends calling release_extent_buffer(). > dec refs to 0 > delete from radix tree > btrfs_release_extent_buffer_pages > BUG_ON(io_pages > 0)!!! > > We observe this at a very low rate in production and were also able to > reproduce it in a test environment by introducing some spurious delays > and by introducing probabilistic trylock_page failures. > > To fix it, we apply check_tree_ref at a point where it could not > possibly be unset by a competing task: after io_pages has been > incremented. There is no race in write_one_eb, that we know of, but for > consistency, apply it there too. All the codepaths that clear TREE_REF > check for io, so they would not be able to clear it after this point. > > Signed-off-by: Boris Burkov <boris@bur.io> Btw, if you have a stack trace, it would be good to include it in the change log for grep-ability in case someone runs into the same problem. > --- > fs/btrfs/extent_io.c | 45 ++++++++++++++++++++++++++++---------------- > 1 file changed, 29 insertions(+), 16 deletions(-) > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index c59e07360083..f6758ebbb6a2 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -3927,6 +3927,11 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb, > clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags); > num_pages = num_extent_pages(eb); > atomic_set(&eb->io_pages, num_pages); > + /* > + * It is possible for releasepage to clear the TREE_REF bit before we > + * set io_pages. See check_buffer_tree_ref for a more detailed comment. > + */ > + check_buffer_tree_ref(eb); This is a whole different case from the one described in the changelog, as this is in the write path. Why do we need this one? At this point the eb pages are marked with the dirty bit, and btree_releasepage() returns 0 if the page is dirty and we don't end up calling try_release_extent_buffer(). So I don't understand why this hunk is needed, what variant of the problem it's trying to solve. > > /* set btree blocks beyond nritems with 0 to avoid stale content. */ > nritems = btrfs_header_nritems(eb); > @@ -5086,25 +5091,28 @@ struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info, > static void check_buffer_tree_ref(struct extent_buffer *eb) > { > int refs; > - /* the ref bit is tricky. We have to make sure it is set > - * if we have the buffer dirty. Otherwise the > - * code to free a buffer can end up dropping a dirty > - * page > + /* > + * The TREE_REF bit is first set when the extent_buffer is added > + * to the radix tree. It is also reset, if unset, when a new reference > + * is created by find_extent_buffer. > * > - * Once the ref bit is set, it won't go away while the > - * buffer is dirty or in writeback, and it also won't > - * go away while we have the reference count on the > - * eb bumped. > + * It is only cleared in two cases: freeing the last non-tree > + * reference to the extent_buffer when its STALE bit is set or > + * calling releasepage when the tree reference is the only reference. > * > - * We can't just set the ref bit without bumping the > - * ref on the eb because free_extent_buffer might > - * see the ref bit and try to clear it. If this happens > - * free_extent_buffer might end up dropping our original > - * ref by mistake and freeing the page before we are able > - * to add one more ref. > + * In both cases, care is taken to ensure that the extent_buffer's > + * pages are not under io. However, releasepage can be concurrently > + * called with creating new references, which is prone to race > + * conditions between the calls to check_tree_ref in those codepaths > + * and clearing TREE_REF in try_release_extent_buffer. > * > - * So bump the ref count first, then set the bit. If someone > - * beat us to it, drop the ref we added. > + * The actual lifetime of the extent_buffer in the radix tree is > + * adequately protected by the refcount, but the TREE_REF bit and > + * its corresponding reference are not. To protect against this > + * class of races, we call check_buffer_tree_ref from the codepaths > + * which trigger io after they set eb->io_pages. Note that once io is > + * initiated, TREE_REF can no longer be cleared, so that is the > + * moment at which any such race is best fixed. > */ > refs = atomic_read(&eb->refs); > if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) > @@ -5555,6 +5563,11 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num) > clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags); > eb->read_mirror = 0; > atomic_set(&eb->io_pages, num_reads); > + /* > + * It is possible for releasepage to clear the TREE_REF bit before we > + * set io_pages. See check_buffer_tree_ref for a more detailed comment. > + */ > + check_buffer_tree_ref(eb); This is the hunk that fixes the problem described in the change log, and it looks good to me. Thanks. > for (i = 0; i < num_pages; i++) { > page = eb->pages[i]; > > -- > 2.24.1 >
On Wed, Jun 17, 2020 at 6:20 PM Filipe Manana <fdmanana@gmail.com> wrote: > > On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov <boris@bur.io> wrote: > > > > Under somewhat convoluted conditions, it is possible to attempt to > > release an extent_buffer that is under io, which triggers a BUG_ON in > > btrfs_release_extent_buffer_pages. > > > > This relies on a few different factors. First, extent_buffer reads done > > as readahead for searching use WAIT_NONE, so they free the local extent > > buffer reference while the io is outstanding. However, they should still > > be protected by TREE_REF. However, if the system is doing signficant > > reclaim, and simultaneously heavily accessing the extent_buffers, it is > > possible for releasepage to race with two concurrent readahead attempts > > in a way that leaves TREE_REF unset when the readahead extent buffer is > > released. > > > > Essentially, if two tasks race to allocate a new extent_buffer, but the > > winner who attempts the first io is rebuffed by a page being locked > > (likely by the reclaim itself) then the loser will still go ahead with > > issuing the readahead. The loser's call to find_extent_buffer must also > > race with the reclaim task reading the extent_buffer's refcount as 1 in > > a way that allows the reclaim to re-clear the TREE_REF checked by > > find_extent_buffer. > > > > The following represents an example execution demonstrating the race: > > > > CPU0 CPU1 CPU2 > > reada_for_search reada_for_search > > readahead_tree_block readahead_tree_block > > find_create_tree_block find_create_tree_block > > alloc_extent_buffer alloc_extent_buffer > > find_extent_buffer // not found > > allocates eb > > lock pages > > associate pages to eb > > insert eb into radix tree > > set TREE_REF, refs == 2 > > unlock pages > > read_extent_buffer_pages // WAIT_NONE > > not uptodate (brand new eb) > > lock_page > > if !trylock_page > > goto unlock_exit // not an error > > free_extent_buffer > > release_extent_buffer > > atomic_dec_and_test refs to 1 > > find_extent_buffer // found > > try_release_extent_buffer > > take refs_lock > > reads refs == 1; no io > > atomic_inc_not_zero refs to 2 > > mark_buffer_accessed > > check_buffer_tree_ref > > // not STALE, won't take refs_lock > > refs == 2; TREE_REF set // no action > > read_extent_buffer_pages // WAIT_NONE > > clear TREE_REF > > release_extent_buffer > > atomic_dec_and_test refs to 1 > > unlock_page > > still not uptodate (CPU1 read failed on trylock_page) > > locks pages > > set io_pages > 0 > > submit io > > return > > release_extent_buffer > > Small detail, missing the call to free_extent_buffer() from > readahead_tree_block(), which is what ends calling > release_extent_buffer(). > > > dec refs to 0 > > delete from radix tree > > btrfs_release_extent_buffer_pages > > BUG_ON(io_pages > 0)!!! > > > > We observe this at a very low rate in production and were also able to > > reproduce it in a test environment by introducing some spurious delays > > and by introducing probabilistic trylock_page failures. > > > > To fix it, we apply check_tree_ref at a point where it could not > > possibly be unset by a competing task: after io_pages has been > > incremented. There is no race in write_one_eb, that we know of, but for > > consistency, apply it there too. All the codepaths that clear TREE_REF > > check for io, so they would not be able to clear it after this point. > > > > Signed-off-by: Boris Burkov <boris@bur.io> > > Btw, if you have a stack trace, it would be good to include it in the > change log for grep-ability in case someone runs into the same > problem. > > > --- > > fs/btrfs/extent_io.c | 45 ++++++++++++++++++++++++++++---------------- > > 1 file changed, 29 insertions(+), 16 deletions(-) > > > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > > index c59e07360083..f6758ebbb6a2 100644 > > --- a/fs/btrfs/extent_io.c > > +++ b/fs/btrfs/extent_io.c > > @@ -3927,6 +3927,11 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb, > > clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags); > > num_pages = num_extent_pages(eb); > > atomic_set(&eb->io_pages, num_pages); > > + /* > > + * It is possible for releasepage to clear the TREE_REF bit before we > > + * set io_pages. See check_buffer_tree_ref for a more detailed comment. > > + */ > > + check_buffer_tree_ref(eb); > > This is a whole different case from the one described in the > changelog, as this is in the write path. > Why do we need this one? > > At this point the eb pages are marked with the dirty bit, and > btree_releasepage() returns 0 if the page is dirty and we don't end up > calling try_release_extent_buffer(). In addition to that, when write_one_eb() is called we have incremented its ref count before and EXTENT_BUFFER_WRITEBACK is set on the eb, try_release_extent_buffer() should return and never call release_extent_buffer(), since the refcount is > 1 and extent_buffer_under_io() returns true. > So I don't understand why this hunk is needed, what variant of the > problem it's trying to solve. > > > > > /* set btree blocks beyond nritems with 0 to avoid stale content. */ > > nritems = btrfs_header_nritems(eb); > > @@ -5086,25 +5091,28 @@ struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info, > > static void check_buffer_tree_ref(struct extent_buffer *eb) > > { > > int refs; > > - /* the ref bit is tricky. We have to make sure it is set > > - * if we have the buffer dirty. Otherwise the > > - * code to free a buffer can end up dropping a dirty > > - * page > > + /* > > + * The TREE_REF bit is first set when the extent_buffer is added > > + * to the radix tree. It is also reset, if unset, when a new reference > > + * is created by find_extent_buffer. > > * > > - * Once the ref bit is set, it won't go away while the > > - * buffer is dirty or in writeback, and it also won't > > - * go away while we have the reference count on the > > - * eb bumped. > > + * It is only cleared in two cases: freeing the last non-tree > > + * reference to the extent_buffer when its STALE bit is set or > > + * calling releasepage when the tree reference is the only reference. > > * > > - * We can't just set the ref bit without bumping the > > - * ref on the eb because free_extent_buffer might > > - * see the ref bit and try to clear it. If this happens > > - * free_extent_buffer might end up dropping our original > > - * ref by mistake and freeing the page before we are able > > - * to add one more ref. > > + * In both cases, care is taken to ensure that the extent_buffer's > > + * pages are not under io. However, releasepage can be concurrently > > + * called with creating new references, which is prone to race > > + * conditions between the calls to check_tree_ref in those codepaths > > + * and clearing TREE_REF in try_release_extent_buffer. > > * > > - * So bump the ref count first, then set the bit. If someone > > - * beat us to it, drop the ref we added. > > + * The actual lifetime of the extent_buffer in the radix tree is > > + * adequately protected by the refcount, but the TREE_REF bit and > > + * its corresponding reference are not. To protect against this > > + * class of races, we call check_buffer_tree_ref from the codepaths > > + * which trigger io after they set eb->io_pages. Note that once io is > > + * initiated, TREE_REF can no longer be cleared, so that is the > > + * moment at which any such race is best fixed. > > */ > > refs = atomic_read(&eb->refs); > > if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) > > @@ -5555,6 +5563,11 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num) > > clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags); > > eb->read_mirror = 0; > > atomic_set(&eb->io_pages, num_reads); > > + /* > > + * It is possible for releasepage to clear the TREE_REF bit before we > > + * set io_pages. See check_buffer_tree_ref for a more detailed comment. > > + */ > > + check_buffer_tree_ref(eb); > > This is the hunk that fixes the problem described in the change log, > and it looks good to me. > > Thanks. > > > for (i = 0; i < num_pages; i++) { > > page = eb->pages[i]; > > > > -- > > 2.24.1 > > > > > -- > Filipe David Manana, > > “Whether you think you can, or you think you can't — you're right.”
On 17 Jun 2020, at 13:20, Filipe Manana wrote: > On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov <boris@bur.io> wrote: > >> --- >> fs/btrfs/extent_io.c | 45 >> ++++++++++++++++++++++++++++---------------- >> 1 file changed, 29 insertions(+), 16 deletions(-) >> >> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c >> index c59e07360083..f6758ebbb6a2 100644 >> --- a/fs/btrfs/extent_io.c >> +++ b/fs/btrfs/extent_io.c >> @@ -3927,6 +3927,11 @@ static noinline_for_stack int >> write_one_eb(struct extent_buffer *eb, >> clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags); >> num_pages = num_extent_pages(eb); >> atomic_set(&eb->io_pages, num_pages); >> + /* >> + * It is possible for releasepage to clear the TREE_REF bit >> before we >> + * set io_pages. See check_buffer_tree_ref for a more >> detailed comment. >> + */ >> + check_buffer_tree_ref(eb); > > This is a whole different case from the one described in the > changelog, as this is in the write path. > Why do we need this one? This was Josef’s idea, but I really like the symmetry. You set io_pages, you do the tree_ref dance. Everyone fiddling with the write back bit right now correctly clears writeback after doing the atomic_dec on io_pages, but the race is tiny and prone to getting exposed again by shifting code around. Tree ref checks around io_pages are the most reliable way to prevent this bug from coming back again later. -chris
On Wed, Jun 17, 2020 at 6:43 PM Chris Mason <clm@fb.com> wrote: > > On 17 Jun 2020, at 13:20, Filipe Manana wrote: > > > On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov <boris@bur.io> wrote: > > > >> --- > >> fs/btrfs/extent_io.c | 45 > >> ++++++++++++++++++++++++++++---------------- > >> 1 file changed, 29 insertions(+), 16 deletions(-) > >> > >> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > >> index c59e07360083..f6758ebbb6a2 100644 > >> --- a/fs/btrfs/extent_io.c > >> +++ b/fs/btrfs/extent_io.c > >> @@ -3927,6 +3927,11 @@ static noinline_for_stack int > >> write_one_eb(struct extent_buffer *eb, > >> clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags); > >> num_pages = num_extent_pages(eb); > >> atomic_set(&eb->io_pages, num_pages); > >> + /* > >> + * It is possible for releasepage to clear the TREE_REF bit > >> before we > >> + * set io_pages. See check_buffer_tree_ref for a more > >> detailed comment. > >> + */ > >> + check_buffer_tree_ref(eb); > > > > This is a whole different case from the one described in the > > changelog, as this is in the write path. > > Why do we need this one? > > This was Josef’s idea, but I really like the symmetry. You set > io_pages, you do the tree_ref dance. Everyone fiddling with the write > back bit right now correctly clears writeback after doing the atomic_dec > on io_pages, but the race is tiny and prone to getting exposed again by > shifting code around. Tree ref checks around io_pages are the most > reliable way to prevent this bug from coming back again later. Ok, but that still doesn't answer my question. Is there an actual race/problem this hunk solves? Before calling write_one_eb() we increment the ref on the eb and we also call lock_extent_buffer_for_io(), which clears the dirty bit and sets the writeback bit on the eb while holding its ref_locks spin_lock. Even if we get to try_release_extent_buffer, it calls extent_buffer_under_io(eb) while holding the ref_locks spin_lock, so at any time it should return true, as either the dirty or the writeback bit is set. Is this purely a safety guard that is being introduced? Can we at least describe in the changelog why we are adding this hunk in the write path? All it mentions is a race between reading and releasing pages, there's nothing mentioned about races with writeback. Thanks. > > -chris
On 6/17/20 2:11 PM, Filipe Manana wrote: > On Wed, Jun 17, 2020 at 6:43 PM Chris Mason <clm@fb.com> wrote: >> >> On 17 Jun 2020, at 13:20, Filipe Manana wrote: >> >>> On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov <boris@bur.io> wrote: >>> >>>> --- >>>> fs/btrfs/extent_io.c | 45 >>>> ++++++++++++++++++++++++++++---------------- >>>> 1 file changed, 29 insertions(+), 16 deletions(-) >>>> >>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c >>>> index c59e07360083..f6758ebbb6a2 100644 >>>> --- a/fs/btrfs/extent_io.c >>>> +++ b/fs/btrfs/extent_io.c >>>> @@ -3927,6 +3927,11 @@ static noinline_for_stack int >>>> write_one_eb(struct extent_buffer *eb, >>>> clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags); >>>> num_pages = num_extent_pages(eb); >>>> atomic_set(&eb->io_pages, num_pages); >>>> + /* >>>> + * It is possible for releasepage to clear the TREE_REF bit >>>> before we >>>> + * set io_pages. See check_buffer_tree_ref for a more >>>> detailed comment. >>>> + */ >>>> + check_buffer_tree_ref(eb); >>> >>> This is a whole different case from the one described in the >>> changelog, as this is in the write path. >>> Why do we need this one? >> >> This was Josef’s idea, but I really like the symmetry. You set >> io_pages, you do the tree_ref dance. Everyone fiddling with the write >> back bit right now correctly clears writeback after doing the atomic_dec >> on io_pages, but the race is tiny and prone to getting exposed again by >> shifting code around. Tree ref checks around io_pages are the most >> reliable way to prevent this bug from coming back again later. > > Ok, but that still doesn't answer my question. > Is there an actual race/problem this hunk solves? > > Before calling write_one_eb() we increment the ref on the eb and we > also call lock_extent_buffer_for_io(), > which clears the dirty bit and sets the writeback bit on the eb while > holding its ref_locks spin_lock. > > Even if we get to try_release_extent_buffer, it calls > extent_buffer_under_io(eb) while holding the ref_locks spin_lock, > so at any time it should return true, as either the dirty or the > writeback bit is set. > > Is this purely a safety guard that is being introduced? > > Can we at least describe in the changelog why we are adding this hunk > in the write path? > All it mentions is a race between reading and releasing pages, there's > nothing mentioned about races with writeback. > I think maybe we make that bit a separate patch, and leave the panic fix on it's own. I suggested this because I have lofty ideas of killing the refs_lock, and this would at least keep us consistent in our usage TREE_REF to save us from freeing stuff that may be currently under IO. I'm super not happy with our reference handling coupled with releasepage, but these are the kind of hoops we're going to have to jump through until we have some better infrastructure in place to handle metadata. Thanks, Josef
On Wed, Jun 17, 2020 at 7:19 PM Josef Bacik <josef@toxicpanda.com> wrote: > > On 6/17/20 2:11 PM, Filipe Manana wrote: > > On Wed, Jun 17, 2020 at 6:43 PM Chris Mason <clm@fb.com> wrote: > >> > >> On 17 Jun 2020, at 13:20, Filipe Manana wrote: > >> > >>> On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov <boris@bur.io> wrote: > >>> > >>>> --- > >>>> fs/btrfs/extent_io.c | 45 > >>>> ++++++++++++++++++++++++++++---------------- > >>>> 1 file changed, 29 insertions(+), 16 deletions(-) > >>>> > >>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > >>>> index c59e07360083..f6758ebbb6a2 100644 > >>>> --- a/fs/btrfs/extent_io.c > >>>> +++ b/fs/btrfs/extent_io.c > >>>> @@ -3927,6 +3927,11 @@ static noinline_for_stack int > >>>> write_one_eb(struct extent_buffer *eb, > >>>> clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags); > >>>> num_pages = num_extent_pages(eb); > >>>> atomic_set(&eb->io_pages, num_pages); > >>>> + /* > >>>> + * It is possible for releasepage to clear the TREE_REF bit > >>>> before we > >>>> + * set io_pages. See check_buffer_tree_ref for a more > >>>> detailed comment. > >>>> + */ > >>>> + check_buffer_tree_ref(eb); > >>> > >>> This is a whole different case from the one described in the > >>> changelog, as this is in the write path. > >>> Why do we need this one? > >> > >> This was Josef’s idea, but I really like the symmetry. You set > >> io_pages, you do the tree_ref dance. Everyone fiddling with the write > >> back bit right now correctly clears writeback after doing the atomic_dec > >> on io_pages, but the race is tiny and prone to getting exposed again by > >> shifting code around. Tree ref checks around io_pages are the most > >> reliable way to prevent this bug from coming back again later. > > > > Ok, but that still doesn't answer my question. > > Is there an actual race/problem this hunk solves? > > > > Before calling write_one_eb() we increment the ref on the eb and we > > also call lock_extent_buffer_for_io(), > > which clears the dirty bit and sets the writeback bit on the eb while > > holding its ref_locks spin_lock. > > > > Even if we get to try_release_extent_buffer, it calls > > extent_buffer_under_io(eb) while holding the ref_locks spin_lock, > > so at any time it should return true, as either the dirty or the > > writeback bit is set. > > > > Is this purely a safety guard that is being introduced? > > > > Can we at least describe in the changelog why we are adding this hunk > > in the write path? > > All it mentions is a race between reading and releasing pages, there's > > nothing mentioned about races with writeback. > > > > I think maybe we make that bit a separate patch, and leave the panic fix on it's > own. I suggested this because I have lofty ideas of killing the refs_lock, and > this would at least keep us consistent in our usage TREE_REF to save us from > freeing stuff that may be currently under IO. > > I'm super not happy with our reference handling coupled with releasepage, but > these are the kind of hoops we're going to have to jump through until we have > some better infrastructure in place to handle metadata. Thanks, Ok, so it's just a safety guard then. Yes, either a separate patch or at the very least mention why we are adding that in the change log. Thanks. > > Josef
Hi Boris, Thank you for the patch! Yet something to improve: [auto build test ERROR on kdave/for-next] [also build test ERROR on v5.8-rc1 next-20200617] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system. BTW, we also suggest to use '--base' option to specify the base tree in git format-patch, please see https://stackoverflow.com/a/37406982] url: https://github.com/0day-ci/linux/commits/Boris-Burkov/btrfs-fix-fatal-extent_buffer-readahead-vs-releasepage-race/20200618-002941 base: https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next config: x86_64-rhel (attached as .config) compiler: gcc-9 (Debian 9.3.0-13) 9.3.0 reproduce (this is a W=1 build): # save the attached .config to linux build tree make W=1 ARCH=x86_64 If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All error/warnings (new ones prefixed by >>, old ones prefixed by <<): In file included from fs/btrfs/extent_io.c:19: fs/btrfs/ctree.h:2216:8: warning: type qualifiers ignored on function return type [-Wignored-qualifiers] 2216 | size_t __const btrfs_get_num_csums(void); | ^~~~~~~ fs/btrfs/extent_io.c: In function 'write_one_eb': >> fs/btrfs/extent_io.c:3934:2: error: implicit declaration of function 'check_buffer_tree_ref' [-Werror=implicit-function-declaration] 3934 | check_buffer_tree_ref(eb); | ^~~~~~~~~~~~~~~~~~~~~ fs/btrfs/extent_io.c: At top level: >> fs/btrfs/extent_io.c:5091:13: warning: conflicting types for 'check_buffer_tree_ref' 5091 | static void check_buffer_tree_ref(struct extent_buffer *eb) | ^~~~~~~~~~~~~~~~~~~~~ >> fs/btrfs/extent_io.c:5091:13: error: static declaration of 'check_buffer_tree_ref' follows non-static declaration fs/btrfs/extent_io.c:3934:2: note: previous implicit declaration of 'check_buffer_tree_ref' was here 3934 | check_buffer_tree_ref(eb); | ^~~~~~~~~~~~~~~~~~~~~ cc1: some warnings being treated as errors vim +/check_buffer_tree_ref +3934 fs/btrfs/extent_io.c 3915 3916 static noinline_for_stack int write_one_eb(struct extent_buffer *eb, 3917 struct writeback_control *wbc, 3918 struct extent_page_data *epd) 3919 { 3920 u64 offset = eb->start; 3921 u32 nritems; 3922 int i, num_pages; 3923 unsigned long start, end; 3924 unsigned int write_flags = wbc_to_write_flags(wbc) | REQ_META; 3925 int ret = 0; 3926 3927 clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags); 3928 num_pages = num_extent_pages(eb); 3929 atomic_set(&eb->io_pages, num_pages); 3930 /* 3931 * It is possible for releasepage to clear the TREE_REF bit before we 3932 * set io_pages. See check_buffer_tree_ref for a more detailed comment. 3933 */ > 3934 check_buffer_tree_ref(eb); 3935 3936 /* set btree blocks beyond nritems with 0 to avoid stale content. */ 3937 nritems = btrfs_header_nritems(eb); 3938 if (btrfs_header_level(eb) > 0) { 3939 end = btrfs_node_key_ptr_offset(nritems); 3940 3941 memzero_extent_buffer(eb, end, eb->len - end); 3942 } else { 3943 /* 3944 * leaf: 3945 * header 0 1 2 .. N ... data_N .. data_2 data_1 data_0 3946 */ 3947 start = btrfs_item_nr_offset(nritems); 3948 end = BTRFS_LEAF_DATA_OFFSET + leaf_data_end(eb); 3949 memzero_extent_buffer(eb, start, end - start); 3950 } 3951 3952 for (i = 0; i < num_pages; i++) { 3953 struct page *p = eb->pages[i]; 3954 3955 clear_page_dirty_for_io(p); 3956 set_page_writeback(p); 3957 ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc, 3958 p, offset, PAGE_SIZE, 0, 3959 &epd->bio, 3960 end_bio_extent_buffer_writepage, 3961 0, 0, 0, false); 3962 if (ret) { 3963 set_btree_ioerr(p); 3964 if (PageWriteback(p)) 3965 end_page_writeback(p); 3966 if (atomic_sub_and_test(num_pages - i, &eb->io_pages)) 3967 end_extent_buffer_writeback(eb); 3968 ret = -EIO; 3969 break; 3970 } 3971 offset += PAGE_SIZE; 3972 update_nr_written(wbc, 1); 3973 unlock_page(p); 3974 } 3975 3976 if (unlikely(ret)) { 3977 for (; i < num_pages; i++) { 3978 struct page *p = eb->pages[i]; 3979 clear_page_dirty_for_io(p); 3980 unlock_page(p); 3981 } 3982 } 3983 3984 return ret; 3985 } 3986 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index c59e07360083..f6758ebbb6a2 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3927,6 +3927,11 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb, clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags); num_pages = num_extent_pages(eb); atomic_set(&eb->io_pages, num_pages); + /* + * It is possible for releasepage to clear the TREE_REF bit before we + * set io_pages. See check_buffer_tree_ref for a more detailed comment. + */ + check_buffer_tree_ref(eb); /* set btree blocks beyond nritems with 0 to avoid stale content. */ nritems = btrfs_header_nritems(eb); @@ -5086,25 +5091,28 @@ struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info, static void check_buffer_tree_ref(struct extent_buffer *eb) { int refs; - /* the ref bit is tricky. We have to make sure it is set - * if we have the buffer dirty. Otherwise the - * code to free a buffer can end up dropping a dirty - * page + /* + * The TREE_REF bit is first set when the extent_buffer is added + * to the radix tree. It is also reset, if unset, when a new reference + * is created by find_extent_buffer. * - * Once the ref bit is set, it won't go away while the - * buffer is dirty or in writeback, and it also won't - * go away while we have the reference count on the - * eb bumped. + * It is only cleared in two cases: freeing the last non-tree + * reference to the extent_buffer when its STALE bit is set or + * calling releasepage when the tree reference is the only reference. * - * We can't just set the ref bit without bumping the - * ref on the eb because free_extent_buffer might - * see the ref bit and try to clear it. If this happens - * free_extent_buffer might end up dropping our original - * ref by mistake and freeing the page before we are able - * to add one more ref. + * In both cases, care is taken to ensure that the extent_buffer's + * pages are not under io. However, releasepage can be concurrently + * called with creating new references, which is prone to race + * conditions between the calls to check_tree_ref in those codepaths + * and clearing TREE_REF in try_release_extent_buffer. * - * So bump the ref count first, then set the bit. If someone - * beat us to it, drop the ref we added. + * The actual lifetime of the extent_buffer in the radix tree is + * adequately protected by the refcount, but the TREE_REF bit and + * its corresponding reference are not. To protect against this + * class of races, we call check_buffer_tree_ref from the codepaths + * which trigger io after they set eb->io_pages. Note that once io is + * initiated, TREE_REF can no longer be cleared, so that is the + * moment at which any such race is best fixed. */ refs = atomic_read(&eb->refs); if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) @@ -5555,6 +5563,11 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num) clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags); eb->read_mirror = 0; atomic_set(&eb->io_pages, num_reads); + /* + * It is possible for releasepage to clear the TREE_REF bit before we + * set io_pages. See check_buffer_tree_ref for a more detailed comment. + */ + check_buffer_tree_ref(eb); for (i = 0; i < num_pages; i++) { page = eb->pages[i];
Under somewhat convoluted conditions, it is possible to attempt to release an extent_buffer that is under io, which triggers a BUG_ON in btrfs_release_extent_buffer_pages. This relies on a few different factors. First, extent_buffer reads done as readahead for searching use WAIT_NONE, so they free the local extent buffer reference while the io is outstanding. However, they should still be protected by TREE_REF. However, if the system is doing signficant reclaim, and simultaneously heavily accessing the extent_buffers, it is possible for releasepage to race with two concurrent readahead attempts in a way that leaves TREE_REF unset when the readahead extent buffer is released. Essentially, if two tasks race to allocate a new extent_buffer, but the winner who attempts the first io is rebuffed by a page being locked (likely by the reclaim itself) then the loser will still go ahead with issuing the readahead. The loser's call to find_extent_buffer must also race with the reclaim task reading the extent_buffer's refcount as 1 in a way that allows the reclaim to re-clear the TREE_REF checked by find_extent_buffer. The following represents an example execution demonstrating the race: CPU0 CPU1 CPU2 reada_for_search reada_for_search readahead_tree_block readahead_tree_block find_create_tree_block find_create_tree_block alloc_extent_buffer alloc_extent_buffer find_extent_buffer // not found allocates eb lock pages associate pages to eb insert eb into radix tree set TREE_REF, refs == 2 unlock pages read_extent_buffer_pages // WAIT_NONE not uptodate (brand new eb) lock_page if !trylock_page goto unlock_exit // not an error free_extent_buffer release_extent_buffer atomic_dec_and_test refs to 1 find_extent_buffer // found try_release_extent_buffer take refs_lock reads refs == 1; no io atomic_inc_not_zero refs to 2 mark_buffer_accessed check_buffer_tree_ref // not STALE, won't take refs_lock refs == 2; TREE_REF set // no action read_extent_buffer_pages // WAIT_NONE clear TREE_REF release_extent_buffer atomic_dec_and_test refs to 1 unlock_page still not uptodate (CPU1 read failed on trylock_page) locks pages set io_pages > 0 submit io return release_extent_buffer dec refs to 0 delete from radix tree btrfs_release_extent_buffer_pages BUG_ON(io_pages > 0)!!! We observe this at a very low rate in production and were also able to reproduce it in a test environment by introducing some spurious delays and by introducing probabilistic trylock_page failures. To fix it, we apply check_tree_ref at a point where it could not possibly be unset by a competing task: after io_pages has been incremented. There is no race in write_one_eb, that we know of, but for consistency, apply it there too. All the codepaths that clear TREE_REF check for io, so they would not be able to clear it after this point. Signed-off-by: Boris Burkov <boris@bur.io> --- fs/btrfs/extent_io.c | 45 ++++++++++++++++++++++++++++---------------- 1 file changed, 29 insertions(+), 16 deletions(-)