Message ID | 20181123182540.7206-1-fdmanana@kernel.org (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
Series | [v4] Btrfs: fix deadlock with memory reclaim during scrub | expand |
On 23.11.18 г. 20:25 ч., fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > When a transaction commit starts, it attempts to pause scrub and it blocks > until the scrub is paused. So while the transaction is blocked waiting for > scrub to pause, we can not do memory allocation with GFP_KERNEL from scrub, > otherwise we risk getting into a deadlock with reclaim. > > Checking for scrub pause requests is done early at the beginning of the > while loop of scrub_stripe() and later in the loop, scrub_extent() and > scrub_raid56_parity() are called, which in turn call scrub_pages() and > scrub_pages_for_parity() respectively. These last two functions do memory > allocations using GFP_KERNEL. Same problem could happen while scrubbing > the super blocks, since it calls scrub_pages(). > > So make sure GFP_NOFS is used for the memory allocations because at any > time a scrub pause request can happen from another task that started to > commit a transaction. > > Fixes: 58c4e173847a ("btrfs: scrub: use GFP_KERNEL on the submission path") > Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> > --- > > V2: Make using GFP_NOFS unconditionial. Previous version was racy, as pausing > requests migth happen just after we checked for them. > > V3: Use memalloc_nofs_save() just like V1 did. > > V4: Similar problem happened for raid56, which was previously missed, so > deal with it as well as the case for scrub_supers(). > > fs/btrfs/scrub.c | 12 ++++++++++++ > 1 file changed, 12 insertions(+) > > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c > index 3be1456b5116..e08b7502d1f0 100644 > --- a/fs/btrfs/scrub.c > +++ b/fs/btrfs/scrub.c > @@ -3779,6 +3779,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start, > struct scrub_ctx *sctx; > int ret; > struct btrfs_device *dev; > + unsigned int nofs_flag; > > if (btrfs_fs_closing(fs_info)) > return -EINVAL; > @@ -3882,6 +3883,16 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start, > atomic_inc(&fs_info->scrubs_running); > mutex_unlock(&fs_info->scrub_lock); > > + /* > + * In order to avoid deadlock with reclaim when there is a transaction > + * trying to pause scrub, make sure we use GFP_NOFS for all the > + * allocations done at btrfs_scrub_pages() and scrub_pages_for_parity() > + * invoked by our callees. The pausing request is done when the > + * transaction commit starts, and it blocks the transaction until scrub > + * is paused (done at specific points at scrub_stripe() or right above > + * before incrementing fs_info->scrubs_running). > + */ > + nofs_flag = memalloc_nofs_save(); > if (!is_dev_replace) { > /* > * by holding device list mutex, we can > @@ -3895,6 +3906,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start, > if (!ret) > ret = scrub_enumerate_chunks(sctx, dev, start, end, > is_dev_replace); > + memalloc_nofs_restore(nofs_flag); > > wait_event(sctx->list_wait, atomic_read(&sctx->bios_in_flight) == 0); > atomic_dec(&fs_info->scrubs_running); >
On Fri, Nov 23, 2018 at 06:25:40PM +0000, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > When a transaction commit starts, it attempts to pause scrub and it blocks > until the scrub is paused. So while the transaction is blocked waiting for > scrub to pause, we can not do memory allocation with GFP_KERNEL from scrub, > otherwise we risk getting into a deadlock with reclaim. > > Checking for scrub pause requests is done early at the beginning of the > while loop of scrub_stripe() and later in the loop, scrub_extent() and > scrub_raid56_parity() are called, which in turn call scrub_pages() and > scrub_pages_for_parity() respectively. These last two functions do memory > allocations using GFP_KERNEL. Same problem could happen while scrubbing > the super blocks, since it calls scrub_pages(). > > So make sure GFP_NOFS is used for the memory allocations because at any > time a scrub pause request can happen from another task that started to > commit a transaction. > > Fixes: 58c4e173847a ("btrfs: scrub: use GFP_KERNEL on the submission path") > Signed-off-by: Filipe Manana <fdmanana@suse.com> > --- > > V2: Make using GFP_NOFS unconditionial. Previous version was racy, as pausing > requests migth happen just after we checked for them. > > V3: Use memalloc_nofs_save() just like V1 did. > > V4: Similar problem happened for raid56, which was previously missed, so > deal with it as well as the case for scrub_supers(). Enclosing the whole scrub to 'nofs' seems like the best option and future proof. What I missed in 58c4e173847a was the "don't hold big lock under GFP_KERNEL allocation" pattern. > fs/btrfs/scrub.c | 12 ++++++++++++ > 1 file changed, 12 insertions(+) > > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c > index 3be1456b5116..e08b7502d1f0 100644 > --- a/fs/btrfs/scrub.c > +++ b/fs/btrfs/scrub.c > @@ -3779,6 +3779,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start, > struct scrub_ctx *sctx; > int ret; > struct btrfs_device *dev; > + unsigned int nofs_flag; > > if (btrfs_fs_closing(fs_info)) > return -EINVAL; > @@ -3882,6 +3883,16 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start, > atomic_inc(&fs_info->scrubs_running); > mutex_unlock(&fs_info->scrub_lock); > > + /* > + * In order to avoid deadlock with reclaim when there is a transaction > + * trying to pause scrub, make sure we use GFP_NOFS for all the > + * allocations done at btrfs_scrub_pages() and scrub_pages_for_parity() > + * invoked by our callees. The pausing request is done when the > + * transaction commit starts, and it blocks the transaction until scrub > + * is paused (done at specific points at scrub_stripe() or right above > + * before incrementing fs_info->scrubs_running). This hilights why there's perhaps no point in trying to make the nofs section smaller, handling all the interactions between scrub and transaction would be too complex. Reviewed-by: David Sterba <dsterba@suse.com>
On Mon, Nov 26, 2018 at 6:17 PM David Sterba <dsterba@suse.cz> wrote: > > On Fri, Nov 23, 2018 at 06:25:40PM +0000, fdmanana@kernel.org wrote: > > From: Filipe Manana <fdmanana@suse.com> > > > > When a transaction commit starts, it attempts to pause scrub and it blocks > > until the scrub is paused. So while the transaction is blocked waiting for > > scrub to pause, we can not do memory allocation with GFP_KERNEL from scrub, > > otherwise we risk getting into a deadlock with reclaim. > > > > Checking for scrub pause requests is done early at the beginning of the > > while loop of scrub_stripe() and later in the loop, scrub_extent() and > > scrub_raid56_parity() are called, which in turn call scrub_pages() and > > scrub_pages_for_parity() respectively. These last two functions do memory > > allocations using GFP_KERNEL. Same problem could happen while scrubbing > > the super blocks, since it calls scrub_pages(). > > > > So make sure GFP_NOFS is used for the memory allocations because at any > > time a scrub pause request can happen from another task that started to > > commit a transaction. > > > > Fixes: 58c4e173847a ("btrfs: scrub: use GFP_KERNEL on the submission path") > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > > --- > > > > V2: Make using GFP_NOFS unconditionial. Previous version was racy, as pausing > > requests migth happen just after we checked for them. > > > > V3: Use memalloc_nofs_save() just like V1 did. > > > > V4: Similar problem happened for raid56, which was previously missed, so > > deal with it as well as the case for scrub_supers(). > > Enclosing the whole scrub to 'nofs' seems like the best option and > future proof. What I missed in 58c4e173847a was the "don't hold big lock > under GFP_KERNEL allocation" pattern. > > > fs/btrfs/scrub.c | 12 ++++++++++++ > > 1 file changed, 12 insertions(+) > > > > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c > > index 3be1456b5116..e08b7502d1f0 100644 > > --- a/fs/btrfs/scrub.c > > +++ b/fs/btrfs/scrub.c > > @@ -3779,6 +3779,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start, > > struct scrub_ctx *sctx; > > int ret; > > struct btrfs_device *dev; > > + unsigned int nofs_flag; > > > > if (btrfs_fs_closing(fs_info)) > > return -EINVAL; > > @@ -3882,6 +3883,16 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start, > > atomic_inc(&fs_info->scrubs_running); > > mutex_unlock(&fs_info->scrub_lock); > > > > + /* > > + * In order to avoid deadlock with reclaim when there is a transaction > > + * trying to pause scrub, make sure we use GFP_NOFS for all the > > + * allocations done at btrfs_scrub_pages() and scrub_pages_for_parity() > > + * invoked by our callees. The pausing request is done when the > > + * transaction commit starts, and it blocks the transaction until scrub > > + * is paused (done at specific points at scrub_stripe() or right above > > + * before incrementing fs_info->scrubs_running). > > This hilights why there's perhaps no point in trying to make the nofs > section smaller, handling all the interactions between scrub and > transaction would be too complex. > > Reviewed-by: David Sterba <dsterba@suse.com> Well, the worker tasks can also not use gfp_kernel, since the scrub task waits for them to complete before pausing. I missed this, and 2 reviewers as well, so perhaps it wasn't that trivial and I shouldn't feel that I miserably failed to identify all cases for something rather trivial. V5 sent.
On Mon, Nov 26, 2018 at 08:10:30PM +0000, Filipe Manana wrote: > On Mon, Nov 26, 2018 at 6:17 PM David Sterba <dsterba@suse.cz> wrote: > > > > On Fri, Nov 23, 2018 at 06:25:40PM +0000, fdmanana@kernel.org wrote: > > > From: Filipe Manana <fdmanana@suse.com> > > > > > > When a transaction commit starts, it attempts to pause scrub and it blocks > > > until the scrub is paused. So while the transaction is blocked waiting for > > > scrub to pause, we can not do memory allocation with GFP_KERNEL from scrub, > > > otherwise we risk getting into a deadlock with reclaim. > > > > > > Checking for scrub pause requests is done early at the beginning of the > > > while loop of scrub_stripe() and later in the loop, scrub_extent() and > > > scrub_raid56_parity() are called, which in turn call scrub_pages() and > > > scrub_pages_for_parity() respectively. These last two functions do memory > > > allocations using GFP_KERNEL. Same problem could happen while scrubbing > > > the super blocks, since it calls scrub_pages(). > > > > > > So make sure GFP_NOFS is used for the memory allocations because at any > > > time a scrub pause request can happen from another task that started to > > > commit a transaction. > > > > > > Fixes: 58c4e173847a ("btrfs: scrub: use GFP_KERNEL on the submission path") > > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > > > --- > > > > > > V2: Make using GFP_NOFS unconditionial. Previous version was racy, as pausing > > > requests migth happen just after we checked for them. > > > > > > V3: Use memalloc_nofs_save() just like V1 did. > > > > > > V4: Similar problem happened for raid56, which was previously missed, so > > > deal with it as well as the case for scrub_supers(). > > > > Enclosing the whole scrub to 'nofs' seems like the best option and > > future proof. What I missed in 58c4e173847a was the "don't hold big lock > > under GFP_KERNEL allocation" pattern. > > > > > fs/btrfs/scrub.c | 12 ++++++++++++ > > > 1 file changed, 12 insertions(+) > > > > > > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c > > > index 3be1456b5116..e08b7502d1f0 100644 > > > --- a/fs/btrfs/scrub.c > > > +++ b/fs/btrfs/scrub.c > > > @@ -3779,6 +3779,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start, > > > struct scrub_ctx *sctx; > > > int ret; > > > struct btrfs_device *dev; > > > + unsigned int nofs_flag; > > > > > > if (btrfs_fs_closing(fs_info)) > > > return -EINVAL; > > > @@ -3882,6 +3883,16 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start, > > > atomic_inc(&fs_info->scrubs_running); > > > mutex_unlock(&fs_info->scrub_lock); > > > > > > + /* > > > + * In order to avoid deadlock with reclaim when there is a transaction > > > + * trying to pause scrub, make sure we use GFP_NOFS for all the > > > + * allocations done at btrfs_scrub_pages() and scrub_pages_for_parity() > > > + * invoked by our callees. The pausing request is done when the > > > + * transaction commit starts, and it blocks the transaction until scrub > > > + * is paused (done at specific points at scrub_stripe() or right above > > > + * before incrementing fs_info->scrubs_running). > > > > This hilights why there's perhaps no point in trying to make the nofs > > section smaller, handling all the interactions between scrub and > > transaction would be too complex. > > > > Reviewed-by: David Sterba <dsterba@suse.com> > > Well, the worker tasks can also not use gfp_kernel, since the scrub > task waits for them to complete before pausing. > I missed this, and 2 reviewers as well, so perhaps it wasn't that > trivial and I shouldn't feel that I miserably failed to identify all > cases for something rather trivial. V5 sent. You can say that you left it there intentionally, such cookies are a good drill for reviewers to stay sharp. When I started the conversions of GFP_NOFS -> GFP_KERNEL, scrub looked quite safe in this respect but turns out it's not. I was wondering if we could add some lock assertions before GFP_KERNEL allocations, like: assert_lock_not_held(fs_info->device_list_mutex) kmalloc(GFP_KERNEL) and add more locks per subsystem (eg. the scrub lock) and possibly some convenience wrappers. Michal's scope GFS_NOFS patch series has a debugging warning where NOFS is used in context where it does not need to, while having the 'must not hold an important lock' would be a good debugging helper too.
On Wed, Nov 28, 2018 at 2:22 PM David Sterba <dsterba@suse.cz> wrote: > > On Mon, Nov 26, 2018 at 08:10:30PM +0000, Filipe Manana wrote: > > On Mon, Nov 26, 2018 at 6:17 PM David Sterba <dsterba@suse.cz> wrote: > > > > > > On Fri, Nov 23, 2018 at 06:25:40PM +0000, fdmanana@kernel.org wrote: > > > > From: Filipe Manana <fdmanana@suse.com> > > > > > > > > When a transaction commit starts, it attempts to pause scrub and it blocks > > > > until the scrub is paused. So while the transaction is blocked waiting for > > > > scrub to pause, we can not do memory allocation with GFP_KERNEL from scrub, > > > > otherwise we risk getting into a deadlock with reclaim. > > > > > > > > Checking for scrub pause requests is done early at the beginning of the > > > > while loop of scrub_stripe() and later in the loop, scrub_extent() and > > > > scrub_raid56_parity() are called, which in turn call scrub_pages() and > > > > scrub_pages_for_parity() respectively. These last two functions do memory > > > > allocations using GFP_KERNEL. Same problem could happen while scrubbing > > > > the super blocks, since it calls scrub_pages(). > > > > > > > > So make sure GFP_NOFS is used for the memory allocations because at any > > > > time a scrub pause request can happen from another task that started to > > > > commit a transaction. > > > > > > > > Fixes: 58c4e173847a ("btrfs: scrub: use GFP_KERNEL on the submission path") > > > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > > > > --- > > > > > > > > V2: Make using GFP_NOFS unconditionial. Previous version was racy, as pausing > > > > requests migth happen just after we checked for them. > > > > > > > > V3: Use memalloc_nofs_save() just like V1 did. > > > > > > > > V4: Similar problem happened for raid56, which was previously missed, so > > > > deal with it as well as the case for scrub_supers(). > > > > > > Enclosing the whole scrub to 'nofs' seems like the best option and > > > future proof. What I missed in 58c4e173847a was the "don't hold big lock > > > under GFP_KERNEL allocation" pattern. > > > > > > > fs/btrfs/scrub.c | 12 ++++++++++++ > > > > 1 file changed, 12 insertions(+) > > > > > > > > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c > > > > index 3be1456b5116..e08b7502d1f0 100644 > > > > --- a/fs/btrfs/scrub.c > > > > +++ b/fs/btrfs/scrub.c > > > > @@ -3779,6 +3779,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start, > > > > struct scrub_ctx *sctx; > > > > int ret; > > > > struct btrfs_device *dev; > > > > + unsigned int nofs_flag; > > > > > > > > if (btrfs_fs_closing(fs_info)) > > > > return -EINVAL; > > > > @@ -3882,6 +3883,16 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start, > > > > atomic_inc(&fs_info->scrubs_running); > > > > mutex_unlock(&fs_info->scrub_lock); > > > > > > > > + /* > > > > + * In order to avoid deadlock with reclaim when there is a transaction > > > > + * trying to pause scrub, make sure we use GFP_NOFS for all the > > > > + * allocations done at btrfs_scrub_pages() and scrub_pages_for_parity() > > > > + * invoked by our callees. The pausing request is done when the > > > > + * transaction commit starts, and it blocks the transaction until scrub > > > > + * is paused (done at specific points at scrub_stripe() or right above > > > > + * before incrementing fs_info->scrubs_running). > > > > > > This hilights why there's perhaps no point in trying to make the nofs > > > section smaller, handling all the interactions between scrub and > > > transaction would be too complex. > > > > > > Reviewed-by: David Sterba <dsterba@suse.com> > > > > Well, the worker tasks can also not use gfp_kernel, since the scrub > > task waits for them to complete before pausing. > > I missed this, and 2 reviewers as well, so perhaps it wasn't that > > trivial and I shouldn't feel that I miserably failed to identify all > > cases for something rather trivial. V5 sent. > > You can say that you left it there intentionally, such cookies are a > good drill for reviewers to stay sharp. Unfortunately for me, it was not on purpose. However there's the small drill of, for the workers only, potentially moving the memalloc_nofs_save() and restore to scrub_handle_errored_block(), since the only two possible gfp_kernel allocations for workers are during the case where a repair is needed: scrub_bio_end_io_worker() scrub_block_complete() scrub_handle_errored_block() lock_full_stripe() insert_full_stripe_lock() -> kmalloc with GFP_KERNEL scrub_bio_end_io_worker() scrub_block_complete() scrub_handle_errored_block() scrub_write_page_to_dev_replace() scrub_add_page_to_wr_bio() -> kzalloc with GFP_KERNEL Just to avoid some duplication. > > When I started the conversions of GFP_NOFS -> GFP_KERNEL, scrub looked > quite safe in this respect but turns out it's not. I was wondering if we > could add some lock assertions before GFP_KERNEL allocations, like: > > assert_lock_not_held(fs_info->device_list_mutex) > kmalloc(GFP_KERNEL) > > and add more locks per subsystem (eg. the scrub lock) and possibly some > convenience wrappers. Michal's scope GFS_NOFS patch series has a > debugging warning where NOFS is used in context where it does not need > to, while having the 'must not hold an important lock' would be a good > debugging helper too.
On Wed, Nov 28, 2018 at 02:40:07PM +0000, Filipe Manana wrote: > > > Well, the worker tasks can also not use gfp_kernel, since the scrub > > > task waits for them to complete before pausing. > > > I missed this, and 2 reviewers as well, so perhaps it wasn't that > > > trivial and I shouldn't feel that I miserably failed to identify all > > > cases for something rather trivial. V5 sent. > > > > You can say that you left it there intentionally, such cookies are a > > good drill for reviewers to stay sharp. > > Unfortunately for me, it was not on purpose. > > However there's the small drill of, for the workers only, potentially > moving the memalloc_nofs_save() and > restore to scrub_handle_errored_block(), since the only two possible > gfp_kernel allocations for workers > are during the case where a repair is needed: > > scrub_bio_end_io_worker() > scrub_block_complete() > scrub_handle_errored_block() > lock_full_stripe() > insert_full_stripe_lock() > -> kmalloc with GFP_KERNEL > > > scrub_bio_end_io_worker() > scrub_block_complete() > scrub_handle_errored_block() > scrub_write_page_to_dev_replace() > scrub_add_page_to_wr_bio() > -> kzalloc with GFP_KERNEL > > Just to avoid some duplication. Sounds like a nice cleanup and more in line with the idea of scoped GFP_NOFS, the markers should be at a higher level. Starting at the bottom of the callstack is fine as it's obvious where we want the nofs protection, and then push it up the call chain. That way it's easier to review. Please send a followup patch, thanks.
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 3be1456b5116..e08b7502d1f0 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -3779,6 +3779,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start, struct scrub_ctx *sctx; int ret; struct btrfs_device *dev; + unsigned int nofs_flag; if (btrfs_fs_closing(fs_info)) return -EINVAL; @@ -3882,6 +3883,16 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start, atomic_inc(&fs_info->scrubs_running); mutex_unlock(&fs_info->scrub_lock); + /* + * In order to avoid deadlock with reclaim when there is a transaction + * trying to pause scrub, make sure we use GFP_NOFS for all the + * allocations done at btrfs_scrub_pages() and scrub_pages_for_parity() + * invoked by our callees. The pausing request is done when the + * transaction commit starts, and it blocks the transaction until scrub + * is paused (done at specific points at scrub_stripe() or right above + * before incrementing fs_info->scrubs_running). + */ + nofs_flag = memalloc_nofs_save(); if (!is_dev_replace) { /* * by holding device list mutex, we can @@ -3895,6 +3906,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start, if (!ret) ret = scrub_enumerate_chunks(sctx, dev, start, end, is_dev_replace); + memalloc_nofs_restore(nofs_flag); wait_event(sctx->list_wait, atomic_read(&sctx->bios_in_flight) == 0); atomic_dec(&fs_info->scrubs_running);