diff mbox series

[v4] Btrfs: fix deadlock with memory reclaim during scrub

Message ID 20181123182540.7206-1-fdmanana@kernel.org (mailing list archive)
State Superseded, archived
Headers show
Series [v4] Btrfs: fix deadlock with memory reclaim during scrub | expand

Commit Message

Filipe Manana Nov. 23, 2018, 6:25 p.m. UTC
From: Filipe Manana <fdmanana@suse.com>

When a transaction commit starts, it attempts to pause scrub and it blocks
until the scrub is paused. So while the transaction is blocked waiting for
scrub to pause, we can not do memory allocation with GFP_KERNEL from scrub,
otherwise we risk getting into a deadlock with reclaim.

Checking for scrub pause requests is done early at the beginning of the
while loop of scrub_stripe() and later in the loop, scrub_extent() and
scrub_raid56_parity() are called, which in turn call scrub_pages() and
scrub_pages_for_parity() respectively. These last two functions do memory
allocations using GFP_KERNEL. Same problem could happen while scrubbing
the super blocks, since it calls scrub_pages().

So make sure GFP_NOFS is used for the memory allocations because at any
time a scrub pause request can happen from another task that started to
commit a transaction.

Fixes: 58c4e173847a ("btrfs: scrub: use GFP_KERNEL on the submission path")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: Make using GFP_NOFS unconditionial. Previous version was racy, as pausing
requests migth happen just after we checked for them.

V3: Use memalloc_nofs_save() just like V1 did.

V4: Similar problem happened for raid56, which was previously missed, so
    deal with it as well as the case for scrub_supers().

 fs/btrfs/scrub.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

Comments

Nikolay Borisov Nov. 26, 2018, 7:27 a.m. UTC | #1
On 23.11.18 г. 20:25 ч., fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When a transaction commit starts, it attempts to pause scrub and it blocks
> until the scrub is paused. So while the transaction is blocked waiting for
> scrub to pause, we can not do memory allocation with GFP_KERNEL from scrub,
> otherwise we risk getting into a deadlock with reclaim.
> 
> Checking for scrub pause requests is done early at the beginning of the
> while loop of scrub_stripe() and later in the loop, scrub_extent() and
> scrub_raid56_parity() are called, which in turn call scrub_pages() and
> scrub_pages_for_parity() respectively. These last two functions do memory
> allocations using GFP_KERNEL. Same problem could happen while scrubbing
> the super blocks, since it calls scrub_pages().
> 
> So make sure GFP_NOFS is used for the memory allocations because at any
> time a scrub pause request can happen from another task that started to
> commit a transaction.
> 
> Fixes: 58c4e173847a ("btrfs: scrub: use GFP_KERNEL on the submission path")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

> ---
> 
> V2: Make using GFP_NOFS unconditionial. Previous version was racy, as pausing
> requests migth happen just after we checked for them.
> 
> V3: Use memalloc_nofs_save() just like V1 did.
> 
> V4: Similar problem happened for raid56, which was previously missed, so
>     deal with it as well as the case for scrub_supers().
> 
>  fs/btrfs/scrub.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 3be1456b5116..e08b7502d1f0 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3779,6 +3779,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
>  	struct scrub_ctx *sctx;
>  	int ret;
>  	struct btrfs_device *dev;
> +	unsigned int nofs_flag;
>  
>  	if (btrfs_fs_closing(fs_info))
>  		return -EINVAL;
> @@ -3882,6 +3883,16 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
>  	atomic_inc(&fs_info->scrubs_running);
>  	mutex_unlock(&fs_info->scrub_lock);
>  
> +	/*
> +	 * In order to avoid deadlock with reclaim when there is a transaction
> +	 * trying to pause scrub, make sure we use GFP_NOFS for all the
> +	 * allocations done at btrfs_scrub_pages() and scrub_pages_for_parity()
> +	 * invoked by our callees. The pausing request is done when the
> +	 * transaction commit starts, and it blocks the transaction until scrub
> +	 * is paused (done at specific points at scrub_stripe() or right above
> +	 * before incrementing fs_info->scrubs_running).
> +	 */
> +	nofs_flag = memalloc_nofs_save();
>  	if (!is_dev_replace) {
>  		/*
>  		 * by holding device list mutex, we can
> @@ -3895,6 +3906,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
>  	if (!ret)
>  		ret = scrub_enumerate_chunks(sctx, dev, start, end,
>  					     is_dev_replace);
> +	memalloc_nofs_restore(nofs_flag);
>  
>  	wait_event(sctx->list_wait, atomic_read(&sctx->bios_in_flight) == 0);
>  	atomic_dec(&fs_info->scrubs_running);
>
David Sterba Nov. 26, 2018, 6:17 p.m. UTC | #2
On Fri, Nov 23, 2018 at 06:25:40PM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When a transaction commit starts, it attempts to pause scrub and it blocks
> until the scrub is paused. So while the transaction is blocked waiting for
> scrub to pause, we can not do memory allocation with GFP_KERNEL from scrub,
> otherwise we risk getting into a deadlock with reclaim.
> 
> Checking for scrub pause requests is done early at the beginning of the
> while loop of scrub_stripe() and later in the loop, scrub_extent() and
> scrub_raid56_parity() are called, which in turn call scrub_pages() and
> scrub_pages_for_parity() respectively. These last two functions do memory
> allocations using GFP_KERNEL. Same problem could happen while scrubbing
> the super blocks, since it calls scrub_pages().
> 
> So make sure GFP_NOFS is used for the memory allocations because at any
> time a scrub pause request can happen from another task that started to
> commit a transaction.
> 
> Fixes: 58c4e173847a ("btrfs: scrub: use GFP_KERNEL on the submission path")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
> 
> V2: Make using GFP_NOFS unconditionial. Previous version was racy, as pausing
> requests migth happen just after we checked for them.
> 
> V3: Use memalloc_nofs_save() just like V1 did.
> 
> V4: Similar problem happened for raid56, which was previously missed, so
>     deal with it as well as the case for scrub_supers().

Enclosing the whole scrub to 'nofs' seems like the best option and
future proof. What I missed in 58c4e173847a was the "don't hold big lock
under GFP_KERNEL allocation" pattern.

>  fs/btrfs/scrub.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 3be1456b5116..e08b7502d1f0 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3779,6 +3779,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
>  	struct scrub_ctx *sctx;
>  	int ret;
>  	struct btrfs_device *dev;
> +	unsigned int nofs_flag;
>  
>  	if (btrfs_fs_closing(fs_info))
>  		return -EINVAL;
> @@ -3882,6 +3883,16 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
>  	atomic_inc(&fs_info->scrubs_running);
>  	mutex_unlock(&fs_info->scrub_lock);
>  
> +	/*
> +	 * In order to avoid deadlock with reclaim when there is a transaction
> +	 * trying to pause scrub, make sure we use GFP_NOFS for all the
> +	 * allocations done at btrfs_scrub_pages() and scrub_pages_for_parity()
> +	 * invoked by our callees. The pausing request is done when the
> +	 * transaction commit starts, and it blocks the transaction until scrub
> +	 * is paused (done at specific points at scrub_stripe() or right above
> +	 * before incrementing fs_info->scrubs_running).

This hilights why there's perhaps no point in trying to make the nofs
section smaller, handling all the interactions between scrub and
transaction would be too complex.

Reviewed-by: David Sterba <dsterba@suse.com>
Filipe Manana Nov. 26, 2018, 8:10 p.m. UTC | #3
On Mon, Nov 26, 2018 at 6:17 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Fri, Nov 23, 2018 at 06:25:40PM +0000, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > When a transaction commit starts, it attempts to pause scrub and it blocks
> > until the scrub is paused. So while the transaction is blocked waiting for
> > scrub to pause, we can not do memory allocation with GFP_KERNEL from scrub,
> > otherwise we risk getting into a deadlock with reclaim.
> >
> > Checking for scrub pause requests is done early at the beginning of the
> > while loop of scrub_stripe() and later in the loop, scrub_extent() and
> > scrub_raid56_parity() are called, which in turn call scrub_pages() and
> > scrub_pages_for_parity() respectively. These last two functions do memory
> > allocations using GFP_KERNEL. Same problem could happen while scrubbing
> > the super blocks, since it calls scrub_pages().
> >
> > So make sure GFP_NOFS is used for the memory allocations because at any
> > time a scrub pause request can happen from another task that started to
> > commit a transaction.
> >
> > Fixes: 58c4e173847a ("btrfs: scrub: use GFP_KERNEL on the submission path")
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >
> > V2: Make using GFP_NOFS unconditionial. Previous version was racy, as pausing
> > requests migth happen just after we checked for them.
> >
> > V3: Use memalloc_nofs_save() just like V1 did.
> >
> > V4: Similar problem happened for raid56, which was previously missed, so
> >     deal with it as well as the case for scrub_supers().
>
> Enclosing the whole scrub to 'nofs' seems like the best option and
> future proof. What I missed in 58c4e173847a was the "don't hold big lock
> under GFP_KERNEL allocation" pattern.
>
> >  fs/btrfs/scrub.c | 12 ++++++++++++
> >  1 file changed, 12 insertions(+)
> >
> > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> > index 3be1456b5116..e08b7502d1f0 100644
> > --- a/fs/btrfs/scrub.c
> > +++ b/fs/btrfs/scrub.c
> > @@ -3779,6 +3779,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
> >       struct scrub_ctx *sctx;
> >       int ret;
> >       struct btrfs_device *dev;
> > +     unsigned int nofs_flag;
> >
> >       if (btrfs_fs_closing(fs_info))
> >               return -EINVAL;
> > @@ -3882,6 +3883,16 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
> >       atomic_inc(&fs_info->scrubs_running);
> >       mutex_unlock(&fs_info->scrub_lock);
> >
> > +     /*
> > +      * In order to avoid deadlock with reclaim when there is a transaction
> > +      * trying to pause scrub, make sure we use GFP_NOFS for all the
> > +      * allocations done at btrfs_scrub_pages() and scrub_pages_for_parity()
> > +      * invoked by our callees. The pausing request is done when the
> > +      * transaction commit starts, and it blocks the transaction until scrub
> > +      * is paused (done at specific points at scrub_stripe() or right above
> > +      * before incrementing fs_info->scrubs_running).
>
> This hilights why there's perhaps no point in trying to make the nofs
> section smaller, handling all the interactions between scrub and
> transaction would be too complex.
>
> Reviewed-by: David Sterba <dsterba@suse.com>

Well, the worker tasks can also not use gfp_kernel, since the scrub
task waits for them to complete before pausing.
I missed this, and 2 reviewers as well, so perhaps it wasn't that
trivial and I shouldn't feel that I miserably failed to identify all
cases for something rather trivial. V5 sent.
David Sterba Nov. 28, 2018, 2:22 p.m. UTC | #4
On Mon, Nov 26, 2018 at 08:10:30PM +0000, Filipe Manana wrote:
> On Mon, Nov 26, 2018 at 6:17 PM David Sterba <dsterba@suse.cz> wrote:
> >
> > On Fri, Nov 23, 2018 at 06:25:40PM +0000, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > When a transaction commit starts, it attempts to pause scrub and it blocks
> > > until the scrub is paused. So while the transaction is blocked waiting for
> > > scrub to pause, we can not do memory allocation with GFP_KERNEL from scrub,
> > > otherwise we risk getting into a deadlock with reclaim.
> > >
> > > Checking for scrub pause requests is done early at the beginning of the
> > > while loop of scrub_stripe() and later in the loop, scrub_extent() and
> > > scrub_raid56_parity() are called, which in turn call scrub_pages() and
> > > scrub_pages_for_parity() respectively. These last two functions do memory
> > > allocations using GFP_KERNEL. Same problem could happen while scrubbing
> > > the super blocks, since it calls scrub_pages().
> > >
> > > So make sure GFP_NOFS is used for the memory allocations because at any
> > > time a scrub pause request can happen from another task that started to
> > > commit a transaction.
> > >
> > > Fixes: 58c4e173847a ("btrfs: scrub: use GFP_KERNEL on the submission path")
> > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > > ---
> > >
> > > V2: Make using GFP_NOFS unconditionial. Previous version was racy, as pausing
> > > requests migth happen just after we checked for them.
> > >
> > > V3: Use memalloc_nofs_save() just like V1 did.
> > >
> > > V4: Similar problem happened for raid56, which was previously missed, so
> > >     deal with it as well as the case for scrub_supers().
> >
> > Enclosing the whole scrub to 'nofs' seems like the best option and
> > future proof. What I missed in 58c4e173847a was the "don't hold big lock
> > under GFP_KERNEL allocation" pattern.
> >
> > >  fs/btrfs/scrub.c | 12 ++++++++++++
> > >  1 file changed, 12 insertions(+)
> > >
> > > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> > > index 3be1456b5116..e08b7502d1f0 100644
> > > --- a/fs/btrfs/scrub.c
> > > +++ b/fs/btrfs/scrub.c
> > > @@ -3779,6 +3779,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
> > >       struct scrub_ctx *sctx;
> > >       int ret;
> > >       struct btrfs_device *dev;
> > > +     unsigned int nofs_flag;
> > >
> > >       if (btrfs_fs_closing(fs_info))
> > >               return -EINVAL;
> > > @@ -3882,6 +3883,16 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
> > >       atomic_inc(&fs_info->scrubs_running);
> > >       mutex_unlock(&fs_info->scrub_lock);
> > >
> > > +     /*
> > > +      * In order to avoid deadlock with reclaim when there is a transaction
> > > +      * trying to pause scrub, make sure we use GFP_NOFS for all the
> > > +      * allocations done at btrfs_scrub_pages() and scrub_pages_for_parity()
> > > +      * invoked by our callees. The pausing request is done when the
> > > +      * transaction commit starts, and it blocks the transaction until scrub
> > > +      * is paused (done at specific points at scrub_stripe() or right above
> > > +      * before incrementing fs_info->scrubs_running).
> >
> > This hilights why there's perhaps no point in trying to make the nofs
> > section smaller, handling all the interactions between scrub and
> > transaction would be too complex.
> >
> > Reviewed-by: David Sterba <dsterba@suse.com>
> 
> Well, the worker tasks can also not use gfp_kernel, since the scrub
> task waits for them to complete before pausing.
> I missed this, and 2 reviewers as well, so perhaps it wasn't that
> trivial and I shouldn't feel that I miserably failed to identify all
> cases for something rather trivial. V5 sent.

You can say that you left it there intentionally, such cookies are a
good drill for reviewers to stay sharp.

When I started the conversions of GFP_NOFS -> GFP_KERNEL, scrub looked
quite safe in this respect but turns out it's not. I was wondering if we
could add some lock assertions before GFP_KERNEL allocations, like:

	assert_lock_not_held(fs_info->device_list_mutex)
	kmalloc(GFP_KERNEL)

and add more locks per subsystem (eg. the scrub lock) and possibly some
convenience wrappers. Michal's scope GFS_NOFS patch series has a
debugging warning where NOFS is used in context where it does not need
to, while having the 'must not hold an important lock' would be a good
debugging helper too.
Filipe Manana Nov. 28, 2018, 2:40 p.m. UTC | #5
On Wed, Nov 28, 2018 at 2:22 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Mon, Nov 26, 2018 at 08:10:30PM +0000, Filipe Manana wrote:
> > On Mon, Nov 26, 2018 at 6:17 PM David Sterba <dsterba@suse.cz> wrote:
> > >
> > > On Fri, Nov 23, 2018 at 06:25:40PM +0000, fdmanana@kernel.org wrote:
> > > > From: Filipe Manana <fdmanana@suse.com>
> > > >
> > > > When a transaction commit starts, it attempts to pause scrub and it blocks
> > > > until the scrub is paused. So while the transaction is blocked waiting for
> > > > scrub to pause, we can not do memory allocation with GFP_KERNEL from scrub,
> > > > otherwise we risk getting into a deadlock with reclaim.
> > > >
> > > > Checking for scrub pause requests is done early at the beginning of the
> > > > while loop of scrub_stripe() and later in the loop, scrub_extent() and
> > > > scrub_raid56_parity() are called, which in turn call scrub_pages() and
> > > > scrub_pages_for_parity() respectively. These last two functions do memory
> > > > allocations using GFP_KERNEL. Same problem could happen while scrubbing
> > > > the super blocks, since it calls scrub_pages().
> > > >
> > > > So make sure GFP_NOFS is used for the memory allocations because at any
> > > > time a scrub pause request can happen from another task that started to
> > > > commit a transaction.
> > > >
> > > > Fixes: 58c4e173847a ("btrfs: scrub: use GFP_KERNEL on the submission path")
> > > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > > > ---
> > > >
> > > > V2: Make using GFP_NOFS unconditionial. Previous version was racy, as pausing
> > > > requests migth happen just after we checked for them.
> > > >
> > > > V3: Use memalloc_nofs_save() just like V1 did.
> > > >
> > > > V4: Similar problem happened for raid56, which was previously missed, so
> > > >     deal with it as well as the case for scrub_supers().
> > >
> > > Enclosing the whole scrub to 'nofs' seems like the best option and
> > > future proof. What I missed in 58c4e173847a was the "don't hold big lock
> > > under GFP_KERNEL allocation" pattern.
> > >
> > > >  fs/btrfs/scrub.c | 12 ++++++++++++
> > > >  1 file changed, 12 insertions(+)
> > > >
> > > > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> > > > index 3be1456b5116..e08b7502d1f0 100644
> > > > --- a/fs/btrfs/scrub.c
> > > > +++ b/fs/btrfs/scrub.c
> > > > @@ -3779,6 +3779,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
> > > >       struct scrub_ctx *sctx;
> > > >       int ret;
> > > >       struct btrfs_device *dev;
> > > > +     unsigned int nofs_flag;
> > > >
> > > >       if (btrfs_fs_closing(fs_info))
> > > >               return -EINVAL;
> > > > @@ -3882,6 +3883,16 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
> > > >       atomic_inc(&fs_info->scrubs_running);
> > > >       mutex_unlock(&fs_info->scrub_lock);
> > > >
> > > > +     /*
> > > > +      * In order to avoid deadlock with reclaim when there is a transaction
> > > > +      * trying to pause scrub, make sure we use GFP_NOFS for all the
> > > > +      * allocations done at btrfs_scrub_pages() and scrub_pages_for_parity()
> > > > +      * invoked by our callees. The pausing request is done when the
> > > > +      * transaction commit starts, and it blocks the transaction until scrub
> > > > +      * is paused (done at specific points at scrub_stripe() or right above
> > > > +      * before incrementing fs_info->scrubs_running).
> > >
> > > This hilights why there's perhaps no point in trying to make the nofs
> > > section smaller, handling all the interactions between scrub and
> > > transaction would be too complex.
> > >
> > > Reviewed-by: David Sterba <dsterba@suse.com>
> >
> > Well, the worker tasks can also not use gfp_kernel, since the scrub
> > task waits for them to complete before pausing.
> > I missed this, and 2 reviewers as well, so perhaps it wasn't that
> > trivial and I shouldn't feel that I miserably failed to identify all
> > cases for something rather trivial. V5 sent.
>
> You can say that you left it there intentionally, such cookies are a
> good drill for reviewers to stay sharp.

Unfortunately for me, it was not on purpose.

However there's the small drill of, for the workers only, potentially
moving the memalloc_nofs_save() and
restore to scrub_handle_errored_block(), since the only two possible
gfp_kernel allocations for workers
are during the case where a repair is needed:

scrub_bio_end_io_worker()
  scrub_block_complete()
    scrub_handle_errored_block()
      lock_full_stripe()
        insert_full_stripe_lock()
          -> kmalloc with GFP_KERNEL


scrub_bio_end_io_worker()
   scrub_block_complete()
     scrub_handle_errored_block()
       scrub_write_page_to_dev_replace()
         scrub_add_page_to_wr_bio()
           -> kzalloc with GFP_KERNEL

Just to avoid some duplication.

>
> When I started the conversions of GFP_NOFS -> GFP_KERNEL, scrub looked
> quite safe in this respect but turns out it's not. I was wondering if we
> could add some lock assertions before GFP_KERNEL allocations, like:
>
>         assert_lock_not_held(fs_info->device_list_mutex)
>         kmalloc(GFP_KERNEL)
>
> and add more locks per subsystem (eg. the scrub lock) and possibly some
> convenience wrappers. Michal's scope GFS_NOFS patch series has a
> debugging warning where NOFS is used in context where it does not need
> to, while having the 'must not hold an important lock' would be a good
> debugging helper too.
David Sterba Dec. 4, 2018, 2:47 p.m. UTC | #6
On Wed, Nov 28, 2018 at 02:40:07PM +0000, Filipe Manana wrote:
> > > Well, the worker tasks can also not use gfp_kernel, since the scrub
> > > task waits for them to complete before pausing.
> > > I missed this, and 2 reviewers as well, so perhaps it wasn't that
> > > trivial and I shouldn't feel that I miserably failed to identify all
> > > cases for something rather trivial. V5 sent.
> >
> > You can say that you left it there intentionally, such cookies are a
> > good drill for reviewers to stay sharp.
> 
> Unfortunately for me, it was not on purpose.
> 
> However there's the small drill of, for the workers only, potentially
> moving the memalloc_nofs_save() and
> restore to scrub_handle_errored_block(), since the only two possible
> gfp_kernel allocations for workers
> are during the case where a repair is needed:
> 
> scrub_bio_end_io_worker()
>   scrub_block_complete()
>     scrub_handle_errored_block()
>       lock_full_stripe()
>         insert_full_stripe_lock()
>           -> kmalloc with GFP_KERNEL
> 
> 
> scrub_bio_end_io_worker()
>    scrub_block_complete()
>      scrub_handle_errored_block()
>        scrub_write_page_to_dev_replace()
>          scrub_add_page_to_wr_bio()
>            -> kzalloc with GFP_KERNEL
> 
> Just to avoid some duplication.

Sounds like a nice cleanup and more in line with the idea of scoped
GFP_NOFS, the markers should be at a higher level. Starting at the
bottom of the callstack is fine as it's obvious where we want the nofs
protection, and then push it up the call chain. That way it's easier to
review. Please send a followup patch, thanks.
diff mbox series

Patch

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 3be1456b5116..e08b7502d1f0 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3779,6 +3779,7 @@  int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
 	struct scrub_ctx *sctx;
 	int ret;
 	struct btrfs_device *dev;
+	unsigned int nofs_flag;
 
 	if (btrfs_fs_closing(fs_info))
 		return -EINVAL;
@@ -3882,6 +3883,16 @@  int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
 	atomic_inc(&fs_info->scrubs_running);
 	mutex_unlock(&fs_info->scrub_lock);
 
+	/*
+	 * In order to avoid deadlock with reclaim when there is a transaction
+	 * trying to pause scrub, make sure we use GFP_NOFS for all the
+	 * allocations done at btrfs_scrub_pages() and scrub_pages_for_parity()
+	 * invoked by our callees. The pausing request is done when the
+	 * transaction commit starts, and it blocks the transaction until scrub
+	 * is paused (done at specific points at scrub_stripe() or right above
+	 * before incrementing fs_info->scrubs_running).
+	 */
+	nofs_flag = memalloc_nofs_save();
 	if (!is_dev_replace) {
 		/*
 		 * by holding device list mutex, we can
@@ -3895,6 +3906,7 @@  int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
 	if (!ret)
 		ret = scrub_enumerate_chunks(sctx, dev, start, end,
 					     is_dev_replace);
+	memalloc_nofs_restore(nofs_flag);
 
 	wait_event(sctx->list_wait, atomic_read(&sctx->bios_in_flight) == 0);
 	atomic_dec(&fs_info->scrubs_running);