diff mbox series

bfq: silence lockdep for bfqd/ioc lock inversion

Message ID 20210319060015.3979352-1-khazhy@google.com (mailing list archive)
State New, archived
Headers show
Series bfq: silence lockdep for bfqd/ioc lock inversion | expand

Commit Message

Khazhy Kumykov March 19, 2021, 6 a.m. UTC
lockdep warns of circular locking due to inversion between
bfq_insert_requests and bfq_exit_icq. If we end freeing a request when
merging, we *may* grab an ioc->lock if that request is the last refcount
to that ioc. bfq_bio_merge also potentially could have this ordering.
bfq_exit_icq, conversely, grabs bfqd but is always called with ioc->lock
held.

bfq_exit_icq may either be called from put_io_context_active with ioc
refcount raised, ioc_release_fn after the last refcount was already
dropped, or ioc_clear_queue, which is only called while queue is
quiesced or exiting, so the inverted orderings should never conflict.

Fixes: aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as
an extra scheduler")

Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
---
 block/bfq-iosched.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Noticed this lockdep running xfstests (generic/464) on top of a bfq
block device. I was also able to tease it out w/ binary trying to issue
requests that would end up merging while rapidly swapping the active
scheduler. As far as I could see, the deadlock would not actually occur,
so this patch opts to change lock class for the inverted case.

bfqd -> ioc :
[ 2995.524557] __lock_acquire+0x18f5/0x2660
[ 2995.524562] lock_acquire+0xb4/0x3a0
[ 2995.524565] _raw_spin_lock_irqsave+0x3f/0x60
[ 2995.524569] put_io_context+0x33/0x90.  -> ioc->lock grabbed
[ 2995.524573] blk_mq_free_request+0x51/0x140
[ 2995.524577] blk_put_request+0xe/0x10
[ 2995.524580] blk_attempt_req_merge+0x1d/0x30
[ 2995.524585] elv_attempt_insert_merge+0x56/0xa0
[ 2995.524590] blk_mq_sched_try_insert_merge+0x4b/0x60
[ 2995.524595] bfq_insert_requests+0x9e/0x18c0.    -> bfqd->lock grabbed
[ 2995.524598] blk_mq_sched_insert_requests+0xd6/0x2b0
[ 2995.524602] blk_mq_flush_plug_list+0x154/0x280
[ 2995.524606] blk_finish_plug+0x40/0x60
[ 2995.524609] ext4_writepages+0x696/0x1320
[ 2995.524614] do_writepages+0x1c/0x80
[ 2995.524621] __filemap_fdatawrite_range+0xd7/0x120
[ 2995.524625] sync_file_range+0xac/0xf0
[ 2995.524642] __x64_sys_sync_file_range+0x44/0x70
[ 2995.524646] do_syscall_64+0x31/0x40
[ 2995.524649] entry_SYSCALL_64_after_hwframe+0x44/0xae

ioc -> bfqd
[ 2995.524490] _raw_spin_lock_irqsave+0x3f/0x60
[ 2995.524498] bfq_exit_icq+0xa3/0xe0 -> bfqd->lock grabbed
[ 2995.524512] put_io_context_active+0x78/0xb0 -> ioc->lock grabbed
[ 2995.524516] exit_io_context+0x48/0x50
[ 2995.524519] do_exit+0x7e9/0xdd0
[ 2995.524526] do_group_exit+0x54/0xc0
[ 2995.524530] __x64_sys_exit_group+0x18/0x20
[ 2995.524534] do_syscall_64+0x31/0x40
[ 2995.524537] entry_SYSCALL_64_after_hwframe+0x44/0xae

Another trace where we grab ioc -> bfqd through bfq_exit_icq is when
changing elevator
               -> #1 (&(&bfqd->lock)->rlock){-.-.}:
[  646.890820]        lock_acquire+0x9b/0x140
[  646.894868]        _raw_spin_lock_irqsave+0x3b/0x50
[  646.899707]        bfq_exit_icq_bfqq+0x47/0x1f0
[  646.904196]        bfq_exit_icq+0x21/0x30
[  646.908160]        ioc_destroy_icq+0xf3/0x130
[  646.912466]        ioc_clear_queue+0xb8/0x140
[  646.916771]        elevator_switch_mq+0xa4/0x3c0
[  646.921333]        elevator_switch+0x5f/0x340

Comments

Jan Kara April 14, 2021, 9:54 a.m. UTC | #1
On Thu 18-03-21 23:00:15, Khazhismel Kumykov wrote:
> lockdep warns of circular locking due to inversion between
> bfq_insert_requests and bfq_exit_icq. If we end freeing a request when
> merging, we *may* grab an ioc->lock if that request is the last refcount
> to that ioc. bfq_bio_merge also potentially could have this ordering.
> bfq_exit_icq, conversely, grabs bfqd but is always called with ioc->lock
> held.
> 
> bfq_exit_icq may either be called from put_io_context_active with ioc
> refcount raised, ioc_release_fn after the last refcount was already
> dropped, or ioc_clear_queue, which is only called while queue is
> quiesced or exiting, so the inverted orderings should never conflict.
> 
> Fixes: aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as
> an extra scheduler")
> 
> Signed-off-by: Khazhismel Kumykov <khazhy@google.com>

I've just hit the same lockdep complaint. When looking at this another
option to solve this complaint seemed to be to modify bfq_bio_merge() like:

        ret = blk_mq_sched_try_merge(q, bio, nr_segs, &free);
 
+       spin_unlock_irq(&bfqd->lock);
        if (free)
                blk_mq_free_request(free);
-       spin_unlock_irq(&bfqd->lock);

        return ret;

to release request outside of bfqd->lock. Because AFAICT there's no good
reason why we are actually freeing the request under bfqd->lock. And it
would seem a bit safer than annotating-away the lockdep complaint (as much
as I don't see a problem with your analysis). Paolo?

								Honza
> ---
>  block/bfq-iosched.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> Noticed this lockdep running xfstests (generic/464) on top of a bfq
> block device. I was also able to tease it out w/ binary trying to issue
> requests that would end up merging while rapidly swapping the active
> scheduler. As far as I could see, the deadlock would not actually occur,
> so this patch opts to change lock class for the inverted case.
> 
> bfqd -> ioc :
> [ 2995.524557] __lock_acquire+0x18f5/0x2660
> [ 2995.524562] lock_acquire+0xb4/0x3a0
> [ 2995.524565] _raw_spin_lock_irqsave+0x3f/0x60
> [ 2995.524569] put_io_context+0x33/0x90.  -> ioc->lock grabbed
> [ 2995.524573] blk_mq_free_request+0x51/0x140
> [ 2995.524577] blk_put_request+0xe/0x10
> [ 2995.524580] blk_attempt_req_merge+0x1d/0x30
> [ 2995.524585] elv_attempt_insert_merge+0x56/0xa0
> [ 2995.524590] blk_mq_sched_try_insert_merge+0x4b/0x60
> [ 2995.524595] bfq_insert_requests+0x9e/0x18c0.    -> bfqd->lock grabbed
> [ 2995.524598] blk_mq_sched_insert_requests+0xd6/0x2b0
> [ 2995.524602] blk_mq_flush_plug_list+0x154/0x280
> [ 2995.524606] blk_finish_plug+0x40/0x60
> [ 2995.524609] ext4_writepages+0x696/0x1320
> [ 2995.524614] do_writepages+0x1c/0x80
> [ 2995.524621] __filemap_fdatawrite_range+0xd7/0x120
> [ 2995.524625] sync_file_range+0xac/0xf0
> [ 2995.524642] __x64_sys_sync_file_range+0x44/0x70
> [ 2995.524646] do_syscall_64+0x31/0x40
> [ 2995.524649] entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> ioc -> bfqd
> [ 2995.524490] _raw_spin_lock_irqsave+0x3f/0x60
> [ 2995.524498] bfq_exit_icq+0xa3/0xe0 -> bfqd->lock grabbed
> [ 2995.524512] put_io_context_active+0x78/0xb0 -> ioc->lock grabbed
> [ 2995.524516] exit_io_context+0x48/0x50
> [ 2995.524519] do_exit+0x7e9/0xdd0
> [ 2995.524526] do_group_exit+0x54/0xc0
> [ 2995.524530] __x64_sys_exit_group+0x18/0x20
> [ 2995.524534] do_syscall_64+0x31/0x40
> [ 2995.524537] entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> Another trace where we grab ioc -> bfqd through bfq_exit_icq is when
> changing elevator
>                -> #1 (&(&bfqd->lock)->rlock){-.-.}:
> [  646.890820]        lock_acquire+0x9b/0x140
> [  646.894868]        _raw_spin_lock_irqsave+0x3b/0x50
> [  646.899707]        bfq_exit_icq_bfqq+0x47/0x1f0
> [  646.904196]        bfq_exit_icq+0x21/0x30
> [  646.908160]        ioc_destroy_icq+0xf3/0x130
> [  646.912466]        ioc_clear_queue+0xb8/0x140
> [  646.916771]        elevator_switch_mq+0xa4/0x3c0
> [  646.921333]        elevator_switch+0x5f/0x340
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index 95586137194e..cb50ac0ffe80 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -5027,7 +5027,14 @@ static void bfq_exit_icq_bfqq(struct bfq_io_cq *bic, bool is_sync)
>  	if (bfqq && bfqd) {
>  		unsigned long flags;
>  
> -		spin_lock_irqsave(&bfqd->lock, flags);
> +		/* bfq_exit_icq is usually called with ioc->lock held, which is
> +		 * inverse order from elsewhere, which may grab ioc->lock
> +		 * under bfqd->lock if we merge requests and drop the last ioc
> +		 * refcount. Since exit_icq is either called with a refcount,
> +		 * or with queue quiesced, use a differnet lock class to
> +		 * silence lockdep
> +		 */
> +		spin_lock_irqsave_nested(&bfqd->lock, flags, 1);
>  		bfqq->bic = NULL;
>  		bfq_exit_bfqq(bfqd, bfqq);
>  		bic_set_bfqq(bic, NULL, is_sync);
> -- 
> 2.31.0.rc2.261.g7f71774620-goog
>
Khazhy Kumykov April 14, 2021, 6:33 p.m. UTC | #2
On Wed, Apr 14, 2021 at 2:54 AM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 18-03-21 23:00:15, Khazhismel Kumykov wrote:
> > lockdep warns of circular locking due to inversion between
> > bfq_insert_requests and bfq_exit_icq. If we end freeing a request when
> > merging, we *may* grab an ioc->lock if that request is the last refcount
> > to that ioc. bfq_bio_merge also potentially could have this ordering.
> > bfq_exit_icq, conversely, grabs bfqd but is always called with ioc->lock
> > held.
> >
> > bfq_exit_icq may either be called from put_io_context_active with ioc
> > refcount raised, ioc_release_fn after the last refcount was already
> > dropped, or ioc_clear_queue, which is only called while queue is
> > quiesced or exiting, so the inverted orderings should never conflict.
> >
> > Fixes: aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as
> > an extra scheduler")
> >
> > Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
>
> I've just hit the same lockdep complaint. When looking at this another
> option to solve this complaint seemed to be to modify bfq_bio_merge() like:
>
>         ret = blk_mq_sched_try_merge(q, bio, nr_segs, &free);
>
> +       spin_unlock_irq(&bfqd->lock);
>         if (free)
>                 blk_mq_free_request(free);
> -       spin_unlock_irq(&bfqd->lock);
>
>         return ret;
>
> to release request outside of bfqd->lock. Because AFAICT there's no good
> reason why we are actually freeing the request under bfqd->lock. And it
> would seem a bit safer than annotating-away the lockdep complaint (as much
> as I don't see a problem with your analysis). Paolo?

If we can re-order the locking so we don't need the annotation, that
seems better ("inversion is OK so long as either we're frozen or we
have ioc refcount, and we only grab ioc->lock normally if we drop the
last refcount" is a tad "clever"). Though we still need to deal with
blk_mq_sched_try_insert_merge which can potentially free a request.
(See the first stacktrace). Something simple that I wasn't sure of is:
could we delay bfq_exit_icq work, then avoid the inversion? Simpler to
analyze then.

Khazhy

>
>                                                                 Honza
> > ---
> >  block/bfq-iosched.c | 9 ++++++++-
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> >
> > Noticed this lockdep running xfstests (generic/464) on top of a bfq
> > block device. I was also able to tease it out w/ binary trying to issue
> > requests that would end up merging while rapidly swapping the active
> > scheduler. As far as I could see, the deadlock would not actually occur,
> > so this patch opts to change lock class for the inverted case.
> >
> > bfqd -> ioc :
> > [ 2995.524557] __lock_acquire+0x18f5/0x2660
> > [ 2995.524562] lock_acquire+0xb4/0x3a0
> > [ 2995.524565] _raw_spin_lock_irqsave+0x3f/0x60
> > [ 2995.524569] put_io_context+0x33/0x90.  -> ioc->lock grabbed
> > [ 2995.524573] blk_mq_free_request+0x51/0x140
> > [ 2995.524577] blk_put_request+0xe/0x10
> > [ 2995.524580] blk_attempt_req_merge+0x1d/0x30
> > [ 2995.524585] elv_attempt_insert_merge+0x56/0xa0
> > [ 2995.524590] blk_mq_sched_try_insert_merge+0x4b/0x60
> > [ 2995.524595] bfq_insert_requests+0x9e/0x18c0.    -> bfqd->lock grabbed
> > [ 2995.524598] blk_mq_sched_insert_requests+0xd6/0x2b0
> > [ 2995.524602] blk_mq_flush_plug_list+0x154/0x280
> > [ 2995.524606] blk_finish_plug+0x40/0x60
> > [ 2995.524609] ext4_writepages+0x696/0x1320
> > [ 2995.524614] do_writepages+0x1c/0x80
> > [ 2995.524621] __filemap_fdatawrite_range+0xd7/0x120
> > [ 2995.524625] sync_file_range+0xac/0xf0
> > [ 2995.524642] __x64_sys_sync_file_range+0x44/0x70
> > [ 2995.524646] do_syscall_64+0x31/0x40
> > [ 2995.524649] entry_SYSCALL_64_after_hwframe+0x44/0xae
> >
> > ioc -> bfqd
> > [ 2995.524490] _raw_spin_lock_irqsave+0x3f/0x60
> > [ 2995.524498] bfq_exit_icq+0xa3/0xe0 -> bfqd->lock grabbed
> > [ 2995.524512] put_io_context_active+0x78/0xb0 -> ioc->lock grabbed
> > [ 2995.524516] exit_io_context+0x48/0x50
> > [ 2995.524519] do_exit+0x7e9/0xdd0
> > [ 2995.524526] do_group_exit+0x54/0xc0
> > [ 2995.524530] __x64_sys_exit_group+0x18/0x20
> > [ 2995.524534] do_syscall_64+0x31/0x40
> > [ 2995.524537] entry_SYSCALL_64_after_hwframe+0x44/0xae
> >
> > Another trace where we grab ioc -> bfqd through bfq_exit_icq is when
> > changing elevator
> >                -> #1 (&(&bfqd->lock)->rlock){-.-.}:
> > [  646.890820]        lock_acquire+0x9b/0x140
> > [  646.894868]        _raw_spin_lock_irqsave+0x3b/0x50
> > [  646.899707]        bfq_exit_icq_bfqq+0x47/0x1f0
> > [  646.904196]        bfq_exit_icq+0x21/0x30
> > [  646.908160]        ioc_destroy_icq+0xf3/0x130
> > [  646.912466]        ioc_clear_queue+0xb8/0x140
> > [  646.916771]        elevator_switch_mq+0xa4/0x3c0
> > [  646.921333]        elevator_switch+0x5f/0x340
> >
> > diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> > index 95586137194e..cb50ac0ffe80 100644
> > --- a/block/bfq-iosched.c
> > +++ b/block/bfq-iosched.c
> > @@ -5027,7 +5027,14 @@ static void bfq_exit_icq_bfqq(struct bfq_io_cq *bic, bool is_sync)
> >       if (bfqq && bfqd) {
> >               unsigned long flags;
> >
> > -             spin_lock_irqsave(&bfqd->lock, flags);
> > +             /* bfq_exit_icq is usually called with ioc->lock held, which is
> > +              * inverse order from elsewhere, which may grab ioc->lock
> > +              * under bfqd->lock if we merge requests and drop the last ioc
> > +              * refcount. Since exit_icq is either called with a refcount,
> > +              * or with queue quiesced, use a differnet lock class to
> > +              * silence lockdep
> > +              */
> > +             spin_lock_irqsave_nested(&bfqd->lock, flags, 1);
> >               bfqq->bic = NULL;
> >               bfq_exit_bfqq(bfqd, bfqq);
> >               bic_set_bfqq(bic, NULL, is_sync);
> > --
> > 2.31.0.rc2.261.g7f71774620-goog
> >
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
Jan Kara April 15, 2021, 10:47 a.m. UTC | #3
On Wed 14-04-21 11:33:14, Khazhy Kumykov wrote:
> On Wed, Apr 14, 2021 at 2:54 AM Jan Kara <jack@suse.cz> wrote:
> >
> > On Thu 18-03-21 23:00:15, Khazhismel Kumykov wrote:
> > > lockdep warns of circular locking due to inversion between
> > > bfq_insert_requests and bfq_exit_icq. If we end freeing a request when
> > > merging, we *may* grab an ioc->lock if that request is the last refcount
> > > to that ioc. bfq_bio_merge also potentially could have this ordering.
> > > bfq_exit_icq, conversely, grabs bfqd but is always called with ioc->lock
> > > held.
> > >
> > > bfq_exit_icq may either be called from put_io_context_active with ioc
> > > refcount raised, ioc_release_fn after the last refcount was already
> > > dropped, or ioc_clear_queue, which is only called while queue is
> > > quiesced or exiting, so the inverted orderings should never conflict.
> > >
> > > Fixes: aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as
> > > an extra scheduler")
> > >
> > > Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
> >
> > I've just hit the same lockdep complaint. When looking at this another
> > option to solve this complaint seemed to be to modify bfq_bio_merge() like:
> >
> >         ret = blk_mq_sched_try_merge(q, bio, nr_segs, &free);
> >
> > +       spin_unlock_irq(&bfqd->lock);
> >         if (free)
> >                 blk_mq_free_request(free);
> > -       spin_unlock_irq(&bfqd->lock);
> >
> >         return ret;
> >
> > to release request outside of bfqd->lock. Because AFAICT there's no good
> > reason why we are actually freeing the request under bfqd->lock. And it
> > would seem a bit safer than annotating-away the lockdep complaint (as much
> > as I don't see a problem with your analysis). Paolo?
> 
> If we can re-order the locking so we don't need the annotation, that
> seems better ("inversion is OK so long as either we're frozen or we
> have ioc refcount, and we only grab ioc->lock normally if we drop the
> last refcount" is a tad "clever"). Though we still need to deal with
> blk_mq_sched_try_insert_merge which can potentially free a request.

I see, right.

> (See the first stacktrace). Something simple that I wasn't sure of is:
> could we delay bfq_exit_icq work, then avoid the inversion? Simpler to
> analyze then.

That's problematic because ICQ (referencing BFQQs etc.) is going to be
freed after RCU grace period expires. So we cannot really postpone the
teardown of bfq_io_cq. What we could do is to modify
blk_mq_sched_try_insert_merge() so that it returns request to free
similarly to blk_mq_sched_try_merge().  Then we can free the request after
dropping bfqd->lock.

								Honza

> > > ---
> > >  block/bfq-iosched.c | 9 ++++++++-
> > >  1 file changed, 8 insertions(+), 1 deletion(-)
> > >
> > > Noticed this lockdep running xfstests (generic/464) on top of a bfq
> > > block device. I was also able to tease it out w/ binary trying to issue
> > > requests that would end up merging while rapidly swapping the active
> > > scheduler. As far as I could see, the deadlock would not actually occur,
> > > so this patch opts to change lock class for the inverted case.
> > >
> > > bfqd -> ioc :
> > > [ 2995.524557] __lock_acquire+0x18f5/0x2660
> > > [ 2995.524562] lock_acquire+0xb4/0x3a0
> > > [ 2995.524565] _raw_spin_lock_irqsave+0x3f/0x60
> > > [ 2995.524569] put_io_context+0x33/0x90.  -> ioc->lock grabbed
> > > [ 2995.524573] blk_mq_free_request+0x51/0x140
> > > [ 2995.524577] blk_put_request+0xe/0x10
> > > [ 2995.524580] blk_attempt_req_merge+0x1d/0x30
> > > [ 2995.524585] elv_attempt_insert_merge+0x56/0xa0
> > > [ 2995.524590] blk_mq_sched_try_insert_merge+0x4b/0x60
> > > [ 2995.524595] bfq_insert_requests+0x9e/0x18c0.    -> bfqd->lock grabbed
> > > [ 2995.524598] blk_mq_sched_insert_requests+0xd6/0x2b0
> > > [ 2995.524602] blk_mq_flush_plug_list+0x154/0x280
> > > [ 2995.524606] blk_finish_plug+0x40/0x60
> > > [ 2995.524609] ext4_writepages+0x696/0x1320
> > > [ 2995.524614] do_writepages+0x1c/0x80
> > > [ 2995.524621] __filemap_fdatawrite_range+0xd7/0x120
> > > [ 2995.524625] sync_file_range+0xac/0xf0
> > > [ 2995.524642] __x64_sys_sync_file_range+0x44/0x70
> > > [ 2995.524646] do_syscall_64+0x31/0x40
> > > [ 2995.524649] entry_SYSCALL_64_after_hwframe+0x44/0xae
> > >
> > > ioc -> bfqd
> > > [ 2995.524490] _raw_spin_lock_irqsave+0x3f/0x60
> > > [ 2995.524498] bfq_exit_icq+0xa3/0xe0 -> bfqd->lock grabbed
> > > [ 2995.524512] put_io_context_active+0x78/0xb0 -> ioc->lock grabbed
> > > [ 2995.524516] exit_io_context+0x48/0x50
> > > [ 2995.524519] do_exit+0x7e9/0xdd0
> > > [ 2995.524526] do_group_exit+0x54/0xc0
> > > [ 2995.524530] __x64_sys_exit_group+0x18/0x20
> > > [ 2995.524534] do_syscall_64+0x31/0x40
> > > [ 2995.524537] entry_SYSCALL_64_after_hwframe+0x44/0xae
> > >
> > > Another trace where we grab ioc -> bfqd through bfq_exit_icq is when
> > > changing elevator
> > >                -> #1 (&(&bfqd->lock)->rlock){-.-.}:
> > > [  646.890820]        lock_acquire+0x9b/0x140
> > > [  646.894868]        _raw_spin_lock_irqsave+0x3b/0x50
> > > [  646.899707]        bfq_exit_icq_bfqq+0x47/0x1f0
> > > [  646.904196]        bfq_exit_icq+0x21/0x30
> > > [  646.908160]        ioc_destroy_icq+0xf3/0x130
> > > [  646.912466]        ioc_clear_queue+0xb8/0x140
> > > [  646.916771]        elevator_switch_mq+0xa4/0x3c0
> > > [  646.921333]        elevator_switch+0x5f/0x340
> > >
> > > diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> > > index 95586137194e..cb50ac0ffe80 100644
> > > --- a/block/bfq-iosched.c
> > > +++ b/block/bfq-iosched.c
> > > @@ -5027,7 +5027,14 @@ static void bfq_exit_icq_bfqq(struct bfq_io_cq *bic, bool is_sync)
> > >       if (bfqq && bfqd) {
> > >               unsigned long flags;
> > >
> > > -             spin_lock_irqsave(&bfqd->lock, flags);
> > > +             /* bfq_exit_icq is usually called with ioc->lock held, which is
> > > +              * inverse order from elsewhere, which may grab ioc->lock
> > > +              * under bfqd->lock if we merge requests and drop the last ioc
> > > +              * refcount. Since exit_icq is either called with a refcount,
> > > +              * or with queue quiesced, use a differnet lock class to
> > > +              * silence lockdep
> > > +              */
> > > +             spin_lock_irqsave_nested(&bfqd->lock, flags, 1);
> > >               bfqq->bic = NULL;
> > >               bfq_exit_bfqq(bfqd, bfqq);
> > >               bic_set_bfqq(bic, NULL, is_sync);
> > > --
> > > 2.31.0.rc2.261.g7f71774620-goog
> > >
> > --
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR
Paolo Valente May 5, 2021, 1:35 p.m. UTC | #4
Hi Jan, Khazhy,
sorry for my super delay.  Thanks Khazy for spotting this, and Jan for
proposing alternative solutions.

> Il giorno 15 apr 2021, alle ore 12:47, Jan Kara <jack@suse.cz> ha scritto:
> 
> On Wed 14-04-21 11:33:14, Khazhy Kumykov wrote:
>> On Wed, Apr 14, 2021 at 2:54 AM Jan Kara <jack@suse.cz> wrote:
>>> 
>>> On Thu 18-03-21 23:00:15, Khazhismel Kumykov wrote:
>>>> lockdep warns of circular locking due to inversion between
>>>> bfq_insert_requests and bfq_exit_icq. If we end freeing a request when
>>>> merging, we *may* grab an ioc->lock if that request is the last refcount
>>>> to that ioc. bfq_bio_merge also potentially could have this ordering.
>>>> bfq_exit_icq, conversely, grabs bfqd but is always called with ioc->lock
>>>> held.
>>>> 
>>>> bfq_exit_icq may either be called from put_io_context_active with ioc
>>>> refcount raised, ioc_release_fn after the last refcount was already
>>>> dropped, or ioc_clear_queue, which is only called while queue is
>>>> quiesced or exiting, so the inverted orderings should never conflict.
>>>> 
>>>> Fixes: aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as
>>>> an extra scheduler")
>>>> 
>>>> Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
>>> 
>>> I've just hit the same lockdep complaint. When looking at this another
>>> option to solve this complaint seemed to be to modify bfq_bio_merge() like:
>>> 
>>>        ret = blk_mq_sched_try_merge(q, bio, nr_segs, &free);
>>> 
>>> +       spin_unlock_irq(&bfqd->lock);
>>>        if (free)
>>>                blk_mq_free_request(free);
>>> -       spin_unlock_irq(&bfqd->lock);
>>> 
>>>        return ret;
>>> 
>>> to release request outside of bfqd->lock. Because AFAICT there's no good
>>> reason why we are actually freeing the request under bfqd->lock. And it
>>> would seem a bit safer than annotating-away the lockdep complaint (as much
>>> as I don't see a problem with your analysis). Paolo?
>> 
>> If we can re-order the locking so we don't need the annotation, that
>> seems better ("inversion is OK so long as either we're frozen or we
>> have ioc refcount, and we only grab ioc->lock normally if we drop the
>> last refcount" is a tad "clever"). Though we still need to deal with
>> blk_mq_sched_try_insert_merge which can potentially free a request.
> 
> I see, right.
> 

Trying to put pieces together:
1) Moving ahead the invocation of spin_unlock_irq(&bfqd->lock) in
bfq_bio_merge(), as suggested by Jan, seems ok to me as well.  I also
proposed this change several years ago, but received no feedback.  So
I followed the conservative approach of not touching what apparently
works :)
2) If I'm not missing anything, then also what Jan suggests below is
ok.  That is, in blk_mq_sched_try_insert_merge, ioc->lock gets grabbed
only in case blk_mq_free_request is invoked.  So it is ok to simply move
the invocation of blk_mq_free_request outside
blk_mq_sched_try_insert_merge, using the same approach as with
blk_mq_sched_try_merge in bfq_bio_merge.

Thanks,
Paolo

>> (See the first stacktrace). Something simple that I wasn't sure of is:
>> could we delay bfq_exit_icq work, then avoid the inversion? Simpler to
>> analyze then.
> 
> That's problematic because ICQ (referencing BFQQs etc.) is going to be
> freed after RCU grace period expires. So we cannot really postpone the
> teardown of bfq_io_cq. What we could do is to modify
> blk_mq_sched_try_insert_merge() so that it returns request to free
> similarly to blk_mq_sched_try_merge().  Then we can free the request after
> dropping bfqd->lock.
> 
> 								Honza
> 
>>>> ---
>>>> block/bfq-iosched.c | 9 ++++++++-
>>>> 1 file changed, 8 insertions(+), 1 deletion(-)
>>>> 
>>>> Noticed this lockdep running xfstests (generic/464) on top of a bfq
>>>> block device. I was also able to tease it out w/ binary trying to issue
>>>> requests that would end up merging while rapidly swapping the active
>>>> scheduler. As far as I could see, the deadlock would not actually occur,
>>>> so this patch opts to change lock class for the inverted case.
>>>> 
>>>> bfqd -> ioc :
>>>> [ 2995.524557] __lock_acquire+0x18f5/0x2660
>>>> [ 2995.524562] lock_acquire+0xb4/0x3a0
>>>> [ 2995.524565] _raw_spin_lock_irqsave+0x3f/0x60
>>>> [ 2995.524569] put_io_context+0x33/0x90.  -> ioc->lock grabbed
>>>> [ 2995.524573] blk_mq_free_request+0x51/0x140
>>>> [ 2995.524577] blk_put_request+0xe/0x10
>>>> [ 2995.524580] blk_attempt_req_merge+0x1d/0x30
>>>> [ 2995.524585] elv_attempt_insert_merge+0x56/0xa0
>>>> [ 2995.524590] blk_mq_sched_try_insert_merge+0x4b/0x60
>>>> [ 2995.524595] bfq_insert_requests+0x9e/0x18c0.    -> bfqd->lock grabbed
>>>> [ 2995.524598] blk_mq_sched_insert_requests+0xd6/0x2b0
>>>> [ 2995.524602] blk_mq_flush_plug_list+0x154/0x280
>>>> [ 2995.524606] blk_finish_plug+0x40/0x60
>>>> [ 2995.524609] ext4_writepages+0x696/0x1320
>>>> [ 2995.524614] do_writepages+0x1c/0x80
>>>> [ 2995.524621] __filemap_fdatawrite_range+0xd7/0x120
>>>> [ 2995.524625] sync_file_range+0xac/0xf0
>>>> [ 2995.524642] __x64_sys_sync_file_range+0x44/0x70
>>>> [ 2995.524646] do_syscall_64+0x31/0x40
>>>> [ 2995.524649] entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>> 
>>>> ioc -> bfqd
>>>> [ 2995.524490] _raw_spin_lock_irqsave+0x3f/0x60
>>>> [ 2995.524498] bfq_exit_icq+0xa3/0xe0 -> bfqd->lock grabbed
>>>> [ 2995.524512] put_io_context_active+0x78/0xb0 -> ioc->lock grabbed
>>>> [ 2995.524516] exit_io_context+0x48/0x50
>>>> [ 2995.524519] do_exit+0x7e9/0xdd0
>>>> [ 2995.524526] do_group_exit+0x54/0xc0
>>>> [ 2995.524530] __x64_sys_exit_group+0x18/0x20
>>>> [ 2995.524534] do_syscall_64+0x31/0x40
>>>> [ 2995.524537] entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>> 
>>>> Another trace where we grab ioc -> bfqd through bfq_exit_icq is when
>>>> changing elevator
>>>>               -> #1 (&(&bfqd->lock)->rlock){-.-.}:
>>>> [  646.890820]        lock_acquire+0x9b/0x140
>>>> [  646.894868]        _raw_spin_lock_irqsave+0x3b/0x50
>>>> [  646.899707]        bfq_exit_icq_bfqq+0x47/0x1f0
>>>> [  646.904196]        bfq_exit_icq+0x21/0x30
>>>> [  646.908160]        ioc_destroy_icq+0xf3/0x130
>>>> [  646.912466]        ioc_clear_queue+0xb8/0x140
>>>> [  646.916771]        elevator_switch_mq+0xa4/0x3c0
>>>> [  646.921333]        elevator_switch+0x5f/0x340
>>>> 
>>>> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
>>>> index 95586137194e..cb50ac0ffe80 100644
>>>> --- a/block/bfq-iosched.c
>>>> +++ b/block/bfq-iosched.c
>>>> @@ -5027,7 +5027,14 @@ static void bfq_exit_icq_bfqq(struct bfq_io_cq *bic, bool is_sync)
>>>>      if (bfqq && bfqd) {
>>>>              unsigned long flags;
>>>> 
>>>> -             spin_lock_irqsave(&bfqd->lock, flags);
>>>> +             /* bfq_exit_icq is usually called with ioc->lock held, which is
>>>> +              * inverse order from elsewhere, which may grab ioc->lock
>>>> +              * under bfqd->lock if we merge requests and drop the last ioc
>>>> +              * refcount. Since exit_icq is either called with a refcount,
>>>> +              * or with queue quiesced, use a differnet lock class to
>>>> +              * silence lockdep
>>>> +              */
>>>> +             spin_lock_irqsave_nested(&bfqd->lock, flags, 1);
>>>>              bfqq->bic = NULL;
>>>>              bfq_exit_bfqq(bfqd, bfqq);
>>>>              bic_set_bfqq(bic, NULL, is_sync);
>>>> --
>>>> 2.31.0.rc2.261.g7f71774620-goog
>>>> 
>>> --
>>> Jan Kara <jack@suse.com>
>>> SUSE Labs, CR
> 
> 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
Khazhy Kumykov May 5, 2021, 7:51 p.m. UTC | #5
On Wed, May 5, 2021 at 6:44 AM Paolo Valente <paolo.valente@linaro.org> wrote:
>
> Hi Jan, Khazhy,
> sorry for my super delay.  Thanks Khazy for spotting this, and Jan for
> proposing alternative solutions.
>
> > Il giorno 15 apr 2021, alle ore 12:47, Jan Kara <jack@suse.cz> ha scritto:
> >
> > On Wed 14-04-21 11:33:14, Khazhy Kumykov wrote:
> >> On Wed, Apr 14, 2021 at 2:54 AM Jan Kara <jack@suse.cz> wrote:
> >>>
> >>> On Thu 18-03-21 23:00:15, Khazhismel Kumykov wrote:
> >>>> lockdep warns of circular locking due to inversion between
> >>>> bfq_insert_requests and bfq_exit_icq. If we end freeing a request when
> >>>> merging, we *may* grab an ioc->lock if that request is the last refcount
> >>>> to that ioc. bfq_bio_merge also potentially could have this ordering.
> >>>> bfq_exit_icq, conversely, grabs bfqd but is always called with ioc->lock
> >>>> held.
> >>>>
> >>>> bfq_exit_icq may either be called from put_io_context_active with ioc
> >>>> refcount raised, ioc_release_fn after the last refcount was already
> >>>> dropped, or ioc_clear_queue, which is only called while queue is
> >>>> quiesced or exiting, so the inverted orderings should never conflict.
> >>>>
> >>>> Fixes: aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as
> >>>> an extra scheduler")
> >>>>
> >>>> Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
> >>>
> >>> I've just hit the same lockdep complaint. When looking at this another
> >>> option to solve this complaint seemed to be to modify bfq_bio_merge() like:
> >>>
> >>>        ret = blk_mq_sched_try_merge(q, bio, nr_segs, &free);
> >>>
> >>> +       spin_unlock_irq(&bfqd->lock);
> >>>        if (free)
> >>>                blk_mq_free_request(free);
> >>> -       spin_unlock_irq(&bfqd->lock);
> >>>
> >>>        return ret;
> >>>
> >>> to release request outside of bfqd->lock. Because AFAICT there's no good
> >>> reason why we are actually freeing the request under bfqd->lock. And it
> >>> would seem a bit safer than annotating-away the lockdep complaint (as much
> >>> as I don't see a problem with your analysis). Paolo?
> >>
> >> If we can re-order the locking so we don't need the annotation, that
> >> seems better ("inversion is OK so long as either we're frozen or we
> >> have ioc refcount, and we only grab ioc->lock normally if we drop the
> >> last refcount" is a tad "clever"). Though we still need to deal with
> >> blk_mq_sched_try_insert_merge which can potentially free a request.
> >
> > I see, right.
> >
>
> Trying to put pieces together:
> 1) Moving ahead the invocation of spin_unlock_irq(&bfqd->lock) in
> bfq_bio_merge(), as suggested by Jan, seems ok to me as well.  I also
> proposed this change several years ago, but received no feedback.  So
> I followed the conservative approach of not touching what apparently
> works :)
> 2) If I'm not missing anything, then also what Jan suggests below is
> ok.  That is, in blk_mq_sched_try_insert_merge, ioc->lock gets grabbed
> only in case blk_mq_free_request is invoked.  So it is ok to simply move
> the invocation of blk_mq_free_request outside
> blk_mq_sched_try_insert_merge, using the same approach as with
> blk_mq_sched_try_merge in bfq_bio_merge.

Yes, I believe these two changes together should solve the lockdep
warning. The second change requires a bit more plumbing
(blk_mq_sched_try_insert_merge -> elv_attempt_insert_merge ->
blk_attempt_req_merge). In particular elv_attempt_insert_merge could
result in freeing multiple requests, so plumbing that up requires some
more thought

>
> Thanks,
> Paolo
>
> >> (See the first stacktrace). Something simple that I wasn't sure of is:
> >> could we delay bfq_exit_icq work, then avoid the inversion? Simpler to
> >> analyze then.
> >
> > That's problematic because ICQ (referencing BFQQs etc.) is going to be
> > freed after RCU grace period expires. So we cannot really postpone the
> > teardown of bfq_io_cq. What we could do is to modify
> > blk_mq_sched_try_insert_merge() so that it returns request to free
> > similarly to blk_mq_sched_try_merge().  Then we can free the request after
> > dropping bfqd->lock.
> >
> >                                                               Honza
> >
> >>>> ---
> >>>> block/bfq-iosched.c | 9 ++++++++-
> >>>> 1 file changed, 8 insertions(+), 1 deletion(-)
> >>>>
> >>>> Noticed this lockdep running xfstests (generic/464) on top of a bfq
> >>>> block device. I was also able to tease it out w/ binary trying to issue
> >>>> requests that would end up merging while rapidly swapping the active
> >>>> scheduler. As far as I could see, the deadlock would not actually occur,
> >>>> so this patch opts to change lock class for the inverted case.
> >>>>
> >>>> bfqd -> ioc :
> >>>> [ 2995.524557] __lock_acquire+0x18f5/0x2660
> >>>> [ 2995.524562] lock_acquire+0xb4/0x3a0
> >>>> [ 2995.524565] _raw_spin_lock_irqsave+0x3f/0x60
> >>>> [ 2995.524569] put_io_context+0x33/0x90.  -> ioc->lock grabbed
> >>>> [ 2995.524573] blk_mq_free_request+0x51/0x140
> >>>> [ 2995.524577] blk_put_request+0xe/0x10
> >>>> [ 2995.524580] blk_attempt_req_merge+0x1d/0x30
> >>>> [ 2995.524585] elv_attempt_insert_merge+0x56/0xa0
> >>>> [ 2995.524590] blk_mq_sched_try_insert_merge+0x4b/0x60
> >>>> [ 2995.524595] bfq_insert_requests+0x9e/0x18c0.    -> bfqd->lock grabbed
> >>>> [ 2995.524598] blk_mq_sched_insert_requests+0xd6/0x2b0
> >>>> [ 2995.524602] blk_mq_flush_plug_list+0x154/0x280
> >>>> [ 2995.524606] blk_finish_plug+0x40/0x60
> >>>> [ 2995.524609] ext4_writepages+0x696/0x1320
> >>>> [ 2995.524614] do_writepages+0x1c/0x80
> >>>> [ 2995.524621] __filemap_fdatawrite_range+0xd7/0x120
> >>>> [ 2995.524625] sync_file_range+0xac/0xf0
> >>>> [ 2995.524642] __x64_sys_sync_file_range+0x44/0x70
> >>>> [ 2995.524646] do_syscall_64+0x31/0x40
> >>>> [ 2995.524649] entry_SYSCALL_64_after_hwframe+0x44/0xae
> >>>>
> >>>> ioc -> bfqd
> >>>> [ 2995.524490] _raw_spin_lock_irqsave+0x3f/0x60
> >>>> [ 2995.524498] bfq_exit_icq+0xa3/0xe0 -> bfqd->lock grabbed
> >>>> [ 2995.524512] put_io_context_active+0x78/0xb0 -> ioc->lock grabbed
> >>>> [ 2995.524516] exit_io_context+0x48/0x50
> >>>> [ 2995.524519] do_exit+0x7e9/0xdd0
> >>>> [ 2995.524526] do_group_exit+0x54/0xc0
> >>>> [ 2995.524530] __x64_sys_exit_group+0x18/0x20
> >>>> [ 2995.524534] do_syscall_64+0x31/0x40
> >>>> [ 2995.524537] entry_SYSCALL_64_after_hwframe+0x44/0xae
> >>>>
> >>>> Another trace where we grab ioc -> bfqd through bfq_exit_icq is when
> >>>> changing elevator
> >>>>               -> #1 (&(&bfqd->lock)->rlock){-.-.}:
> >>>> [  646.890820]        lock_acquire+0x9b/0x140
> >>>> [  646.894868]        _raw_spin_lock_irqsave+0x3b/0x50
> >>>> [  646.899707]        bfq_exit_icq_bfqq+0x47/0x1f0
> >>>> [  646.904196]        bfq_exit_icq+0x21/0x30
> >>>> [  646.908160]        ioc_destroy_icq+0xf3/0x130
> >>>> [  646.912466]        ioc_clear_queue+0xb8/0x140
> >>>> [  646.916771]        elevator_switch_mq+0xa4/0x3c0
> >>>> [  646.921333]        elevator_switch+0x5f/0x340
> >>>>
> >>>> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> >>>> index 95586137194e..cb50ac0ffe80 100644
> >>>> --- a/block/bfq-iosched.c
> >>>> +++ b/block/bfq-iosched.c
> >>>> @@ -5027,7 +5027,14 @@ static void bfq_exit_icq_bfqq(struct bfq_io_cq *bic, bool is_sync)
> >>>>      if (bfqq && bfqd) {
> >>>>              unsigned long flags;
> >>>>
> >>>> -             spin_lock_irqsave(&bfqd->lock, flags);
> >>>> +             /* bfq_exit_icq is usually called with ioc->lock held, which is
> >>>> +              * inverse order from elsewhere, which may grab ioc->lock
> >>>> +              * under bfqd->lock if we merge requests and drop the last ioc
> >>>> +              * refcount. Since exit_icq is either called with a refcount,
> >>>> +              * or with queue quiesced, use a differnet lock class to
> >>>> +              * silence lockdep
> >>>> +              */
> >>>> +             spin_lock_irqsave_nested(&bfqd->lock, flags, 1);
> >>>>              bfqq->bic = NULL;
> >>>>              bfq_exit_bfqq(bfqd, bfqq);
> >>>>              bic_set_bfqq(bic, NULL, is_sync);
> >>>> --
> >>>> 2.31.0.rc2.261.g7f71774620-goog
> >>>>
> >>> --
> >>> Jan Kara <jack@suse.com>
> >>> SUSE Labs, CR
> >
> >
> > --
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR
>
diff mbox series

Patch

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 95586137194e..cb50ac0ffe80 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -5027,7 +5027,14 @@  static void bfq_exit_icq_bfqq(struct bfq_io_cq *bic, bool is_sync)
 	if (bfqq && bfqd) {
 		unsigned long flags;
 
-		spin_lock_irqsave(&bfqd->lock, flags);
+		/* bfq_exit_icq is usually called with ioc->lock held, which is
+		 * inverse order from elsewhere, which may grab ioc->lock
+		 * under bfqd->lock if we merge requests and drop the last ioc
+		 * refcount. Since exit_icq is either called with a refcount,
+		 * or with queue quiesced, use a differnet lock class to
+		 * silence lockdep
+		 */
+		spin_lock_irqsave_nested(&bfqd->lock, flags, 1);
 		bfqq->bic = NULL;
 		bfq_exit_bfqq(bfqd, bfqq);
 		bic_set_bfqq(bic, NULL, is_sync);