diff mbox series

xfs: fix deadlock when set label online

Message ID 20230626131542.3711391-1-yangerkun@huaweicloud.com (mailing list archive)
State Deferred, archived
Headers show
Series xfs: fix deadlock when set label online | expand

Commit Message

yangerkun June 26, 2023, 1:15 p.m. UTC
From: yangerkun <yangerkun@huawei.com>

Combine use of xfs_trans_hold and xfs_trans_set_sync in xfs_sync_sb_buf
can trigger a deadlock once shutdown happened concurrently. xlog_ioend_work
will first unpin the sb(which stuck with xfs_buf_lock), then wakeup
xfs_sync_sb_buf. However, xfs_sync_sb_buf never get the chance to unlock
sb until been wakeup by xlog_ioend_work.

xfs_sync_sb_buf
  xfs_trans_getsb // lock sb buf
  xfs_trans_bhold // sb buf keep lock until success commit
  xfs_trans_commit
  ...
    xfs_log_force_seq
      xlog_force_lsn
        xlog_wait_on_iclog
          xlog_wait(&iclog->ic_force_wait... // shutdown happened
  xfs_buf_relse // unlock sb buf

xlog_ioend_work
  xlog_force_shutdown
    xlog_state_shutdown_callbacks
      xlog_cil_process_committed
        xlog_cil_committed
        ...
        xfs_buf_item_unpin
          xfs_buf_lock // deadlock
      wake_up_all(&iclog->ic_force_wait)

xfs_ioc_setlabel use xfs_sync_sb_buf to make sure userspace will see the
change for sb immediately. We can simply call xfs_ail_push_all_sync to
do this and sametime fix the deadlock.

Fixes: f7664b31975b ("xfs: implement online get/set fs label")
Signed-off-by: yangerkun <yangerkun@huawei.com>
---
 fs/xfs/libxfs/xfs_sb.c | 32 --------------------------------
 fs/xfs/libxfs/xfs_sb.h |  1 -
 fs/xfs/xfs_ioctl.c     |  5 ++++-
 3 files changed, 4 insertions(+), 34 deletions(-)

Comments

Dave Chinner June 26, 2023, 9:45 p.m. UTC | #1
On Mon, Jun 26, 2023 at 09:15:42PM +0800, yangerkun wrote:
> From: yangerkun <yangerkun@huawei.com>
> 
> Combine use of xfs_trans_hold and xfs_trans_set_sync in xfs_sync_sb_buf
> can trigger a deadlock once shutdown happened concurrently. xlog_ioend_work
> will first unpin the sb(which stuck with xfs_buf_lock), then wakeup
> xfs_sync_sb_buf. However, xfs_sync_sb_buf never get the chance to unlock
> sb until been wakeup by xlog_ioend_work.
> 
> xfs_sync_sb_buf
>   xfs_trans_getsb // lock sb buf
>   xfs_trans_bhold // sb buf keep lock until success commit
>   xfs_trans_commit
>   ...
>     xfs_log_force_seq
>       xlog_force_lsn
>         xlog_wait_on_iclog
>           xlog_wait(&iclog->ic_force_wait... // shutdown happened
>   xfs_buf_relse // unlock sb buf
>
> xlog_ioend_work
>   xlog_force_shutdown
>     xlog_state_shutdown_callbacks
>       xlog_cil_process_committed
>         xlog_cil_committed
>         ...
>         xfs_buf_item_unpin
>           xfs_buf_lock // deadlock
>       wake_up_all(&iclog->ic_force_wait)
> 
> xfs_ioc_setlabel use xfs_sync_sb_buf to make sure userspace will see the
> change for sb immediately. We can simply call xfs_ail_push_all_sync to
> do this and sametime fix the deadlock.

Why is this deadlock specific to the superblock buffer?

Can't any buffer that is held locked over a synchronous transaction
commit deadlock during a shutdown like this?

Cheers,

Dave.
yangerkun June 27, 2023, 8:42 a.m. UTC | #2
在 2023/6/27 5:45, Dave Chinner 写道:
> On Mon, Jun 26, 2023 at 09:15:42PM +0800, yangerkun wrote:
>> From: yangerkun <yangerkun@huawei.com>
>>
>> Combine use of xfs_trans_hold and xfs_trans_set_sync in xfs_sync_sb_buf
>> can trigger a deadlock once shutdown happened concurrently. xlog_ioend_work
>> will first unpin the sb(which stuck with xfs_buf_lock), then wakeup
>> xfs_sync_sb_buf. However, xfs_sync_sb_buf never get the chance to unlock
>> sb until been wakeup by xlog_ioend_work.
>>
>> xfs_sync_sb_buf
>>    xfs_trans_getsb // lock sb buf
>>    xfs_trans_bhold // sb buf keep lock until success commit
>>    xfs_trans_commit
>>    ...
>>      xfs_log_force_seq
>>        xlog_force_lsn
>>          xlog_wait_on_iclog
>>            xlog_wait(&iclog->ic_force_wait... // shutdown happened
>>    xfs_buf_relse // unlock sb buf
>>
>> xlog_ioend_work
>>    xlog_force_shutdown
>>      xlog_state_shutdown_callbacks
>>        xlog_cil_process_committed
>>          xlog_cil_committed
>>          ...
>>          xfs_buf_item_unpin
>>            xfs_buf_lock // deadlock
>>        wake_up_all(&iclog->ic_force_wait)
>>
>> xfs_ioc_setlabel use xfs_sync_sb_buf to make sure userspace will see the
>> change for sb immediately. We can simply call xfs_ail_push_all_sync to
>> do this and sametime fix the deadlock.
> 
> Why is this deadlock specific to the superblock buffer?

Hi Dave,

Thanks a lot for your revirew! We find this problem when do some code 
reading(which can help us to fix another growfs bug). And then reproduce 
it easily when we set label online frequently with IO error inject at 
the sametime.

> 
> Can't any buffer that is held locked over a synchronous transaction
> commit deadlock during a shutdown like this?

After check all place use xfs_buf_bhold, it seems xfs_sync_sb_buf is the 
only convict that combine use xfs_trans_hold and xfs_trans_set_sync(I'm 
not familiar with xfs yet, so I may have some problems with my code 
check)...

Thanks,
Yang Erkun.
> 
> Cheers,
> 
> Dave.
Dave Chinner June 28, 2023, 11:10 p.m. UTC | #3
On Tue, Jun 27, 2023 at 04:42:41PM +0800, yangerkun wrote:
> 
> 
> 在 2023/6/27 5:45, Dave Chinner 写道:
> > On Mon, Jun 26, 2023 at 09:15:42PM +0800, yangerkun wrote:
> > > From: yangerkun <yangerkun@huawei.com>
> > > 
> > > Combine use of xfs_trans_hold and xfs_trans_set_sync in xfs_sync_sb_buf
> > > can trigger a deadlock once shutdown happened concurrently. xlog_ioend_work
> > > will first unpin the sb(which stuck with xfs_buf_lock), then wakeup
> > > xfs_sync_sb_buf. However, xfs_sync_sb_buf never get the chance to unlock
> > > sb until been wakeup by xlog_ioend_work.
> > > 
> > > xfs_sync_sb_buf
> > >    xfs_trans_getsb // lock sb buf
> > >    xfs_trans_bhold // sb buf keep lock until success commit
> > >    xfs_trans_commit
> > >    ...
> > >      xfs_log_force_seq
> > >        xlog_force_lsn
> > >          xlog_wait_on_iclog
> > >            xlog_wait(&iclog->ic_force_wait... // shutdown happened
> > >    xfs_buf_relse // unlock sb buf
> > > 
> > > xlog_ioend_work
> > >    xlog_force_shutdown
> > >      xlog_state_shutdown_callbacks
> > >        xlog_cil_process_committed
> > >          xlog_cil_committed
> > >          ...
> > >          xfs_buf_item_unpin
> > >            xfs_buf_lock // deadlock
> > >        wake_up_all(&iclog->ic_force_wait)
> > > 
> > > xfs_ioc_setlabel use xfs_sync_sb_buf to make sure userspace will see the
> > > change for sb immediately. We can simply call xfs_ail_push_all_sync to
> > > do this and sametime fix the deadlock.
> > 
> > Why is this deadlock specific to the superblock buffer?
> 
> Hi Dave,
> 
> Thanks a lot for your revirew! We find this problem when do some code
> reading(which can help us to fix another growfs bug). And then reproduce it
> easily when we set label online frequently with IO error inject at the
> sametime.

Right, I know how it can be triggered; that's not actually my
concern...

> > Can't any buffer that is held locked over a synchronous transaction
> > commit deadlock during a shutdown like this?
> 
> After check all place use xfs_buf_bhold, it seems xfs_sync_sb_buf is the
> only convict that combine use xfs_trans_hold and xfs_trans_set_sync(I'm not
> familiar with xfs yet, so I may have some problems with my code check)...

Yes, I can also see that. But my concern is that this change only
addresses the symptom, but leaves the underlying deadlock unsolved.

Indeed, this isn't xfs_trans_commit() I'm worried about here; it's
the call to xfs_log_force(mp, XFS_LOG_SYNC) or
xfs_log_force_seq(XFS_LOG_SYNC) with a buffer held locked that I'm
worried about.

i.e. We have a buffer in the CIL (from a previous transaction) that
we currently hold locked while we call xfs_log_force(XFS_LOG_SYNC).
If a shutdown occurs while we are waiting for journal IO completion
to occur, then xlog_ioend_work() will attempt to lock the buffer and
deadlock, right?

e.g. I'm thinking of things like busy extent flushing (hold AGF +
AGFL + AG btree blocks locked when we call xfs_log_force()) could
also be vulnerable to the same deadlock...

If that's true, how do we avoid the shutdown from causing a deadlock
in these situations?

Cheers,

Dave.
yangerkun June 29, 2023, 11:55 a.m. UTC | #4
在 2023/6/29 7:10, Dave Chinner 写道:
> On Tue, Jun 27, 2023 at 04:42:41PM +0800, yangerkun wrote:
>>
>>
>> 在 2023/6/27 5:45, Dave Chinner 写道:
>>> On Mon, Jun 26, 2023 at 09:15:42PM +0800, yangerkun wrote:
>>>> From: yangerkun <yangerkun@huawei.com>
>>>>
>>>> Combine use of xfs_trans_hold and xfs_trans_set_sync in xfs_sync_sb_buf
>>>> can trigger a deadlock once shutdown happened concurrently. xlog_ioend_work
>>>> will first unpin the sb(which stuck with xfs_buf_lock), then wakeup
>>>> xfs_sync_sb_buf. However, xfs_sync_sb_buf never get the chance to unlock
>>>> sb until been wakeup by xlog_ioend_work.
>>>>
>>>> xfs_sync_sb_buf
>>>>     xfs_trans_getsb // lock sb buf
>>>>     xfs_trans_bhold // sb buf keep lock until success commit
>>>>     xfs_trans_commit
>>>>     ...
>>>>       xfs_log_force_seq
>>>>         xlog_force_lsn
>>>>           xlog_wait_on_iclog
>>>>             xlog_wait(&iclog->ic_force_wait... // shutdown happened
>>>>     xfs_buf_relse // unlock sb buf
>>>>
>>>> xlog_ioend_work
>>>>     xlog_force_shutdown
>>>>       xlog_state_shutdown_callbacks
>>>>         xlog_cil_process_committed
>>>>           xlog_cil_committed
>>>>           ...
>>>>           xfs_buf_item_unpin
>>>>             xfs_buf_lock // deadlock
>>>>         wake_up_all(&iclog->ic_force_wait)
>>>>
>>>> xfs_ioc_setlabel use xfs_sync_sb_buf to make sure userspace will see the
>>>> change for sb immediately. We can simply call xfs_ail_push_all_sync to
>>>> do this and sametime fix the deadlock.
>>>
>>> Why is this deadlock specific to the superblock buffer?
>>
>> Hi Dave,
>>
>> Thanks a lot for your revirew! We find this problem when do some code
>> reading(which can help us to fix another growfs bug). And then reproduce it
>> easily when we set label online frequently with IO error inject at the
>> sametime.
> 
> Right, I know how it can be triggered; that's not actually my
> concern...
> 
>>> Can't any buffer that is held locked over a synchronous transaction
>>> commit deadlock during a shutdown like this?
>>
>> After check all place use xfs_buf_bhold, it seems xfs_sync_sb_buf is the
>> only convict that combine use xfs_trans_hold and xfs_trans_set_sync(I'm not
>> familiar with xfs yet, so I may have some problems with my code check)...
> 
> Yes, I can also see that. But my concern is that this change only
> addresses the symptom, but leaves the underlying deadlock unsolved.
> 
> Indeed, this isn't xfs_trans_commit() I'm worried about here; it's
> the call to xfs_log_force(mp, XFS_LOG_SYNC) or
> xfs_log_force_seq(XFS_LOG_SYNC) with a buffer held locked that I'm
> worried about.
> 
> i.e. We have a buffer in the CIL (from a previous transaction) that
> we currently hold locked while we call xfs_log_force(XFS_LOG_SYNC).
> If a shutdown occurs while we are waiting for journal IO completion
> to occur, then xlog_ioend_work() will attempt to lock the buffer and
> deadlock, right?
> 
> e.g. I'm thinking of things like busy extent flushing (hold AGF +
> AGFL + AG btree blocks locked when we call xfs_log_force()) could
> also be vulnerable to the same deadlock...

You mean something like xfs_allocbt_alloc_block(call xfs_log_force to
flush busy extent which keep agf locked sametime)?

We call xfs_log_force(mp, XFS_LOG_SYNC) after lock agf and before
xfs_trans_commit. It seems ok since xfs_buf_item_unpin will not call
xfs_buf_lock because bli_refcount still keep active(once we hold locked
agf, the bli_refcount will inc in _xfs_trans_bjoin, and keep it until
xfs_trans_commit success(clean agf item) or .iop_unpin(dirty agf item,
call from xlog_ioend_work) which can be called after xfs_trans_commit
too)...


> 
> If that's true, how do we avoid the shutdown from causing a deadlock
> in these situations?
> 
> Cheers,
> 
> Dave.
Dave Chinner June 29, 2023, 10:24 p.m. UTC | #5
On Thu, Jun 29, 2023 at 07:55:10PM +0800, yangerkun wrote:
> 在 2023/6/29 7:10, Dave Chinner 写道:
> > On Tue, Jun 27, 2023 at 04:42:41PM +0800, yangerkun wrote:
> > > 在 2023/6/27 5:45, Dave Chinner 写道:
> > > > On Mon, Jun 26, 2023 at 09:15:42PM +0800, yangerkun wrote:
> > > > > From: yangerkun <yangerkun@huawei.com>
> > > > > 
> > > > > Combine use of xfs_trans_hold and xfs_trans_set_sync in xfs_sync_sb_buf
> > > > > can trigger a deadlock once shutdown happened concurrently. xlog_ioend_work
> > > > > will first unpin the sb(which stuck with xfs_buf_lock), then wakeup
> > > > > xfs_sync_sb_buf. However, xfs_sync_sb_buf never get the chance to unlock
> > > > > sb until been wakeup by xlog_ioend_work.
> > > > > 
> > > > > xfs_sync_sb_buf
> > > > >     xfs_trans_getsb // lock sb buf
> > > > >     xfs_trans_bhold // sb buf keep lock until success commit
> > > > >     xfs_trans_commit
> > > > >     ...
> > > > >       xfs_log_force_seq
> > > > >         xlog_force_lsn
> > > > >           xlog_wait_on_iclog
> > > > >             xlog_wait(&iclog->ic_force_wait... // shutdown happened
> > > > >     xfs_buf_relse // unlock sb buf
> > > > > 
> > > > > xlog_ioend_work
> > > > >     xlog_force_shutdown
> > > > >       xlog_state_shutdown_callbacks
> > > > >         xlog_cil_process_committed
> > > > >           xlog_cil_committed
> > > > >           ...
> > > > >           xfs_buf_item_unpin
> > > > >             xfs_buf_lock // deadlock
> > > > >         wake_up_all(&iclog->ic_force_wait)
> > > > > 
> > > > > xfs_ioc_setlabel use xfs_sync_sb_buf to make sure userspace will see the
> > > > > change for sb immediately. We can simply call xfs_ail_push_all_sync to
> > > > > do this and sametime fix the deadlock.
> > > > 
> > > > Why is this deadlock specific to the superblock buffer?
> > > 
> > > Hi Dave,
> > > 
> > > Thanks a lot for your revirew! We find this problem when do some code
> > > reading(which can help us to fix another growfs bug). And then reproduce it
> > > easily when we set label online frequently with IO error inject at the
> > > sametime.
> > 
> > Right, I know how it can be triggered; that's not actually my
> > concern...
> > 
> > > > Can't any buffer that is held locked over a synchronous transaction
> > > > commit deadlock during a shutdown like this?
> > > 
> > > After check all place use xfs_buf_bhold, it seems xfs_sync_sb_buf is the
> > > only convict that combine use xfs_trans_hold and xfs_trans_set_sync(I'm not
> > > familiar with xfs yet, so I may have some problems with my code check)...
> > 
> > Yes, I can also see that. But my concern is that this change only
> > addresses the symptom, but leaves the underlying deadlock unsolved.
> > 
> > Indeed, this isn't xfs_trans_commit() I'm worried about here; it's
> > the call to xfs_log_force(mp, XFS_LOG_SYNC) or
> > xfs_log_force_seq(XFS_LOG_SYNC) with a buffer held locked that I'm
> > worried about.
> > 
> > i.e. We have a buffer in the CIL (from a previous transaction) that
> > we currently hold locked while we call xfs_log_force(XFS_LOG_SYNC).
> > If a shutdown occurs while we are waiting for journal IO completion
> > to occur, then xlog_ioend_work() will attempt to lock the buffer and
> > deadlock, right?
> > 
> > e.g. I'm thinking of things like busy extent flushing (hold AGF +
> > AGFL + AG btree blocks locked when we call xfs_log_force()) could
> > also be vulnerable to the same deadlock...
> 
> You mean something like xfs_allocbt_alloc_block(call xfs_log_force to
> flush busy extent which keep agf locked sametime)?
> 
> We call xfs_log_force(mp, XFS_LOG_SYNC) after lock agf and before
> xfs_trans_commit. It seems ok since xfs_buf_item_unpin will not call
> xfs_buf_lock because bli_refcount still keep active(once we hold locked
> agf, the bli_refcount will inc in _xfs_trans_bjoin, and keep it until
> xfs_trans_commit success(clean agf item) or .iop_unpin(dirty agf item,
> call from xlog_ioend_work) which can be called after xfs_trans_commit
> too)...

Again, I gave an example of the class of issue I'm worried about.
Again, you chased the one example given through, but haven't
mentioned a thing about all the other code paths that lead to
xfs_log_force(SYNC) that might hold buffers locked that I didn't
mention.

I don't want to have to ask every person who proposes a fix about
every possible code path the bug may manifest in -one at a time-.  I
use examples to point you in the right direction for further
analysis of the rest of the code base, not because that's the only
thing I want checked. Please use your initiative to look at all the
callers of xfs_log_force(SYNC) and determine if they are all safe or
whether there are landmines lurked or even more bugs of a similar
sort.

When we learn about a new issue, this is the sort of audit work that
is necessary to determine the scope of the issue. We need to perform
such audits because they direct the scope of the fix necessary. We
are not interested in slapping a band-aid fix over the symptom that
was reported - that only leads to more band-aid fixes as the same
issue appears in other places.

Now we know there is a lock ordering problem in this code, so before
we attempt to fix it we need to know how widespread it is, what the
impact is, how different code paths avoid it, etc. That requires a
code audit to determine, and that requires looking at all the paths
into xfs_log_force(XFS_LOG_SYNC) to determine if they are safe or
not and documenting that.

Yes, it's more work *right now* than slapping a quick band-aid fix
over it, but it's much less work in the long run for us and we don't
have to keep playing whack-a-mole because we fixed it the right way
the first time.

-Dave.
yangerkun June 30, 2023, 2:19 a.m. UTC | #6
在 2023/6/30 6:24, Dave Chinner 写道:
> On Thu, Jun 29, 2023 at 07:55:10PM +0800, yangerkun wrote:
>> 在 2023/6/29 7:10, Dave Chinner 写道:
>>> On Tue, Jun 27, 2023 at 04:42:41PM +0800, yangerkun wrote:
>>>> 在 2023/6/27 5:45, Dave Chinner 写道:
>>>>> On Mon, Jun 26, 2023 at 09:15:42PM +0800, yangerkun wrote:
>>>>>> From: yangerkun <yangerkun@huawei.com>
>>>>>>
>>>>>> Combine use of xfs_trans_hold and xfs_trans_set_sync in xfs_sync_sb_buf
>>>>>> can trigger a deadlock once shutdown happened concurrently. xlog_ioend_work
>>>>>> will first unpin the sb(which stuck with xfs_buf_lock), then wakeup
>>>>>> xfs_sync_sb_buf. However, xfs_sync_sb_buf never get the chance to unlock
>>>>>> sb until been wakeup by xlog_ioend_work.
>>>>>>
>>>>>> xfs_sync_sb_buf
>>>>>>      xfs_trans_getsb // lock sb buf
>>>>>>      xfs_trans_bhold // sb buf keep lock until success commit
>>>>>>      xfs_trans_commit
>>>>>>      ...
>>>>>>        xfs_log_force_seq
>>>>>>          xlog_force_lsn
>>>>>>            xlog_wait_on_iclog
>>>>>>              xlog_wait(&iclog->ic_force_wait... // shutdown happened
>>>>>>      xfs_buf_relse // unlock sb buf
>>>>>>
>>>>>> xlog_ioend_work
>>>>>>      xlog_force_shutdown
>>>>>>        xlog_state_shutdown_callbacks
>>>>>>          xlog_cil_process_committed
>>>>>>            xlog_cil_committed
>>>>>>            ...
>>>>>>            xfs_buf_item_unpin
>>>>>>              xfs_buf_lock // deadlock
>>>>>>          wake_up_all(&iclog->ic_force_wait)
>>>>>>
>>>>>> xfs_ioc_setlabel use xfs_sync_sb_buf to make sure userspace will see the
>>>>>> change for sb immediately. We can simply call xfs_ail_push_all_sync to
>>>>>> do this and sametime fix the deadlock.
>>>>>
>>>>> Why is this deadlock specific to the superblock buffer?
>>>>
>>>> Hi Dave,
>>>>
>>>> Thanks a lot for your revirew! We find this problem when do some code
>>>> reading(which can help us to fix another growfs bug). And then reproduce it
>>>> easily when we set label online frequently with IO error inject at the
>>>> sametime.
>>>
>>> Right, I know how it can be triggered; that's not actually my
>>> concern...
>>>
>>>>> Can't any buffer that is held locked over a synchronous transaction
>>>>> commit deadlock during a shutdown like this?
>>>>
>>>> After check all place use xfs_buf_bhold, it seems xfs_sync_sb_buf is the
>>>> only convict that combine use xfs_trans_hold and xfs_trans_set_sync(I'm not
>>>> familiar with xfs yet, so I may have some problems with my code check)...
>>>
>>> Yes, I can also see that. But my concern is that this change only
>>> addresses the symptom, but leaves the underlying deadlock unsolved.
>>>
>>> Indeed, this isn't xfs_trans_commit() I'm worried about here; it's
>>> the call to xfs_log_force(mp, XFS_LOG_SYNC) or
>>> xfs_log_force_seq(XFS_LOG_SYNC) with a buffer held locked that I'm
>>> worried about.
>>>
>>> i.e. We have a buffer in the CIL (from a previous transaction) that
>>> we currently hold locked while we call xfs_log_force(XFS_LOG_SYNC).
>>> If a shutdown occurs while we are waiting for journal IO completion
>>> to occur, then xlog_ioend_work() will attempt to lock the buffer and
>>> deadlock, right?
>>>
>>> e.g. I'm thinking of things like busy extent flushing (hold AGF +
>>> AGFL + AG btree blocks locked when we call xfs_log_force()) could
>>> also be vulnerable to the same deadlock...
>>
>> You mean something like xfs_allocbt_alloc_block(call xfs_log_force to
>> flush busy extent which keep agf locked sametime)?
>>
>> We call xfs_log_force(mp, XFS_LOG_SYNC) after lock agf and before
>> xfs_trans_commit. It seems ok since xfs_buf_item_unpin will not call
>> xfs_buf_lock because bli_refcount still keep active(once we hold locked
>> agf, the bli_refcount will inc in _xfs_trans_bjoin, and keep it until
>> xfs_trans_commit success(clean agf item) or .iop_unpin(dirty agf item,
>> call from xlog_ioend_work) which can be called after xfs_trans_commit
>> too)...
> 
> Again, I gave an example of the class of issue I'm worried about.
> Again, you chased the one example given through, but haven't
> mentioned a thing about all the other code paths that lead to
> xfs_log_force(SYNC) that might hold buffers locked that I didn't
> mention.
> 
> I don't want to have to ask every person who proposes a fix about
> every possible code path the bug may manifest in -one at a time-.  I
> use examples to point you in the right direction for further
> analysis of the rest of the code base, not because that's the only
> thing I want checked. Please use your initiative to look at all the
> callers of xfs_log_force(SYNC) and determine if they are all safe or
> whether there are landmines lurked or even more bugs of a similar
> sort.

Hi Dave,

Thank you very much for pointing this out! I'm so sorry for the lack of
awareness of a comprehensive investigation does there any other place
can trigger the bug too...

> 
> When we learn about a new issue, this is the sort of audit work that
> is necessary to determine the scope of the issue. We need to perform
> such audits because they direct the scope of the fix necessary. We
> are not interested in slapping a band-aid fix over the symptom that
> was reported - that only leads to more band-aid fixes as the same
> issue appears in other places.

Yes, agree with you and thanks for your advise, it can really help me to
forbid a band-aid fix however leads to more band-aid fixes, so can
contribute better!

> 
> Now we know there is a lock ordering problem in this code, so before
> we attempt to fix it we need to know how widespread it is, what the
> impact is, how different code paths avoid it, etc. That requires a
> code audit to determine, and that requires looking at all the paths
> into xfs_log_force(XFS_LOG_SYNC) to determine if they are safe or
> not and documenting that.
> 
> Yes, it's more work *right now* than slapping a quick band-aid fix
> over it, but it's much less work in the long run for us and we don't
> have to keep playing whack-a-mole because we fixed it the right way
> the first time.
> 

I will try to look all paths into xfs_log_force(XFS_LOG_SYNC) or
xfs_log_force_seq(XFS_LOG_SYNC) to check if it's safe or not. Thanks
again for your advise!

Thanks,
Yang Erkun.

> -Dave.
yangerkun Aug. 11, 2023, 6:21 a.m. UTC | #7
在 2023/6/30 10:19, yangerkun 写道:
> 
> 
> 在 2023/6/30 6:24, Dave Chinner 写道:
>> On Thu, Jun 29, 2023 at 07:55:10PM +0800, yangerkun wrote:
>>> 在 2023/6/29 7:10, Dave Chinner 写道:
>>>> On Tue, Jun 27, 2023 at 04:42:41PM +0800, yangerkun wrote:
>>>>> 在 2023/6/27 5:45, Dave Chinner 写道:
>>>>>> On Mon, Jun 26, 2023 at 09:15:42PM +0800, yangerkun wrote:
>>>>>>> From: yangerkun <yangerkun@huawei.com>
>>>>>>>
>>>>>>> Combine use of xfs_trans_hold and xfs_trans_set_sync in 
>>>>>>> xfs_sync_sb_buf
>>>>>>> can trigger a deadlock once shutdown happened concurrently. 
>>>>>>> xlog_ioend_work
>>>>>>> will first unpin the sb(which stuck with xfs_buf_lock), then wakeup
>>>>>>> xfs_sync_sb_buf. However, xfs_sync_sb_buf never get the chance to 
>>>>>>> unlock
>>>>>>> sb until been wakeup by xlog_ioend_work.
>>>>>>>
>>>>>>> xfs_sync_sb_buf
>>>>>>>      xfs_trans_getsb // lock sb buf
>>>>>>>      xfs_trans_bhold // sb buf keep lock until success commit
>>>>>>>      xfs_trans_commit
>>>>>>>      ...
>>>>>>>        xfs_log_force_seq
>>>>>>>          xlog_force_lsn
>>>>>>>            xlog_wait_on_iclog
>>>>>>>              xlog_wait(&iclog->ic_force_wait... // shutdown happened
>>>>>>>      xfs_buf_relse // unlock sb buf
>>>>>>>
>>>>>>> xlog_ioend_work
>>>>>>>      xlog_force_shutdown
>>>>>>>        xlog_state_shutdown_callbacks
>>>>>>>          xlog_cil_process_committed
>>>>>>>            xlog_cil_committed
>>>>>>>            ...
>>>>>>>            xfs_buf_item_unpin
>>>>>>>              xfs_buf_lock // deadlock
>>>>>>>          wake_up_all(&iclog->ic_force_wait)
>>>>>>>
>>>>>>> xfs_ioc_setlabel use xfs_sync_sb_buf to make sure userspace will 
>>>>>>> see the
>>>>>>> change for sb immediately. We can simply call 
>>>>>>> xfs_ail_push_all_sync to
>>>>>>> do this and sametime fix the deadlock.
>>>>>>
>>>>>> Why is this deadlock specific to the superblock buffer?
>>>>>
>>>>> Hi Dave,
>>>>>
>>>>> Thanks a lot for your revirew! We find this problem when do some code
>>>>> reading(which can help us to fix another growfs bug). And then 
>>>>> reproduce it
>>>>> easily when we set label online frequently with IO error inject at the
>>>>> sametime.
>>>>
>>>> Right, I know how it can be triggered; that's not actually my
>>>> concern...
>>>>
>>>>>> Can't any buffer that is held locked over a synchronous transaction
>>>>>> commit deadlock during a shutdown like this?
>>>>>
>>>>> After check all place use xfs_buf_bhold, it seems xfs_sync_sb_buf 
>>>>> is the
>>>>> only convict that combine use xfs_trans_hold and 
>>>>> xfs_trans_set_sync(I'm not
>>>>> familiar with xfs yet, so I may have some problems with my code 
>>>>> check)...
>>>>
>>>> Yes, I can also see that. But my concern is that this change only
>>>> addresses the symptom, but leaves the underlying deadlock unsolved.
>>>>
>>>> Indeed, this isn't xfs_trans_commit() I'm worried about here; it's
>>>> the call to xfs_log_force(mp, XFS_LOG_SYNC) or
>>>> xfs_log_force_seq(XFS_LOG_SYNC) with a buffer held locked that I'm
>>>> worried about.
>>>>
>>>> i.e. We have a buffer in the CIL (from a previous transaction) that
>>>> we currently hold locked while we call xfs_log_force(XFS_LOG_SYNC).
>>>> If a shutdown occurs while we are waiting for journal IO completion
>>>> to occur, then xlog_ioend_work() will attempt to lock the buffer and
>>>> deadlock, right?
>>>>
>>>> e.g. I'm thinking of things like busy extent flushing (hold AGF +
>>>> AGFL + AG btree blocks locked when we call xfs_log_force()) could
>>>> also be vulnerable to the same deadlock...
>>>
>>> You mean something like xfs_allocbt_alloc_block(call xfs_log_force to
>>> flush busy extent which keep agf locked sametime)?
>>>
>>> We call xfs_log_force(mp, XFS_LOG_SYNC) after lock agf and before
>>> xfs_trans_commit. It seems ok since xfs_buf_item_unpin will not call
>>> xfs_buf_lock because bli_refcount still keep active(once we hold locked
>>> agf, the bli_refcount will inc in _xfs_trans_bjoin, and keep it until
>>> xfs_trans_commit success(clean agf item) or .iop_unpin(dirty agf item,
>>> call from xlog_ioend_work) which can be called after xfs_trans_commit
>>> too)...
>>
>> Again, I gave an example of the class of issue I'm worried about.
>> Again, you chased the one example given through, but haven't
>> mentioned a thing about all the other code paths that lead to
>> xfs_log_force(SYNC) that might hold buffers locked that I didn't
>> mention.
>>
>> I don't want to have to ask every person who proposes a fix about
>> every possible code path the bug may manifest in -one at a time-.  I
>> use examples to point you in the right direction for further
>> analysis of the rest of the code base, not because that's the only
>> thing I want checked. Please use your initiative to look at all the
>> callers of xfs_log_force(SYNC) and determine if they are all safe or
>> whether there are landmines lurked or even more bugs of a similar
>> sort.
> 
> Hi Dave,
> 
> Thank you very much for pointing this out! I'm so sorry for the lack of
> awareness of a comprehensive investigation does there any other place
> can trigger the bug too...
> 
>>
>> When we learn about a new issue, this is the sort of audit work that
>> is necessary to determine the scope of the issue. We need to perform
>> such audits because they direct the scope of the fix necessary. We
>> are not interested in slapping a band-aid fix over the symptom that
>> was reported - that only leads to more band-aid fixes as the same
>> issue appears in other places.
> 
> Yes, agree with you and thanks for your advise, it can really help me to
> forbid a band-aid fix however leads to more band-aid fixes, so can
> contribute better!
> 
>>
>> Now we know there is a lock ordering problem in this code, so before
>> we attempt to fix it we need to know how widespread it is, what the
>> impact is, how different code paths avoid it, etc. That requires a
>> code audit to determine, and that requires looking at all the paths
>> into xfs_log_force(XFS_LOG_SYNC) to determine if they are safe or
>> not and documenting that.
>>
>> Yes, it's more work *right now* than slapping a quick band-aid fix
>> over it, but it's much less work in the long run for us and we don't
>> have to keep playing whack-a-mole because we fixed it the right way
>> the first time.
>>
> 
> I will try to look all paths into xfs_log_force(XFS_LOG_SYNC) or
> xfs_log_force_seq(XFS_LOG_SYNC) to check if it's safe or not. Thanks
> again for your advise!
> 
> Thanks,
> Yang Erkun.
> 
>> -Dave.
> 

Hi, Dave,

Sorry for the late reply, I was quiet busy last month and it also took 
me long time to check does all the callers of 
xfs_log_force(SYNC)/xfs_log_force_seq(SYNC) was safe. I'm not familiar 
with xfs yet, so if there's anything wrong with the description below, 
please point it out!

The logic I choose was to check will we call 
xfs_log_force(SYNC)/xfs_log_force_seq(SYNC) between 
xfs_buf_lock/xfs_buf_trylock and xfs_buf_unlock at the same thread 
context(I have check other item's .iop_unpin, it seems only xfs_buf item 
can trigger the problem since it will try to lock the buf in 
xfs_buf_item_unpin; besides, different thread context call for 
xfs_buf_lock&xfs_log_force(SYNC)/xfs_log_force_seq(SYNC) and 
xfs_buf_unlock is safe too since this unlock will not wait until 
xfs_log_force(SYNC)/xfs_log_force_seq(SYNC) success return), and once it 
happend, will we trigger the bug too?

I divide the logic of calling xfs_buf_lock/xfs_buf_trylock into two 
categories:

1. Later the xfs_buf will join the tp(xfs_trans_bjoin will inc 
.bli_refcount)

a. xfs_trans_bjoin will inc .bli_refcount
b. xfs_buf_item_pin will inc .bli_refcount when the item of xfs_buf was 
dirty
c. xfs_buf_item_committing will dec .bli_refcount no matter the item was 
dirty or not, and normally it will unlock the xfs_buf, or keep the 
xfs_buf locked when we see XFS_BLI_HOLD
d. xfs_buf_item_unpin will dec .bli_refcount, and it won't call 
xfs_buf_lock when another .bli_refcount exist

xfs_log_force(SYNC)/xfs_log_force_seq(SYNC) can happend before we commit 
the tp(like xfs_create, it will first read&lock agi buf, then call 
xfs_dir_createname, which may trigger agfl fixup, the block allocation 
may see the busy extent and call xfs_log_force(SYNC) to flush the busy 
extent journal). It won't trigger the problem since xfs_trans_bjoin will 
keep another .bli_refcount.

xfs_log_force_seq(SYNC) can happend when we commit the tp since we see 
XFS_TRANS_SYNC(see __xfs_trans_commit), the only case we can trigger the 
deadlock was that we have combine called xfs_trans_bhold, or we will 
unlock xfs_buf in xfs_buf_item_committing. The case which this patch try 
to fix was the only case combine call for xfs_trans_bhold and set 
XFS_TRANS_SYNC.

After commit tp, xfs_buf will only keep locked because of 
xfs_trans_bhold, and once there is a XFS_TRANS_PERM_LOG_RES tp, we may 
trigger another commit, then xfs_log_force(SYNC)/xfs_log_force_seq(SYNC) 
can happend too. But it is safe too since we will first rejoin xfs_buf 
to tp which help protect us.

2. The xfs_buf won't join the tp

xfs_buf_readahead_map
xfs_buf_read
xfs_buf_get
xfs_buf_incore
xfs_buf_delwri_cancel
xfs_buf_delwri_submit_buffers
xfs_buf_delwri_pushbuf
xfs_buf_item_unpin
xfs_iflush_shutdown_abort
xfs_log_quiesce
xlog_do_recover
xfs_freesb
xfs_add_incompat_log_feature
xfs_clear_incompat_log_features

Most case above was io, for xfs_buf_unlock called from another thread 
context, it is safe; for xfs_buf_unlock called from same thread context, 
it is safe too since we won't trigger 
xfs_log_force(SYNC)/xfs_log_force_seq(SYNC) between xfs_buf_lock and 
xfs_buf_unlock.


 From above, it seems only xfs_sync_sb_buf can trigger this deadlock... 
And I prefer to add some comments to xfs_trans_bhold to notice that 
there is a bug when combine use xfs_trans_bhold and 
xfs_log_force(SYNC)/xfs_log_force_seq(SYNC)...


Dave, sorry again for the late reply, and look forward to your reply!

Thanks,
Yang Erkun.
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index ba0f17bc1dc0..8b89e65c5e1e 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -1089,38 +1089,6 @@  xfs_update_secondary_sbs(
 	return saved_error ? saved_error : error;
 }
 
-/*
- * Same behavior as xfs_sync_sb, except that it is always synchronous and it
- * also writes the superblock buffer to disk sector 0 immediately.
- */
-int
-xfs_sync_sb_buf(
-	struct xfs_mount	*mp)
-{
-	struct xfs_trans	*tp;
-	struct xfs_buf		*bp;
-	int			error;
-
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_sb, 0, 0, 0, &tp);
-	if (error)
-		return error;
-
-	bp = xfs_trans_getsb(tp);
-	xfs_log_sb(tp);
-	xfs_trans_bhold(tp, bp);
-	xfs_trans_set_sync(tp);
-	error = xfs_trans_commit(tp);
-	if (error)
-		goto out;
-	/*
-	 * write out the sb buffer to get the changes to disk
-	 */
-	error = xfs_bwrite(bp);
-out:
-	xfs_buf_relse(bp);
-	return error;
-}
-
 void
 xfs_fs_geometry(
 	struct xfs_mount	*mp,
diff --git a/fs/xfs/libxfs/xfs_sb.h b/fs/xfs/libxfs/xfs_sb.h
index a5e14740ec9a..fb420312c476 100644
--- a/fs/xfs/libxfs/xfs_sb.h
+++ b/fs/xfs/libxfs/xfs_sb.h
@@ -15,7 +15,6 @@  struct xfs_perag;
 
 extern void	xfs_log_sb(struct xfs_trans *tp);
 extern int	xfs_sync_sb(struct xfs_mount *mp, bool wait);
-extern int	xfs_sync_sb_buf(struct xfs_mount *mp);
 extern void	xfs_sb_mount_common(struct xfs_mount *mp, struct xfs_sb *sbp);
 extern void	xfs_sb_from_disk(struct xfs_sb *to, struct xfs_dsb *from);
 extern void	xfs_sb_to_disk(struct xfs_dsb *to, struct xfs_sb *from);
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 55bb01173cde..9bacc71d2c9e 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -27,6 +27,7 @@ 
 #include "xfs_trace.h"
 #include "xfs_icache.h"
 #include "xfs_trans.h"
+#include "xfs_trans_priv.h"
 #include "xfs_acl.h"
 #include "xfs_btree.h"
 #include <linux/fsmap.h>
@@ -1798,9 +1799,11 @@  xfs_ioc_setlabel(
 	 * buffered reads from userspace (i.e. from blkid) are invalidated,
 	 * and userspace will see the newly-written label.
 	 */
-	error = xfs_sync_sb_buf(mp);
+	error = xfs_sync_sb(mp, true);
 	if (error)
 		goto out;
+
+	xfs_ail_push_all_sync(mp->m_ail);
 	/*
 	 * growfs also updates backup supers so lock against that.
 	 */