[RFC,2/2] btrfs: space-info: Don't allow signal to interrupt ticket waiting
diff mbox series

Message ID 20200706074435.52356-3-wqu@suse.com
State New
Headers show
Series
  • btrfs: make ticket wait uninterruptible to address unexpected RO during balance
Related show

Commit Message

Qu Wenruo July 6, 2020, 7:44 a.m. UTC
[BUG]
When balance receive a fatal signal, it can make the fs to read-only
mode if the timing is unlucky enough:

  BTRFS info (device xvdb): balance: start -d -m -s
  BTRFS info (device xvdb): relocating block group 73001861120 flags metadata
  BTRFS info (device xvdb): found 12236 extents, stage: move data extents
  BTRFS info (device xvdb): relocating block group 71928119296 flags data
  BTRFS info (device xvdb): found 3 extents, stage: move data extents
  BTRFS info (device xvdb): found 3 extents, stage: update data pointers
  BTRFS info (device xvdb): relocating block group 60922265600 flags metadata
  BTRFS: error (device xvdb) in btrfs_drop_snapshot:5505: errno=-4 unknown
  BTRFS info (device xvdb): forced readonly
  BTRFS info (device xvdb): balance: ended with status: -4

[CAUSE]
This is caused by the fact that btrfs ticketing space system can be
interrupted, and cause all kind of -EINTR returned to various critical
section, where we never thought of -EINTR at all.

Even for things like btrfs_start_transaction() can be affected by
signal:
 btrfs_start_transaction()
 |- start_transaction(flush = FLUSH_ALL)
    |- btrfs_block_rsv_add()
       |- btrfs_reserve_metadata_bytes()
          |- __reserve_metadata_bytes()
             |- handle_reserve_ticket()
                |- wait_reserve_ticket()
                   |- prepare_to_wait_event(TASK_KILLABLE)
                   |- ticket->error = -EINTR;

And all related callers get -EINTR error.

In fact, there are really very limited call sites can really handle that
-EINTR properly, above btrfs_drop_snapshot() is one case.

[FIX]
Things like metadata allocation is really a critical section for btrfs,
we don't really want it to be that killable by some impatient users.

In fact, for really long duration calls, it should have their own checks
on signal, like balance, reflink, generic fiemap calls.

So this patch will make ticket waiting uninterruptible, relying on each
long duration calls to handle their signals more properly.

Reported-by: Hans van Kranenburg <hans@knorrie.org>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/space-info.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Josef Bacik July 6, 2020, 1:45 p.m. UTC | #1
On 7/6/20 3:44 AM, Qu Wenruo wrote:
> [BUG]
> When balance receive a fatal signal, it can make the fs to read-only
> mode if the timing is unlucky enough:
> 
>    BTRFS info (device xvdb): balance: start -d -m -s
>    BTRFS info (device xvdb): relocating block group 73001861120 flags metadata
>    BTRFS info (device xvdb): found 12236 extents, stage: move data extents
>    BTRFS info (device xvdb): relocating block group 71928119296 flags data
>    BTRFS info (device xvdb): found 3 extents, stage: move data extents
>    BTRFS info (device xvdb): found 3 extents, stage: update data pointers
>    BTRFS info (device xvdb): relocating block group 60922265600 flags metadata
>    BTRFS: error (device xvdb) in btrfs_drop_snapshot:5505: errno=-4 unknown
>    BTRFS info (device xvdb): forced readonly
>    BTRFS info (device xvdb): balance: ended with status: -4
> 
> [CAUSE]
> This is caused by the fact that btrfs ticketing space system can be
> interrupted, and cause all kind of -EINTR returned to various critical
> section, where we never thought of -EINTR at all.
> 
> Even for things like btrfs_start_transaction() can be affected by
> signal:
>   btrfs_start_transaction()
>   |- start_transaction(flush = FLUSH_ALL)
>      |- btrfs_block_rsv_add()
>         |- btrfs_reserve_metadata_bytes()
>            |- __reserve_metadata_bytes()
>               |- handle_reserve_ticket()
>                  |- wait_reserve_ticket()
>                     |- prepare_to_wait_event(TASK_KILLABLE)
>                     |- ticket->error = -EINTR;
> 
> And all related callers get -EINTR error.
> 
> In fact, there are really very limited call sites can really handle that
> -EINTR properly, above btrfs_drop_snapshot() is one case.
> 
> [FIX]
> Things like metadata allocation is really a critical section for btrfs,
> we don't really want it to be that killable by some impatient users.
> 
> In fact, for really long duration calls, it should have their own checks
> on signal, like balance, reflink, generic fiemap calls.
> 
> So this patch will make ticket waiting uninterruptible, relying on each
> long duration calls to handle their signals more properly.
> 

Nope, everybody that calls start_transaction() should be able to handle -EINTR, 
so we need to find those callsites and fix them, not make it so we hang the box 
because we don't like fixing error paths.  Thanks,

Josef
Qu Wenruo July 6, 2020, 1:50 p.m. UTC | #2
On 2020/7/6 下午9:45, Josef Bacik wrote:
> On 7/6/20 3:44 AM, Qu Wenruo wrote:
>> [BUG]
>> When balance receive a fatal signal, it can make the fs to read-only
>> mode if the timing is unlucky enough:
>>
>>    BTRFS info (device xvdb): balance: start -d -m -s
>>    BTRFS info (device xvdb): relocating block group 73001861120 flags
>> metadata
>>    BTRFS info (device xvdb): found 12236 extents, stage: move data
>> extents
>>    BTRFS info (device xvdb): relocating block group 71928119296 flags
>> data
>>    BTRFS info (device xvdb): found 3 extents, stage: move data extents
>>    BTRFS info (device xvdb): found 3 extents, stage: update data pointers
>>    BTRFS info (device xvdb): relocating block group 60922265600 flags
>> metadata
>>    BTRFS: error (device xvdb) in btrfs_drop_snapshot:5505: errno=-4
>> unknown
>>    BTRFS info (device xvdb): forced readonly
>>    BTRFS info (device xvdb): balance: ended with status: -4
>>
>> [CAUSE]
>> This is caused by the fact that btrfs ticketing space system can be
>> interrupted, and cause all kind of -EINTR returned to various critical
>> section, where we never thought of -EINTR at all.
>>
>> Even for things like btrfs_start_transaction() can be affected by
>> signal:
>>   btrfs_start_transaction()
>>   |- start_transaction(flush = FLUSH_ALL)
>>      |- btrfs_block_rsv_add()
>>         |- btrfs_reserve_metadata_bytes()
>>            |- __reserve_metadata_bytes()
>>               |- handle_reserve_ticket()
>>                  |- wait_reserve_ticket()
>>                     |- prepare_to_wait_event(TASK_KILLABLE)
>>                     |- ticket->error = -EINTR;
>>
>> And all related callers get -EINTR error.
>>
>> In fact, there are really very limited call sites can really handle that
>> -EINTR properly, above btrfs_drop_snapshot() is one case.
>>
>> [FIX]
>> Things like metadata allocation is really a critical section for btrfs,
>> we don't really want it to be that killable by some impatient users.
>>
>> In fact, for really long duration calls, it should have their own checks
>> on signal, like balance, reflink, generic fiemap calls.
>>
>> So this patch will make ticket waiting uninterruptible, relying on each
>> long duration calls to handle their signals more properly.
>>
> 
> Nope, everybody that calls start_transaction() should be able to handle
> -EINTR, so we need to find those callsites and fix them, not make it so
> we hang the box because we don't like fixing error paths.  Thanks,

Then we also need to handle btrfs_delalloc_reserve_metadata(),
btrfs_block_rsv_refill(), btrfs_use_block_rsv(), btrfs_block_rsv_add().

Are you really willing to go to that rabbit hole?

To me, there are only limited call sites would benefit from signal
checking, while we need to handle tons of unnecessary possible -EINTR
errors just for no obvious benefits for other calls sites?

That doesn't sound sane to me.

Thanks,
Qu

> 
> Josef
>
Josef Bacik July 6, 2020, 1:53 p.m. UTC | #3
On 7/6/20 9:50 AM, Qu Wenruo wrote:
> 
> 
> On 2020/7/6 下午9:45, Josef Bacik wrote:
>> On 7/6/20 3:44 AM, Qu Wenruo wrote:
>>> [BUG]
>>> When balance receive a fatal signal, it can make the fs to read-only
>>> mode if the timing is unlucky enough:
>>>
>>>     BTRFS info (device xvdb): balance: start -d -m -s
>>>     BTRFS info (device xvdb): relocating block group 73001861120 flags
>>> metadata
>>>     BTRFS info (device xvdb): found 12236 extents, stage: move data
>>> extents
>>>     BTRFS info (device xvdb): relocating block group 71928119296 flags
>>> data
>>>     BTRFS info (device xvdb): found 3 extents, stage: move data extents
>>>     BTRFS info (device xvdb): found 3 extents, stage: update data pointers
>>>     BTRFS info (device xvdb): relocating block group 60922265600 flags
>>> metadata
>>>     BTRFS: error (device xvdb) in btrfs_drop_snapshot:5505: errno=-4
>>> unknown
>>>     BTRFS info (device xvdb): forced readonly
>>>     BTRFS info (device xvdb): balance: ended with status: -4
>>>
>>> [CAUSE]
>>> This is caused by the fact that btrfs ticketing space system can be
>>> interrupted, and cause all kind of -EINTR returned to various critical
>>> section, where we never thought of -EINTR at all.
>>>
>>> Even for things like btrfs_start_transaction() can be affected by
>>> signal:
>>>    btrfs_start_transaction()
>>>    |- start_transaction(flush = FLUSH_ALL)
>>>       |- btrfs_block_rsv_add()
>>>          |- btrfs_reserve_metadata_bytes()
>>>             |- __reserve_metadata_bytes()
>>>                |- handle_reserve_ticket()
>>>                   |- wait_reserve_ticket()
>>>                      |- prepare_to_wait_event(TASK_KILLABLE)
>>>                      |- ticket->error = -EINTR;
>>>
>>> And all related callers get -EINTR error.
>>>
>>> In fact, there are really very limited call sites can really handle that
>>> -EINTR properly, above btrfs_drop_snapshot() is one case.
>>>
>>> [FIX]
>>> Things like metadata allocation is really a critical section for btrfs,
>>> we don't really want it to be that killable by some impatient users.
>>>
>>> In fact, for really long duration calls, it should have their own checks
>>> on signal, like balance, reflink, generic fiemap calls.
>>>
>>> So this patch will make ticket waiting uninterruptible, relying on each
>>> long duration calls to handle their signals more properly.
>>>
>>
>> Nope, everybody that calls start_transaction() should be able to handle
>> -EINTR, so we need to find those callsites and fix them, not make it so
>> we hang the box because we don't like fixing error paths.  Thanks,
> 
> Then we also need to handle btrfs_delalloc_reserve_metadata(),
> btrfs_block_rsv_refill(), btrfs_use_block_rsv(), btrfs_block_rsv_add().
> 

This only needs to be handled for FLUSH_ALL and FLUSH_STEAL.  Anybody doing 
btrfs_start_transaction() should be able to fail with -EINTR, because they 
should be close to the syscall path.  Balance needs to be fixed to deal with it, 
and I assume there might be a few other places, but by-in-large none of these 
places should flip read only.  Thanks,

Josef
Qu Wenruo July 6, 2020, 2:05 p.m. UTC | #4
On 2020/7/6 下午9:53, Josef Bacik wrote:
> On 7/6/20 9:50 AM, Qu Wenruo wrote:
>>
>>
>> On 2020/7/6 下午9:45, Josef Bacik wrote:
>>> On 7/6/20 3:44 AM, Qu Wenruo wrote:
>>>> [BUG]
>>>> When balance receive a fatal signal, it can make the fs to read-only
>>>> mode if the timing is unlucky enough:
>>>>
>>>>     BTRFS info (device xvdb): balance: start -d -m -s
>>>>     BTRFS info (device xvdb): relocating block group 73001861120 flags
>>>> metadata
>>>>     BTRFS info (device xvdb): found 12236 extents, stage: move data
>>>> extents
>>>>     BTRFS info (device xvdb): relocating block group 71928119296 flags
>>>> data
>>>>     BTRFS info (device xvdb): found 3 extents, stage: move data extents
>>>>     BTRFS info (device xvdb): found 3 extents, stage: update data
>>>> pointers
>>>>     BTRFS info (device xvdb): relocating block group 60922265600 flags
>>>> metadata
>>>>     BTRFS: error (device xvdb) in btrfs_drop_snapshot:5505: errno=-4
>>>> unknown
>>>>     BTRFS info (device xvdb): forced readonly
>>>>     BTRFS info (device xvdb): balance: ended with status: -4
>>>>
>>>> [CAUSE]
>>>> This is caused by the fact that btrfs ticketing space system can be
>>>> interrupted, and cause all kind of -EINTR returned to various critical
>>>> section, where we never thought of -EINTR at all.
>>>>
>>>> Even for things like btrfs_start_transaction() can be affected by
>>>> signal:
>>>>    btrfs_start_transaction()
>>>>    |- start_transaction(flush = FLUSH_ALL)
>>>>       |- btrfs_block_rsv_add()
>>>>          |- btrfs_reserve_metadata_bytes()
>>>>             |- __reserve_metadata_bytes()
>>>>                |- handle_reserve_ticket()
>>>>                   |- wait_reserve_ticket()
>>>>                      |- prepare_to_wait_event(TASK_KILLABLE)
>>>>                      |- ticket->error = -EINTR;
>>>>
>>>> And all related callers get -EINTR error.
>>>>
>>>> In fact, there are really very limited call sites can really handle
>>>> that
>>>> -EINTR properly, above btrfs_drop_snapshot() is one case.
>>>>
>>>> [FIX]
>>>> Things like metadata allocation is really a critical section for btrfs,
>>>> we don't really want it to be that killable by some impatient users.
>>>>
>>>> In fact, for really long duration calls, it should have their own
>>>> checks
>>>> on signal, like balance, reflink, generic fiemap calls.
>>>>
>>>> So this patch will make ticket waiting uninterruptible, relying on each
>>>> long duration calls to handle their signals more properly.
>>>>
>>>
>>> Nope, everybody that calls start_transaction() should be able to handle
>>> -EINTR, so we need to find those callsites and fix them, not make it so
>>> we hang the box because we don't like fixing error paths.  Thanks,
>>
>> Then we also need to handle btrfs_delalloc_reserve_metadata(),
>> btrfs_block_rsv_refill(), btrfs_use_block_rsv(), btrfs_block_rsv_add().
>>
> 
> This only needs to be handled for FLUSH_ALL and FLUSH_STEAL.  Anybody
> doing btrfs_start_transaction() should be able to fail with -EINTR,
> because they should be close to the syscall path.  Balance needs to be
> fixed to deal with it, and I assume there might be a few other places,
> but by-in-large none of these places should flip read only.  Thanks,

There are already ones existing, for btrfs_drop_snapshot().

This is mostly caused by btrfs_delalloc_reserve_metadata(), which always
use FLUSH_ALL unless there is a running trans or its space cache inode.

But still, for a lot of relocation code, we don't really want to bother
the EINTR and just want uninterruptible at least for now.

Any idea for that?
Or just rework how we handle errors in a lot of places?

It doesn't look correct for such a low level mechanism to return -EINTR
while most of callers doesn't really want to bother.

Thanks,
Qu

> 
> Josef
Josef Bacik July 6, 2020, 2:33 p.m. UTC | #5
On 7/6/20 10:05 AM, Qu Wenruo wrote:
> 
> 
> On 2020/7/6 下午9:53, Josef Bacik wrote:
>> On 7/6/20 9:50 AM, Qu Wenruo wrote:
>>>
>>>
>>> On 2020/7/6 下午9:45, Josef Bacik wrote:
>>>> On 7/6/20 3:44 AM, Qu Wenruo wrote:
>>>>> [BUG]
>>>>> When balance receive a fatal signal, it can make the fs to read-only
>>>>> mode if the timing is unlucky enough:
>>>>>
>>>>>      BTRFS info (device xvdb): balance: start -d -m -s
>>>>>      BTRFS info (device xvdb): relocating block group 73001861120 flags
>>>>> metadata
>>>>>      BTRFS info (device xvdb): found 12236 extents, stage: move data
>>>>> extents
>>>>>      BTRFS info (device xvdb): relocating block group 71928119296 flags
>>>>> data
>>>>>      BTRFS info (device xvdb): found 3 extents, stage: move data extents
>>>>>      BTRFS info (device xvdb): found 3 extents, stage: update data
>>>>> pointers
>>>>>      BTRFS info (device xvdb): relocating block group 60922265600 flags
>>>>> metadata
>>>>>      BTRFS: error (device xvdb) in btrfs_drop_snapshot:5505: errno=-4
>>>>> unknown
>>>>>      BTRFS info (device xvdb): forced readonly
>>>>>      BTRFS info (device xvdb): balance: ended with status: -4
>>>>>
>>>>> [CAUSE]
>>>>> This is caused by the fact that btrfs ticketing space system can be
>>>>> interrupted, and cause all kind of -EINTR returned to various critical
>>>>> section, where we never thought of -EINTR at all.
>>>>>
>>>>> Even for things like btrfs_start_transaction() can be affected by
>>>>> signal:
>>>>>     btrfs_start_transaction()
>>>>>     |- start_transaction(flush = FLUSH_ALL)
>>>>>        |- btrfs_block_rsv_add()
>>>>>           |- btrfs_reserve_metadata_bytes()
>>>>>              |- __reserve_metadata_bytes()
>>>>>                 |- handle_reserve_ticket()
>>>>>                    |- wait_reserve_ticket()
>>>>>                       |- prepare_to_wait_event(TASK_KILLABLE)
>>>>>                       |- ticket->error = -EINTR;
>>>>>
>>>>> And all related callers get -EINTR error.
>>>>>
>>>>> In fact, there are really very limited call sites can really handle
>>>>> that
>>>>> -EINTR properly, above btrfs_drop_snapshot() is one case.
>>>>>
>>>>> [FIX]
>>>>> Things like metadata allocation is really a critical section for btrfs,
>>>>> we don't really want it to be that killable by some impatient users.
>>>>>
>>>>> In fact, for really long duration calls, it should have their own
>>>>> checks
>>>>> on signal, like balance, reflink, generic fiemap calls.
>>>>>
>>>>> So this patch will make ticket waiting uninterruptible, relying on each
>>>>> long duration calls to handle their signals more properly.
>>>>>
>>>>
>>>> Nope, everybody that calls start_transaction() should be able to handle
>>>> -EINTR, so we need to find those callsites and fix them, not make it so
>>>> we hang the box because we don't like fixing error paths.  Thanks,
>>>
>>> Then we also need to handle btrfs_delalloc_reserve_metadata(),
>>> btrfs_block_rsv_refill(), btrfs_use_block_rsv(), btrfs_block_rsv_add().
>>>
>>
>> This only needs to be handled for FLUSH_ALL and FLUSH_STEAL.  Anybody
>> doing btrfs_start_transaction() should be able to fail with -EINTR,
>> because they should be close to the syscall path.  Balance needs to be
>> fixed to deal with it, and I assume there might be a few other places,
>> but by-in-large none of these places should flip read only.  Thanks,
> 
> There are already ones existing, for btrfs_drop_snapshot().
> 
> This is mostly caused by btrfs_delalloc_reserve_metadata(), which always
> use FLUSH_ALL unless there is a running trans or its space cache inode.
> 
> But still, for a lot of relocation code, we don't really want to bother
> the EINTR and just want uninterruptible at least for now.
> 
> Any idea for that?
> Or just rework how we handle errors in a lot of places?
> 
> It doesn't look correct for such a low level mechanism to return -EINTR
> while most of callers doesn't really want to bother.
> 

That's the point, most callers shouldn't have to, because most callers shouldn't 
be far enough into their operations that -EINTR causes problems.

We should probably just change btrfs_drop_snapshot to use join, and then have it 
do any space reservation it needs outside of the trans handle.  The other option 
is a FLUSH_ALL_UNKILLABLE.  Thanks,

Josef
David Sterba July 7, 2020, 4:16 p.m. UTC | #6
On Mon, Jul 06, 2020 at 10:33:56AM -0400, Josef Bacik wrote:
> On 7/6/20 10:05 AM, Qu Wenruo wrote:
> >> This only needs to be handled for FLUSH_ALL and FLUSH_STEAL.  Anybody
> >> doing btrfs_start_transaction() should be able to fail with -EINTR,
> >> because they should be close to the syscall path.  Balance needs to be
> >> fixed to deal with it, and I assume there might be a few other places,
> >> but by-in-large none of these places should flip read only.  Thanks,
> > 
> > There are already ones existing, for btrfs_drop_snapshot().
> > 
> > This is mostly caused by btrfs_delalloc_reserve_metadata(), which always
> > use FLUSH_ALL unless there is a running trans or its space cache inode.
> > 
> > But still, for a lot of relocation code, we don't really want to bother
> > the EINTR and just want uninterruptible at least for now.
> > 
> > Any idea for that?
> > Or just rework how we handle errors in a lot of places?
> > 
> > It doesn't look correct for such a low level mechanism to return -EINTR
> > while most of callers doesn't really want to bother.
> > 
> 
> That's the point, most callers shouldn't have to, because most callers shouldn't 
> be far enough into their operations that -EINTR causes problems.

Agreed, that's a sane pattern to follow so we should convert the
remaining cases, perhaps also auditing all existing
btrfs_start_transaction calls but after a quick look I haven't spotted
anything else than the reloc and drop snapshot.

> We should probably just change btrfs_drop_snapshot to use join, and then have it 
> do any space reservation it needs outside of the trans handle.  The other option 
> is a FLUSH_ALL_UNKILLABLE.  Thanks,

I'd rather not push the killable semantics to the flushing state machine
and let it up to the caller context to decide.

Patch
diff mbox series

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index c7bd3fdd7792..c5cfc759b804 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -1099,7 +1099,8 @@  static void wait_reserve_ticket(struct btrfs_fs_info *fs_info,
 
 	spin_lock(&space_info->lock);
 	while (ticket->bytes > 0 && ticket->error == 0) {
-		ret = prepare_to_wait_event(&ticket->wait, &wait, TASK_KILLABLE);
+		ret = prepare_to_wait_event(&ticket->wait, &wait,
+					    TASK_UNINTERRUPTIBLE);
 		if (ret) {
 			/*
 			 * Delete us from the list. After we unlock the space