[2/3] btrfs: qgroup: Try to flush qgroup space when we get -EDQUOT
diff mbox series

Message ID 20200702001434.7745-3-wqu@suse.com
State New
Headers show
Series
  • btrfs: qgroup: Fix the long existing regression of btrfs/153
Related show

Commit Message

Qu Wenruo July 2, 2020, 12:14 a.m. UTC
[PROBLEM]
There are known problem related to how btrfs handles qgroup reserved
space.
One of the most obvious case is the the test case btrfs/153, which do
fallocate, then write into the preallocated range.

  btrfs/153 1s ... - output mismatch (see xfstests-dev/results//btrfs/153.out.bad)
      --- tests/btrfs/153.out     2019-10-22 15:18:14.068965341 +0800
      +++ xfstests-dev/results//btrfs/153.out.bad      2020-07-01 20:24:40.730000089 +0800
      @@ -1,2 +1,5 @@
       QA output created by 153
      +pwrite: Disk quota exceeded
      +/mnt/scratch/testfile2: Disk quota exceeded
      +/mnt/scratch/testfile2: Disk quota exceeded
       Silence is golden
      ...
      (Run 'diff -u xfstests-dev/tests/btrfs/153.out xfstests-dev/results//btrfs/153.out.bad'  to see the entire diff)

[CAUSE]
Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have to"),
we always reserve space no matter if it's COW or not.

Such behavior change is mostly for performance, and reverting it is not
a good idea anyway.

For preallcoated extent, we reserve qgroup data space for it already,
and since we also reserve data space for qgroup at buffered write time,
it needs twice the space for us to write into preallocated space.

This leads to the -EDQUOT in buffered write routine.

And we can't follow the same solution, unlike data/meta space check,
qgroup reserved space is shared between data/meta.
The EDQUOT can happen at the metadata reservation, so doing NODATACOW
check after qgroup reservation failure is not a solution.

[FIX]
To solve the problem, we don't return -EDQUOT directly, but every time
we got a -EDQUOT, we try to flush qgroup space by:
- Flush all inodes of the root
  NODATACOW writes will free the qgroup reserved at run_dealloc_range().
  However we don't have the infrastructure to only flush NODATACOW
  inodes, here we flush all inodes anyway.

- Wait ordered extents
  This would convert the preallocated metadata space into per-trans
  metadata, which can be freed in later transaction commit.

- Commit transaction
  This will free all per-trans metadata space.

Also we don't want to trigger flush too racy, so here we introduce a
per-root mutex to ensure if there is a running qgroup flushing, we wait
for it to end and don't start re-flush.

Fixes: c6887cd11149 ("Btrfs: don't do nocow check unless we have to")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h   |   1 +
 fs/btrfs/disk-io.c |   1 +
 fs/btrfs/qgroup.c  | 112 ++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 103 insertions(+), 11 deletions(-)

Comments

Josef Bacik July 2, 2020, 1:43 p.m. UTC | #1
On 7/1/20 8:14 PM, Qu Wenruo wrote:
> [PROBLEM]
> There are known problem related to how btrfs handles qgroup reserved
> space.
> One of the most obvious case is the the test case btrfs/153, which do
> fallocate, then write into the preallocated range.
> 
>    btrfs/153 1s ... - output mismatch (see xfstests-dev/results//btrfs/153.out.bad)
>        --- tests/btrfs/153.out     2019-10-22 15:18:14.068965341 +0800
>        +++ xfstests-dev/results//btrfs/153.out.bad      2020-07-01 20:24:40.730000089 +0800
>        @@ -1,2 +1,5 @@
>         QA output created by 153
>        +pwrite: Disk quota exceeded
>        +/mnt/scratch/testfile2: Disk quota exceeded
>        +/mnt/scratch/testfile2: Disk quota exceeded
>         Silence is golden
>        ...
>        (Run 'diff -u xfstests-dev/tests/btrfs/153.out xfstests-dev/results//btrfs/153.out.bad'  to see the entire diff)
> 
> [CAUSE]
> Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have to"),
> we always reserve space no matter if it's COW or not.
> 
> Such behavior change is mostly for performance, and reverting it is not
> a good idea anyway.
> 
> For preallcoated extent, we reserve qgroup data space for it already,
> and since we also reserve data space for qgroup at buffered write time,
> it needs twice the space for us to write into preallocated space.
> 
> This leads to the -EDQUOT in buffered write routine.
> 
> And we can't follow the same solution, unlike data/meta space check,
> qgroup reserved space is shared between data/meta.
> The EDQUOT can happen at the metadata reservation, so doing NODATACOW
> check after qgroup reservation failure is not a solution.

Why not?  I get that we don't know for sure how we failed, but in the case of a 
write we're way more likely to have failed for data reasons right?  So why not 
just fall back to the NODATACOW check and then do the metadata reservation. 
Then if it fails again you know its a real EDQUOT and your done.

Or if you want to get super fancy you could even break up the metadata and data 
reservations here so that we only fall through to the NODATACOW check if we fail 
the data reservation.  Thanks,

Josef
Qu Wenruo July 2, 2020, 1:54 p.m. UTC | #2
On 2020/7/2 下午9:43, Josef Bacik wrote:
> On 7/1/20 8:14 PM, Qu Wenruo wrote:
>> [PROBLEM]
>> There are known problem related to how btrfs handles qgroup reserved
>> space.
>> One of the most obvious case is the the test case btrfs/153, which do
>> fallocate, then write into the preallocated range.
>>
>>    btrfs/153 1s ... - output mismatch (see
>> xfstests-dev/results//btrfs/153.out.bad)
>>        --- tests/btrfs/153.out     2019-10-22 15:18:14.068965341 +0800
>>        +++ xfstests-dev/results//btrfs/153.out.bad      2020-07-01
>> 20:24:40.730000089 +0800
>>        @@ -1,2 +1,5 @@
>>         QA output created by 153
>>        +pwrite: Disk quota exceeded
>>        +/mnt/scratch/testfile2: Disk quota exceeded
>>        +/mnt/scratch/testfile2: Disk quota exceeded
>>         Silence is golden
>>        ...
>>        (Run 'diff -u xfstests-dev/tests/btrfs/153.out
>> xfstests-dev/results//btrfs/153.out.bad'  to see the entire diff)
>>
>> [CAUSE]
>> Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have
>> to"),
>> we always reserve space no matter if it's COW or not.
>>
>> Such behavior change is mostly for performance, and reverting it is not
>> a good idea anyway.
>>
>> For preallcoated extent, we reserve qgroup data space for it already,
>> and since we also reserve data space for qgroup at buffered write time,
>> it needs twice the space for us to write into preallocated space.
>>
>> This leads to the -EDQUOT in buffered write routine.
>>
>> And we can't follow the same solution, unlike data/meta space check,
>> qgroup reserved space is shared between data/meta.
>> The EDQUOT can happen at the metadata reservation, so doing NODATACOW
>> check after qgroup reservation failure is not a solution.
> 
> Why not?  I get that we don't know for sure how we failed, but in the
> case of a write we're way more likely to have failed for data reasons
> right?

Nope, mostly we failed at metadata reservation, as that would return
EDQUOT to user space.

We may have some cases which get EDQUOT at data reservation part, but
that's what we excepted.
(And already what we're doing)

The problem is when the metadata reservation failed with EDQUOT.

>  So why not just fall back to the NODATACOW check and then do the
> metadata reservation. Then if it fails again you know its a real EDQUOT
> and your done.
> 
> Or if you want to get super fancy you could even break up the metadata
> and data reservations here so that we only fall through to the NODATACOW
> check if we fail the data reservation.  Thanks,

The problem is, qgroup doesn't split metadata and data (yet).
Currently data and meta shares the same limit.

So when we hit EDQUOT, you have no guarantee it would happen only in
qgroup data reservation.

Thanks,
Qu
> 
> Josef
Josef Bacik July 2, 2020, 1:57 p.m. UTC | #3
On 7/2/20 9:54 AM, Qu Wenruo wrote:
> 
> 
> On 2020/7/2 下午9:43, Josef Bacik wrote:
>> On 7/1/20 8:14 PM, Qu Wenruo wrote:
>>> [PROBLEM]
>>> There are known problem related to how btrfs handles qgroup reserved
>>> space.
>>> One of the most obvious case is the the test case btrfs/153, which do
>>> fallocate, then write into the preallocated range.
>>>
>>>     btrfs/153 1s ... - output mismatch (see
>>> xfstests-dev/results//btrfs/153.out.bad)
>>>         --- tests/btrfs/153.out     2019-10-22 15:18:14.068965341 +0800
>>>         +++ xfstests-dev/results//btrfs/153.out.bad      2020-07-01
>>> 20:24:40.730000089 +0800
>>>         @@ -1,2 +1,5 @@
>>>          QA output created by 153
>>>         +pwrite: Disk quota exceeded
>>>         +/mnt/scratch/testfile2: Disk quota exceeded
>>>         +/mnt/scratch/testfile2: Disk quota exceeded
>>>          Silence is golden
>>>         ...
>>>         (Run 'diff -u xfstests-dev/tests/btrfs/153.out
>>> xfstests-dev/results//btrfs/153.out.bad'  to see the entire diff)
>>>
>>> [CAUSE]
>>> Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have
>>> to"),
>>> we always reserve space no matter if it's COW or not.
>>>
>>> Such behavior change is mostly for performance, and reverting it is not
>>> a good idea anyway.
>>>
>>> For preallcoated extent, we reserve qgroup data space for it already,
>>> and since we also reserve data space for qgroup at buffered write time,
>>> it needs twice the space for us to write into preallocated space.
>>>
>>> This leads to the -EDQUOT in buffered write routine.
>>>
>>> And we can't follow the same solution, unlike data/meta space check,
>>> qgroup reserved space is shared between data/meta.
>>> The EDQUOT can happen at the metadata reservation, so doing NODATACOW
>>> check after qgroup reservation failure is not a solution.
>>
>> Why not?  I get that we don't know for sure how we failed, but in the
>> case of a write we're way more likely to have failed for data reasons
>> right?
> 
> Nope, mostly we failed at metadata reservation, as that would return
> EDQUOT to user space.
> 
> We may have some cases which get EDQUOT at data reservation part, but
> that's what we excepted.
> (And already what we're doing)
> 
> The problem is when the metadata reservation failed with EDQUOT.
> 
>>    So why not just fall back to the NODATACOW check and then do the
>> metadata reservation. Then if it fails again you know its a real EDQUOT
>> and your done.
>>
>> Or if you want to get super fancy you could even break up the metadata
>> and data reservations here so that we only fall through to the NODATACOW
>> check if we fail the data reservation.  Thanks,
> 
> The problem is, qgroup doesn't split metadata and data (yet).
> Currently data and meta shares the same limit.
> 
> So when we hit EDQUOT, you have no guarantee it would happen only in
> qgroup data reservation.
> 

Sure, but if you are able to do the nocow thing, then presumably your quota 
reservation is less now?  So on failure you go do the complicated nocow check, 
and if it succeeds you retry your quota reservation with just the metadata 
portion, right?  Thanks,

Josef
Qu Wenruo July 2, 2020, 2:19 p.m. UTC | #4
On 2020/7/2 下午9:57, Josef Bacik wrote:
> On 7/2/20 9:54 AM, Qu Wenruo wrote:
>>
>>
>> On 2020/7/2 下午9:43, Josef Bacik wrote:
>>> On 7/1/20 8:14 PM, Qu Wenruo wrote:
>>>> [PROBLEM]
>>>> There are known problem related to how btrfs handles qgroup reserved
>>>> space.
>>>> One of the most obvious case is the the test case btrfs/153, which do
>>>> fallocate, then write into the preallocated range.
>>>>
>>>>     btrfs/153 1s ... - output mismatch (see
>>>> xfstests-dev/results//btrfs/153.out.bad)
>>>>         --- tests/btrfs/153.out     2019-10-22 15:18:14.068965341 +0800
>>>>         +++ xfstests-dev/results//btrfs/153.out.bad      2020-07-01
>>>> 20:24:40.730000089 +0800
>>>>         @@ -1,2 +1,5 @@
>>>>          QA output created by 153
>>>>         +pwrite: Disk quota exceeded
>>>>         +/mnt/scratch/testfile2: Disk quota exceeded
>>>>         +/mnt/scratch/testfile2: Disk quota exceeded
>>>>          Silence is golden
>>>>         ...
>>>>         (Run 'diff -u xfstests-dev/tests/btrfs/153.out
>>>> xfstests-dev/results//btrfs/153.out.bad'  to see the entire diff)
>>>>
>>>> [CAUSE]
>>>> Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have
>>>> to"),
>>>> we always reserve space no matter if it's COW or not.
>>>>
>>>> Such behavior change is mostly for performance, and reverting it is not
>>>> a good idea anyway.
>>>>
>>>> For preallcoated extent, we reserve qgroup data space for it already,
>>>> and since we also reserve data space for qgroup at buffered write time,
>>>> it needs twice the space for us to write into preallocated space.
>>>>
>>>> This leads to the -EDQUOT in buffered write routine.
>>>>
>>>> And we can't follow the same solution, unlike data/meta space check,
>>>> qgroup reserved space is shared between data/meta.
>>>> The EDQUOT can happen at the metadata reservation, so doing NODATACOW
>>>> check after qgroup reservation failure is not a solution.
>>>
>>> Why not?  I get that we don't know for sure how we failed, but in the
>>> case of a write we're way more likely to have failed for data reasons
>>> right?
>>
>> Nope, mostly we failed at metadata reservation, as that would return
>> EDQUOT to user space.
>>
>> We may have some cases which get EDQUOT at data reservation part, but
>> that's what we excepted.
>> (And already what we're doing)
>>
>> The problem is when the metadata reservation failed with EDQUOT.
>>
>>>    So why not just fall back to the NODATACOW check and then do the
>>> metadata reservation. Then if it fails again you know its a real EDQUOT
>>> and your done.
>>>
>>> Or if you want to get super fancy you could even break up the metadata
>>> and data reservations here so that we only fall through to the NODATACOW
>>> check if we fail the data reservation.  Thanks,
>>
>> The problem is, qgroup doesn't split metadata and data (yet).
>> Currently data and meta shares the same limit.
>>
>> So when we hit EDQUOT, you have no guarantee it would happen only in
>> qgroup data reservation.
>>
> 
> Sure, but if you are able to do the nocow thing, then presumably your
> quota reservation is less now?  So on failure you go do the complicated
> nocow check, and if it succeeds you retry your quota reservation with
> just the metadata portion, right?  Thanks,

Then metadata portion can still fail, even we skipped the data reserv.

The metadata portion still needs some space, while the data rsv skip
only happens after we're already near the qgroup limit, which means
there are ready not much space left.

Consider this case, we have 128M limit, we fallocated 120M, then we have
dirtied 7M data, plus several kilo for metadata reserved.

Then at the next 1M, we run out of qgroup limit, at whatever position.
Even we skip current 4K for data, the next metadata reserve may still
not be met, and still got EDQUOT at metadata reserve.

Or some other open() calls to create a new file would just get EDQUOT,
without any way to free any extra space.


Instead of try to skip just several 4K for qgroup data rsv, we should
flush the existing 7M, to free at least 7M data + several kilo meta space.

Thanks,
Qu

> 
> Josef
Josef Bacik July 2, 2020, 2:58 p.m. UTC | #5
On 7/2/20 10:19 AM, Qu Wenruo wrote:
> 
> 
> On 2020/7/2 下午9:57, Josef Bacik wrote:
>> On 7/2/20 9:54 AM, Qu Wenruo wrote:
>>>
>>>
>>> On 2020/7/2 下午9:43, Josef Bacik wrote:
>>>> On 7/1/20 8:14 PM, Qu Wenruo wrote:
>>>>> [PROBLEM]
>>>>> There are known problem related to how btrfs handles qgroup reserved
>>>>> space.
>>>>> One of the most obvious case is the the test case btrfs/153, which do
>>>>> fallocate, then write into the preallocated range.
>>>>>
>>>>>      btrfs/153 1s ... - output mismatch (see
>>>>> xfstests-dev/results//btrfs/153.out.bad)
>>>>>          --- tests/btrfs/153.out     2019-10-22 15:18:14.068965341 +0800
>>>>>          +++ xfstests-dev/results//btrfs/153.out.bad      2020-07-01
>>>>> 20:24:40.730000089 +0800
>>>>>          @@ -1,2 +1,5 @@
>>>>>           QA output created by 153
>>>>>          +pwrite: Disk quota exceeded
>>>>>          +/mnt/scratch/testfile2: Disk quota exceeded
>>>>>          +/mnt/scratch/testfile2: Disk quota exceeded
>>>>>           Silence is golden
>>>>>          ...
>>>>>          (Run 'diff -u xfstests-dev/tests/btrfs/153.out
>>>>> xfstests-dev/results//btrfs/153.out.bad'  to see the entire diff)
>>>>>
>>>>> [CAUSE]
>>>>> Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have
>>>>> to"),
>>>>> we always reserve space no matter if it's COW or not.
>>>>>
>>>>> Such behavior change is mostly for performance, and reverting it is not
>>>>> a good idea anyway.
>>>>>
>>>>> For preallcoated extent, we reserve qgroup data space for it already,
>>>>> and since we also reserve data space for qgroup at buffered write time,
>>>>> it needs twice the space for us to write into preallocated space.
>>>>>
>>>>> This leads to the -EDQUOT in buffered write routine.
>>>>>
>>>>> And we can't follow the same solution, unlike data/meta space check,
>>>>> qgroup reserved space is shared between data/meta.
>>>>> The EDQUOT can happen at the metadata reservation, so doing NODATACOW
>>>>> check after qgroup reservation failure is not a solution.
>>>>
>>>> Why not?  I get that we don't know for sure how we failed, but in the
>>>> case of a write we're way more likely to have failed for data reasons
>>>> right?
>>>
>>> Nope, mostly we failed at metadata reservation, as that would return
>>> EDQUOT to user space.
>>>
>>> We may have some cases which get EDQUOT at data reservation part, but
>>> that's what we excepted.
>>> (And already what we're doing)
>>>
>>> The problem is when the metadata reservation failed with EDQUOT.
>>>
>>>>     So why not just fall back to the NODATACOW check and then do the
>>>> metadata reservation. Then if it fails again you know its a real EDQUOT
>>>> and your done.
>>>>
>>>> Or if you want to get super fancy you could even break up the metadata
>>>> and data reservations here so that we only fall through to the NODATACOW
>>>> check if we fail the data reservation.  Thanks,
>>>
>>> The problem is, qgroup doesn't split metadata and data (yet).
>>> Currently data and meta shares the same limit.
>>>
>>> So when we hit EDQUOT, you have no guarantee it would happen only in
>>> qgroup data reservation.
>>>
>>
>> Sure, but if you are able to do the nocow thing, then presumably your
>> quota reservation is less now?  So on failure you go do the complicated
>> nocow check, and if it succeeds you retry your quota reservation with
>> just the metadata portion, right?  Thanks,
> 
> Then metadata portion can still fail, even we skipped the data reserv.
> 
> The metadata portion still needs some space, while the data rsv skip
> only happens after we're already near the qgroup limit, which means
> there are ready not much space left.
> 
> Consider this case, we have 128M limit, we fallocated 120M, then we have
> dirtied 7M data, plus several kilo for metadata reserved.
> 
> Then at the next 1M, we run out of qgroup limit, at whatever position.
> Even we skip current 4K for data, the next metadata reserve may still
> not be met, and still got EDQUOT at metadata reserve.
> 
> Or some other open() calls to create a new file would just get EDQUOT,
> without any way to free any extra space.
> 
> 
> Instead of try to skip just several 4K for qgroup data rsv, we should
> flush the existing 7M, to free at least 7M data + several kilo meta space.
> 

Right so I'm not against flushing in general, I just think that we can greatly 
improve on this particular problem without flushing.  Changing how we do the 
NOCOW check with quota could be faster than doing the flushing.

Now as for the flushing part itself, I'd rather hook into the existing flushing 
infrastructure we have.  Obviously the ticketing is going to be different, but 
the flushing part is still the same, and with data reservations now moved over 
to that infrastructure we finally have it all in the same place.  Thanks,

Josef
Qu Wenruo July 2, 2020, 11:36 p.m. UTC | #6
On 2020/7/2 下午10:58, Josef Bacik wrote:
> On 7/2/20 10:19 AM, Qu Wenruo wrote:
>>
>>
>> On 2020/7/2 下午9:57, Josef Bacik wrote:
>>> On 7/2/20 9:54 AM, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2020/7/2 下午9:43, Josef Bacik wrote:
>>>>> On 7/1/20 8:14 PM, Qu Wenruo wrote:
>>>>>> [PROBLEM]
>>>>>> There are known problem related to how btrfs handles qgroup reserved
>>>>>> space.
>>>>>> One of the most obvious case is the the test case btrfs/153, which do
>>>>>> fallocate, then write into the preallocated range.
>>>>>>
>>>>>>      btrfs/153 1s ... - output mismatch (see
>>>>>> xfstests-dev/results//btrfs/153.out.bad)
>>>>>>          --- tests/btrfs/153.out     2019-10-22 15:18:14.068965341
>>>>>> +0800
>>>>>>          +++ xfstests-dev/results//btrfs/153.out.bad      2020-07-01
>>>>>> 20:24:40.730000089 +0800
>>>>>>          @@ -1,2 +1,5 @@
>>>>>>           QA output created by 153
>>>>>>          +pwrite: Disk quota exceeded
>>>>>>          +/mnt/scratch/testfile2: Disk quota exceeded
>>>>>>          +/mnt/scratch/testfile2: Disk quota exceeded
>>>>>>           Silence is golden
>>>>>>          ...
>>>>>>          (Run 'diff -u xfstests-dev/tests/btrfs/153.out
>>>>>> xfstests-dev/results//btrfs/153.out.bad'  to see the entire diff)
>>>>>>
>>>>>> [CAUSE]
>>>>>> Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we
>>>>>> have
>>>>>> to"),
>>>>>> we always reserve space no matter if it's COW or not.
>>>>>>
>>>>>> Such behavior change is mostly for performance, and reverting it
>>>>>> is not
>>>>>> a good idea anyway.
>>>>>>
>>>>>> For preallcoated extent, we reserve qgroup data space for it already,
>>>>>> and since we also reserve data space for qgroup at buffered write
>>>>>> time,
>>>>>> it needs twice the space for us to write into preallocated space.
>>>>>>
>>>>>> This leads to the -EDQUOT in buffered write routine.
>>>>>>
>>>>>> And we can't follow the same solution, unlike data/meta space check,
>>>>>> qgroup reserved space is shared between data/meta.
>>>>>> The EDQUOT can happen at the metadata reservation, so doing NODATACOW
>>>>>> check after qgroup reservation failure is not a solution.
>>>>>
>>>>> Why not?  I get that we don't know for sure how we failed, but in the
>>>>> case of a write we're way more likely to have failed for data reasons
>>>>> right?
>>>>
>>>> Nope, mostly we failed at metadata reservation, as that would return
>>>> EDQUOT to user space.
>>>>
>>>> We may have some cases which get EDQUOT at data reservation part, but
>>>> that's what we excepted.
>>>> (And already what we're doing)
>>>>
>>>> The problem is when the metadata reservation failed with EDQUOT.
>>>>
>>>>>     So why not just fall back to the NODATACOW check and then do the
>>>>> metadata reservation. Then if it fails again you know its a real
>>>>> EDQUOT
>>>>> and your done.
>>>>>
>>>>> Or if you want to get super fancy you could even break up the metadata
>>>>> and data reservations here so that we only fall through to the
>>>>> NODATACOW
>>>>> check if we fail the data reservation.  Thanks,
>>>>
>>>> The problem is, qgroup doesn't split metadata and data (yet).
>>>> Currently data and meta shares the same limit.
>>>>
>>>> So when we hit EDQUOT, you have no guarantee it would happen only in
>>>> qgroup data reservation.
>>>>
>>>
>>> Sure, but if you are able to do the nocow thing, then presumably your
>>> quota reservation is less now?  So on failure you go do the complicated
>>> nocow check, and if it succeeds you retry your quota reservation with
>>> just the metadata portion, right?  Thanks,
>>
>> Then metadata portion can still fail, even we skipped the data reserv.
>>
>> The metadata portion still needs some space, while the data rsv skip
>> only happens after we're already near the qgroup limit, which means
>> there are ready not much space left.
>>
>> Consider this case, we have 128M limit, we fallocated 120M, then we have
>> dirtied 7M data, plus several kilo for metadata reserved.
>>
>> Then at the next 1M, we run out of qgroup limit, at whatever position.
>> Even we skip current 4K for data, the next metadata reserve may still
>> not be met, and still got EDQUOT at metadata reserve.
>>
>> Or some other open() calls to create a new file would just get EDQUOT,
>> without any way to free any extra space.
>>
>>
>> Instead of try to skip just several 4K for qgroup data rsv, we should
>> flush the existing 7M, to free at least 7M data + several kilo meta
>> space.
>>
> 
> Right so I'm not against flushing in general, I just think that we can
> greatly improve on this particular problem without flushing.  Changing
> how we do the NOCOW check with quota could be faster than doing the
> flushing.

Yep, but as mentioned, the uncertain timing of when we get the EDQUOT is
really annoying and tricky to solve, thus have to go the flushing method.

The performance is definitely slower, but it's not acceptable, since
we're near the limiting, slowing down is pretty common.

> 
> Now as for the flushing part itself, I'd rather hook into the existing
> flushing infrastructure we have.

That's the ultimate objective.

>  Obviously the ticketing is going to be
> different, but the flushing part is still the same, and with data
> reservations now moved over to that infrastructure we finally have it
> all in the same place.  Thanks,

Before the needed infrastructure get merged, I'll keep the current small
retry code and look into what's needed to integrate qgroup rsv into the
ticketing system.

Thanks,
Qu

> 
> Josef

Patch
diff mbox series

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4dd478b4fe3a..891f47c7891f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1162,6 +1162,7 @@  struct btrfs_root {
 	spinlock_t qgroup_meta_rsv_lock;
 	u64 qgroup_meta_rsv_pertrans;
 	u64 qgroup_meta_rsv_prealloc;
+	struct mutex qgroup_flushing_mutex;
 
 	/* Number of active swapfiles */
 	atomic_t nr_swapfiles;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7c07578866f3..ca2c7d39e9f7 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1116,6 +1116,7 @@  static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
 	mutex_init(&root->log_mutex);
 	mutex_init(&root->ordered_extent_mutex);
 	mutex_init(&root->delalloc_mutex);
+	mutex_init(&root->qgroup_flushing_mutex);
 	init_waitqueue_head(&root->log_writer_wait);
 	init_waitqueue_head(&root->log_commit_wait[0]);
 	init_waitqueue_head(&root->log_commit_wait[1]);
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 5cfd2bcc55ef..24b4aed57249 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -3508,20 +3508,62 @@  static void qgroup_revert(struct btrfs_inode *inode,
 }
 
 /*
- * Reserve qgroup space for range [start, start + len).
+ * Try to free some space for qgroup.
  *
- * This function will either reserve space from related qgroups or doing
- * nothing if the range is already reserved.
+ * For qgroup, there are only 3 ways to free qgroup space:
+ * - Flush nodatacow write
+ *   Any nodatacow write will free its reserved data space at
+ *   run_delalloc_range().
+ *   In theory, we should only flush nodatacow inodes, but it's
+ *   not yet possible, so we need to flush the whole root.
  *
- * Return 0 for successful reserve
- * Return <0 for error (including -EQUOT)
+ * - Wait for ordered extents
+ *   When ordered extents are finished, their reserved metadata
+ *   is finally converted to per_trans status, which can be freed
+ *   by later commit transaction.
  *
- * NOTE: this function may sleep for memory allocation.
- *       if btrfs_qgroup_reserve_data() is called multiple times with
- *       same @reserved, caller must ensure when error happens it's OK
- *       to free *ALL* reserved space.
+ * - Commit transaction
+ *   This would free the meta_per_trans space.
+ *   In theory this shouldn't provide much space, but any more qgroup space
+ *   is needed.
  */
-int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
+static int try_flush_qgroup(struct btrfs_root *root)
+{
+	struct btrfs_trans_handle *trans;
+	int ret;
+
+	if (!is_fstree(root->root_key.objectid))
+		return 0;
+
+	/*
+	 * We don't want to run flush again and again, so if there is a running
+	 * one, we won't try to start a new flush, but exit directly.
+	 */
+	ret = mutex_trylock(&root->qgroup_flushing_mutex);
+	if (!ret) {
+		mutex_lock(&root->qgroup_flushing_mutex);
+		mutex_unlock(&root->qgroup_flushing_mutex);
+		return 0;
+	}
+
+	ret = btrfs_start_delalloc_snapshot(root);
+	if (ret < 0)
+		goto unlock;
+	btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1);
+
+	trans = btrfs_join_transaction(root);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		goto unlock;
+	}
+
+	ret = btrfs_commit_transaction(trans);
+unlock:
+	mutex_unlock(&root->qgroup_flushing_mutex);
+	return ret;
+}
+
+static int qgroup_reserve_data(struct btrfs_inode *inode,
 			struct extent_changeset **reserved_ret, u64 start,
 			u64 len)
 {
@@ -3573,6 +3615,37 @@  int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
 	return ret;
 }
 
+/*
+ * Reserve qgroup space for range [start, start + len).
+ *
+ * This function will either reserve space from related qgroups or doing
+ * nothing if the range is already reserved.
+ *
+ * Return 0 for successful reserve
+ * Return <0 for error (including -EQUOT)
+ *
+ * NOTE: This function may sleep for memory allocation, dirty page flushing and
+ *	 commit transaction. So caller should not hold any dirty page locked.
+ */
+int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
+			struct extent_changeset **reserved_ret, u64 start,
+			u64 len)
+{
+	int ret;
+
+	ret = qgroup_reserve_data(inode, reserved_ret, start, len);
+	if (!ret)
+		return ret;
+
+	if (ret < 0 && ret != -EDQUOT)
+		return ret;
+
+	ret = try_flush_qgroup(inode->root);
+	if (ret < 0)
+		return ret;
+	return qgroup_reserve_data(inode, reserved_ret, start, len);
+}
+
 /* Free ranges specified by @reserved, normally in error path */
 static int qgroup_free_reserved_data(struct btrfs_inode *inode,
 			struct extent_changeset *reserved, u64 start, u64 len)
@@ -3740,7 +3813,7 @@  static int sub_root_meta_rsv(struct btrfs_root *root, int num_bytes,
 	return num_bytes;
 }
 
-int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
+static int qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
 				enum btrfs_qgroup_rsv_type type, bool enforce)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
@@ -3767,6 +3840,23 @@  int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
 	return ret;
 }
 
+int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
+				enum btrfs_qgroup_rsv_type type, bool enforce)
+{
+	int ret;
+
+	ret = qgroup_reserve_meta(root, num_bytes, type, enforce);
+	if (!ret)
+		return ret;
+	if (ret < 0 && ret != -EDQUOT)
+		return ret;
+
+	ret = try_flush_qgroup(root);
+	if (ret < 0)
+		return ret;
+	return qgroup_reserve_meta(root, num_bytes, type, enforce);
+}
+
 void btrfs_qgroup_free_meta_all_pertrans(struct btrfs_root *root)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;