mbox series

[0/2] mm: skip memcg for certain address space

Message ID cover.1720572937.git.wqu@suse.com (mailing list archive)
Headers show
Series mm: skip memcg for certain address space | expand

Message

Qu Wenruo July 10, 2024, 1:07 a.m. UTC
Recently I'm hitting soft lockup if adding an order 2 folio to a
filemap using GFP_NOFS | __GFP_NOFAIL. The softlockup happens at memcg
charge code, and I guess that's exactly what __GFP_NOFAIL is expected to
do, wait indefinitely until the request can be met.

On the other hand, if we do not use __GFP_NOFAIL, we can be limited by
memcg at a lot of critical location, and lead to unnecessary transaction
abort just due to memcg limit.

However for that specific btrfs call site, there is really no need charge
the memcg, as that address space belongs to btree inode, which is not
accessible to any end user, and that btree inode is a shared pool for
all metadata of a btrfs.

So this patchset introduces a new address space flag, AS_NO_MEMCG, so
that folios added to that address space will not trigger any memcg
charge.

This would be the basis for future btrfs changes, like removing
__GFP_NOFAIL completely and larger metadata folios.

Qu Wenruo (2):
  mm: make lru_gen_eviction() to handle folios without memcg info
  mm: allow certain address space to be not accounted by memcg

 fs/btrfs/disk-io.c      |  1 +
 include/linux/pagemap.h |  1 +
 mm/filemap.c            | 12 +++++++++---
 mm/workingset.c         |  2 +-
 4 files changed, 12 insertions(+), 4 deletions(-)

Comments

Qu Wenruo July 17, 2024, 7:42 a.m. UTC | #1
Ping?

Any feedback?

I guess in this case btrfs is really the only one can benefit from this 
feature?

Thanks,
Qu

在 2024/7/10 10:37, Qu Wenruo 写道:
> Recently I'm hitting soft lockup if adding an order 2 folio to a
> filemap using GFP_NOFS | __GFP_NOFAIL. The softlockup happens at memcg
> charge code, and I guess that's exactly what __GFP_NOFAIL is expected to
> do, wait indefinitely until the request can be met.
> 
> On the other hand, if we do not use __GFP_NOFAIL, we can be limited by
> memcg at a lot of critical location, and lead to unnecessary transaction
> abort just due to memcg limit.
> 
> However for that specific btrfs call site, there is really no need charge
> the memcg, as that address space belongs to btree inode, which is not
> accessible to any end user, and that btree inode is a shared pool for
> all metadata of a btrfs.
> 
> So this patchset introduces a new address space flag, AS_NO_MEMCG, so
> that folios added to that address space will not trigger any memcg
> charge.
> 
> This would be the basis for future btrfs changes, like removing
> __GFP_NOFAIL completely and larger metadata folios.
> 
> Qu Wenruo (2):
>    mm: make lru_gen_eviction() to handle folios without memcg info
>    mm: allow certain address space to be not accounted by memcg
> 
>   fs/btrfs/disk-io.c      |  1 +
>   include/linux/pagemap.h |  1 +
>   mm/filemap.c            | 12 +++++++++---
>   mm/workingset.c         |  2 +-
>   4 files changed, 12 insertions(+), 4 deletions(-)
>
Vlastimil Babka (SUSE) July 17, 2024, 3:55 p.m. UTC | #2
Hi,

you should have Ccd people according to get_maintainers script to get a
reply faster. Let me Cc the MEMCG section.

On 7/10/24 3:07 AM, Qu Wenruo wrote:
> Recently I'm hitting soft lockup if adding an order 2 folio to a
> filemap using GFP_NOFS | __GFP_NOFAIL. The softlockup happens at memcg
> charge code, and I guess that's exactly what __GFP_NOFAIL is expected to
> do, wait indefinitely until the request can be met.

Seems like a bug to me, as the charging of __GFP_NOFAIL in
try_charge_memcg() should proceed to the force: part AFAICS and just go over
the limit.

I was suspecting mem_cgroup_oom() a bit earlier return true, causing the
retry loop, due to GFP_NOFS. But it seems out_of_memory() should be
specifically proceeding for GFP_NOFS if it's memcg oom. But I might be
missing something else. Anyway we should know what exactly is going first.

> On the other hand, if we do not use __GFP_NOFAIL, we can be limited by
> memcg at a lot of critical location, and lead to unnecessary transaction
> abort just due to memcg limit.
> 
> However for that specific btrfs call site, there is really no need charge
> the memcg, as that address space belongs to btree inode, which is not
> accessible to any end user, and that btree inode is a shared pool for
> all metadata of a btrfs.
> 
> So this patchset introduces a new address space flag, AS_NO_MEMCG, so
> that folios added to that address space will not trigger any memcg
> charge.
> 
> This would be the basis for future btrfs changes, like removing
> __GFP_NOFAIL completely and larger metadata folios.
> 
> Qu Wenruo (2):
>   mm: make lru_gen_eviction() to handle folios without memcg info
>   mm: allow certain address space to be not accounted by memcg
> 
>  fs/btrfs/disk-io.c      |  1 +
>  include/linux/pagemap.h |  1 +
>  mm/filemap.c            | 12 +++++++++---
>  mm/workingset.c         |  2 +-
>  4 files changed, 12 insertions(+), 4 deletions(-)
>
Michal Hocko July 17, 2024, 4:14 p.m. UTC | #3
On Wed 17-07-24 17:55:23, Vlastimil Babka (SUSE) wrote:
> Hi,
> 
> you should have Ccd people according to get_maintainers script to get a
> reply faster. Let me Cc the MEMCG section.
> 
> On 7/10/24 3:07 AM, Qu Wenruo wrote:
> > Recently I'm hitting soft lockup if adding an order 2 folio to a
> > filemap using GFP_NOFS | __GFP_NOFAIL. The softlockup happens at memcg
> > charge code, and I guess that's exactly what __GFP_NOFAIL is expected to
> > do, wait indefinitely until the request can be met.
> 
> Seems like a bug to me, as the charging of __GFP_NOFAIL in
> try_charge_memcg() should proceed to the force: part AFAICS and just go over
> the limit.
> 
> I was suspecting mem_cgroup_oom() a bit earlier return true, causing the
> retry loop, due to GFP_NOFS. But it seems out_of_memory() should be
> specifically proceeding for GFP_NOFS if it's memcg oom. But I might be
> missing something else. Anyway we should know what exactly is going first.

Correct. memcg oom code will invoke the memcg OOM killer for NOFS
requests. See out_of_memory 

        /*
         * The OOM killer does not compensate for IO-less reclaim.
         * But mem_cgroup_oom() has to invoke the OOM killer even
         * if it is a GFP_NOFS allocation.
         */
        if (!(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
                return true;

That means that there will be a victim killed, charges reclaimed and
forward progress made. If there is no victim then the charging path will
bail out and overcharge.

Also the reclaim should have cond_rescheds in the reclaim path. If that
is not sufficient it should be fixed rather than workaround.
Qu Wenruo July 17, 2024, 10:38 p.m. UTC | #4
在 2024/7/18 01:44, Michal Hocko 写道:
> On Wed 17-07-24 17:55:23, Vlastimil Babka (SUSE) wrote:
>> Hi,
>>
>> you should have Ccd people according to get_maintainers script to get a
>> reply faster. Let me Cc the MEMCG section.
>>
>> On 7/10/24 3:07 AM, Qu Wenruo wrote:
>>> Recently I'm hitting soft lockup if adding an order 2 folio to a
>>> filemap using GFP_NOFS | __GFP_NOFAIL. The softlockup happens at memcg
>>> charge code, and I guess that's exactly what __GFP_NOFAIL is expected to
>>> do, wait indefinitely until the request can be met.
>>
>> Seems like a bug to me, as the charging of __GFP_NOFAIL in
>> try_charge_memcg() should proceed to the force: part AFAICS and just go over
>> the limit.
>>
>> I was suspecting mem_cgroup_oom() a bit earlier return true, causing the
>> retry loop, due to GFP_NOFS. But it seems out_of_memory() should be
>> specifically proceeding for GFP_NOFS if it's memcg oom. But I might be
>> missing something else. Anyway we should know what exactly is going first.
>
> Correct. memcg oom code will invoke the memcg OOM killer for NOFS
> requests. See out_of_memory
>
>          /*
>           * The OOM killer does not compensate for IO-less reclaim.
>           * But mem_cgroup_oom() has to invoke the OOM killer even
>           * if it is a GFP_NOFS allocation.
>           */
>          if (!(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
>                  return true;
>
> That means that there will be a victim killed, charges reclaimed and
> forward progress made. If there is no victim then the charging path will
> bail out and overcharge.
>
> Also the reclaim should have cond_rescheds in the reclaim path. If that
> is not sufficient it should be fixed rather than workaround.

Another question is, I only see this hang with larger folio (order 2 vs
the old order 0) when adding to the same address space.

Does the folio order has anything related to the problem or just a
higher order makes it more possible?


And finally, even without the hang problem, does it make any sense to
skip all the possible memcg charge completely, either to reduce latency
or just to reduce GFP_NOFAIL usage, for those user inaccessible inodes?

Thanks,
Qu
Vlastimil Babka (SUSE) July 18, 2024, 7:17 a.m. UTC | #5
On 7/18/24 12:38 AM, Qu Wenruo wrote:
> 在 2024/7/18 01:44, Michal Hocko 写道:
>> On Wed 17-07-24 17:55:23, Vlastimil Babka (SUSE) wrote:
>>> Hi,
>>>
>>> you should have Ccd people according to get_maintainers script to get a
>>> reply faster. Let me Cc the MEMCG section.
>>>
>>> On 7/10/24 3:07 AM, Qu Wenruo wrote:
>>>> Recently I'm hitting soft lockup if adding an order 2 folio to a
>>>> filemap using GFP_NOFS | __GFP_NOFAIL. The softlockup happens at memcg
>>>> charge code, and I guess that's exactly what __GFP_NOFAIL is expected to
>>>> do, wait indefinitely until the request can be met.
>>>
>>> Seems like a bug to me, as the charging of __GFP_NOFAIL in
>>> try_charge_memcg() should proceed to the force: part AFAICS and just go over
>>> the limit.
>>>
>>> I was suspecting mem_cgroup_oom() a bit earlier return true, causing the
>>> retry loop, due to GFP_NOFS. But it seems out_of_memory() should be
>>> specifically proceeding for GFP_NOFS if it's memcg oom. But I might be
>>> missing something else. Anyway we should know what exactly is going first.
>>
>> Correct. memcg oom code will invoke the memcg OOM killer for NOFS
>> requests. See out_of_memory
>>
>>          /*
>>           * The OOM killer does not compensate for IO-less reclaim.
>>           * But mem_cgroup_oom() has to invoke the OOM killer even
>>           * if it is a GFP_NOFS allocation.
>>           */
>>          if (!(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
>>                  return true;
>>
>> That means that there will be a victim killed, charges reclaimed and
>> forward progress made. If there is no victim then the charging path will
>> bail out and overcharge.
>>
>> Also the reclaim should have cond_rescheds in the reclaim path. If that
>> is not sufficient it should be fixed rather than workaround.
> 
> Another question is, I only see this hang with larger folio (order 2 vs
> the old order 0) when adding to the same address space.
> 
> Does the folio order has anything related to the problem or just a
> higher order makes it more possible?

I didn't spot anything in the memcg charge path that would depend on the
order directly, hm. Also what kernel version was showing these soft lockups?

> And finally, even without the hang problem, does it make any sense to
> skip all the possible memcg charge completely, either to reduce latency
> or just to reduce GFP_NOFAIL usage, for those user inaccessible inodes?

Is it common to even use the filemap code for such metadata that can't be
really mapped to userspace? How does it even interact with reclaim, do they
become part of the page cache and are scanned by reclaim together with data
that is mapped? How are the lru decisions handled if there are no references
for PTE access bits. Or can they be even reclaimed, or because there may
e.g. other open inodes pinning this metadata, the reclaim is impossible?

(sorry if the questions seem noob, I'm not that much familiar with the page
cache side of mm)

> Thanks,
> Qu
Michal Hocko July 18, 2024, 7:25 a.m. UTC | #6
On Thu 18-07-24 09:17:42, Vlastimil Babka (SUSE) wrote:
> On 7/18/24 12:38 AM, Qu Wenruo wrote:
[...]
> > Does the folio order has anything related to the problem or just a
> > higher order makes it more possible?
> 
> I didn't spot anything in the memcg charge path that would depend on the
> order directly, hm. Also what kernel version was showing these soft lockups?

Correct. Order just defines the number of charges to be reclaimed.
Unlike the page allocator path we do not have any specific requirements
on the memory to be released. 

> > And finally, even without the hang problem, does it make any sense to
> > skip all the possible memcg charge completely, either to reduce latency
> > or just to reduce GFP_NOFAIL usage, for those user inaccessible inodes?

Let me just add to the pile of questions. Who does own this memory?
Qu Wenruo July 18, 2024, 7:52 a.m. UTC | #7
在 2024/7/18 16:47, Vlastimil Babka (SUSE) 写道:
> On 7/18/24 12:38 AM, Qu Wenruo wrote:
[...]
>> Another question is, I only see this hang with larger folio (order 2 vs
>> the old order 0) when adding to the same address space.
>>
>> Does the folio order has anything related to the problem or just a
>> higher order makes it more possible?
> 
> I didn't spot anything in the memcg charge path that would depend on the
> order directly, hm. Also what kernel version was showing these soft lockups?

The previous rc kernel. IIRC it's v6.10-rc6.

But that needs extra btrfs patches, or btrfs are still only doing the 
order-0 allocation, then add the order-0 folio into the filemap.

The extra patch just direct btrfs to allocate an order 2 folio (matching 
the default 16K nodesize), then attach the folio to the metadata filemap.

With extra coding handling corner cases like different folio sizes etc.

> 
>> And finally, even without the hang problem, does it make any sense to
>> skip all the possible memcg charge completely, either to reduce latency
>> or just to reduce GFP_NOFAIL usage, for those user inaccessible inodes?
> 
> Is it common to even use the filemap code for such metadata that can't be
> really mapped to userspace?

At least XFS/EXT4 doesn't use filemap to handle their metadata. One of 
the reason is, btrfs has pretty large metadata structure.
Not only for the regular filesystem things, but also data checksum.

Even using the default CRC32C algo, it's 4 bytes per 4K data.
Thus things can go crazy pretty easily, and that's the reason why btrfs 
is still sticking to the filemap solution.

> How does it even interact with reclaim, do they
> become part of the page cache and are scanned by reclaim together with data
> that is mapped?

Yes, it's handled just like all other filemaps, it's also using page 
cache, and all the lru/scanning things.

The major difference is, we only implement a small subset of the address 
operations:

- write
- release
- invalidate
- migrate
- dirty (debug only, otherwise falls back to filemap_dirty_folio())

Note there is no read operations, as it's btrfs itself triggering the 
metadata read, thus there is no read/readahead.
Thus we're in the full control of the page cache, e.g. determine the 
folio size to be added into the filemap.

The filemap infrastructure provides 2 good functionalities:

- (Page) Cache
   So that we can easily determine if we really need to read from the
   disk, and this can save us a lot of random IOs.

- Reclaiming

And of course the page cache of the metadata inode won't be 
cloned/shared to any user accessible inode.

> How are the lru decisions handled if there are no references
> for PTE access bits. Or can they be even reclaimed, or because there may
> e.g. other open inodes pinning this metadata, the reclaim is impossible?

If I understand it correctly, we have implemented release_folio() 
callback, which does the btrfs metadata checks to determine if we can 
release the current folio, and avoid releasing folios that's still under 
IO etc.

> 
> (sorry if the questions seem noob, I'm not that much familiar with the page
> cache side of mm)

No worry at all, I'm also a newbie on the whole mm part.

Thanks,
Qu

> 
>> Thanks,
>> Qu
>
Qu Wenruo July 18, 2024, 7:57 a.m. UTC | #8
在 2024/7/18 16:55, Michal Hocko 写道:
> On Thu 18-07-24 09:17:42, Vlastimil Babka (SUSE) wrote:
>> On 7/18/24 12:38 AM, Qu Wenruo wrote:
> [...]
>>> Does the folio order has anything related to the problem or just a
>>> higher order makes it more possible?
>>
>> I didn't spot anything in the memcg charge path that would depend on the
>> order directly, hm. Also what kernel version was showing these soft lockups?
> 
> Correct. Order just defines the number of charges to be reclaimed.
> Unlike the page allocator path we do not have any specific requirements
> on the memory to be released.

So I guess the higher folio order just brings more pressure to trigger 
the problem?

> 
>>> And finally, even without the hang problem, does it make any sense to
>>> skip all the possible memcg charge completely, either to reduce latency
>>> or just to reduce GFP_NOFAIL usage, for those user inaccessible inodes?
> 
> Let me just add to the pile of questions. Who does own this memory?

A special inode inside btrfs, we call it btree_inode, which is not 
accessible out of the btrfs module, and its lifespan is the same as the 
mounted btrfs filesystem.

The inode is kept inside memory (btrfs_fs_info::btree_inode), it's 
initialized at the first mount, and released upon the last unmount.

The address_space_operation() are special that it doesn't implement any 
read/readahead.
Only write/release/invalidate/migrate functions are implemented.

The read are triggered and handled all inside btrfs itself.

Thanks,
Qu
Michal Hocko July 18, 2024, 8:09 a.m. UTC | #9
On Thu 18-07-24 17:27:05, Qu Wenruo wrote:
> 
> 
> 在 2024/7/18 16:55, Michal Hocko 写道:
> > On Thu 18-07-24 09:17:42, Vlastimil Babka (SUSE) wrote:
> > > On 7/18/24 12:38 AM, Qu Wenruo wrote:
> > [...]
> > > > Does the folio order has anything related to the problem or just a
> > > > higher order makes it more possible?
> > > 
> > > I didn't spot anything in the memcg charge path that would depend on the
> > > order directly, hm. Also what kernel version was showing these soft lockups?
> > 
> > Correct. Order just defines the number of charges to be reclaimed.
> > Unlike the page allocator path we do not have any specific requirements
> > on the memory to be released.
> 
> So I guess the higher folio order just brings more pressure to trigger the
> problem?

It increases the reclaim target (in number of pages to reclaim). That
might contribute but we are cond_resched-ing in shrink_node_memcgs and
also down the path in shrink_lruvec etc. So higher target shouldn't
cause soft lockups unless we have a bug there - e.g. not triggering any
of those paths with empty LRUs and looping somewhere. Not sure about
MGLRU state of things TBH.
 
> > > > And finally, even without the hang problem, does it make any sense to
> > > > skip all the possible memcg charge completely, either to reduce latency
> > > > or just to reduce GFP_NOFAIL usage, for those user inaccessible inodes?
> > 
> > Let me just add to the pile of questions. Who does own this memory?
> 
> A special inode inside btrfs, we call it btree_inode, which is not
> accessible out of the btrfs module, and its lifespan is the same as the
> mounted btrfs filesystem.

But the memory charge is attributed to the caller unless you tell
otherwise. So if this is really an internal use and you use a shared
infrastructure which expects the current task to be owner of the charged
memory then you need to wrap the initialization into set_active_memcg
scope.
Michal Hocko July 18, 2024, 8:10 a.m. UTC | #10
On Thu 18-07-24 10:09:31, Michal Hocko wrote:
> On Thu 18-07-24 17:27:05, Qu Wenruo wrote:
> > 
> > 
> > 在 2024/7/18 16:55, Michal Hocko 写道:
> > > On Thu 18-07-24 09:17:42, Vlastimil Babka (SUSE) wrote:
> > > > On 7/18/24 12:38 AM, Qu Wenruo wrote:
> > > [...]
> > > > > Does the folio order has anything related to the problem or just a
> > > > > higher order makes it more possible?
> > > > 
> > > > I didn't spot anything in the memcg charge path that would depend on the
> > > > order directly, hm. Also what kernel version was showing these soft lockups?
> > > 
> > > Correct. Order just defines the number of charges to be reclaimed.
> > > Unlike the page allocator path we do not have any specific requirements
> > > on the memory to be released.
> > 
> > So I guess the higher folio order just brings more pressure to trigger the
> > problem?
> 
> It increases the reclaim target (in number of pages to reclaim). That
> might contribute but we are cond_resched-ing in shrink_node_memcgs and
> also down the path in shrink_lruvec etc. So higher target shouldn't
> cause soft lockups unless we have a bug there - e.g. not triggering any
> of those paths with empty LRUs and looping somewhere. Not sure about
> MGLRU state of things TBH.
>  
> > > > > And finally, even without the hang problem, does it make any sense to
> > > > > skip all the possible memcg charge completely, either to reduce latency
> > > > > or just to reduce GFP_NOFAIL usage, for those user inaccessible inodes?
> > > 
> > > Let me just add to the pile of questions. Who does own this memory?
> > 
> > A special inode inside btrfs, we call it btree_inode, which is not
> > accessible out of the btrfs module, and its lifespan is the same as the
> > mounted btrfs filesystem.
> 
> But the memory charge is attributed to the caller unless you tell
> otherwise. So if this is really an internal use and you use a shared
> infrastructure which expects the current task to be owner of the charged
> memory then you need to wrap the initialization into set_active_memcg
> scope.

hit send too quickly, meant to finish with
... and use root cgroup.
Vlastimil Babka (SUSE) July 18, 2024, 8:28 a.m. UTC | #11
On 7/18/24 9:52 AM, Qu Wenruo wrote:
> 
> 
> 在 2024/7/18 16:47, Vlastimil Babka (SUSE) 写道:
>> On 7/18/24 12:38 AM, Qu Wenruo wrote:
> [...]
>>> Another question is, I only see this hang with larger folio (order 2 vs
>>> the old order 0) when adding to the same address space.
>>>
>>> Does the folio order has anything related to the problem or just a
>>> higher order makes it more possible?
>> 
>> I didn't spot anything in the memcg charge path that would depend on the
>> order directly, hm. Also what kernel version was showing these soft lockups?
> 
> The previous rc kernel. IIRC it's v6.10-rc6.
> 
> But that needs extra btrfs patches, or btrfs are still only doing the 
> order-0 allocation, then add the order-0 folio into the filemap.
> 
> The extra patch just direct btrfs to allocate an order 2 folio (matching 
> the default 16K nodesize), then attach the folio to the metadata filemap.
> 
> With extra coding handling corner cases like different folio sizes etc.

Hm right, but the same code is triggered for high-order folios (at least for
user mappable page cache) today by some filesystems AFAIK, so we should be
seeing such lockups already? btrfs case might be special that it's for the
internal node as you explain, but that makes no difference for
filemap_add_folio(), right? Or is it the only user with GFP_NOFS? Also is
that passed as gfp directly or are there some extra scoped gfp resctrictions
involved? (memalloc_..._save()).

>> 
>>> And finally, even without the hang problem, does it make any sense to
>>> skip all the possible memcg charge completely, either to reduce latency
>>> or just to reduce GFP_NOFAIL usage, for those user inaccessible inodes?
>> 
>> Is it common to even use the filemap code for such metadata that can't be
>> really mapped to userspace?
> 
> At least XFS/EXT4 doesn't use filemap to handle their metadata. One of 
> the reason is, btrfs has pretty large metadata structure.
> Not only for the regular filesystem things, but also data checksum.
> 
> Even using the default CRC32C algo, it's 4 bytes per 4K data.
> Thus things can go crazy pretty easily, and that's the reason why btrfs 
> is still sticking to the filemap solution.
> 
>> How does it even interact with reclaim, do they
>> become part of the page cache and are scanned by reclaim together with data
>> that is mapped?
> 
> Yes, it's handled just like all other filemaps, it's also using page 
> cache, and all the lru/scanning things.
> 
> The major difference is, we only implement a small subset of the address 
> operations:
> 
> - write
> - release
> - invalidate
> - migrate
> - dirty (debug only, otherwise falls back to filemap_dirty_folio())
> 
> Note there is no read operations, as it's btrfs itself triggering the 
> metadata read, thus there is no read/readahead.
> Thus we're in the full control of the page cache, e.g. determine the 
> folio size to be added into the filemap.
> 
> The filemap infrastructure provides 2 good functionalities:
> 
> - (Page) Cache
>    So that we can easily determine if we really need to read from the
>    disk, and this can save us a lot of random IOs.
> 
> - Reclaiming
> 
> And of course the page cache of the metadata inode won't be 
> cloned/shared to any user accessible inode.
> 
>> How are the lru decisions handled if there are no references
>> for PTE access bits. Or can they be even reclaimed, or because there may
>> e.g. other open inodes pinning this metadata, the reclaim is impossible?
> 
> If I understand it correctly, we have implemented release_folio() 
> callback, which does the btrfs metadata checks to determine if we can 
> release the current folio, and avoid releasing folios that's still under 
> IO etc.

I see, thanks. Sounds like there might be potentially some suboptimal
handling in that the folio will appear inactive because there's no
references that folio_check_references() can detect, unless there's some
folio_mark_accessed() calls involved (I see some FGP_ACCESSED in btrfs so
maybe that's fine enough) so reclaim could consider it often, only to be
stopped by release_folio failing.

>> 
>> (sorry if the questions seem noob, I'm not that much familiar with the page
>> cache side of mm)
> 
> No worry at all, I'm also a newbie on the whole mm part.
> 
> Thanks,
> Qu
> 
>> 
>>> Thanks,
>>> Qu
>>
Qu Wenruo July 18, 2024, 8:50 a.m. UTC | #12
在 2024/7/18 17:58, Vlastimil Babka (SUSE) 写道:
> On 7/18/24 9:52 AM, Qu Wenruo wrote:
>>
>>
>> 在 2024/7/18 16:47, Vlastimil Babka (SUSE) 写道:
>>> On 7/18/24 12:38 AM, Qu Wenruo wrote:
>> [...]
>>>> Another question is, I only see this hang with larger folio (order 2 vs
>>>> the old order 0) when adding to the same address space.
>>>>
>>>> Does the folio order has anything related to the problem or just a
>>>> higher order makes it more possible?
>>>
>>> I didn't spot anything in the memcg charge path that would depend on the
>>> order directly, hm. Also what kernel version was showing these soft lockups?
>>
>> The previous rc kernel. IIRC it's v6.10-rc6.
>>
>> But that needs extra btrfs patches, or btrfs are still only doing the
>> order-0 allocation, then add the order-0 folio into the filemap.
>>
>> The extra patch just direct btrfs to allocate an order 2 folio (matching
>> the default 16K nodesize), then attach the folio to the metadata filemap.
>>
>> With extra coding handling corner cases like different folio sizes etc.
> 
> Hm right, but the same code is triggered for high-order folios (at least for
> user mappable page cache) today by some filesystems AFAIK, so we should be
> seeing such lockups already? btrfs case might be special that it's for the
> internal node as you explain, but that makes no difference for
> filemap_add_folio(), right? Or is it the only user with GFP_NOFS? Also is
> that passed as gfp directly or are there some extra scoped gfp resctrictions
> involved? (memalloc_..._save()).

I'm not sure about other fses, but for that hang case, it's very 
metadata heavy, and ALL folios for that btree inode filemap is in order 
2, since we're always allocating the order folios using GFP_NOFAIL, and 
attaching that folio into the filemap using GFP_NOFAIL too.

Not sure if other fses can have such situation.

[...]
>> If I understand it correctly, we have implemented release_folio()
>> callback, which does the btrfs metadata checks to determine if we can
>> release the current folio, and avoid releasing folios that's still under
>> IO etc.
> 
> I see, thanks. Sounds like there might be potentially some suboptimal
> handling in that the folio will appear inactive because there's no
> references that folio_check_references() can detect, unless there's some
> folio_mark_accessed() calls involved (I see some FGP_ACCESSED in btrfs so
> maybe that's fine enough) so reclaim could consider it often, only to be
> stopped by release_folio failing.

For the page accessed part, btrfs handles it by 
mark_extent_buffer_accessed() call, and it's called every time we try to 
grab an extent buffer structure (the structure used to represent a 
metadata block inside btrfs).

So the accessed flag part should be fine I guess?

Thanks,
Qu
> 
>>>
>>> (sorry if the questions seem noob, I'm not that much familiar with the page
>>> cache side of mm)
>>
>> No worry at all, I'm also a newbie on the whole mm part.
>>
>> Thanks,
>> Qu
>>
>>>
>>>> Thanks,
>>>> Qu
>>>
>
Qu Wenruo July 18, 2024, 8:52 a.m. UTC | #13
在 2024/7/18 17:39, Michal Hocko 写道:
> On Thu 18-07-24 17:27:05, Qu Wenruo wrote:
>>
>>
>> 在 2024/7/18 16:55, Michal Hocko 写道:
>>> On Thu 18-07-24 09:17:42, Vlastimil Babka (SUSE) wrote:
>>>> On 7/18/24 12:38 AM, Qu Wenruo wrote:
>>> [...]
>>>>> Does the folio order has anything related to the problem or just a
>>>>> higher order makes it more possible?
>>>>
>>>> I didn't spot anything in the memcg charge path that would depend on the
>>>> order directly, hm. Also what kernel version was showing these soft lockups?
>>>
>>> Correct. Order just defines the number of charges to be reclaimed.
>>> Unlike the page allocator path we do not have any specific requirements
>>> on the memory to be released.
>>
>> So I guess the higher folio order just brings more pressure to trigger the
>> problem?
> 
> It increases the reclaim target (in number of pages to reclaim). That
> might contribute but we are cond_resched-ing in shrink_node_memcgs and
> also down the path in shrink_lruvec etc. So higher target shouldn't
> cause soft lockups unless we have a bug there - e.g. not triggering any
> of those paths with empty LRUs and looping somewhere. Not sure about
> MGLRU state of things TBH.
>   
>>>>> And finally, even without the hang problem, does it make any sense to
>>>>> skip all the possible memcg charge completely, either to reduce latency
>>>>> or just to reduce GFP_NOFAIL usage, for those user inaccessible inodes?
>>>
>>> Let me just add to the pile of questions. Who does own this memory?
>>
>> A special inode inside btrfs, we call it btree_inode, which is not
>> accessible out of the btrfs module, and its lifespan is the same as the
>> mounted btrfs filesystem.
> 
> But the memory charge is attributed to the caller unless you tell
> otherwise.

By the caller, did you mean the user space program who triggered the 
filesystem operations?

Then it's too hard to determine. Almost all operations of btrfs involves 
its metadata, from the basic read/write, even to some endio functions 
(delayed into workqueue, like verify the data against its csum).


> So if this is really an internal use and you use a shared
> infrastructure which expects the current task to be owner of the charged
> memory then you need to wrap the initialization into set_active_memcg
> scope.
> 

And for root cgroup I guess it means we will have no memory limits or 
whatever, and filemap_add_folio() should always success (except real 
-ENOMEM situations or -EEXIST error btrfs would handle)?

Then it looks like a good solution at least from the respective of btrfs.

Thanks,
Qu
Vlastimil Babka (SUSE) July 18, 2024, 9:19 a.m. UTC | #14
On 7/18/24 10:50 AM, Qu Wenruo wrote:
> 
> 
> 在 2024/7/18 17:58, Vlastimil Babka (SUSE) 写道:
>> On 7/18/24 9:52 AM, Qu Wenruo wrote:
>>>
>>> The previous rc kernel. IIRC it's v6.10-rc6.
>>>
>>> But that needs extra btrfs patches, or btrfs are still only doing the
>>> order-0 allocation, then add the order-0 folio into the filemap.
>>>
>>> The extra patch just direct btrfs to allocate an order 2 folio (matching
>>> the default 16K nodesize), then attach the folio to the metadata filemap.
>>>
>>> With extra coding handling corner cases like different folio sizes etc.
>> 
>> Hm right, but the same code is triggered for high-order folios (at least for
>> user mappable page cache) today by some filesystems AFAIK, so we should be
>> seeing such lockups already? btrfs case might be special that it's for the
>> internal node as you explain, but that makes no difference for
>> filemap_add_folio(), right? Or is it the only user with GFP_NOFS? Also is
>> that passed as gfp directly or are there some extra scoped gfp resctrictions
>> involved? (memalloc_..._save()).
> 
> I'm not sure about other fses, but for that hang case, it's very 
> metadata heavy, and ALL folios for that btree inode filemap is in order 
> 2, since we're always allocating the order folios using GFP_NOFAIL, and 
> attaching that folio into the filemap using GFP_NOFAIL too.
> 
> Not sure if other fses can have such situation.

Doh right of course, the __GFP_NOFAIL is the special part compared to the
usual page cache usage.

> [...]
>>> If I understand it correctly, we have implemented release_folio()
>>> callback, which does the btrfs metadata checks to determine if we can
>>> release the current folio, and avoid releasing folios that's still under
>>> IO etc.
>> 
>> I see, thanks. Sounds like there might be potentially some suboptimal
>> handling in that the folio will appear inactive because there's no
>> references that folio_check_references() can detect, unless there's some
>> folio_mark_accessed() calls involved (I see some FGP_ACCESSED in btrfs so
>> maybe that's fine enough) so reclaim could consider it often, only to be
>> stopped by release_folio failing.
> 
> For the page accessed part, btrfs handles it by 
> mark_extent_buffer_accessed() call, and it's called every time we try to 
> grab an extent buffer structure (the structure used to represent a 
> metadata block inside btrfs).
> 
> So the accessed flag part should be fine I guess?

Sounds good then, thanks!

> Thanks,
> Qu
>> 
>>>>
>>>> (sorry if the questions seem noob, I'm not that much familiar with the page
>>>> cache side of mm)
>>>
>>> No worry at all, I'm also a newbie on the whole mm part.
>>>
>>> Thanks,
>>> Qu
>>>
>>>>
>>>>> Thanks,
>>>>> Qu
>>>>
>>
Michal Hocko July 18, 2024, 9:25 a.m. UTC | #15
On Thu 18-07-24 18:22:11, Qu Wenruo wrote:
> 
> 
> 在 2024/7/18 17:39, Michal Hocko 写道:
> > On Thu 18-07-24 17:27:05, Qu Wenruo wrote:
> > > 
> > > 
> > > 在 2024/7/18 16:55, Michal Hocko 写道:
> > > > On Thu 18-07-24 09:17:42, Vlastimil Babka (SUSE) wrote:
> > > > > On 7/18/24 12:38 AM, Qu Wenruo wrote:
> > > > [...]
> > > > > > Does the folio order has anything related to the problem or just a
> > > > > > higher order makes it more possible?
> > > > > 
> > > > > I didn't spot anything in the memcg charge path that would depend on the
> > > > > order directly, hm. Also what kernel version was showing these soft lockups?
> > > > 
> > > > Correct. Order just defines the number of charges to be reclaimed.
> > > > Unlike the page allocator path we do not have any specific requirements
> > > > on the memory to be released.
> > > 
> > > So I guess the higher folio order just brings more pressure to trigger the
> > > problem?
> > 
> > It increases the reclaim target (in number of pages to reclaim). That
> > might contribute but we are cond_resched-ing in shrink_node_memcgs and
> > also down the path in shrink_lruvec etc. So higher target shouldn't
> > cause soft lockups unless we have a bug there - e.g. not triggering any
> > of those paths with empty LRUs and looping somewhere. Not sure about
> > MGLRU state of things TBH.
> > > > > > And finally, even without the hang problem, does it make any sense to
> > > > > > skip all the possible memcg charge completely, either to reduce latency
> > > > > > or just to reduce GFP_NOFAIL usage, for those user inaccessible inodes?
> > > > 
> > > > Let me just add to the pile of questions. Who does own this memory?
> > > 
> > > A special inode inside btrfs, we call it btree_inode, which is not
> > > accessible out of the btrfs module, and its lifespan is the same as the
> > > mounted btrfs filesystem.
> > 
> > But the memory charge is attributed to the caller unless you tell
> > otherwise.
> 
> By the caller, did you mean the user space program who triggered the
> filesystem operations?

Yes, the current task while these operations are done.

[...]
> > So if this is really an internal use and you use a shared
> > infrastructure which expects the current task to be owner of the charged
> > memory then you need to wrap the initialization into set_active_memcg
> > scope.
> > 
> 
> And for root cgroup I guess it means we will have no memory limits or
> whatever, and filemap_add_folio() should always success (except real -ENOMEM
> situations or -EEXIST error btrfs would handle)?

Yes. try_charge will bypass charging altogether for root cgroup. You
will likely need to ifdef root_mem_cgroup usage by CONFIG_MEMCG.
Qu Wenruo July 25, 2024, 9 a.m. UTC | #16
在 2024/7/18 01:25, Vlastimil Babka (SUSE) 写道:
> Hi,
>
> you should have Ccd people according to get_maintainers script to get a
> reply faster. Let me Cc the MEMCG section.
>
> On 7/10/24 3:07 AM, Qu Wenruo wrote:
>> Recently I'm hitting soft lockup if adding an order 2 folio to a
>> filemap using GFP_NOFS | __GFP_NOFAIL. The softlockup happens at memcg
>> charge code, and I guess that's exactly what __GFP_NOFAIL is expected to
>> do, wait indefinitely until the request can be met.
>
> Seems like a bug to me, as the charging of __GFP_NOFAIL in
> try_charge_memcg() should proceed to the force: part AFAICS and just go over
> the limit.

After more reproduces of the bug (thus more logs), it turns out to be a
corner case that is specific to the different folio sizes, not the mem
cgroup.

We have something like this:

retry:
	ret = filemap_add_folio();
	if (!ret)
		goto out;
	existing_folio = filemap_lock_folio();
	if (IS_ERROR(existing_folio))
		goto retry;

This is causing a dead loop, if we have the following filemap layout:

	|<-  folio range  ->|
	|    |    |////|////|

Where |//| is the range that we have an exiting page.

In above case, filemap_add_folio() will return -EEXIST due to the
conflicting two pages.
Meanwhile filemap_lock_folio() will always return -ENOENT, as at the
folio index, there is no page at all.

The symptom looks like cgroup related just because we're spending a lot
of time inside cgroup code, but the cause is not cgroup at all.

This is not causing problem for now because the existing code is always
using order 0 folios, thus above case won't happen.

Upon larger folios support is enabled, and we're allowing mixed folio
sizes, it will lead to the above problem sooner or later.

I'll still push the opt-out of mem cgroup as an optimization, but since
the root cause is pinned down, I'll no longer include this optimization
in the larger folio enablement.

Thanks for all the help, and sorry for the extra noise.
Qu

>
> I was suspecting mem_cgroup_oom() a bit earlier return true, causing the
> retry loop, due to GFP_NOFS. But it seems out_of_memory() should be
> specifically proceeding for GFP_NOFS if it's memcg oom. But I might be
> missing something else. Anyway we should know what exactly is going first.
>
>> On the other hand, if we do not use __GFP_NOFAIL, we can be limited by
>> memcg at a lot of critical location, and lead to unnecessary transaction
>> abort just due to memcg limit.
>>
>> However for that specific btrfs call site, there is really no need charge
>> the memcg, as that address space belongs to btree inode, which is not
>> accessible to any end user, and that btree inode is a shared pool for
>> all metadata of a btrfs.
>>
>> So this patchset introduces a new address space flag, AS_NO_MEMCG, so
>> that folios added to that address space will not trigger any memcg
>> charge.
>>
>> This would be the basis for future btrfs changes, like removing
>> __GFP_NOFAIL completely and larger metadata folios.
>>
>> Qu Wenruo (2):
>>    mm: make lru_gen_eviction() to handle folios without memcg info
>>    mm: allow certain address space to be not accounted by memcg
>>
>>   fs/btrfs/disk-io.c      |  1 +
>>   include/linux/pagemap.h |  1 +
>>   mm/filemap.c            | 12 +++++++++---
>>   mm/workingset.c         |  2 +-
>>   4 files changed, 12 insertions(+), 4 deletions(-)
>>
>
>