btrfs: root memcgroup for metadata filemap_add_folio()

Message ID	b5fef5372ae454a7b6da4f2f75c427aeab6a07d6.1727498749.git.wqu@suse.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Qu Wenruo <wqu@suse.com> To: linux-btrfs@vger.kernel.org Cc: hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, akpm@linux-foundation.org, cgroups@vger.kernel.org, linux-mm@kvack.org, Michal Hocko <mhocko@suse.com>, "Vlastimil Babka (SUSE)" <vbabka@kernel.org> Subject: [PATCH] btrfs: root memcgroup for metadata filemap_add_folio() Date: Sat, 28 Sep 2024 14:15:56 +0930 Message-ID: <b5fef5372ae454a7b6da4f2f75c427aeab6a07d6.1727498749.git.wqu@suse.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	btrfs: root memcgroup for metadata filemap_add_folio() \| expand btrfs: root memcgroup for metadata filemap_add_folio()

Qu Wenruo Sept. 28, 2024, 4:45 a.m. UTC

[BACKGROUND]
The function filemap_add_folio() charges the memory cgroup,
as we assume all page caches are accessible by user space progresses
thus needs the cgroup accounting.

However btrfs is a special case, it has a very large metadata thanks to
its support of data csum (by default it's 4 bytes per 4K data, and can
be as large as 32 bytes per 4K data).
This means btrfs has to go page cache for its metadata pages, to take
advantage of both cache and reclaim ability of filemap.

This has a tiny problem, that all btrfs metadata pages have to go through
the memcgroup charge, even all those metadata pages are not
accessible by the user space, and doing the charging can introduce some
latency if there is a memory limits set.

Btrfs currently uses __GFP_NOFAIL flag as a workaround for this cgroup
charge situation so that metadata pages won't really be limited by
memcgroup.

[ENHANCEMENT]
Instead of relying on __GFP_NOFAIL to avoid charge failure, use root
memory cgroup to attach metadata pages.

Although this needs to export the symbol mem_root_cgroup for
CONFIG_MEMCG, or define mem_root_cgroup as NULL for !CONFIG_MEMCG.

With root memory cgroup, we directly skip the charging part, and only
rely on __GFP_NOFAIL for the real memory allocation part.

Suggested-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c       | 9 +++++++++
 include/linux/memcontrol.h | 2 ++
 mm/memcontrol.c            | 1 +
 3 files changed, 12 insertions(+)

Shakeel Butt Sept. 30, 2024, 5:23 p.m. UTC | #1

Hi Qu,

On Sat, Sep 28, 2024 at 02:15:56PM GMT, Qu Wenruo wrote:
> [BACKGROUND]
> The function filemap_add_folio() charges the memory cgroup,
> as we assume all page caches are accessible by user space progresses
> thus needs the cgroup accounting.
> 
> However btrfs is a special case, it has a very large metadata thanks to
> its support of data csum (by default it's 4 bytes per 4K data, and can
> be as large as 32 bytes per 4K data).
> This means btrfs has to go page cache for its metadata pages, to take
> advantage of both cache and reclaim ability of filemap.
> 
> This has a tiny problem, that all btrfs metadata pages have to go through
> the memcgroup charge, even all those metadata pages are not
> accessible by the user space, and doing the charging can introduce some
> latency if there is a memory limits set.
> 
> Btrfs currently uses __GFP_NOFAIL flag as a workaround for this cgroup
> charge situation so that metadata pages won't really be limited by
> memcgroup.
> 
> [ENHANCEMENT]
> Instead of relying on __GFP_NOFAIL to avoid charge failure, use root
> memory cgroup to attach metadata pages.
> 
> Although this needs to export the symbol mem_root_cgroup for
> CONFIG_MEMCG, or define mem_root_cgroup as NULL for !CONFIG_MEMCG.
> 
> With root memory cgroup, we directly skip the charging part, and only
> rely on __GFP_NOFAIL for the real memory allocation part.
> 

I have a couple of questions:

1. Were you using __GFP_NOFAIL just to avoid ENOMEMs? Are you ok with
oom-kills?

2. What the normal overhead of these metadata in real world production
environment? I see 4 to 32 bytes per 4k but what's the most used one and
does it depend on the data of 4k or something else?

3. Most probably multiple metadata values are colocated on a single 4k
page of the btrfs page cache even though the corresponding page cache
might be charged to different cgroups. Is that correct?

4. What is stopping us to use reclaimable slab cache for this metadata?

thanks,
Shakeel

Qu Wenruo Sept. 30, 2024, 10 p.m. UTC | #2

在 2024/10/1 02:53, Shakeel Butt 写道:
> Hi Qu,
> 
> On Sat, Sep 28, 2024 at 02:15:56PM GMT, Qu Wenruo wrote:
>> [BACKGROUND]
>> The function filemap_add_folio() charges the memory cgroup,
>> as we assume all page caches are accessible by user space progresses
>> thus needs the cgroup accounting.
>>
>> However btrfs is a special case, it has a very large metadata thanks to
>> its support of data csum (by default it's 4 bytes per 4K data, and can
>> be as large as 32 bytes per 4K data).
>> This means btrfs has to go page cache for its metadata pages, to take
>> advantage of both cache and reclaim ability of filemap.
>>
>> This has a tiny problem, that all btrfs metadata pages have to go through
>> the memcgroup charge, even all those metadata pages are not
>> accessible by the user space, and doing the charging can introduce some
>> latency if there is a memory limits set.
>>
>> Btrfs currently uses __GFP_NOFAIL flag as a workaround for this cgroup
>> charge situation so that metadata pages won't really be limited by
>> memcgroup.
>>
>> [ENHANCEMENT]
>> Instead of relying on __GFP_NOFAIL to avoid charge failure, use root
>> memory cgroup to attach metadata pages.
>>
>> Although this needs to export the symbol mem_root_cgroup for
>> CONFIG_MEMCG, or define mem_root_cgroup as NULL for !CONFIG_MEMCG.
>>
>> With root memory cgroup, we directly skip the charging part, and only
>> rely on __GFP_NOFAIL for the real memory allocation part.
>>
> 
> I have a couple of questions:
> 
> 1. Were you using __GFP_NOFAIL just to avoid ENOMEMs? Are you ok with
> oom-kills?

The NOFAIL flag is inherited from the memory allocation for metadata 
tree blocks.

Although btrfs has error handling already for all the possible ENOMEMs, 
hitting ENOMEMs for metadata may still be a big problem, thus all my 
previous attempt to remove NOFAIL flag all got rejected.

> 
> 2. What the normal overhead of these metadata in real world production
> environment? I see 4 to 32 bytes per 4k but what's the most used one and
> does it depend on the data of 4k or something else?

What did you mean by the "overhead" part? Did you mean the checksum?

If so, there is none, because btrfs store metadata checksum inside the 
tree block (thus the page cache).
The first 32 bytes of a tree block are always reserved for metadata 
checksum.

The tree block size depends on the mkfs time option nodesize, is 16K by 
default, and that's the most common value.

> 
> 3. Most probably multiple metadata values are colocated on a single 4k
> page of the btrfs page cache even though the corresponding page cache
> might be charged to different cgroups. Is that correct?

Not always a single 4K page, it depends on the nodesize, which is 16K by 
default.

Otherwise yes, the metadata page cache can be charged to different 
cgroup, depending on the caller's context.
And we do not want to charge the metadata page cache to the caller's 
cgroup, since it's really a shared resource and the caller has no way to 
directly accessing the page cache.

Not charging the metadata page cache will align btrfs more to the 
ext4/xfs, which all uses regular page allocation without attaching to a 
filemap.

> 
> 4. What is stopping us to use reclaimable slab cache for this metadata?

Josef has tried this before, the attempt failed on the shrinker part, 
and partly due to the size.

Btrfs has very large metadata compared to all other fses, not only due 
to the COW nature and a larger tree block size (16K by default), but 
also the extra data checksum (4 bytes per 4K by default, 32 bytes per 4K 
maximum).

On a real world system, the metadata itself can easily go hundreds of 
GiBs, thus a shrinker is definitely needed.

Thus so far btrfs is using page cache for its metadata cache.

Thanks,
Qu

> 
> thanks,
> Shakeel

Shakeel Butt Oct. 1, 2024, 1:37 a.m. UTC | #3

On Tue, Oct 01, 2024 at 07:30:38AM GMT, Qu Wenruo wrote:
> 
> 
> 在 2024/10/1 02:53, Shakeel Butt 写道:
> > Hi Qu,
> > 
> > On Sat, Sep 28, 2024 at 02:15:56PM GMT, Qu Wenruo wrote:
> > > [BACKGROUND]
> > > The function filemap_add_folio() charges the memory cgroup,
> > > as we assume all page caches are accessible by user space progresses
> > > thus needs the cgroup accounting.
> > > 
> > > However btrfs is a special case, it has a very large metadata thanks to
> > > its support of data csum (by default it's 4 bytes per 4K data, and can
> > > be as large as 32 bytes per 4K data).
> > > This means btrfs has to go page cache for its metadata pages, to take
> > > advantage of both cache and reclaim ability of filemap.
> > > 
> > > This has a tiny problem, that all btrfs metadata pages have to go through
> > > the memcgroup charge, even all those metadata pages are not
> > > accessible by the user space, and doing the charging can introduce some
> > > latency if there is a memory limits set.
> > > 
> > > Btrfs currently uses __GFP_NOFAIL flag as a workaround for this cgroup
> > > charge situation so that metadata pages won't really be limited by
> > > memcgroup.
> > > 
> > > [ENHANCEMENT]
> > > Instead of relying on __GFP_NOFAIL to avoid charge failure, use root
> > > memory cgroup to attach metadata pages.
> > > 
> > > Although this needs to export the symbol mem_root_cgroup for
> > > CONFIG_MEMCG, or define mem_root_cgroup as NULL for !CONFIG_MEMCG.
> > > 
> > > With root memory cgroup, we directly skip the charging part, and only
> > > rely on __GFP_NOFAIL for the real memory allocation part.
> > > 
> > 
> > I have a couple of questions:
> > 
> > 1. Were you using __GFP_NOFAIL just to avoid ENOMEMs? Are you ok with
> > oom-kills?
> 
> The NOFAIL flag is inherited from the memory allocation for metadata tree
> blocks.
> 
> Although btrfs has error handling already for all the possible ENOMEMs,
> hitting ENOMEMs for metadata may still be a big problem, thus all my
> previous attempt to remove NOFAIL flag all got rejected.

__GFP_NOFAIL for memcg charging is reasonable in many scenarios. Memcg
oom-killer is enabled for __GFP_NOFAIL and going over limit and getting
oom-killed is totally reasonable. Orthogonal to the discussion though.

> 
> > 
> > 2. What the normal overhead of these metadata in real world production
> > environment? I see 4 to 32 bytes per 4k but what's the most used one and
> > does it depend on the data of 4k or something else?
> 
> What did you mean by the "overhead" part? Did you mean the checksum?
> 

To me this metadata is overhead, so yes checksum is something not the
actual data stored on the storage.

> If so, there is none, because btrfs store metadata checksum inside the tree
> block (thus the page cache).
> The first 32 bytes of a tree block are always reserved for metadata
> checksum.
> 
> The tree block size depends on the mkfs time option nodesize, is 16K by
> default, and that's the most common value.

Sorry I am not very familiar with btrfs. What is tree block?

> 
> > 
> > 3. Most probably multiple metadata values are colocated on a single 4k
> > page of the btrfs page cache even though the corresponding page cache
> > might be charged to different cgroups. Is that correct?
> 
> Not always a single 4K page, it depends on the nodesize, which is 16K by
> default.
> 
> Otherwise yes, the metadata page cache can be charged to different cgroup,
> depending on the caller's context.
> And we do not want to charge the metadata page cache to the caller's cgroup,
> since it's really a shared resource and the caller has no way to directly
> accessing the page cache.
> 
> Not charging the metadata page cache will align btrfs more to the ext4/xfs,
> which all uses regular page allocation without attaching to a filemap.
> 

Can you point me to ext4/xfs code where they are allocating uncharged
memory for their metadata?

> > 
> > 4. What is stopping us to use reclaimable slab cache for this metadata?
> 
> Josef has tried this before, the attempt failed on the shrinker part, and
> partly due to the size.
> 
> Btrfs has very large metadata compared to all other fses, not only due to
> the COW nature and a larger tree block size (16K by default), but also the
> extra data checksum (4 bytes per 4K by default, 32 bytes per 4K maximum).
> 
> On a real world system, the metadata itself can easily go hundreds of GiBs,
> thus a shrinker is definitely needed.

This amount of uncharged memory is concerning which becomes part of
system overhead and may impact the schedulable memory for the datacenter
environment.

Overall the code seems fine and no pushback from me if btrfs maintainers
are ok with this. I think btrfs should move to slab+shrinker based
solution for this metadata unless there is deep technical reason not to.

thanks,
Shakeel

Qu Wenruo Oct. 1, 2024, 2:03 a.m. UTC | #4

在 2024/10/1 11:07, Shakeel Butt 写道:
> On Tue, Oct 01, 2024 at 07:30:38AM GMT, Qu Wenruo wrote:
[...]
>>
>> Although btrfs has error handling already for all the possible ENOMEMs,
>> hitting ENOMEMs for metadata may still be a big problem, thus all my
>> previous attempt to remove NOFAIL flag all got rejected.
> 
> __GFP_NOFAIL for memcg charging is reasonable in many scenarios. Memcg
> oom-killer is enabled for __GFP_NOFAIL and going over limit and getting
> oom-killed is totally reasonable. Orthogonal to the discussion though.
> 
>>
>>>
>>> 2. What the normal overhead of these metadata in real world production
>>> environment? I see 4 to 32 bytes per 4k but what's the most used one and
>>> does it depend on the data of 4k or something else?
>>
>> What did you mean by the "overhead" part? Did you mean the checksum?
>>
> 
> To me this metadata is overhead, so yes checksum is something not the
> actual data stored on the storage.

Oh, by "metadata" it means everything not data.

It includes all the info like directory layout, file layout, data 
checksum and all the other needed info to represent a btrfs.

> 
>> If so, there is none, because btrfs store metadata checksum inside the tree
>> block (thus the page cache).
>> The first 32 bytes of a tree block are always reserved for metadata
>> checksum.
>>
>> The tree block size depends on the mkfs time option nodesize, is 16K by
>> default, and that's the most common value.
> 
> Sorry I am not very familiar with btrfs. What is tree block?

A tree block of btrfs is a fixed block, containing metadata (aka, 
everything other than the data), organized in a B-tree structure.

A tree block can be a node, containing the pointers to the next level 
nodes/leaves.
Or a leave, contains the key and the extra info bound to that key.

And btrfs uses the same tree block structure for all different kind of 
info.

E.g. an inode is stored with (<ino> INODE_ITEM 0) as the key, with a 
btrfs_inode_item structure as the extra data bound to that key.

And a file extent is stored with (<ino> EXTENT_DATA <file pos>) as the 
key, with a btrfs_file_extent_item structure bound to that key.

> 
>>
>>>
>>> 3. Most probably multiple metadata values are colocated on a single 4k
>>> page of the btrfs page cache even though the corresponding page cache
>>> might be charged to different cgroups. Is that correct?
>>
>> Not always a single 4K page, it depends on the nodesize, which is 16K by
>> default.
>>
>> Otherwise yes, the metadata page cache can be charged to different cgroup,
>> depending on the caller's context.
>> And we do not want to charge the metadata page cache to the caller's cgroup,
>> since it's really a shared resource and the caller has no way to directly
>> accessing the page cache.
>>
>> Not charging the metadata page cache will align btrfs more to the ext4/xfs,
>> which all uses regular page allocation without attaching to a filemap.
>>
> 
> Can you point me to ext4/xfs code where they are allocating uncharged
> memory for their metadata?

For xfs, it's inside fs/xfs/xfs_buf.c.
E.g. xfs_buf_alloc_pages(), which goes with kzalloc() to allocate needed 
pages.

For ext4 it's using buffer header, which is I'm not familiar at all.
But it looks like the bh folios are from the block device mapping, which 
may still be charged by cgroup.

Thanks,
Qu

> 
>>>
>>> 4. What is stopping us to use reclaimable slab cache for this metadata?
>>
>> Josef has tried this before, the attempt failed on the shrinker part, and
>> partly due to the size.
>>
>> Btrfs has very large metadata compared to all other fses, not only due to
>> the COW nature and a larger tree block size (16K by default), but also the
>> extra data checksum (4 bytes per 4K by default, 32 bytes per 4K maximum).
>>
>> On a real world system, the metadata itself can easily go hundreds of GiBs,
>> thus a shrinker is definitely needed.
> 
> This amount of uncharged memory is concerning which becomes part of
> system overhead and may impact the schedulable memory for the datacenter
> environment.
> 
> Overall the code seems fine and no pushback from me if btrfs maintainers
> are ok with this. I think btrfs should move to slab+shrinker based
> solution for this metadata unless there is deep technical reason not to.
> 
> thanks,
> Shakeel

Christoph Hellwig Oct. 1, 2024, 9:19 a.m. UTC | #5

On Sat, Sep 28, 2024 at 02:15:56PM +0930, Qu Wenruo wrote:
> [BACKGROUND]
> The function filemap_add_folio() charges the memory cgroup,
> as we assume all page caches are accessible by user space progresses
> thus needs the cgroup accounting.
> 
> However btrfs is a special case, it has a very large metadata thanks to
> its support of data csum (by default it's 4 bytes per 4K data, and can
> be as large as 32 bytes per 4K data).
> This means btrfs has to go page cache for its metadata pages, to take
> advantage of both cache and reclaim ability of filemap.

FYI, in general reclaims for metadata work much better with a shrinker
than through the pagecache, because it can be object based and
prioritized.

> [ENHANCEMENT]
> Instead of relying on __GFP_NOFAIL to avoid charge failure, use root
> memory cgroup to attach metadata pages.
> 
> Although this needs to export the symbol mem_root_cgroup for
> CONFIG_MEMCG, or define mem_root_cgroup as NULL for !CONFIG_MEMCG.
> 
> With root memory cgroup, we directly skip the charging part, and only
> rely on __GFP_NOFAIL for the real memory allocation part.

This looks pretty ugly.  What speaks against a version of
filemap_add_folio that doesn't charge the memcg?

Qu Wenruo Oct. 1, 2024, 9:40 a.m. UTC | #6

在 2024/10/1 18:49, Christoph Hellwig 写道:
> On Sat, Sep 28, 2024 at 02:15:56PM +0930, Qu Wenruo wrote:
>> [BACKGROUND]
>> The function filemap_add_folio() charges the memory cgroup,
>> as we assume all page caches are accessible by user space progresses
>> thus needs the cgroup accounting.
>>
>> However btrfs is a special case, it has a very large metadata thanks to
>> its support of data csum (by default it's 4 bytes per 4K data, and can
>> be as large as 32 bytes per 4K data).
>> This means btrfs has to go page cache for its metadata pages, to take
>> advantage of both cache and reclaim ability of filemap.
>
> FYI, in general reclaims for metadata work much better with a shrinker
> than through the pagecache, because it can be object based and
> prioritized.
>
>> [ENHANCEMENT]
>> Instead of relying on __GFP_NOFAIL to avoid charge failure, use root
>> memory cgroup to attach metadata pages.
>>
>> Although this needs to export the symbol mem_root_cgroup for
>> CONFIG_MEMCG, or define mem_root_cgroup as NULL for !CONFIG_MEMCG.
>>
>> With root memory cgroup, we directly skip the charging part, and only
>> rely on __GFP_NOFAIL for the real memory allocation part.
>
> This looks pretty ugly.  What speaks against a version of
> filemap_add_folio that doesn't charge the memcg?
>

Because there is so far only one caller has such requirement.

Furthermore I believe the folio API doesn't prefer too many different
functions doing similar things.

E.g. the new folio interfaces only provides filemap_get_folio(),
filemap_lock_folio(), and the more generic __filemap_get_folio().

Meanwhile there are tons of page based interfaces, find_get_page(),
find_or_create_page(), find_lock_page() and flags version etc.

Thus I think something like filemap_add_folio_no_memcg_charge() will be
rejected.

Finally, it's not feasible to go with a new GFP flag either.

We already have __GFP_ACCOUNT for memcg charging purposes, but for
filemap_add_folio() even if we do not pass __GFP_ACCOUNT, the memcg will
still be charged.

It will be even more ugly if we add a __GFP_NO_ACCOUNT, and such attempt
is already rejected before IIRC.

Thanks,
Qu

Christoph Hellwig Oct. 2, 2024, 7:41 a.m. UTC | #7

On Tue, Oct 01, 2024 at 07:10:07PM +0930, Qu Wenruo wrote:
> > This looks pretty ugly.  What speaks against a version of
> > filemap_add_folio that doesn't charge the memcg?
> > 
> 
> Because there is so far only one caller has such requirement.

That is a good argument to review the reasons for an interface, but
not a killer argument.

> Furthermore I believe the folio API doesn't prefer too many different
> functions doing similar things.
> 
> E.g. the new folio interfaces only provides filemap_get_folio(),
> filemap_lock_folio(), and the more generic __filemap_get_folio().
> 
> Meanwhile there are tons of page based interfaces, find_get_page(),
> find_or_create_page(), find_lock_page() and flags version etc.

That's a totally different argument, tough.  Those functions were
trivial wrappers around a more versatile low-level function.

While this is about adding clearly defined functionality, and
more importantly not exporting totally random low-level data.

What I'd propose is something like the patch below, plus proper
documentation.  Note that this now does the uncharge on the unlocked
folio in the error case.  From a quick look that should be fine, but
someone who actually knows the code needs to confirm that.

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 68a5f1ff3301c6..70da62cf32f6c3 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -1284,6 +1284,8 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 		pgoff_t index, gfp_t gfp);
 int filemap_add_folio(struct address_space *mapping, struct folio *folio,
 		pgoff_t index, gfp_t gfp);
+int filemap_add_folio_nocharge(struct address_space *mapping,
+		struct folio *folio, pgoff_t index, gfp_t gfp);
 void filemap_remove_folio(struct folio *folio);
 void __filemap_remove_folio(struct folio *folio, void *shadow);
 void replace_page_cache_folio(struct folio *old, struct folio *new);
diff --git a/mm/filemap.c b/mm/filemap.c
index 36d22968be9a1e..0a1ae841e8c10f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -958,20 +958,15 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 }
 ALLOW_ERROR_INJECTION(__filemap_add_folio, ERRNO);
 
-int filemap_add_folio(struct address_space *mapping, struct folio *folio,
-				pgoff_t index, gfp_t gfp)
+int filemap_add_folio_nocharge(struct address_space *mapping,
+		struct folio *folio, pgoff_t index, gfp_t gfp)
 {
 	void *shadow = NULL;
 	int ret;
 
-	ret = mem_cgroup_charge(folio, NULL, gfp);
-	if (ret)
-		return ret;
-
 	__folio_set_locked(folio);
 	ret = __filemap_add_folio(mapping, folio, index, gfp, &shadow);
 	if (unlikely(ret)) {
-		mem_cgroup_uncharge(folio);
 		__folio_clear_locked(folio);
 	} else {
 		/*
@@ -989,6 +984,22 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio,
 	}
 	return ret;
 }
+EXPORT_SYMBOL_GPL(filemap_add_folio_nocharge);
+
+int filemap_add_folio(struct address_space *mapping, struct folio *folio,
+		pgoff_t index, gfp_t gfp)
+{
+	int ret;
+
+	ret = mem_cgroup_charge(folio, NULL, gfp);
+	if (ret)
+		return ret;
+
+	ret = filemap_add_folio_nocharge(mapping, folio, index, gfp);
+	if (ret)
+		mem_cgroup_uncharge(folio);
+	return ret;
+}
 EXPORT_SYMBOL_GPL(filemap_add_folio);
 
 #ifdef CONFIG_NUMA

Michal Hocko Oct. 3, 2024, 8:07 a.m. UTC | #8

On Wed 02-10-24 00:41:29, Christoph Hellwig wrote:
[...]
> What I'd propose is something like the patch below, plus proper
> documentation.  Note that this now does the uncharge on the unlocked
> folio in the error case.  From a quick look that should be fine, but
> someone who actually knows the code needs to confirm that.

yes, this is a much cleaner solution. filemap_add_folio_nocharge would
need documentation explaining when this is supposed to be used.

> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 68a5f1ff3301c6..70da62cf32f6c3 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -1284,6 +1284,8 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
>  		pgoff_t index, gfp_t gfp);
>  int filemap_add_folio(struct address_space *mapping, struct folio *folio,
>  		pgoff_t index, gfp_t gfp);
> +int filemap_add_folio_nocharge(struct address_space *mapping,
> +		struct folio *folio, pgoff_t index, gfp_t gfp);
>  void filemap_remove_folio(struct folio *folio);
>  void __filemap_remove_folio(struct folio *folio, void *shadow);
>  void replace_page_cache_folio(struct folio *old, struct folio *new);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 36d22968be9a1e..0a1ae841e8c10f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -958,20 +958,15 @@ noinline int __filemap_add_folio(struct address_space *mapping,
>  }
>  ALLOW_ERROR_INJECTION(__filemap_add_folio, ERRNO);
>  
> -int filemap_add_folio(struct address_space *mapping, struct folio *folio,
> -				pgoff_t index, gfp_t gfp)
> +int filemap_add_folio_nocharge(struct address_space *mapping,
> +		struct folio *folio, pgoff_t index, gfp_t gfp)
>  {
>  	void *shadow = NULL;
>  	int ret;
>  
> -	ret = mem_cgroup_charge(folio, NULL, gfp);
> -	if (ret)
> -		return ret;
> -
>  	__folio_set_locked(folio);
>  	ret = __filemap_add_folio(mapping, folio, index, gfp, &shadow);
>  	if (unlikely(ret)) {
> -		mem_cgroup_uncharge(folio);
>  		__folio_clear_locked(folio);
>  	} else {
>  		/*
> @@ -989,6 +984,22 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio,
>  	}
>  	return ret;
>  }
> +EXPORT_SYMBOL_GPL(filemap_add_folio_nocharge);
> +
> +int filemap_add_folio(struct address_space *mapping, struct folio *folio,
> +		pgoff_t index, gfp_t gfp)
> +{
> +	int ret;
> +
> +	ret = mem_cgroup_charge(folio, NULL, gfp);
> +	if (ret)
> +		return ret;
> +
> +	ret = filemap_add_folio_nocharge(mapping, folio, index, gfp);
> +	if (ret)
> +		mem_cgroup_uncharge(folio);
> +	return ret;
> +}
>  EXPORT_SYMBOL_GPL(filemap_add_folio);
>  
>  #ifdef CONFIG_NUMA

Qu Wenruo Oct. 3, 2024, 8:11 a.m. UTC | #9

在 2024/10/2 17:11, Christoph Hellwig 写道:
> On Tue, Oct 01, 2024 at 07:10:07PM +0930, Qu Wenruo wrote:
>>> This looks pretty ugly.  What speaks against a version of
>>> filemap_add_folio that doesn't charge the memcg?
>>>
>>
>> Because there is so far only one caller has such requirement.
>
> That is a good argument to review the reasons for an interface, but
> not a killer argument.
>
>> Furthermore I believe the folio API doesn't prefer too many different
>> functions doing similar things.
>>
>> E.g. the new folio interfaces only provides filemap_get_folio(),
>> filemap_lock_folio(), and the more generic __filemap_get_folio().
>>
>> Meanwhile there are tons of page based interfaces, find_get_page(),
>> find_or_create_page(), find_lock_page() and flags version etc.
>
> That's a totally different argument, tough.  Those functions were
> trivial wrappers around a more versatile low-level function.
>
> While this is about adding clearly defined functionality, and
> more importantly not exporting totally random low-level data.
>
> What I'd propose is something like the patch below, plus proper
> documentation.  Note that this now does the uncharge on the unlocked
> folio in the error case.  From a quick look that should be fine, but
> someone who actually knows the code needs to confirm that.

The interface looks good to me, especially we completely skip the
charging, which is even better than the current form.

And since Michal is also happy with this idea, I can definite go this path.

Just a little curious, would it be better to introduce a flag for
address_space to indicate whether the folio needs to be charged or not?

Thanks,
Qu
>
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 68a5f1ff3301c6..70da62cf32f6c3 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -1284,6 +1284,8 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
>   		pgoff_t index, gfp_t gfp);
>   int filemap_add_folio(struct address_space *mapping, struct folio *folio,
>   		pgoff_t index, gfp_t gfp);
> +int filemap_add_folio_nocharge(struct address_space *mapping,
> +		struct folio *folio, pgoff_t index, gfp_t gfp);
>   void filemap_remove_folio(struct folio *folio);
>   void __filemap_remove_folio(struct folio *folio, void *shadow);
>   void replace_page_cache_folio(struct folio *old, struct folio *new);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 36d22968be9a1e..0a1ae841e8c10f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -958,20 +958,15 @@ noinline int __filemap_add_folio(struct address_space *mapping,
>   }
>   ALLOW_ERROR_INJECTION(__filemap_add_folio, ERRNO);
>
> -int filemap_add_folio(struct address_space *mapping, struct folio *folio,
> -				pgoff_t index, gfp_t gfp)
> +int filemap_add_folio_nocharge(struct address_space *mapping,
> +		struct folio *folio, pgoff_t index, gfp_t gfp)
>   {
>   	void *shadow = NULL;
>   	int ret;
>
> -	ret = mem_cgroup_charge(folio, NULL, gfp);
> -	if (ret)
> -		return ret;
> -
>   	__folio_set_locked(folio);
>   	ret = __filemap_add_folio(mapping, folio, index, gfp, &shadow);
>   	if (unlikely(ret)) {
> -		mem_cgroup_uncharge(folio);
>   		__folio_clear_locked(folio);
>   	} else {
>   		/*
> @@ -989,6 +984,22 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio,
>   	}
>   	return ret;
>   }
> +EXPORT_SYMBOL_GPL(filemap_add_folio_nocharge);
> +
> +int filemap_add_folio(struct address_space *mapping, struct folio *folio,
> +		pgoff_t index, gfp_t gfp)
> +{
> +	int ret;
> +
> +	ret = mem_cgroup_charge(folio, NULL, gfp);
> +	if (ret)
> +		return ret;
> +
> +	ret = filemap_add_folio_nocharge(mapping, folio, index, gfp);
> +	if (ret)
> +		mem_cgroup_uncharge(folio);
> +	return ret;
> +}
>   EXPORT_SYMBOL_GPL(filemap_add_folio);
>
>   #ifdef CONFIG_NUMA
>

Michal Hocko Oct. 3, 2024, 8:22 a.m. UTC | #10

On Thu 03-10-24 17:41:23, Qu Wenruo wrote:
[...]
> Just a little curious, would it be better to introduce a flag for
> address_space to indicate whether the folio needs to be charged or not?

I would say that an explicit interface seems better because it is easier
to find (grep) and reason about. If you make this address space property
then it is really hard to find all the callers.

Qu Wenruo Oct. 3, 2024, 8:23 a.m. UTC | #11

在 2024/10/3 17:52, Michal Hocko 写道:
> On Thu 03-10-24 17:41:23, Qu Wenruo wrote:
> [...]
>> Just a little curious, would it be better to introduce a flag for
>> address_space to indicate whether the folio needs to be charged or not?
> 
> I would say that an explicit interface seems better because it is easier
> to find (grep) and reason about. If you make this address space property
> then it is really hard to find all the callers.

Makes sense, thanks a lot for all the help!

Thanks,
Qu
>

Shakeel Butt Oct. 3, 2024, 8:39 p.m. UTC | #12

On Thu, Oct 03, 2024 at 10:07:33AM GMT, Michal Hocko wrote:
> On Wed 02-10-24 00:41:29, Christoph Hellwig wrote:
> [...]
> > What I'd propose is something like the patch below, plus proper
> > documentation.  Note that this now does the uncharge on the unlocked
> > folio in the error case.  From a quick look that should be fine, but
> > someone who actually knows the code needs to confirm that.
> 
> yes, this is a much cleaner solution. filemap_add_folio_nocharge would
> need documentation explaining when this is supposed to be used.
> 

I feel like we should not make bypassing cgroup accounting easier but
rather make it more awkward :P, so folks give much more thought before
opting to do so. Though I agree filemap_add_folio_nocharge() is easy to
grep.

Johannes Weiner Oct. 3, 2024, 8:58 p.m. UTC | #13

On Wed, Oct 02, 2024 at 12:41:29AM -0700, Christoph Hellwig wrote:
> > > This looks pretty ugly.  What speaks against a version of
> > > filemap_add_folio that doesn't charge the memcg?
> 
> What I'd propose is something like the patch below, plus proper
> documentation.

I like this much better as well.

> Note that this now does the uncharge on the unlocked folio in the
> error case.  From a quick look that should be fine, but someone who
> actually knows the code needs to confirm that.

That's fine. For the same reason the non-atomic __folio_clear_locked()
is fine in that case. The folio just has to be exclusive.

btrfs: root memcgroup for metadata filemap_add_folio()

Commit Message

Comments

Patch