diff mbox series

[v7,2/3] btrfs: always uses root memcgroup for filemap_add_folio()

Message ID 6a9ba2c8e70c7b5c4316404612f281a031f847da.1721384771.git.wqu@suse.com (mailing list archive)
State New
Headers show
Series btrfs: try to allocate larger folios for metadata | expand

Commit Message

Qu Wenruo July 19, 2024, 10:28 a.m. UTC
[BACKGROUND]
The function filemap_add_folio() charges the memory cgroup,
as we assume all page caches are accessible by user space progresses
thus needs the cgroup accounting.

However btrfs is a special case, it has a very large metadata thanks to
its support of data csum (by default it's 4 bytes per 4K data, and can
be as large as 32 bytes per 4K data).
This means btrfs has to go page cache for its metadata pages, to take
advantage of both cache and reclaim ability of filemap.

This has a tiny problem, that all btrfs metadata pages have to go through
the memcgroup charge, even all those metadata pages are not
accessible by the user space, and doing the charging can introduce some
latency if there is a memory limits set.

Btrfs currently uses __GFP_NOFAIL flag as a workaround for this cgroup
charge situation so that metadata pages won't really be limited by
memcgroup.

[ENHANCEMENT]
Instead of relying on __GFP_NOFAIL to avoid charge failure, use root
memory cgroup to attach metadata pages.

With root memory cgroup, we directly skip the charging part, and only
rely on __GFP_NOFAIL for the real memory allocation part.

Suggested-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

Comments

Johannes Weiner July 19, 2024, 5:02 p.m. UTC | #1
On Fri, Jul 19, 2024 at 07:58:40PM +0930, Qu Wenruo wrote:
> [BACKGROUND]
> The function filemap_add_folio() charges the memory cgroup,
> as we assume all page caches are accessible by user space progresses
> thus needs the cgroup accounting.
> 
> However btrfs is a special case, it has a very large metadata thanks to
> its support of data csum (by default it's 4 bytes per 4K data, and can
> be as large as 32 bytes per 4K data).
> This means btrfs has to go page cache for its metadata pages, to take
> advantage of both cache and reclaim ability of filemap.
> 
> This has a tiny problem, that all btrfs metadata pages have to go through
> the memcgroup charge, even all those metadata pages are not
> accessible by the user space, and doing the charging can introduce some
> latency if there is a memory limits set.
> 
> Btrfs currently uses __GFP_NOFAIL flag as a workaround for this cgroup
> charge situation so that metadata pages won't really be limited by
> memcgroup.
> 
> [ENHANCEMENT]
> Instead of relying on __GFP_NOFAIL to avoid charge failure, use root
> memory cgroup to attach metadata pages.
> 
> With root memory cgroup, we directly skip the charging part, and only
> rely on __GFP_NOFAIL for the real memory allocation part.
> 
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Suggested-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/extent_io.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index aa7f8148cd0d..cfeed7673009 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2971,6 +2971,7 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
>  
>  	struct btrfs_fs_info *fs_info = eb->fs_info;
>  	struct address_space *mapping = fs_info->btree_inode->i_mapping;
> +	struct mem_cgroup *old_memcg;
>  	const unsigned long index = eb->start >> PAGE_SHIFT;
>  	struct folio *existing_folio = NULL;
>  	int ret;
> @@ -2981,8 +2982,17 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
>  	ASSERT(eb->folios[i]);
>  
>  retry:
> +	/*
> +	 * Btree inode is a btrfs internal inode, and not exposed to any
> +	 * user.
> +	 * Furthermore we do not want any cgroup limits on this inode.
> +	 * So we always use root_mem_cgroup as our active memcg when attaching
> +	 * the folios.
> +	 */
> +	old_memcg = set_active_memcg(root_mem_cgroup);
>  	ret = filemap_add_folio(mapping, eb->folios[i], index + i,
>  				GFP_NOFS | __GFP_NOFAIL);
> +	set_active_memcg(old_memcg);

It looks correct. But it's going through all dance to set up
current->active_memcg, then have the charge path look that up,
css_get(), call try_charge() only to bail immediately, css_put(), then
update current->active_memcg again. All those branches are necessary
when we want to charge to a "real" other cgroup. But in this case, we
always know we're not charging, so it seems uncalled for.

Wouldn't it be a lot simpler (and cheaper) to have a
filemap_add_folio_nocharge()?
Roman Gushchin July 19, 2024, 5:25 p.m. UTC | #2
On Fri, Jul 19, 2024 at 01:02:06PM -0400, Johannes Weiner wrote:
> On Fri, Jul 19, 2024 at 07:58:40PM +0930, Qu Wenruo wrote:
> > [BACKGROUND]
> > The function filemap_add_folio() charges the memory cgroup,
> > as we assume all page caches are accessible by user space progresses
> > thus needs the cgroup accounting.
> > 
> > However btrfs is a special case, it has a very large metadata thanks to
> > its support of data csum (by default it's 4 bytes per 4K data, and can
> > be as large as 32 bytes per 4K data).
> > This means btrfs has to go page cache for its metadata pages, to take
> > advantage of both cache and reclaim ability of filemap.
> > 
> > This has a tiny problem, that all btrfs metadata pages have to go through
> > the memcgroup charge, even all those metadata pages are not
> > accessible by the user space, and doing the charging can introduce some
> > latency if there is a memory limits set.
> > 
> > Btrfs currently uses __GFP_NOFAIL flag as a workaround for this cgroup
> > charge situation so that metadata pages won't really be limited by
> > memcgroup.
> > 
> > [ENHANCEMENT]
> > Instead of relying on __GFP_NOFAIL to avoid charge failure, use root
> > memory cgroup to attach metadata pages.
> > 
> > With root memory cgroup, we directly skip the charging part, and only
> > rely on __GFP_NOFAIL for the real memory allocation part.
> > 
> > Suggested-by: Michal Hocko <mhocko@suse.com>
> > Suggested-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> > Signed-off-by: Qu Wenruo <wqu@suse.com>
> > ---
> >  fs/btrfs/extent_io.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index aa7f8148cd0d..cfeed7673009 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -2971,6 +2971,7 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
> >  
> >  	struct btrfs_fs_info *fs_info = eb->fs_info;
> >  	struct address_space *mapping = fs_info->btree_inode->i_mapping;
> > +	struct mem_cgroup *old_memcg;
> >  	const unsigned long index = eb->start >> PAGE_SHIFT;
> >  	struct folio *existing_folio = NULL;
> >  	int ret;
> > @@ -2981,8 +2982,17 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
> >  	ASSERT(eb->folios[i]);
> >  
> >  retry:
> > +	/*
> > +	 * Btree inode is a btrfs internal inode, and not exposed to any
> > +	 * user.
> > +	 * Furthermore we do not want any cgroup limits on this inode.
> > +	 * So we always use root_mem_cgroup as our active memcg when attaching
> > +	 * the folios.
> > +	 */
> > +	old_memcg = set_active_memcg(root_mem_cgroup);
> >  	ret = filemap_add_folio(mapping, eb->folios[i], index + i,
> >  				GFP_NOFS | __GFP_NOFAIL);
> > +	set_active_memcg(old_memcg);
> 
> It looks correct. But it's going through all dance to set up
> current->active_memcg, then have the charge path look that up,
> css_get(), call try_charge() only to bail immediately, css_put(), then
> update current->active_memcg again. All those branches are necessary
> when we want to charge to a "real" other cgroup. But in this case, we
> always know we're not charging, so it seems uncalled for.
> 
> Wouldn't it be a lot simpler (and cheaper) to have a
> filemap_add_folio_nocharge()?

Time to restore GFP_NOACCOUNT? I think it might be useful for allocating objects
which are shared across the entire system and/or unlikely will go away under
the memory pressure.
Michal Hocko July 19, 2024, 6:13 p.m. UTC | #3
On Fri 19-07-24 13:02:06, Johannes Weiner wrote:
> On Fri, Jul 19, 2024 at 07:58:40PM +0930, Qu Wenruo wrote:
> > [BACKGROUND]
> > The function filemap_add_folio() charges the memory cgroup,
> > as we assume all page caches are accessible by user space progresses
> > thus needs the cgroup accounting.
> > 
> > However btrfs is a special case, it has a very large metadata thanks to
> > its support of data csum (by default it's 4 bytes per 4K data, and can
> > be as large as 32 bytes per 4K data).
> > This means btrfs has to go page cache for its metadata pages, to take
> > advantage of both cache and reclaim ability of filemap.
> > 
> > This has a tiny problem, that all btrfs metadata pages have to go through
> > the memcgroup charge, even all those metadata pages are not
> > accessible by the user space, and doing the charging can introduce some
> > latency if there is a memory limits set.
> > 
> > Btrfs currently uses __GFP_NOFAIL flag as a workaround for this cgroup
> > charge situation so that metadata pages won't really be limited by
> > memcgroup.
> > 
> > [ENHANCEMENT]
> > Instead of relying on __GFP_NOFAIL to avoid charge failure, use root
> > memory cgroup to attach metadata pages.
> > 
> > With root memory cgroup, we directly skip the charging part, and only
> > rely on __GFP_NOFAIL for the real memory allocation part.
> > 
> > Suggested-by: Michal Hocko <mhocko@suse.com>
> > Suggested-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> > Signed-off-by: Qu Wenruo <wqu@suse.com>
> > ---
> >  fs/btrfs/extent_io.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index aa7f8148cd0d..cfeed7673009 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -2971,6 +2971,7 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
> >  
> >  	struct btrfs_fs_info *fs_info = eb->fs_info;
> >  	struct address_space *mapping = fs_info->btree_inode->i_mapping;
> > +	struct mem_cgroup *old_memcg;
> >  	const unsigned long index = eb->start >> PAGE_SHIFT;
> >  	struct folio *existing_folio = NULL;
> >  	int ret;
> > @@ -2981,8 +2982,17 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
> >  	ASSERT(eb->folios[i]);
> >  
> >  retry:
> > +	/*
> > +	 * Btree inode is a btrfs internal inode, and not exposed to any
> > +	 * user.
> > +	 * Furthermore we do not want any cgroup limits on this inode.
> > +	 * So we always use root_mem_cgroup as our active memcg when attaching
> > +	 * the folios.
> > +	 */
> > +	old_memcg = set_active_memcg(root_mem_cgroup);
> >  	ret = filemap_add_folio(mapping, eb->folios[i], index + i,
> >  				GFP_NOFS | __GFP_NOFAIL);

I thoutght you've said that NOFAIL was added to workaround memcg
charges. Can you remove it when memcg is out of the picture?

It would be great to add some background about how much memory are we
talking about. Because this might require memcg configuration in some
setups.

> > +	set_active_memcg(old_memcg);
> 
> It looks correct. But it's going through all dance to set up
> current->active_memcg, then have the charge path look that up,
> css_get(), call try_charge() only to bail immediately, css_put(), then
> update current->active_memcg again. All those branches are necessary
> when we want to charge to a "real" other cgroup. But in this case, we
> always know we're not charging, so it seems uncalled for.
> 
> Wouldn't it be a lot simpler (and cheaper) to have a
> filemap_add_folio_nocharge()?

Yes, that would certainly simplify things. From the previous discussion
I understood that there would be broader scopes which would opt-out from
charging. If this is really about a single filemap_add_folio call then
having a variant without doesn't call mem_cgroup_charge sounds like a
much more viable option and also it doesn't require to make any memcg
specific changes.
Qu Wenruo July 19, 2024, 10:11 p.m. UTC | #4
在 2024/7/20 03:43, Michal Hocko 写道:
> On Fri 19-07-24 13:02:06, Johannes Weiner wrote:
>> On Fri, Jul 19, 2024 at 07:58:40PM +0930, Qu Wenruo wrote:
>>> [BACKGROUND]
>>> The function filemap_add_folio() charges the memory cgroup,
>>> as we assume all page caches are accessible by user space progresses
>>> thus needs the cgroup accounting.
>>>
>>> However btrfs is a special case, it has a very large metadata thanks to
>>> its support of data csum (by default it's 4 bytes per 4K data, and can
>>> be as large as 32 bytes per 4K data).
>>> This means btrfs has to go page cache for its metadata pages, to take
>>> advantage of both cache and reclaim ability of filemap.
>>>
>>> This has a tiny problem, that all btrfs metadata pages have to go through
>>> the memcgroup charge, even all those metadata pages are not
>>> accessible by the user space, and doing the charging can introduce some
>>> latency if there is a memory limits set.
>>>
>>> Btrfs currently uses __GFP_NOFAIL flag as a workaround for this cgroup
>>> charge situation so that metadata pages won't really be limited by
>>> memcgroup.
>>>
>>> [ENHANCEMENT]
>>> Instead of relying on __GFP_NOFAIL to avoid charge failure, use root
>>> memory cgroup to attach metadata pages.
>>>
>>> With root memory cgroup, we directly skip the charging part, and only
>>> rely on __GFP_NOFAIL for the real memory allocation part.
>>>
>>> Suggested-by: Michal Hocko <mhocko@suse.com>
>>> Suggested-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>> ---
>>>   fs/btrfs/extent_io.c | 10 ++++++++++
>>>   1 file changed, 10 insertions(+)
>>>
>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>> index aa7f8148cd0d..cfeed7673009 100644
>>> --- a/fs/btrfs/extent_io.c
>>> +++ b/fs/btrfs/extent_io.c
>>> @@ -2971,6 +2971,7 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
>>>
>>>   	struct btrfs_fs_info *fs_info = eb->fs_info;
>>>   	struct address_space *mapping = fs_info->btree_inode->i_mapping;
>>> +	struct mem_cgroup *old_memcg;
>>>   	const unsigned long index = eb->start >> PAGE_SHIFT;
>>>   	struct folio *existing_folio = NULL;
>>>   	int ret;
>>> @@ -2981,8 +2982,17 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
>>>   	ASSERT(eb->folios[i]);
>>>
>>>   retry:
>>> +	/*
>>> +	 * Btree inode is a btrfs internal inode, and not exposed to any
>>> +	 * user.
>>> +	 * Furthermore we do not want any cgroup limits on this inode.
>>> +	 * So we always use root_mem_cgroup as our active memcg when attaching
>>> +	 * the folios.
>>> +	 */
>>> +	old_memcg = set_active_memcg(root_mem_cgroup);
>>>   	ret = filemap_add_folio(mapping, eb->folios[i], index + i,
>>>   				GFP_NOFS | __GFP_NOFAIL);
>
> I thoutght you've said that NOFAIL was added to workaround memcg
> charges. Can you remove it when memcg is out of the picture?

Sure, but that would be a dedicated patch, as we need to add the -ENOMEM
handling.
I already have such a patch before:
https://lore.kernel.org/linux-btrfs/d6a9c038e12f1f2dae353f1ba657ba0666f0aaaa.1720159494.git.wqu@suse.com/

But that's before the memcgroup change.

I'd prefer to have all the larger folio fully tested and merged, then
cleanup the NOFAIL flags.

>
> It would be great to add some background about how much memory are we
> talking about. Because this might require memcg configuration in some
> setups.
>
>>> +	set_active_memcg(old_memcg);
>>
>> It looks correct. But it's going through all dance to set up
>> current->active_memcg, then have the charge path look that up,
>> css_get(), call try_charge() only to bail immediately, css_put(), then
>> update current->active_memcg again. All those branches are necessary
>> when we want to charge to a "real" other cgroup. But in this case, we
>> always know we're not charging, so it seems uncalled for.
>>
>> Wouldn't it be a lot simpler (and cheaper) to have a
>> filemap_add_folio_nocharge()?
>
> Yes, that would certainly simplify things. From the previous discussion
> I understood that there would be broader scopes which would opt-out from
> charging. If this is really about a single filemap_add_folio call then
> having a variant without doesn't call mem_cgroup_charge sounds like a
> much more viable option and also it doesn't require to make any memcg
> specific changes.
>

I'm not 100% sure if the VFS guys are happy with that.

The current filemap folio interfaces are already much concentraced,
other than all the various page based interfaces for different situations.

E.g. we have the following wrappers related to filemap page cache
search/creation:

- find_get_page() and find_get_page_flags()
- find_lock_page()
- find_or_create_page()
- grab_cache_page_nowait()
- grab_cache_page()

Meanwhile just two folio interfaces:

- filemap_get_folio()
- __fielmap_get_folio()

So according to the trend, I'm pretty sure VFS people will reject such
new interface just to skip accounting.

Thus the GFP_NO_ACCOUNT solution looks more feasible.

Thanks,
Qu
Michal Hocko July 22, 2024, 7:34 a.m. UTC | #5
On Sat 20-07-24 07:41:19, Qu Wenruo wrote:
[...]
> So according to the trend, I'm pretty sure VFS people will reject such
> new interface just to skip accounting.

I would just give it a try with your usecase described. If this is a
nogo then the root cgroup workaround is still available.

> Thus the GFP_NO_ACCOUNT solution looks more feasible.

So we have GFP_ACCOUNT to opt in for accounting and now we should be
adding GFP_NO_ACCOUNT to override it? This doesn't sound like a good use
of gfp flags (which we do not have infinitely) and it is also quite
confusing TBH.
Qu Wenruo July 22, 2024, 8:08 a.m. UTC | #6
在 2024/7/22 17:04, Michal Hocko 写道:
> On Sat 20-07-24 07:41:19, Qu Wenruo wrote:
> [...]
>> So according to the trend, I'm pretty sure VFS people will reject such
>> new interface just to skip accounting.
> 
> I would just give it a try with your usecase described. If this is a
> nogo then the root cgroup workaround is still available.

I have submitted a patchset doing exactly that, and thankfully that's 
the series where I got all the helpful feedbacks:

https://lore.kernel.org/linux-btrfs/92dea37a395781ee4d5cf8b16307801ccd8a5700.1720572937.git.wqu@suse.com/

Unfortunately I haven't get any feedback from the VFS guys.

> 
>> Thus the GFP_NO_ACCOUNT solution looks more feasible.
> 
> So we have GFP_ACCOUNT to opt in for accounting and now we should be
> adding GFP_NO_ACCOUNT to override it? This doesn't sound like a good use
> of gfp flags (which we do not have infinitely) and it is also quite
> confusing TBH.

The problem is, for filemap_add_folio(), we didn't specify GFP_ACCOUNT 
(nor any other caller) but it is still doing the charge, due to the 
mostly-correct assumption that all filemap page caches are accessible to 
user space programs.

So one can argue that, cgroup is still charged even if no GFP_ACCOUNT is 
specified.

But I get your point, indeed it's not that a good idea to introduce 
GFP_NO_ACCOUNT.

Thanks,
Qu
Qu Wenruo July 24, 2024, 3:46 a.m. UTC | #7
在 2024/7/20 03:43, Michal Hocko 写道:
[...]
>
>>> +	set_active_memcg(old_memcg);
>>
>> It looks correct. But it's going through all dance to set up
>> current->active_memcg, then have the charge path look that up,
>> css_get(), call try_charge() only to bail immediately, css_put(), then
>> update current->active_memcg again. All those branches are necessary
>> when we want to charge to a "real" other cgroup. But in this case, we
>> always know we're not charging, so it seems uncalled for.
>>
>> Wouldn't it be a lot simpler (and cheaper) to have a
>> filemap_add_folio_nocharge()?
>
> Yes, that would certainly simplify things. From the previous discussion
> I understood that there would be broader scopes which would opt-out from
> charging. If this is really about a single filemap_add_folio call then
> having a variant without doesn't call mem_cgroup_charge sounds like a
> much more viable option and also it doesn't require to make any memcg
> specific changes.
>

Talking about skipping mem cgroup charging, I still have a question.

[MEMCG AT FOLIO EVICTION TIME]
Even we completely skip the mem cgroup charging, we cannot really escape
the eviction time handling.

In fact if we just skip the mem_cgroup_charge(), kernel would crash when
evicting the folio.
As in lru_gen_eviction(), folio_memcg() would just return NULL, and
mem_cgroup_id(memcg) would trigger a NULL pointer dereference.

That's why I sent out a patch fixing that first:
https://lore.kernel.org/linux-mm/e1036b9cc8928be9a7dec150ab3a0317bd7180cf.1720572937.git.wqu@suse.com/

I'm not sure if it's going to cause any extra problems even with the
above fix.

And just for the sake of consistency, it looks more sane to have
root_mem_cgroup for the filemap_add_folio() operation, other than leave
it empty, especially since most filemaps still need proper memcg handling.


[REALLY EXPENSIVE?]
Another question is, is the set_active_memcg() and later handling really
that expensive?

set_active_memcg() is small enough to be an inline function, so is the
active_memcg(), css_get() and the root memcg path of try_charge().

Later commit part is not that expensive either, mostly simple member or
per-cpu assignment.

According to my very little knowledge about mem cgroup, most of the
heavy lifting part is in the slow path of try_charge_memcg().

Even with all the set_active_memcg(), the whole extra overhead still
look very tiny.
And it should already be a big win for btrfs to opt-out the regular
charging routine.

Thanks,
Qu
diff mbox series

Patch

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index aa7f8148cd0d..cfeed7673009 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2971,6 +2971,7 @@  static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
 
 	struct btrfs_fs_info *fs_info = eb->fs_info;
 	struct address_space *mapping = fs_info->btree_inode->i_mapping;
+	struct mem_cgroup *old_memcg;
 	const unsigned long index = eb->start >> PAGE_SHIFT;
 	struct folio *existing_folio = NULL;
 	int ret;
@@ -2981,8 +2982,17 @@  static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
 	ASSERT(eb->folios[i]);
 
 retry:
+	/*
+	 * Btree inode is a btrfs internal inode, and not exposed to any
+	 * user.
+	 * Furthermore we do not want any cgroup limits on this inode.
+	 * So we always use root_mem_cgroup as our active memcg when attaching
+	 * the folios.
+	 */
+	old_memcg = set_active_memcg(root_mem_cgroup);
 	ret = filemap_add_folio(mapping, eb->folios[i], index + i,
 				GFP_NOFS | __GFP_NOFAIL);
+	set_active_memcg(old_memcg);
 	if (!ret)
 		goto finish;