Message ID | b5fef5372ae454a7b6da4f2f75c427aeab6a07d6.1727498749.git.wqu@suse.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | btrfs: root memcgroup for metadata filemap_add_folio() | expand |
Hi Qu, On Sat, Sep 28, 2024 at 02:15:56PM GMT, Qu Wenruo wrote: > [BACKGROUND] > The function filemap_add_folio() charges the memory cgroup, > as we assume all page caches are accessible by user space progresses > thus needs the cgroup accounting. > > However btrfs is a special case, it has a very large metadata thanks to > its support of data csum (by default it's 4 bytes per 4K data, and can > be as large as 32 bytes per 4K data). > This means btrfs has to go page cache for its metadata pages, to take > advantage of both cache and reclaim ability of filemap. > > This has a tiny problem, that all btrfs metadata pages have to go through > the memcgroup charge, even all those metadata pages are not > accessible by the user space, and doing the charging can introduce some > latency if there is a memory limits set. > > Btrfs currently uses __GFP_NOFAIL flag as a workaround for this cgroup > charge situation so that metadata pages won't really be limited by > memcgroup. > > [ENHANCEMENT] > Instead of relying on __GFP_NOFAIL to avoid charge failure, use root > memory cgroup to attach metadata pages. > > Although this needs to export the symbol mem_root_cgroup for > CONFIG_MEMCG, or define mem_root_cgroup as NULL for !CONFIG_MEMCG. > > With root memory cgroup, we directly skip the charging part, and only > rely on __GFP_NOFAIL for the real memory allocation part. > I have a couple of questions: 1. Were you using __GFP_NOFAIL just to avoid ENOMEMs? Are you ok with oom-kills? 2. What the normal overhead of these metadata in real world production environment? I see 4 to 32 bytes per 4k but what's the most used one and does it depend on the data of 4k or something else? 3. Most probably multiple metadata values are colocated on a single 4k page of the btrfs page cache even though the corresponding page cache might be charged to different cgroups. Is that correct? 4. What is stopping us to use reclaimable slab cache for this metadata? thanks, Shakeel
在 2024/10/1 02:53, Shakeel Butt 写道: > Hi Qu, > > On Sat, Sep 28, 2024 at 02:15:56PM GMT, Qu Wenruo wrote: >> [BACKGROUND] >> The function filemap_add_folio() charges the memory cgroup, >> as we assume all page caches are accessible by user space progresses >> thus needs the cgroup accounting. >> >> However btrfs is a special case, it has a very large metadata thanks to >> its support of data csum (by default it's 4 bytes per 4K data, and can >> be as large as 32 bytes per 4K data). >> This means btrfs has to go page cache for its metadata pages, to take >> advantage of both cache and reclaim ability of filemap. >> >> This has a tiny problem, that all btrfs metadata pages have to go through >> the memcgroup charge, even all those metadata pages are not >> accessible by the user space, and doing the charging can introduce some >> latency if there is a memory limits set. >> >> Btrfs currently uses __GFP_NOFAIL flag as a workaround for this cgroup >> charge situation so that metadata pages won't really be limited by >> memcgroup. >> >> [ENHANCEMENT] >> Instead of relying on __GFP_NOFAIL to avoid charge failure, use root >> memory cgroup to attach metadata pages. >> >> Although this needs to export the symbol mem_root_cgroup for >> CONFIG_MEMCG, or define mem_root_cgroup as NULL for !CONFIG_MEMCG. >> >> With root memory cgroup, we directly skip the charging part, and only >> rely on __GFP_NOFAIL for the real memory allocation part. >> > > I have a couple of questions: > > 1. Were you using __GFP_NOFAIL just to avoid ENOMEMs? Are you ok with > oom-kills? The NOFAIL flag is inherited from the memory allocation for metadata tree blocks. Although btrfs has error handling already for all the possible ENOMEMs, hitting ENOMEMs for metadata may still be a big problem, thus all my previous attempt to remove NOFAIL flag all got rejected. > > 2. What the normal overhead of these metadata in real world production > environment? I see 4 to 32 bytes per 4k but what's the most used one and > does it depend on the data of 4k or something else? What did you mean by the "overhead" part? Did you mean the checksum? If so, there is none, because btrfs store metadata checksum inside the tree block (thus the page cache). The first 32 bytes of a tree block are always reserved for metadata checksum. The tree block size depends on the mkfs time option nodesize, is 16K by default, and that's the most common value. > > 3. Most probably multiple metadata values are colocated on a single 4k > page of the btrfs page cache even though the corresponding page cache > might be charged to different cgroups. Is that correct? Not always a single 4K page, it depends on the nodesize, which is 16K by default. Otherwise yes, the metadata page cache can be charged to different cgroup, depending on the caller's context. And we do not want to charge the metadata page cache to the caller's cgroup, since it's really a shared resource and the caller has no way to directly accessing the page cache. Not charging the metadata page cache will align btrfs more to the ext4/xfs, which all uses regular page allocation without attaching to a filemap. > > 4. What is stopping us to use reclaimable slab cache for this metadata? Josef has tried this before, the attempt failed on the shrinker part, and partly due to the size. Btrfs has very large metadata compared to all other fses, not only due to the COW nature and a larger tree block size (16K by default), but also the extra data checksum (4 bytes per 4K by default, 32 bytes per 4K maximum). On a real world system, the metadata itself can easily go hundreds of GiBs, thus a shrinker is definitely needed. Thus so far btrfs is using page cache for its metadata cache. Thanks, Qu > > thanks, > Shakeel
On Tue, Oct 01, 2024 at 07:30:38AM GMT, Qu Wenruo wrote: > > > 在 2024/10/1 02:53, Shakeel Butt 写道: > > Hi Qu, > > > > On Sat, Sep 28, 2024 at 02:15:56PM GMT, Qu Wenruo wrote: > > > [BACKGROUND] > > > The function filemap_add_folio() charges the memory cgroup, > > > as we assume all page caches are accessible by user space progresses > > > thus needs the cgroup accounting. > > > > > > However btrfs is a special case, it has a very large metadata thanks to > > > its support of data csum (by default it's 4 bytes per 4K data, and can > > > be as large as 32 bytes per 4K data). > > > This means btrfs has to go page cache for its metadata pages, to take > > > advantage of both cache and reclaim ability of filemap. > > > > > > This has a tiny problem, that all btrfs metadata pages have to go through > > > the memcgroup charge, even all those metadata pages are not > > > accessible by the user space, and doing the charging can introduce some > > > latency if there is a memory limits set. > > > > > > Btrfs currently uses __GFP_NOFAIL flag as a workaround for this cgroup > > > charge situation so that metadata pages won't really be limited by > > > memcgroup. > > > > > > [ENHANCEMENT] > > > Instead of relying on __GFP_NOFAIL to avoid charge failure, use root > > > memory cgroup to attach metadata pages. > > > > > > Although this needs to export the symbol mem_root_cgroup for > > > CONFIG_MEMCG, or define mem_root_cgroup as NULL for !CONFIG_MEMCG. > > > > > > With root memory cgroup, we directly skip the charging part, and only > > > rely on __GFP_NOFAIL for the real memory allocation part. > > > > > > > I have a couple of questions: > > > > 1. Were you using __GFP_NOFAIL just to avoid ENOMEMs? Are you ok with > > oom-kills? > > The NOFAIL flag is inherited from the memory allocation for metadata tree > blocks. > > Although btrfs has error handling already for all the possible ENOMEMs, > hitting ENOMEMs for metadata may still be a big problem, thus all my > previous attempt to remove NOFAIL flag all got rejected. __GFP_NOFAIL for memcg charging is reasonable in many scenarios. Memcg oom-killer is enabled for __GFP_NOFAIL and going over limit and getting oom-killed is totally reasonable. Orthogonal to the discussion though. > > > > > 2. What the normal overhead of these metadata in real world production > > environment? I see 4 to 32 bytes per 4k but what's the most used one and > > does it depend on the data of 4k or something else? > > What did you mean by the "overhead" part? Did you mean the checksum? > To me this metadata is overhead, so yes checksum is something not the actual data stored on the storage. > If so, there is none, because btrfs store metadata checksum inside the tree > block (thus the page cache). > The first 32 bytes of a tree block are always reserved for metadata > checksum. > > The tree block size depends on the mkfs time option nodesize, is 16K by > default, and that's the most common value. Sorry I am not very familiar with btrfs. What is tree block? > > > > > 3. Most probably multiple metadata values are colocated on a single 4k > > page of the btrfs page cache even though the corresponding page cache > > might be charged to different cgroups. Is that correct? > > Not always a single 4K page, it depends on the nodesize, which is 16K by > default. > > Otherwise yes, the metadata page cache can be charged to different cgroup, > depending on the caller's context. > And we do not want to charge the metadata page cache to the caller's cgroup, > since it's really a shared resource and the caller has no way to directly > accessing the page cache. > > Not charging the metadata page cache will align btrfs more to the ext4/xfs, > which all uses regular page allocation without attaching to a filemap. > Can you point me to ext4/xfs code where they are allocating uncharged memory for their metadata? > > > > 4. What is stopping us to use reclaimable slab cache for this metadata? > > Josef has tried this before, the attempt failed on the shrinker part, and > partly due to the size. > > Btrfs has very large metadata compared to all other fses, not only due to > the COW nature and a larger tree block size (16K by default), but also the > extra data checksum (4 bytes per 4K by default, 32 bytes per 4K maximum). > > On a real world system, the metadata itself can easily go hundreds of GiBs, > thus a shrinker is definitely needed. This amount of uncharged memory is concerning which becomes part of system overhead and may impact the schedulable memory for the datacenter environment. Overall the code seems fine and no pushback from me if btrfs maintainers are ok with this. I think btrfs should move to slab+shrinker based solution for this metadata unless there is deep technical reason not to. thanks, Shakeel
在 2024/10/1 11:07, Shakeel Butt 写道: > On Tue, Oct 01, 2024 at 07:30:38AM GMT, Qu Wenruo wrote: [...] >> >> Although btrfs has error handling already for all the possible ENOMEMs, >> hitting ENOMEMs for metadata may still be a big problem, thus all my >> previous attempt to remove NOFAIL flag all got rejected. > > __GFP_NOFAIL for memcg charging is reasonable in many scenarios. Memcg > oom-killer is enabled for __GFP_NOFAIL and going over limit and getting > oom-killed is totally reasonable. Orthogonal to the discussion though. > >> >>> >>> 2. What the normal overhead of these metadata in real world production >>> environment? I see 4 to 32 bytes per 4k but what's the most used one and >>> does it depend on the data of 4k or something else? >> >> What did you mean by the "overhead" part? Did you mean the checksum? >> > > To me this metadata is overhead, so yes checksum is something not the > actual data stored on the storage. Oh, by "metadata" it means everything not data. It includes all the info like directory layout, file layout, data checksum and all the other needed info to represent a btrfs. > >> If so, there is none, because btrfs store metadata checksum inside the tree >> block (thus the page cache). >> The first 32 bytes of a tree block are always reserved for metadata >> checksum. >> >> The tree block size depends on the mkfs time option nodesize, is 16K by >> default, and that's the most common value. > > Sorry I am not very familiar with btrfs. What is tree block? A tree block of btrfs is a fixed block, containing metadata (aka, everything other than the data), organized in a B-tree structure. A tree block can be a node, containing the pointers to the next level nodes/leaves. Or a leave, contains the key and the extra info bound to that key. And btrfs uses the same tree block structure for all different kind of info. E.g. an inode is stored with (<ino> INODE_ITEM 0) as the key, with a btrfs_inode_item structure as the extra data bound to that key. And a file extent is stored with (<ino> EXTENT_DATA <file pos>) as the key, with a btrfs_file_extent_item structure bound to that key. > >> >>> >>> 3. Most probably multiple metadata values are colocated on a single 4k >>> page of the btrfs page cache even though the corresponding page cache >>> might be charged to different cgroups. Is that correct? >> >> Not always a single 4K page, it depends on the nodesize, which is 16K by >> default. >> >> Otherwise yes, the metadata page cache can be charged to different cgroup, >> depending on the caller's context. >> And we do not want to charge the metadata page cache to the caller's cgroup, >> since it's really a shared resource and the caller has no way to directly >> accessing the page cache. >> >> Not charging the metadata page cache will align btrfs more to the ext4/xfs, >> which all uses regular page allocation without attaching to a filemap. >> > > Can you point me to ext4/xfs code where they are allocating uncharged > memory for their metadata? For xfs, it's inside fs/xfs/xfs_buf.c. E.g. xfs_buf_alloc_pages(), which goes with kzalloc() to allocate needed pages. For ext4 it's using buffer header, which is I'm not familiar at all. But it looks like the bh folios are from the block device mapping, which may still be charged by cgroup. Thanks, Qu > >>> >>> 4. What is stopping us to use reclaimable slab cache for this metadata? >> >> Josef has tried this before, the attempt failed on the shrinker part, and >> partly due to the size. >> >> Btrfs has very large metadata compared to all other fses, not only due to >> the COW nature and a larger tree block size (16K by default), but also the >> extra data checksum (4 bytes per 4K by default, 32 bytes per 4K maximum). >> >> On a real world system, the metadata itself can easily go hundreds of GiBs, >> thus a shrinker is definitely needed. > > This amount of uncharged memory is concerning which becomes part of > system overhead and may impact the schedulable memory for the datacenter > environment. > > Overall the code seems fine and no pushback from me if btrfs maintainers > are ok with this. I think btrfs should move to slab+shrinker based > solution for this metadata unless there is deep technical reason not to. > > thanks, > Shakeel
On Sat, Sep 28, 2024 at 02:15:56PM +0930, Qu Wenruo wrote: > [BACKGROUND] > The function filemap_add_folio() charges the memory cgroup, > as we assume all page caches are accessible by user space progresses > thus needs the cgroup accounting. > > However btrfs is a special case, it has a very large metadata thanks to > its support of data csum (by default it's 4 bytes per 4K data, and can > be as large as 32 bytes per 4K data). > This means btrfs has to go page cache for its metadata pages, to take > advantage of both cache and reclaim ability of filemap. FYI, in general reclaims for metadata work much better with a shrinker than through the pagecache, because it can be object based and prioritized. > [ENHANCEMENT] > Instead of relying on __GFP_NOFAIL to avoid charge failure, use root > memory cgroup to attach metadata pages. > > Although this needs to export the symbol mem_root_cgroup for > CONFIG_MEMCG, or define mem_root_cgroup as NULL for !CONFIG_MEMCG. > > With root memory cgroup, we directly skip the charging part, and only > rely on __GFP_NOFAIL for the real memory allocation part. This looks pretty ugly. What speaks against a version of filemap_add_folio that doesn't charge the memcg?
在 2024/10/1 18:49, Christoph Hellwig 写道: > On Sat, Sep 28, 2024 at 02:15:56PM +0930, Qu Wenruo wrote: >> [BACKGROUND] >> The function filemap_add_folio() charges the memory cgroup, >> as we assume all page caches are accessible by user space progresses >> thus needs the cgroup accounting. >> >> However btrfs is a special case, it has a very large metadata thanks to >> its support of data csum (by default it's 4 bytes per 4K data, and can >> be as large as 32 bytes per 4K data). >> This means btrfs has to go page cache for its metadata pages, to take >> advantage of both cache and reclaim ability of filemap. > > FYI, in general reclaims for metadata work much better with a shrinker > than through the pagecache, because it can be object based and > prioritized. > >> [ENHANCEMENT] >> Instead of relying on __GFP_NOFAIL to avoid charge failure, use root >> memory cgroup to attach metadata pages. >> >> Although this needs to export the symbol mem_root_cgroup for >> CONFIG_MEMCG, or define mem_root_cgroup as NULL for !CONFIG_MEMCG. >> >> With root memory cgroup, we directly skip the charging part, and only >> rely on __GFP_NOFAIL for the real memory allocation part. > > This looks pretty ugly. What speaks against a version of > filemap_add_folio that doesn't charge the memcg? > Because there is so far only one caller has such requirement. Furthermore I believe the folio API doesn't prefer too many different functions doing similar things. E.g. the new folio interfaces only provides filemap_get_folio(), filemap_lock_folio(), and the more generic __filemap_get_folio(). Meanwhile there are tons of page based interfaces, find_get_page(), find_or_create_page(), find_lock_page() and flags version etc. Thus I think something like filemap_add_folio_no_memcg_charge() will be rejected. Finally, it's not feasible to go with a new GFP flag either. We already have __GFP_ACCOUNT for memcg charging purposes, but for filemap_add_folio() even if we do not pass __GFP_ACCOUNT, the memcg will still be charged. It will be even more ugly if we add a __GFP_NO_ACCOUNT, and such attempt is already rejected before IIRC. Thanks, Qu
On Tue, Oct 01, 2024 at 07:10:07PM +0930, Qu Wenruo wrote: > > This looks pretty ugly. What speaks against a version of > > filemap_add_folio that doesn't charge the memcg? > > > > Because there is so far only one caller has such requirement. That is a good argument to review the reasons for an interface, but not a killer argument. > Furthermore I believe the folio API doesn't prefer too many different > functions doing similar things. > > E.g. the new folio interfaces only provides filemap_get_folio(), > filemap_lock_folio(), and the more generic __filemap_get_folio(). > > Meanwhile there are tons of page based interfaces, find_get_page(), > find_or_create_page(), find_lock_page() and flags version etc. That's a totally different argument, tough. Those functions were trivial wrappers around a more versatile low-level function. While this is about adding clearly defined functionality, and more importantly not exporting totally random low-level data. What I'd propose is something like the patch below, plus proper documentation. Note that this now does the uncharge on the unlocked folio in the error case. From a quick look that should be fine, but someone who actually knows the code needs to confirm that. diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 68a5f1ff3301c6..70da62cf32f6c3 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -1284,6 +1284,8 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping, pgoff_t index, gfp_t gfp); int filemap_add_folio(struct address_space *mapping, struct folio *folio, pgoff_t index, gfp_t gfp); +int filemap_add_folio_nocharge(struct address_space *mapping, + struct folio *folio, pgoff_t index, gfp_t gfp); void filemap_remove_folio(struct folio *folio); void __filemap_remove_folio(struct folio *folio, void *shadow); void replace_page_cache_folio(struct folio *old, struct folio *new); diff --git a/mm/filemap.c b/mm/filemap.c index 36d22968be9a1e..0a1ae841e8c10f 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -958,20 +958,15 @@ noinline int __filemap_add_folio(struct address_space *mapping, } ALLOW_ERROR_INJECTION(__filemap_add_folio, ERRNO); -int filemap_add_folio(struct address_space *mapping, struct folio *folio, - pgoff_t index, gfp_t gfp) +int filemap_add_folio_nocharge(struct address_space *mapping, + struct folio *folio, pgoff_t index, gfp_t gfp) { void *shadow = NULL; int ret; - ret = mem_cgroup_charge(folio, NULL, gfp); - if (ret) - return ret; - __folio_set_locked(folio); ret = __filemap_add_folio(mapping, folio, index, gfp, &shadow); if (unlikely(ret)) { - mem_cgroup_uncharge(folio); __folio_clear_locked(folio); } else { /* @@ -989,6 +984,22 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio, } return ret; } +EXPORT_SYMBOL_GPL(filemap_add_folio_nocharge); + +int filemap_add_folio(struct address_space *mapping, struct folio *folio, + pgoff_t index, gfp_t gfp) +{ + int ret; + + ret = mem_cgroup_charge(folio, NULL, gfp); + if (ret) + return ret; + + ret = filemap_add_folio_nocharge(mapping, folio, index, gfp); + if (ret) + mem_cgroup_uncharge(folio); + return ret; +} EXPORT_SYMBOL_GPL(filemap_add_folio); #ifdef CONFIG_NUMA
On Wed 02-10-24 00:41:29, Christoph Hellwig wrote: [...] > What I'd propose is something like the patch below, plus proper > documentation. Note that this now does the uncharge on the unlocked > folio in the error case. From a quick look that should be fine, but > someone who actually knows the code needs to confirm that. yes, this is a much cleaner solution. filemap_add_folio_nocharge would need documentation explaining when this is supposed to be used. > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > index 68a5f1ff3301c6..70da62cf32f6c3 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -1284,6 +1284,8 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping, > pgoff_t index, gfp_t gfp); > int filemap_add_folio(struct address_space *mapping, struct folio *folio, > pgoff_t index, gfp_t gfp); > +int filemap_add_folio_nocharge(struct address_space *mapping, > + struct folio *folio, pgoff_t index, gfp_t gfp); > void filemap_remove_folio(struct folio *folio); > void __filemap_remove_folio(struct folio *folio, void *shadow); > void replace_page_cache_folio(struct folio *old, struct folio *new); > diff --git a/mm/filemap.c b/mm/filemap.c > index 36d22968be9a1e..0a1ae841e8c10f 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -958,20 +958,15 @@ noinline int __filemap_add_folio(struct address_space *mapping, > } > ALLOW_ERROR_INJECTION(__filemap_add_folio, ERRNO); > > -int filemap_add_folio(struct address_space *mapping, struct folio *folio, > - pgoff_t index, gfp_t gfp) > +int filemap_add_folio_nocharge(struct address_space *mapping, > + struct folio *folio, pgoff_t index, gfp_t gfp) > { > void *shadow = NULL; > int ret; > > - ret = mem_cgroup_charge(folio, NULL, gfp); > - if (ret) > - return ret; > - > __folio_set_locked(folio); > ret = __filemap_add_folio(mapping, folio, index, gfp, &shadow); > if (unlikely(ret)) { > - mem_cgroup_uncharge(folio); > __folio_clear_locked(folio); > } else { > /* > @@ -989,6 +984,22 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio, > } > return ret; > } > +EXPORT_SYMBOL_GPL(filemap_add_folio_nocharge); > + > +int filemap_add_folio(struct address_space *mapping, struct folio *folio, > + pgoff_t index, gfp_t gfp) > +{ > + int ret; > + > + ret = mem_cgroup_charge(folio, NULL, gfp); > + if (ret) > + return ret; > + > + ret = filemap_add_folio_nocharge(mapping, folio, index, gfp); > + if (ret) > + mem_cgroup_uncharge(folio); > + return ret; > +} > EXPORT_SYMBOL_GPL(filemap_add_folio); > > #ifdef CONFIG_NUMA
在 2024/10/2 17:11, Christoph Hellwig 写道: > On Tue, Oct 01, 2024 at 07:10:07PM +0930, Qu Wenruo wrote: >>> This looks pretty ugly. What speaks against a version of >>> filemap_add_folio that doesn't charge the memcg? >>> >> >> Because there is so far only one caller has such requirement. > > That is a good argument to review the reasons for an interface, but > not a killer argument. > >> Furthermore I believe the folio API doesn't prefer too many different >> functions doing similar things. >> >> E.g. the new folio interfaces only provides filemap_get_folio(), >> filemap_lock_folio(), and the more generic __filemap_get_folio(). >> >> Meanwhile there are tons of page based interfaces, find_get_page(), >> find_or_create_page(), find_lock_page() and flags version etc. > > That's a totally different argument, tough. Those functions were > trivial wrappers around a more versatile low-level function. > > While this is about adding clearly defined functionality, and > more importantly not exporting totally random low-level data. > > What I'd propose is something like the patch below, plus proper > documentation. Note that this now does the uncharge on the unlocked > folio in the error case. From a quick look that should be fine, but > someone who actually knows the code needs to confirm that. The interface looks good to me, especially we completely skip the charging, which is even better than the current form. And since Michal is also happy with this idea, I can definite go this path. Just a little curious, would it be better to introduce a flag for address_space to indicate whether the folio needs to be charged or not? Thanks, Qu > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > index 68a5f1ff3301c6..70da62cf32f6c3 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -1284,6 +1284,8 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping, > pgoff_t index, gfp_t gfp); > int filemap_add_folio(struct address_space *mapping, struct folio *folio, > pgoff_t index, gfp_t gfp); > +int filemap_add_folio_nocharge(struct address_space *mapping, > + struct folio *folio, pgoff_t index, gfp_t gfp); > void filemap_remove_folio(struct folio *folio); > void __filemap_remove_folio(struct folio *folio, void *shadow); > void replace_page_cache_folio(struct folio *old, struct folio *new); > diff --git a/mm/filemap.c b/mm/filemap.c > index 36d22968be9a1e..0a1ae841e8c10f 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -958,20 +958,15 @@ noinline int __filemap_add_folio(struct address_space *mapping, > } > ALLOW_ERROR_INJECTION(__filemap_add_folio, ERRNO); > > -int filemap_add_folio(struct address_space *mapping, struct folio *folio, > - pgoff_t index, gfp_t gfp) > +int filemap_add_folio_nocharge(struct address_space *mapping, > + struct folio *folio, pgoff_t index, gfp_t gfp) > { > void *shadow = NULL; > int ret; > > - ret = mem_cgroup_charge(folio, NULL, gfp); > - if (ret) > - return ret; > - > __folio_set_locked(folio); > ret = __filemap_add_folio(mapping, folio, index, gfp, &shadow); > if (unlikely(ret)) { > - mem_cgroup_uncharge(folio); > __folio_clear_locked(folio); > } else { > /* > @@ -989,6 +984,22 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio, > } > return ret; > } > +EXPORT_SYMBOL_GPL(filemap_add_folio_nocharge); > + > +int filemap_add_folio(struct address_space *mapping, struct folio *folio, > + pgoff_t index, gfp_t gfp) > +{ > + int ret; > + > + ret = mem_cgroup_charge(folio, NULL, gfp); > + if (ret) > + return ret; > + > + ret = filemap_add_folio_nocharge(mapping, folio, index, gfp); > + if (ret) > + mem_cgroup_uncharge(folio); > + return ret; > +} > EXPORT_SYMBOL_GPL(filemap_add_folio); > > #ifdef CONFIG_NUMA >
On Thu 03-10-24 17:41:23, Qu Wenruo wrote: [...] > Just a little curious, would it be better to introduce a flag for > address_space to indicate whether the folio needs to be charged or not? I would say that an explicit interface seems better because it is easier to find (grep) and reason about. If you make this address space property then it is really hard to find all the callers.
在 2024/10/3 17:52, Michal Hocko 写道: > On Thu 03-10-24 17:41:23, Qu Wenruo wrote: > [...] >> Just a little curious, would it be better to introduce a flag for >> address_space to indicate whether the folio needs to be charged or not? > > I would say that an explicit interface seems better because it is easier > to find (grep) and reason about. If you make this address space property > then it is really hard to find all the callers. Makes sense, thanks a lot for all the help! Thanks, Qu >
On Thu, Oct 03, 2024 at 10:07:33AM GMT, Michal Hocko wrote: > On Wed 02-10-24 00:41:29, Christoph Hellwig wrote: > [...] > > What I'd propose is something like the patch below, plus proper > > documentation. Note that this now does the uncharge on the unlocked > > folio in the error case. From a quick look that should be fine, but > > someone who actually knows the code needs to confirm that. > > yes, this is a much cleaner solution. filemap_add_folio_nocharge would > need documentation explaining when this is supposed to be used. > I feel like we should not make bypassing cgroup accounting easier but rather make it more awkward :P, so folks give much more thought before opting to do so. Though I agree filemap_add_folio_nocharge() is easy to grep.
On Wed, Oct 02, 2024 at 12:41:29AM -0700, Christoph Hellwig wrote: > > > This looks pretty ugly. What speaks against a version of > > > filemap_add_folio that doesn't charge the memcg? > > What I'd propose is something like the patch below, plus proper > documentation. I like this much better as well. > Note that this now does the uncharge on the unlocked folio in the > error case. From a quick look that should be fine, but someone who > actually knows the code needs to confirm that. That's fine. For the same reason the non-atomic __folio_clear_locked() is fine in that case. The folio just has to be exclusive.
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 9302fde9c464..a3a3fb825a47 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2919,6 +2919,7 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i, struct address_space *mapping = fs_info->btree_inode->i_mapping; const unsigned long index = eb->start >> PAGE_SHIFT; struct folio *existing_folio = NULL; + struct mem_cgroup *old_memcg; int ret; ASSERT(found_eb_ret); @@ -2927,8 +2928,16 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i, ASSERT(eb->folios[i]); retry: + /* + * Btree inode is a btrfs internal inode, and not exposed to any user. + * + * We do not want any cgroup limit on this inode, thus using + * root_mem_cgroup for metadata filemap. + */ + old_memcg = set_active_memcg(root_mem_cgroup); ret = filemap_add_folio(mapping, eb->folios[i], index + i, GFP_NOFS | __GFP_NOFAIL); + set_active_memcg(old_memcg); if (!ret) goto finish; diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0e5bf25d324f..efec74344a4d 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1067,6 +1067,8 @@ void split_page_memcg(struct page *head, int old_order, int new_order); #define MEM_CGROUP_ID_SHIFT 0 +#define root_mem_cgroup (NULL) + static inline struct mem_cgroup *folio_memcg(struct folio *folio) { return NULL; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d563fb515766..2dd1f286364d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -75,6 +75,7 @@ struct cgroup_subsys memory_cgrp_subsys __read_mostly; EXPORT_SYMBOL(memory_cgrp_subsys); struct mem_cgroup *root_mem_cgroup __read_mostly; +EXPORT_SYMBOL(root_mem_cgroup); /* Active memory cgroup to use from an interrupt context */ DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg);
[BACKGROUND] The function filemap_add_folio() charges the memory cgroup, as we assume all page caches are accessible by user space progresses thus needs the cgroup accounting. However btrfs is a special case, it has a very large metadata thanks to its support of data csum (by default it's 4 bytes per 4K data, and can be as large as 32 bytes per 4K data). This means btrfs has to go page cache for its metadata pages, to take advantage of both cache and reclaim ability of filemap. This has a tiny problem, that all btrfs metadata pages have to go through the memcgroup charge, even all those metadata pages are not accessible by the user space, and doing the charging can introduce some latency if there is a memory limits set. Btrfs currently uses __GFP_NOFAIL flag as a workaround for this cgroup charge situation so that metadata pages won't really be limited by memcgroup. [ENHANCEMENT] Instead of relying on __GFP_NOFAIL to avoid charge failure, use root memory cgroup to attach metadata pages. Although this needs to export the symbol mem_root_cgroup for CONFIG_MEMCG, or define mem_root_cgroup as NULL for !CONFIG_MEMCG. With root memory cgroup, we directly skip the charging part, and only rely on __GFP_NOFAIL for the real memory allocation part. Suggested-by: Michal Hocko <mhocko@suse.com> Suggested-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/extent_io.c | 9 +++++++++ include/linux/memcontrol.h | 2 ++ mm/memcontrol.c | 1 + 3 files changed, 12 insertions(+)