Message ID | 20240301-slab-memcg-v1-4-359328a46596@suse.cz (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | memcg_kmem hooks refactoring and kmem_cache_charge() | expand |
On Fri, 1 Mar 2024 at 09:07, Vlastimil Babka <vbabka@suse.cz> wrote: > > This is just an example of using the kmem_cache_charge() API. I think > it's placed in a place that's applicable for Linus's example [1] > although he mentions do_dentry_open() - I have followed from strace() > showing openat(2) to path_openat() doing the alloc_empty_file(). Thanks. This is not the right patch, but yes, patches 1-3 look very nice to me. > The idea is that filp_cachep stops being SLAB_ACCOUNT. Allocations that > want to be accounted immediately can use GFP_KERNEL_ACCOUNT. I did that > in alloc_empty_file_noaccount() (despite the contradictory name but the > noaccount refers to something else, right?) as IIUC it's about > kernel-internal opens. Yeah, the "noaccount" function is about not accounting it towards nr_files. That said, I don't think it necessarily needs to do the memory accounting either - it's literally for cases where we're never going to install the file descriptor in any user space. Your change to use GFP_KERNEL_ACCOUNT isn't exactly wrong, but I don't think it's really the right thing either, because > Why is this unfinished: > > - there are other callers of alloc_empty_file() which I didn't adjust so > they simply became memcg-unaccounted. I haven't investigated for which > ones it would make also sense to separate the allocation and accounting. > Maybe alloc_empty_file() would need to get a parameter to control > this. Right. I think the natural and logical way to deal with this is to just say "we account when we add the file to the fdtable". IOW, just have fd_install() do it. That's the really natural point, and also makes it very logical why alloc_empty_file_noaccount() wouldn't need to do the GFP_KERNEL_ACCOUNT. > - I don't know how to properly unwind the accounting failure case. It > seems like a new case because when we succeed the open, there's no > further error path at least in path_openat(). Yeah, let me think about this part. Becasue fd_install() is the right point, but that too does not really allow for error handling. Yes, we could close things and fail it, but it really is much too late at this point. What I *think* I'd want for this case is (a) allow the accounting to go over by a bit (b) make sure there's a cheap way to ask (before) about "did we go over the limit" IOW, the accounting never needed to be byte-accurate to begin with, and making it fail (cheaply and early) on the next file allocation is fine. Just make it really cheap. Can we do that? For example, maybe don't bother with the whole "bytes and pages" stuff. Just a simple "are we more than one page over?" kind of question. Without the 'stock_lock' mess for sub-page bytes etc How would that look? Would it result in something that can be done cheaply without locking and atomics and without excessive pointer indirection through many levels of memcg data structures? Linus
On Fri, Mar 01, 2024 at 09:51:18AM -0800, Linus Torvalds wrote: > On Fri, 1 Mar 2024 at 09:07, Vlastimil Babka <vbabka@suse.cz> wrote: > > > > This is just an example of using the kmem_cache_charge() API. I think > > it's placed in a place that's applicable for Linus's example [1] > > although he mentions do_dentry_open() - I have followed from strace() > > showing openat(2) to path_openat() doing the alloc_empty_file(). > > Thanks. This is not the right patch, but yes, patches 1-3 look very nice to me. > > > The idea is that filp_cachep stops being SLAB_ACCOUNT. Allocations that > > want to be accounted immediately can use GFP_KERNEL_ACCOUNT. I did that > > in alloc_empty_file_noaccount() (despite the contradictory name but the > > noaccount refers to something else, right?) as IIUC it's about > > kernel-internal opens. > > Yeah, the "noaccount" function is about not accounting it towards nr_files. > > That said, I don't think it necessarily needs to do the memory > accounting either - it's literally for cases where we're never going > to install the file descriptor in any user space. > > Your change to use GFP_KERNEL_ACCOUNT isn't exactly wrong, but I don't > think it's really the right thing either, because > > > Why is this unfinished: > > > > - there are other callers of alloc_empty_file() which I didn't adjust so > > they simply became memcg-unaccounted. I haven't investigated for which > > ones it would make also sense to separate the allocation and accounting. > > Maybe alloc_empty_file() would need to get a parameter to control > > this. > > Right. I think the natural and logical way to deal with this is to > just say "we account when we add the file to the fdtable". > > IOW, just have fd_install() do it. That's the really natural point, > and also makes it very logical why alloc_empty_file_noaccount() > wouldn't need to do the GFP_KERNEL_ACCOUNT. > > > - I don't know how to properly unwind the accounting failure case. It > > seems like a new case because when we succeed the open, there's no > > further error path at least in path_openat(). > > Yeah, let me think about this part. Becasue fd_install() is the right > point, but that too does not really allow for error handling. > > Yes, we could close things and fail it, but it really is much too late > at this point. > > What I *think* I'd want for this case is > > (a) allow the accounting to go over by a bit > > (b) make sure there's a cheap way to ask (before) about "did we go > over the limit" > > IOW, the accounting never needed to be byte-accurate to begin with, > and making it fail (cheaply and early) on the next file allocation is > fine. > > Just make it really cheap. Can we do that? > > For example, maybe don't bother with the whole "bytes and pages" > stuff. Just a simple "are we more than one page over?" kind of > question. Without the 'stock_lock' mess for sub-page bytes etc > > How would that look? Would it result in something that can be done > cheaply without locking and atomics and without excessive pointer > indirection through many levels of memcg data structures? I think it's possible and I'm currently looking into batching charge, objcg refcnt management and vmstats using per-task caching. It should speed up things for the majority of allocations. For allocations from an irq context and targeted allocations (where the target memcg != memcg of the current task) we'd probably need to keep the old scheme. I hope to post some patches relatively soon. I tried to optimize the current implementation but failed to get any significant gains. It seems that the overhead is very evenly spread across objcg pointer access, charge management, objcg refcnt management and vmstats. Thanks!
On Fri, Mar 01, 2024 at 09:51:18AM -0800, Linus Torvalds wrote: > On Fri, 1 Mar 2024 at 09:07, Vlastimil Babka <vbabka@suse.cz> wrote: > > > > This is just an example of using the kmem_cache_charge() API. I think > > it's placed in a place that's applicable for Linus's example [1] > > although he mentions do_dentry_open() - I have followed from strace() > > showing openat(2) to path_openat() doing the alloc_empty_file(). > > Thanks. This is not the right patch, but yes, patches 1-3 look very nice to me. > > > The idea is that filp_cachep stops being SLAB_ACCOUNT. Allocations that > > want to be accounted immediately can use GFP_KERNEL_ACCOUNT. I did that > > in alloc_empty_file_noaccount() (despite the contradictory name but the > > noaccount refers to something else, right?) as IIUC it's about > > kernel-internal opens. > > Yeah, the "noaccount" function is about not accounting it towards nr_files. > That said, I don't think it necessarily needs to do the memory > accounting either - it's literally for cases where we're never going > to install the file descriptor in any user space. Exactly. > Your change to use GFP_KERNEL_ACCOUNT isn't exactly wrong, but I don't > think it's really the right thing either, because > > > Why is this unfinished: > > > > - there are other callers of alloc_empty_file() which I didn't adjust so > > they simply became memcg-unaccounted. I haven't investigated for which > > ones it would make also sense to separate the allocation and accounting. > > Maybe alloc_empty_file() would need to get a parameter to control > > this. > > Right. I think the natural and logical way to deal with this is to > just say "we account when we add the file to the fdtable". > IOW, just have fd_install() do it. That's the really natural point, > and also makes it very logical why alloc_empty_file_noaccount() > wouldn't need to do the GFP_KERNEL_ACCOUNT. > > > - I don't know how to properly unwind the accounting failure case. It > > seems like a new case because when we succeed the open, there's no > > further error path at least in path_openat(). > > Yeah, let me think about this part. Becasue fd_install() is the right > point, but that too does not really allow for error handling. > > Yes, we could close things and fail it, but it really is much too late > at this point. It would also mean massaging 100+ callsites. And having a non-subsystems specific failure step between file allocation, fd reservation and fd_install() would be awkward and an invitation for bugs. > What I *think* I'd want for this case is > > (a) allow the accounting to go over by a bit > > (b) make sure there's a cheap way to ask (before) about "did we go > over the limit" > > IOW, the accounting never needed to be byte-accurate to begin with, > and making it fail (cheaply and early) on the next file allocation is > fine. I think that's a good idea.
On 3/1/24 19:53, Roman Gushchin wrote: > On Fri, Mar 01, 2024 at 09:51:18AM -0800, Linus Torvalds wrote: >> What I *think* I'd want for this case is >> >> (a) allow the accounting to go over by a bit >> >> (b) make sure there's a cheap way to ask (before) about "did we go >> over the limit" >> >> IOW, the accounting never needed to be byte-accurate to begin with, >> and making it fail (cheaply and early) on the next file allocation is >> fine. >> >> Just make it really cheap. Can we do that? >> >> For example, maybe don't bother with the whole "bytes and pages" >> stuff. Just a simple "are we more than one page over?" kind of >> question. Without the 'stock_lock' mess for sub-page bytes etc >> >> How would that look? Would it result in something that can be done >> cheaply without locking and atomics and without excessive pointer >> indirection through many levels of memcg data structures? > > I think it's possible and I'm currently looking into batching charge, > objcg refcnt management and vmstats using per-task caching. It should > speed up things for the majority of allocations. > For allocations from an irq context and targeted allocations > (where the target memcg != memcg of the current task) we'd probably need to > keep the old scheme. I hope to post some patches relatively soon. Do you think this will work on top of this series, i.e. patches 1+2 could be eventually put to slab/for-next after the merge window, or would it interfere with your changes? > I tried to optimize the current implementation but failed to get any > significant gains. It seems that the overhead is very evenly spread across > objcg pointer access, charge management, objcg refcnt management and vmstats. > > Thanks!
On Tue, Mar 12, 2024 at 10:22:54AM +0100, Vlastimil Babka wrote: > On 3/1/24 19:53, Roman Gushchin wrote: > > On Fri, Mar 01, 2024 at 09:51:18AM -0800, Linus Torvalds wrote: > >> What I *think* I'd want for this case is > >> > >> (a) allow the accounting to go over by a bit > >> > >> (b) make sure there's a cheap way to ask (before) about "did we go > >> over the limit" > >> > >> IOW, the accounting never needed to be byte-accurate to begin with, > >> and making it fail (cheaply and early) on the next file allocation is > >> fine. > >> > >> Just make it really cheap. Can we do that? > >> > >> For example, maybe don't bother with the whole "bytes and pages" > >> stuff. Just a simple "are we more than one page over?" kind of > >> question. Without the 'stock_lock' mess for sub-page bytes etc > >> > >> How would that look? Would it result in something that can be done > >> cheaply without locking and atomics and without excessive pointer > >> indirection through many levels of memcg data structures? > > > > I think it's possible and I'm currently looking into batching charge, > > objcg refcnt management and vmstats using per-task caching. It should > > speed up things for the majority of allocations. > > For allocations from an irq context and targeted allocations > > (where the target memcg != memcg of the current task) we'd probably need to > > keep the old scheme. I hope to post some patches relatively soon. > > Do you think this will work on top of this series, i.e. patches 1+2 could be > eventually put to slab/for-next after the merge window, or would it > interfere with your changes? Please, go on and merge them, I'll rebase on top of it, it will be even better for my work. I made a couple of comments there, but overall they look very good to me, thank you for doing this work! > > > I tried to optimize the current implementation but failed to get any > > significant gains. It seems that the overhead is very evenly spread across > > objcg pointer access, charge management, objcg refcnt management and vmstats. I started working on the thing, but it's a bit more complicated than I initially thought because: 1) there are allocations made from a !in_task() context, so we need to handle this correctly 2) tasks can be moved between cgroups concurrently to memory allocations. fortunately my recent changes provide a path here, but it adds to the complexity. In alternative world where tasks can't move between cgroups the life would be so much better (and faster too, we could remove a ton of synchronization). 3) we do have per-numa-node per-memcg stats, which are less trivial to cache on struct task I hope to resolve these issues somehow and post patches, but probably will need a bit more time. Thanks!
On Fri, Mar 01, 2024 at 09:51:18AM -0800, Linus Torvalds wrote: > Right. I think the natural and logical way to deal with this is to > just say "we account when we add the file to the fdtable". > > IOW, just have fd_install() do it. That's the really natural point, > and also makes it very logical why alloc_empty_file_noaccount() > wouldn't need to do the GFP_KERNEL_ACCOUNT. We can have the same file occuring in many slots of many descriptor tables, obviously. So it would have to be a flag (in ->f_mode?) set by it, for "someone's already charged for it", or you'll end up with really insane crap on each fork(), dup(), etc. But there's also MAP_ANON with its setup_shmem_file(), with the resulting file not going into descriptor tables at all, and that's not a rare thing. > > - I don't know how to properly unwind the accounting failure case. It > > seems like a new case because when we succeed the open, there's no > > further error path at least in path_openat(). > > Yeah, let me think about this part. Becasue fd_install() is the right > point, but that too does not really allow for error handling. > > Yes, we could close things and fail it, but it really is much too late > at this point. That as well. For things like O_CREAT even do_dentry_open() would be too late for unrolls. > What I *think* I'd want for this case is > > (a) allow the accounting to go over by a bit > > (b) make sure there's a cheap way to ask (before) about "did we go > over the limit" > > IOW, the accounting never needed to be byte-accurate to begin with, > and making it fail (cheaply and early) on the next file allocation is > fine. > > Just make it really cheap. Can we do that? That might be reasonable, but TBH I would rather combine that with do_dentry_open()/alloc_file() (i.e. the places where we set FMODE_OPENED) as places to do that, rather than messing with fd_install(). How does the following sound? * those who allocate empty files mark them if they are intended to be kernel-internal (see below for how to get the information there) * memcg charge happens when we set FMODE_OPENED, provided that struct file instance is not marked kernel-internal. * exceeding the limit => pretend we'd succeeded and fail the next allocation. As for how to get the information down there... We have 6 functions where "allocate" and "mark it opened" callchains converge - alloc_file() (pipe(2) et.al., mostly), path_openat() (normal opens, but also filp_open() et.al.), dentry_open(), kernel_file_open(), kernel_tmpfile_open(), dentry_create(). The last 3 are all kernel-internal; dentry_open() might or might not be. For path_openat() we can add a bit somewhere in struct open_flags; the places where we set struct open_flags up would be the ones that might need to be annotated. That's file_open_name() file_open_root() do_sys_openat2() (definitely userland) io_openat2() (ditto) sys_uselib() (ditto) do_open_execat() (IMO can be considered userland in all cases) For alloc_file() it's almost always userland. IMO things like dma_buf_export() and setup_shmem_file() should be charged. So it's a matter of propagating the information to dentry_open(), file_open_name() and file_open_root(). That's about 70 callers to annotate, including filp_open() and file_open_root_mnt() into the mix. <greps> 61, actually, and from the quick look it seems that most of them are really obvious... Comments?
[ Al, I hope your email works now ] On Sat, 23 Mar 2024 at 19:27, Al Viro <viro@zeniv.linux.org.uk> wrote: > > We can have the same file occuring in many slots of many descriptor tables, > obviously. So it would have to be a flag (in ->f_mode?) set by it, for > "someone's already charged for it", or you'll end up with really insane > crap on each fork(), dup(), etc. Nope. That flag already exists in the slab code itself with this patch. The kmem_cache_charge() thing itself just sets the "I'm charged" bit in the slab header, and you're done. Any subsequent fd_install (with dup, or fork or whatever) simply is irrelevant. In fact, dup and fork and friends won't need to worry about this, because they only work on files that have already been installed, so they know the file is already accounted. So it's only the initial open() case that needs to do the kmem_cache_charge() as it does the fd_install. > But there's also MAP_ANON with its setup_shmem_file(), with the resulting > file not going into descriptor tables at all, and that's not a rare thing. Just making alloc_file_pseudo() do a SLAB_ACOUNT should take care of all the normal case. For once, the core allocator is not exposed very much, so we can literally just look at "who does alloc_file*()" and it turns out it's all pretty well abstracted out. So I think it's mainly the three cases of 'alloc_empty_file()' that would be affected and need to check that they actually do the fd_install() (or release it). Everything else should either not account at all (if they know they are doing temporary kernel things), or always account (eg alloc_file_pseudo()). Linus
diff --git a/fs/file_table.c b/fs/file_table.c index b991f90571b4..6401b6f175ae 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -223,6 +223,11 @@ struct file *alloc_empty_file(int flags, const struct cred *cred) return ERR_PTR(-ENFILE); } +int kmem_account_file(struct file *f) +{ + return kmem_cache_charge(filp_cachep, GFP_KERNEL_ACCOUNT, f); +} + /* * Variant of alloc_empty_file() that doesn't check and modify nr_files. * @@ -234,7 +239,7 @@ struct file *alloc_empty_file_noaccount(int flags, const struct cred *cred) struct file *f; int error; - f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL); + f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL_ACCOUNT); if (unlikely(!f)) return ERR_PTR(-ENOMEM); @@ -468,7 +473,7 @@ void __init files_init(void) { filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0, SLAB_TYPESAFE_BY_RCU | SLAB_HWCACHE_ALIGN | - SLAB_PANIC | SLAB_ACCOUNT, NULL); + SLAB_PANIC, NULL); percpu_counter_init(&nr_files, 0, GFP_KERNEL); } diff --git a/fs/internal.h b/fs/internal.h index b67406435fc0..06ada11b71d0 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -96,6 +96,7 @@ extern void chroot_fs_refs(const struct path *, const struct path *); struct file *alloc_empty_file(int flags, const struct cred *cred); struct file *alloc_empty_file_noaccount(int flags, const struct cred *cred); struct file *alloc_empty_backing_file(int flags, const struct cred *cred); +int kmem_account_file(struct file *file); static inline void file_put_write_access(struct file *file) { diff --git a/fs/namei.c b/fs/namei.c index 4e0de939fea1..fcf3f3fcd059 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3799,8 +3799,10 @@ static struct file *path_openat(struct nameidata *nd, terminate_walk(nd); } if (likely(!error)) { - if (likely(file->f_mode & FMODE_OPENED)) + if (likely(file->f_mode & FMODE_OPENED)) { + kmem_account_file(file); return file; + } WARN_ON(1); error = -EINVAL; }