Message ID | 20220922224027.59266-1-ivan@cloudflare.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] proc: report open files as size in stat() for /proc/pid/fd | expand |
On Thu, Sep 22, 2022 at 3:40 PM Ivan Babrou <ivan@cloudflare.com> wrote: > > Many monitoring tools include open file count as a metric. Currently > the only way to get this number is to enumerate the files in /proc/pid/fd. > > The problem with the current approach is that it does many things people > generally don't care about when they need one number for a metric. > In our tests for cadvisor, which reports open file counts per cgroup, > we observed that reading the number of open files is slow. Out of 35.23% > of CPU time spent in `proc_readfd_common`, we see 29.43% spent in > `proc_fill_cache`, which is responsible for filling dentry info. > Some of this extra time is spinlock contention, but it's a contention > for the lock we don't want to take to begin with. > > We considered putting the number of open files in /proc/pid/status. > Unfortunately, counting the number of fds involves iterating the open_files > bitmap, which has a linear complexity in proportion with the number > of open files (bitmap slots really, but it's close). We don't want > to make /proc/pid/status any slower, so instead we put this info > in /proc/pid/fd as a size member of the stat syscall result. > Previously the reported number was zero, so there's very little > risk of breaking anything, while still providing a somewhat logical > way to count the open files with a fallback if it's zero. > > RFC for this patch included iterating open fds under RCU. Thanks > to Frank Hofmann for the suggestion to use the bitmap instead. > > Previously: > > ``` > $ sudo stat /proc/1/fd | head -n2 > File: /proc/1/fd > Size: 0 Blocks: 0 IO Block: 1024 directory > ``` > > With this patch: > > ``` > $ sudo stat /proc/1/fd | head -n2 > File: /proc/1/fd > Size: 65 Blocks: 0 IO Block: 1024 directory > ``` > > Correctness check: > > ``` > $ sudo ls /proc/1/fd | wc -l > 65 > ``` > > I added the docs for /proc/<pid>/fd while I'm at it. > > Signed-off-by: Ivan Babrou <ivan@cloudflare.com> > > --- > v2: Added missing rcu_read_lock() / rcu_read_unlock(), > task_lock() / task_unlock() and put_task_struct(). > --- > Documentation/filesystems/proc.rst | 17 ++++++++++++ > fs/proc/fd.c | 44 ++++++++++++++++++++++++++++++ > 2 files changed, 61 insertions(+) Now that Linux 6.1-rc1 is out, should this patch be looked at for inclusion? I see that the net-next tree has opened, not sure if the same rules apply here. We've been running the v2 version of this patch in production successfully for some time now.
On Thu, 22 Sep 2022 15:40:26 -0700 Ivan Babrou <ivan@cloudflare.com> wrote: > Many monitoring tools include open file count as a metric. Currently > the only way to get this number is to enumerate the files in /proc/pid/fd. > > The problem with the current approach is that it does many things people > generally don't care about when they need one number for a metric. > In our tests for cadvisor, which reports open file counts per cgroup, > we observed that reading the number of open files is slow. Out of 35.23% > of CPU time spent in `proc_readfd_common`, we see 29.43% spent in > `proc_fill_cache`, which is responsible for filling dentry info. > Some of this extra time is spinlock contention, but it's a contention > for the lock we don't want to take to begin with. > > We considered putting the number of open files in /proc/pid/status. > Unfortunately, counting the number of fds involves iterating the open_files > bitmap, which has a linear complexity in proportion with the number > of open files (bitmap slots really, but it's close). We don't want > to make /proc/pid/status any slower, so instead we put this info > in /proc/pid/fd as a size member of the stat syscall result. That sounds awfully logical. > Previously the reported number was zero, so there's very little > risk of breaking anything, while still providing a somewhat logical > way to count the open files with a fallback if it's zero. > > RFC for this patch included iterating open fds under RCU. Thanks > to Frank Hofmann for the suggestion to use the bitmap instead. > > Previously: > > ``` > $ sudo stat /proc/1/fd | head -n2 > File: /proc/1/fd > Size: 0 Blocks: 0 IO Block: 1024 directory > ``` > > With this patch: > > ``` > $ sudo stat /proc/1/fd | head -n2 > File: /proc/1/fd > Size: 65 Blocks: 0 IO Block: 1024 directory > ``` > > Correctness check: > > ``` > $ sudo ls /proc/1/fd | wc -l > 65 > ``` > > I added the docs for /proc/<pid>/fd while I'm at it. > Dang that's a good changelog. > index e7aafc82be99..394548d26187 100644 > --- a/Documentation/filesystems/proc.rst > +++ b/Documentation/filesystems/proc.rst > @@ -47,6 +47,7 @@ fixes/update part 1.1 Stefani Seibold <stefani@seibold.net> June 9 2009 > 3.10 /proc/<pid>/timerslack_ns - Task timerslack value > 3.11 /proc/<pid>/patch_state - Livepatch patch operation state > 3.12 /proc/<pid>/arch_status - Task architecture specific information > + 3.13 /proc/<pid>/fd - List of symlinks to open files > > 4 Configuring procfs > 4.1 Mount options > @@ -2145,6 +2146,22 @@ AVX512_elapsed_ms > the task is unlikely an AVX512 user, but depends on the workload and the > scheduling scenario, it also could be a false negative mentioned above. > > +3.13 /proc/<pid>/fd - List of symlinks to open files > +------------------------------------------------------- > +This directory contains symbolic links which represent open files > +the process is maintaining. Example output:: > + > + lr-x------ 1 root root 64 Sep 20 17:53 0 -> /dev/null > + l-wx------ 1 root root 64 Sep 20 17:53 1 -> /dev/null > + lrwx------ 1 root root 64 Sep 20 17:53 10 -> 'socket:[12539]' > + lrwx------ 1 root root 64 Sep 20 17:53 11 -> 'socket:[12540]' > + lrwx------ 1 root root 64 Sep 20 17:53 12 -> 'socket:[12542]' > + > +The number of open files for the process is stored in 'size' member > +of stat() output for /proc/<pid>/fd for fast access. > +------------------------------------------------------- > + > + > Chapter 4: Configuring procfs > ============================= > > diff --git a/fs/proc/fd.c b/fs/proc/fd.c > index 913bef0d2a36..ff526dfc5faa 100644 > --- a/fs/proc/fd.c > +++ b/fs/proc/fd.c > @@ -279,6 +279,34 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx, > return 0; > } > > +static int proc_readfd_count(struct inode *inode) > +{ > + struct task_struct *p = get_proc_task(inode); > + struct fdtable *fdt; > + unsigned int i, size, open_fds = 0; > + > + if (!p) > + return -ENOENT; > + > + task_lock(p); > + if (p->files) { > + rcu_read_lock(); > + > + fdt = files_fdtable(p->files); > + size = fdt->max_fds; > + > + for (i = size / BITS_PER_LONG; i > 0;) > + open_fds += hweight64(fdt->open_fds[--i]); Could BITMAP_WEIGHT() or __bitmap_weight() or bitmap_weight() be used here? > + rcu_read_unlock(); > + } > + task_unlock(p); > + > + put_task_struct(p); > + > + return open_fds; > +} > + > static int proc_readfd(struct file *file, struct dir_context *ctx) > { > return proc_readfd_common(file, ctx, proc_fd_instantiate); > @@ -319,9 +347,25 @@ int proc_fd_permission(struct user_namespace *mnt_userns, > return rv; > } > > +static int proc_fd_getattr(struct user_namespace *mnt_userns, > + const struct path *path, struct kstat *stat, > + u32 request_mask, unsigned int query_flags) > +{ > + struct inode *inode = d_inode(path->dentry); > + > + generic_fillattr(&init_user_ns, inode, stat); > + > + /* If it's a directory, put the number of open fds there */ > + if (S_ISDIR(inode->i_mode)) > + stat->size = proc_readfd_count(inode); > + > + return 0; > +} > + > const struct inode_operations proc_fd_inode_operations = { > .lookup = proc_lookupfd, > .permission = proc_fd_permission, > + .getattr = proc_fd_getattr, > .setattr = proc_setattr, > }; > > -- > 2.37.2
On Mon, Oct 17, 2022 at 6:47 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > + > > + fdt = files_fdtable(p->files); > > + size = fdt->max_fds; > > + > > + for (i = size / BITS_PER_LONG; i > 0;) > > + open_fds += hweight64(fdt->open_fds[--i]); > > Could BITMAP_WEIGHT() or __bitmap_weight() or bitmap_weight() be used here? That's a great suggestion. I tested it with bitmap_weight() and it looks much cleaner while providing the same result. I just sent the v3 with this suggestion applied.
On Tue, Oct 18, 2022 at 6:02 AM Ivan Babrou <ivan@cloudflare.com> wrote: > > On Mon, Oct 17, 2022 at 6:47 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > > + > > > + fdt = files_fdtable(p->files); > > > + size = fdt->max_fds; > > > + > > > + for (i = size / BITS_PER_LONG; i > 0;) > > > + open_fds += hweight64(fdt->open_fds[--i]); > > > > Could BITMAP_WEIGHT() or __bitmap_weight() or bitmap_weight() be used here? > > That's a great suggestion. I tested it with bitmap_weight() and it > looks much cleaner while providing the same result. > > I just sent the v3 with this suggestion applied. +1 from me on using bitmap_weight() - good spotting that. FrankH.
From: Frank Hofmann > Sent: 18 October 2022 09:13 > > On Tue, Oct 18, 2022 at 6:02 AM Ivan Babrou <ivan@cloudflare.com> wrote: > > > > On Mon, Oct 17, 2022 at 6:47 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > > > + > > > > + fdt = files_fdtable(p->files); > > > > + size = fdt->max_fds; > > > > + > > > > + for (i = size / BITS_PER_LONG; i > 0;) > > > > + open_fds += hweight64(fdt->open_fds[--i]); > > > > > > Could BITMAP_WEIGHT() or __bitmap_weight() or bitmap_weight() be used here? > > > > That's a great suggestion. I tested it with bitmap_weight() and it > > looks much cleaner while providing the same result. > > > > I just sent the v3 with this suggestion applied. > > +1 from me on using bitmap_weight() - good spotting that. Does that have the optimisations for the value being 0, ~0u or 2**n-1 all of which are likely for the fd table. (Especially if there is no 'popcnt' instruction.) David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Tue, Oct 18, 2022 at 9:25 AM David Laight <David.Laight@aculab.com> wrote: > > From: Frank Hofmann > > Sent: 18 October 2022 09:13 > > > > On Tue, Oct 18, 2022 at 6:02 AM Ivan Babrou <ivan@cloudflare.com> wrote: > > > > > > On Mon, Oct 17, 2022 at 6:47 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > + > > > > > + fdt = files_fdtable(p->files); > > > > > + size = fdt->max_fds; > > > > > + > > > > > + for (i = size / BITS_PER_LONG; i > 0;) > > > > > + open_fds += hweight64(fdt->open_fds[--i]); > > > > > > > > Could BITMAP_WEIGHT() or __bitmap_weight() or bitmap_weight() be used here? > > > > > > That's a great suggestion. I tested it with bitmap_weight() and it > > > looks much cleaner while providing the same result. > > > > > > I just sent the v3 with this suggestion applied. > > > > +1 from me on using bitmap_weight() - good spotting that. > > Does that have the optimisations for the value being 0, ~0u > or 2**n-1 all of which are likely for the fd table. > (Especially if there is no 'popcnt' instruction.) > > David bitmap_weight() uses hweight_*() under the hood, which then falls through to platform-specific popcnt where available. Re, lib/bitmap.c and arch/.../asm/bitops.h or arch/.../hweight.S, for the impl details. FrankH. > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales)
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index e7aafc82be99..394548d26187 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -47,6 +47,7 @@ fixes/update part 1.1 Stefani Seibold <stefani@seibold.net> June 9 2009 3.10 /proc/<pid>/timerslack_ns - Task timerslack value 3.11 /proc/<pid>/patch_state - Livepatch patch operation state 3.12 /proc/<pid>/arch_status - Task architecture specific information + 3.13 /proc/<pid>/fd - List of symlinks to open files 4 Configuring procfs 4.1 Mount options @@ -2145,6 +2146,22 @@ AVX512_elapsed_ms the task is unlikely an AVX512 user, but depends on the workload and the scheduling scenario, it also could be a false negative mentioned above. +3.13 /proc/<pid>/fd - List of symlinks to open files +------------------------------------------------------- +This directory contains symbolic links which represent open files +the process is maintaining. Example output:: + + lr-x------ 1 root root 64 Sep 20 17:53 0 -> /dev/null + l-wx------ 1 root root 64 Sep 20 17:53 1 -> /dev/null + lrwx------ 1 root root 64 Sep 20 17:53 10 -> 'socket:[12539]' + lrwx------ 1 root root 64 Sep 20 17:53 11 -> 'socket:[12540]' + lrwx------ 1 root root 64 Sep 20 17:53 12 -> 'socket:[12542]' + +The number of open files for the process is stored in 'size' member +of stat() output for /proc/<pid>/fd for fast access. +------------------------------------------------------- + + Chapter 4: Configuring procfs ============================= diff --git a/fs/proc/fd.c b/fs/proc/fd.c index 913bef0d2a36..ff526dfc5faa 100644 --- a/fs/proc/fd.c +++ b/fs/proc/fd.c @@ -279,6 +279,34 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx, return 0; } +static int proc_readfd_count(struct inode *inode) +{ + struct task_struct *p = get_proc_task(inode); + struct fdtable *fdt; + unsigned int i, size, open_fds = 0; + + if (!p) + return -ENOENT; + + task_lock(p); + if (p->files) { + rcu_read_lock(); + + fdt = files_fdtable(p->files); + size = fdt->max_fds; + + for (i = size / BITS_PER_LONG; i > 0;) + open_fds += hweight64(fdt->open_fds[--i]); + + rcu_read_unlock(); + } + task_unlock(p); + + put_task_struct(p); + + return open_fds; +} + static int proc_readfd(struct file *file, struct dir_context *ctx) { return proc_readfd_common(file, ctx, proc_fd_instantiate); @@ -319,9 +347,25 @@ int proc_fd_permission(struct user_namespace *mnt_userns, return rv; } +static int proc_fd_getattr(struct user_namespace *mnt_userns, + const struct path *path, struct kstat *stat, + u32 request_mask, unsigned int query_flags) +{ + struct inode *inode = d_inode(path->dentry); + + generic_fillattr(&init_user_ns, inode, stat); + + /* If it's a directory, put the number of open fds there */ + if (S_ISDIR(inode->i_mode)) + stat->size = proc_readfd_count(inode); + + return 0; +} + const struct inode_operations proc_fd_inode_operations = { .lookup = proc_lookupfd, .permission = proc_fd_permission, + .getattr = proc_fd_getattr, .setattr = proc_setattr, };
Many monitoring tools include open file count as a metric. Currently the only way to get this number is to enumerate the files in /proc/pid/fd. The problem with the current approach is that it does many things people generally don't care about when they need one number for a metric. In our tests for cadvisor, which reports open file counts per cgroup, we observed that reading the number of open files is slow. Out of 35.23% of CPU time spent in `proc_readfd_common`, we see 29.43% spent in `proc_fill_cache`, which is responsible for filling dentry info. Some of this extra time is spinlock contention, but it's a contention for the lock we don't want to take to begin with. We considered putting the number of open files in /proc/pid/status. Unfortunately, counting the number of fds involves iterating the open_files bitmap, which has a linear complexity in proportion with the number of open files (bitmap slots really, but it's close). We don't want to make /proc/pid/status any slower, so instead we put this info in /proc/pid/fd as a size member of the stat syscall result. Previously the reported number was zero, so there's very little risk of breaking anything, while still providing a somewhat logical way to count the open files with a fallback if it's zero. RFC for this patch included iterating open fds under RCU. Thanks to Frank Hofmann for the suggestion to use the bitmap instead. Previously: ``` $ sudo stat /proc/1/fd | head -n2 File: /proc/1/fd Size: 0 Blocks: 0 IO Block: 1024 directory ``` With this patch: ``` $ sudo stat /proc/1/fd | head -n2 File: /proc/1/fd Size: 65 Blocks: 0 IO Block: 1024 directory ``` Correctness check: ``` $ sudo ls /proc/1/fd | wc -l 65 ``` I added the docs for /proc/<pid>/fd while I'm at it. Signed-off-by: Ivan Babrou <ivan@cloudflare.com> --- v2: Added missing rcu_read_lock() / rcu_read_unlock(), task_lock() / task_unlock() and put_task_struct(). --- Documentation/filesystems/proc.rst | 17 ++++++++++++ fs/proc/fd.c | 44 ++++++++++++++++++++++++++++++ 2 files changed, 61 insertions(+)