Message ID | 20220305160424.1040102-1-amir73il@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Generic per-sb io stats | expand |
On Sat, Mar 05, 2022 at 06:04:15PM +0200, Amir Goldstein wrote: > > Dave Chinner asked why the io stats should not be enabled for all > filesystems. That change seems too bold for me so instead, I included > an extra patch to auto-enable per-sb io stats for blockdev filesystems. Perhaps something to consider is allowing users to be able to enable or disable I/O stats on per mount basis? Consider if a potential future user of this feature has servers with one or two 256-core AMD Epyc chip, and suppose that they have a several thousand iSCSI mounted file systems containing various software packages for use by Kubernetes jobs. (Or even several thousand mounted overlay file systems.....) The size of the percpu counter is going to be *big* on a large CPU count machine, and the iostats structure has 5 of these per-cpu counters, so if you have one for every single mounted file system, even if the CPU slowdown isn't significant, the non-swappable kernel memory overhead might be quite large. So maybe a VFS-level mount option, say, "iostats" and "noiostats", and some kind of global option indicating whether the default should be iostats being enabled or disabled? Bonus points if iostats can be enabled or disabled after the initial mount via remount operation. I could imagine some people only being interested to enable iostats on certain file systems, or certain classes of block devices --- so they might want it enabled on some ext4 file systems which are attached to physical devices, but not on the N thousand iSCSI or nbd mounts that are also happen to be using ext4. Cheers, - Ted
On Sun, Mar 6, 2022 at 6:18 AM Theodore Ts'o <tytso@mit.edu> wrote: > > On Sat, Mar 05, 2022 at 06:04:15PM +0200, Amir Goldstein wrote: > > > > Dave Chinner asked why the io stats should not be enabled for all > > filesystems. That change seems too bold for me so instead, I included > > an extra patch to auto-enable per-sb io stats for blockdev filesystems. > > Perhaps something to consider is allowing users to be able to enable > or disable I/O stats on per mount basis? > > Consider if a potential future user of this feature has servers with > one or two 256-core AMD Epyc chip, and suppose that they have a > several thousand iSCSI mounted file systems containing various > software packages for use by Kubernetes jobs. (Or even several > thousand mounted overlay file systems.....) > > The size of the percpu counter is going to be *big* on a large CPU > count machine, and the iostats structure has 5 of these per-cpu > counters, so if you have one for every single mounted file system, > even if the CPU slowdown isn't significant, the non-swappable kernel > memory overhead might be quite large. > > So maybe a VFS-level mount option, say, "iostats" and "noiostats", and > some kind of global option indicating whether the default should be > iostats being enabled or disabled? Bonus points if iostats can be > enabled or disabled after the initial mount via remount operation. > > I could imagine some people only being interested to enable iostats on > certain file systems, or certain classes of block devices --- so they > might want it enabled on some ext4 file systems which are attached to > physical devices, but not on the N thousand iSCSI or nbd mounts that > are also happen to be using ext4. > Those were my thoughts as well. As a matter of fact, I started to have a go at implementing "iostats"/"noiostats" and then I realized I have no clue how the designers of the new mount option parser API intended that new generic mount options like these would be added, so I ended up reusing SB_MAND_LOCK for the test patch. Was I supposed to extend struct fs_context fields sb_flags/sb_flags_mask to unsigned long and add new common SB_ flags to high 32 bits, which can only be set via fsopen()/fsconfig() on a 64bit arch? Or did the designers have something completely different in mind? Perhaps the scope of the new mount API was never to deal with running out of space for common SB_ flags? Thanks, Amir.
On Sat, Mar 05, 2022 at 11:18:34PM -0500, Theodore Ts'o wrote: > On Sat, Mar 05, 2022 at 06:04:15PM +0200, Amir Goldstein wrote: > > > > Dave Chinner asked why the io stats should not be enabled for all > > filesystems. That change seems too bold for me so instead, I included > > an extra patch to auto-enable per-sb io stats for blockdev filesystems. > > Perhaps something to consider is allowing users to be able to enable > or disable I/O stats on per mount basis? > > Consider if a potential future user of this feature has servers with > one or two 256-core AMD Epyc chip, and suppose that they have a > several thousand iSCSI mounted file systems containing various > software packages for use by Kubernetes jobs. (Or even several > thousand mounted overlay file systems.....) > > The size of the percpu counter is going to be *big* on a large CPU > count machine, and the iostats structure has 5 of these per-cpu > counters, so if you have one for every single mounted file system, > even if the CPU slowdown isn't significant, the non-swappable kernel > memory overhead might be quite large. A percpu counter on a 256 core machine is ~1kB. Adding 5kB to the struct superblock isn't a bit deal for a machine of this size, even if you have thousands of superblocks - we're talking a few *megabytes* of extra memory in a machine that would typically have hundreds of GB of RAM. Seriously, the memory overhead of the per-cpu counters is noise compared to the memory footprint of, say, the stacks needing to be allocated for every background worker thread the filesystem needs. Yeah, I know, we have ~175 per-cpu stats counters per XFS superblock (we already cover the 4 counters Amir is proposing to add as generic SB counters), and we have half a dozen dedicated worker threads per mount. Yet systems still function just fine when there are thousands of XFS filesystems and thousands of CPUs. Seriously, a small handful of per-cpu counters that capture information for all superblocks is not a big deal. Small systems will have relatively litte overhead, large systems have the memory to handle it. > So maybe a VFS-level mount option, say, "iostats" and "noiostats", and > some kind of global option indicating whether the default should be > iostats being enabled or disabled? Bonus points if iostats can be > enabled or disabled after the initial mount via remount operation. Can we please just avoid mount options for stuff like this? It'll just never get tested unless it defaults to on, and then almost no-one will ever turn it off because why would you bother tweaking something that has not noticable impact but can give useful insights the workload that is running? I don't care one way or another here because this is essentially duplicating something we've had in XFS for 20+ years. What I want to avoid is blowing out the test matrix even further. Adding optional features has a cost in terms of testing time, so if it's a feature that is only rarely going to be turned on then we shouldn't add it at all. If it's only rearely going to be turned off, OTOH, then we should just make it ubiquitous and available for everything so it's always tested. Hence, AFAICT, the only real option for yes/no support is the Kconfig option. If the kernel builder turns it on, it is on for everything, otherwise it is off for everything. > I could imagine some people only being interested to enable iostats on > certain file systems, or certain classes of block devices --- so they > might want it enabled on some ext4 file systems which are attached to > physical devices, but not on the N thousand iSCSI or nbd mounts that > are also happen to be using ext4. That seems ... fairly contrived. Block device IO stats are not turned on and off based on the block device type - they are generic. Network device stats are not turned on and off based on teh network device - they are generic. Why should per-filesystem IO stats be special and different to everything else? Cheers, Dave.
On Mon, Mar 7, 2022 at 2:14 AM Dave Chinner <david@fromorbit.com> wrote: > > On Sat, Mar 05, 2022 at 11:18:34PM -0500, Theodore Ts'o wrote: > > On Sat, Mar 05, 2022 at 06:04:15PM +0200, Amir Goldstein wrote: > > > > > > Dave Chinner asked why the io stats should not be enabled for all > > > filesystems. That change seems too bold for me so instead, I included > > > an extra patch to auto-enable per-sb io stats for blockdev filesystems. > > > > Perhaps something to consider is allowing users to be able to enable > > or disable I/O stats on per mount basis? > > > > Consider if a potential future user of this feature has servers with > > one or two 256-core AMD Epyc chip, and suppose that they have a > > several thousand iSCSI mounted file systems containing various > > software packages for use by Kubernetes jobs. (Or even several > > thousand mounted overlay file systems.....) > > > > The size of the percpu counter is going to be *big* on a large CPU > > count machine, and the iostats structure has 5 of these per-cpu > > counters, so if you have one for every single mounted file system, > > even if the CPU slowdown isn't significant, the non-swappable kernel > > memory overhead might be quite large. > > A percpu counter on a 256 core machine is ~1kB. Adding 5kB to the > struct superblock isn't a bit deal for a machine of this size, even > if you have thousands of superblocks - we're talking a few > *megabytes* of extra memory in a machine that would typically have > hundreds of GB of RAM. Seriously, the memory overhead of the per-cpu > counters is noise compared to the memory footprint of, say, the > stacks needing to be allocated for every background worker thread > the filesystem needs. > > Yeah, I know, we have ~175 per-cpu stats counters per XFS superblock > (we already cover the 4 counters Amir is proposing to add as generic > SB counters), and we have half a dozen dedicated worker threads per > mount. Yet systems still function just fine when there are thousands > of XFS filesystems and thousands of CPUs. > > Seriously, a small handful of per-cpu counters that capture > information for all superblocks is not a big deal. Small systems > will have relatively litte overhead, large systems have the memory > to handle it. > > > So maybe a VFS-level mount option, say, "iostats" and "noiostats", and > > some kind of global option indicating whether the default should be > > iostats being enabled or disabled? Bonus points if iostats can be > > enabled or disabled after the initial mount via remount operation. > > Can we please just avoid mount options for stuff like this? It'll > just never get tested unless it defaults to on, and then almost > no-one will ever turn it off because why would you bother tweaking > something that has not noticable impact but can give useful insights > the workload that is running? > > I don't care one way or another here because this is essentially > duplicating something we've had in XFS for 20+ years. What I want to > avoid is blowing out the test matrix even further. Adding optional > features has a cost in terms of testing time, so if it's a feature > that is only rarely going to be turned on then we shouldn't add it > at all. If it's only rearely going to be turned off, OTOH, then we > should just make it ubiquitous and available for everything so it's > always tested. > > Hence, AFAICT, the only real option for yes/no support is the > Kconfig option. If the kernel builder turns it on, it is on for > everything, otherwise it is off for everything. > I agree with this sentiment and I also share Ted's concerns that we may be overlooking some aspect, so my preference would be that Miklos takes the sb_iostats infra patches through his tree to enable iostats for fuse/overlayfs (I argued in the commit message why I think they deserve a special treatment). Regarding the last patch - Ted, would you be more comfortable if it came with yet another Kconfig (e.g. FS_IOSTATS_DEFAULT)? Or perhaps with a /proc/sys/fs/ fail safety off switch (like protected_symlinks)? That gives more options to distros. Thanks, Amir.