mbox series

[v4,0/9] Generic per-sb io stats

Message ID 20220305160424.1040102-1-amir73il@gmail.com (mailing list archive)
Headers show
Series Generic per-sb io stats | expand

Message

Amir Goldstein March 5, 2022, 4:04 p.m. UTC
Miklos,

I ran some micro benchmarks on v3 patch [1] which demonstrated up to
20% slowdown for some workloads (many small reads/writes in a small VM).
This revision adds the "relaxed" percpu counter helpers to mitigate
the iostats counters overhead.

With the relaxed counters, the micro benchmarks that I ran did not
demonstrate any measurable overhead on xfs, on overlayfs over xfs
and overlayfs over tmpfs.

Dave Chinner asked why the io stats should not be enabled for all
filesystems.  That change seems too bold for me so instead, I included
an extra patch to auto-enable per-sb io stats for blockdev filesystems.

Should you decide to take the patches for enabling io stats for
overlayfs and/or fuse through your tree, it is up to you to whether you
want to take this patch as well or leave it out until more people have
a chance to test it and run more performance tests on their setups.

Thanks,
Amir.

[1] https://lore.kernel.org/linux-fsdevel/20220301184221.371853-1-amir73il@gmail.com/

Changes since v3:
- Use "relaxed" counters to reduce performance overhead
- Opt-in to per-sb io stats via fs_flags (dchinner)
- Add patch to auto-enable io stats for all blockdev fs (dchinner)

Changes since v2:
- Change from per-mount to per-sb io stats (szeredi)
- Avoid percpu loop when reading mountstats (dchinner)

Changes since v1:
- Opt-in for per-mount io stats for overlayfs and fuse

Amir Goldstein (9):
  lib/percpu_counter: add helpers for "relaxed" counters
  lib/percpu_counter: add helpers for arrays of counters
  fs: tidy up fs_flags definitions
  fs: add optional iostats counters to struct super_block
  fs: collect per-sb io stats
  fs: report per-sb io stats
  ovl: opt-in for per-sb io stats
  fuse: opt-in for per-sb io stats
  fs: enable per-sb io stats for all blockdev filesystems

 fs/Kconfig                     |   8 ++
 fs/fuse/inode.c                |   3 +-
 fs/nfsd/export.c               |  10 ++-
 fs/nfsd/nfscache.c             |   5 +-
 fs/nfsd/stats.c                |  37 +---------
 fs/nfsd/stats.h                |   3 -
 fs/overlayfs/super.c           |   3 +-
 fs/proc_namespace.c            |  16 ++++
 fs/read_write.c                |  88 ++++++++++++++++------
 fs/super.c                     |  11 +++
 include/linux/fs.h             |  25 ++++---
 include/linux/fs_iostats.h     | 130 +++++++++++++++++++++++++++++++++
 include/linux/percpu_counter.h |  48 ++++++++++++
 lib/percpu_counter.c           |  27 +++++++
 14 files changed, 337 insertions(+), 77 deletions(-)
 create mode 100644 include/linux/fs_iostats.h

Comments

Theodore Ts'o March 6, 2022, 4:18 a.m. UTC | #1
On Sat, Mar 05, 2022 at 06:04:15PM +0200, Amir Goldstein wrote:
> 
> Dave Chinner asked why the io stats should not be enabled for all
> filesystems.  That change seems too bold for me so instead, I included
> an extra patch to auto-enable per-sb io stats for blockdev filesystems.

Perhaps something to consider is allowing users to be able to enable
or disable I/O stats on per mount basis?

Consider if a potential future user of this feature has servers with
one or two 256-core AMD Epyc chip, and suppose that they have a
several thousand iSCSI mounted file systems containing various
software packages for use by Kubernetes jobs.  (Or even several
thousand mounted overlay file systems.....)

The size of the percpu counter is going to be *big* on a large CPU
count machine, and the iostats structure has 5 of these per-cpu
counters, so if you have one for every single mounted file system,
even if the CPU slowdown isn't significant, the non-swappable kernel
memory overhead might be quite large.

So maybe a VFS-level mount option, say, "iostats" and "noiostats", and
some kind of global option indicating whether the default should be
iostats being enabled or disabled?  Bonus points if iostats can be
enabled or disabled after the initial mount via remount operation.

I could imagine some people only being interested to enable iostats on
certain file systems, or certain classes of block devices --- so they
might want it enabled on some ext4 file systems which are attached to
physical devices, but not on the N thousand iSCSI or nbd mounts that
are also happen to be using ext4.

Cheers,

						- Ted
Amir Goldstein March 6, 2022, 7:55 a.m. UTC | #2
On Sun, Mar 6, 2022 at 6:18 AM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Sat, Mar 05, 2022 at 06:04:15PM +0200, Amir Goldstein wrote:
> >
> > Dave Chinner asked why the io stats should not be enabled for all
> > filesystems.  That change seems too bold for me so instead, I included
> > an extra patch to auto-enable per-sb io stats for blockdev filesystems.
>
> Perhaps something to consider is allowing users to be able to enable
> or disable I/O stats on per mount basis?
>
> Consider if a potential future user of this feature has servers with
> one or two 256-core AMD Epyc chip, and suppose that they have a
> several thousand iSCSI mounted file systems containing various
> software packages for use by Kubernetes jobs.  (Or even several
> thousand mounted overlay file systems.....)
>
> The size of the percpu counter is going to be *big* on a large CPU
> count machine, and the iostats structure has 5 of these per-cpu
> counters, so if you have one for every single mounted file system,
> even if the CPU slowdown isn't significant, the non-swappable kernel
> memory overhead might be quite large.
>
> So maybe a VFS-level mount option, say, "iostats" and "noiostats", and
> some kind of global option indicating whether the default should be
> iostats being enabled or disabled?  Bonus points if iostats can be
> enabled or disabled after the initial mount via remount operation.
>
> I could imagine some people only being interested to enable iostats on
> certain file systems, or certain classes of block devices --- so they
> might want it enabled on some ext4 file systems which are attached to
> physical devices, but not on the N thousand iSCSI or nbd mounts that
> are also happen to be using ext4.
>

Those were my thoughts as well.

As a matter of fact, I started to have a go at implementing
"iostats"/"noiostats"
and then I realized I have no clue how the designers of the new mount option
parser API intended that new generic mount options like these would be added,
so I ended up reusing SB_MAND_LOCK for the test patch.

Was I supposed to extend struct fs_context fields sb_flags/sb_flags_mask to
unsigned long and add new common SB_ flags to high 32 bits, which can only
be set via fsopen()/fsconfig() on a 64bit arch?

Or did the designers have something completely different in mind?

Perhaps the scope of the new mount API was never to deal with running out of
space for common SB_ flags?

Thanks,
Amir.
Dave Chinner March 7, 2022, 12:14 a.m. UTC | #3
On Sat, Mar 05, 2022 at 11:18:34PM -0500, Theodore Ts'o wrote:
> On Sat, Mar 05, 2022 at 06:04:15PM +0200, Amir Goldstein wrote:
> > 
> > Dave Chinner asked why the io stats should not be enabled for all
> > filesystems.  That change seems too bold for me so instead, I included
> > an extra patch to auto-enable per-sb io stats for blockdev filesystems.
> 
> Perhaps something to consider is allowing users to be able to enable
> or disable I/O stats on per mount basis?
> 
> Consider if a potential future user of this feature has servers with
> one or two 256-core AMD Epyc chip, and suppose that they have a
> several thousand iSCSI mounted file systems containing various
> software packages for use by Kubernetes jobs.  (Or even several
> thousand mounted overlay file systems.....)
> 
> The size of the percpu counter is going to be *big* on a large CPU
> count machine, and the iostats structure has 5 of these per-cpu
> counters, so if you have one for every single mounted file system,
> even if the CPU slowdown isn't significant, the non-swappable kernel
> memory overhead might be quite large.

A percpu counter on a 256 core machine is ~1kB. Adding 5kB to the
struct superblock isn't a bit deal for a machine of this size, even
if you have thousands of superblocks - we're talking a few
*megabytes* of extra memory in a machine that would typically have
hundreds of GB of RAM. Seriously, the memory overhead of the per-cpu
counters is noise compared to the memory footprint of, say, the
stacks needing to be allocated for every background worker thread
the filesystem needs.

Yeah, I know, we have ~175 per-cpu stats counters per XFS superblock
(we already cover the 4 counters Amir is proposing to add as generic
SB counters), and we have half a dozen dedicated worker threads per
mount. Yet systems still function just fine when there are thousands
of XFS filesystems and thousands of CPUs.

Seriously, a small handful of per-cpu counters that capture
information for all superblocks is not a big deal. Small systems
will have relatively litte overhead, large systems have the memory
to handle it.

> So maybe a VFS-level mount option, say, "iostats" and "noiostats", and
> some kind of global option indicating whether the default should be
> iostats being enabled or disabled?  Bonus points if iostats can be
> enabled or disabled after the initial mount via remount operation.

Can we please just avoid mount options for stuff like this? It'll
just never get tested unless it defaults to on, and then almost
no-one will ever turn it off because why would you bother tweaking
something that has not noticable impact but can give useful insights
the workload that is running?

I don't care one way or another here because this is essentially
duplicating something we've had in XFS for 20+ years. What I want to
avoid is blowing out the test matrix even further. Adding optional
features has a cost in terms of testing time, so if it's a feature
that is only rarely going to be turned on then we shouldn't add it
at all. If it's only rearely going to be turned off, OTOH, then we
should just make it ubiquitous and available for everything so it's
always tested.

Hence, AFAICT, the only real option for yes/no support is the
Kconfig option. If the kernel builder turns it on, it is on for
everything, otherwise it is off for everything.

> I could imagine some people only being interested to enable iostats on
> certain file systems, or certain classes of block devices --- so they
> might want it enabled on some ext4 file systems which are attached to
> physical devices, but not on the N thousand iSCSI or nbd mounts that
> are also happen to be using ext4.

That seems ... fairly contrived. Block device IO stats are not turned
on and off based on the block device type - they are generic.
Network device stats are not turned on and off based on teh network
device - they are generic. Why should per-filesystem IO stats be
special and different to everything else?

Cheers,

Dave.
Amir Goldstein March 7, 2022, 10:04 a.m. UTC | #4
On Mon, Mar 7, 2022 at 2:14 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Sat, Mar 05, 2022 at 11:18:34PM -0500, Theodore Ts'o wrote:
> > On Sat, Mar 05, 2022 at 06:04:15PM +0200, Amir Goldstein wrote:
> > >
> > > Dave Chinner asked why the io stats should not be enabled for all
> > > filesystems.  That change seems too bold for me so instead, I included
> > > an extra patch to auto-enable per-sb io stats for blockdev filesystems.
> >
> > Perhaps something to consider is allowing users to be able to enable
> > or disable I/O stats on per mount basis?
> >
> > Consider if a potential future user of this feature has servers with
> > one or two 256-core AMD Epyc chip, and suppose that they have a
> > several thousand iSCSI mounted file systems containing various
> > software packages for use by Kubernetes jobs.  (Or even several
> > thousand mounted overlay file systems.....)
> >
> > The size of the percpu counter is going to be *big* on a large CPU
> > count machine, and the iostats structure has 5 of these per-cpu
> > counters, so if you have one for every single mounted file system,
> > even if the CPU slowdown isn't significant, the non-swappable kernel
> > memory overhead might be quite large.
>
> A percpu counter on a 256 core machine is ~1kB. Adding 5kB to the
> struct superblock isn't a bit deal for a machine of this size, even
> if you have thousands of superblocks - we're talking a few
> *megabytes* of extra memory in a machine that would typically have
> hundreds of GB of RAM. Seriously, the memory overhead of the per-cpu
> counters is noise compared to the memory footprint of, say, the
> stacks needing to be allocated for every background worker thread
> the filesystem needs.
>
> Yeah, I know, we have ~175 per-cpu stats counters per XFS superblock
> (we already cover the 4 counters Amir is proposing to add as generic
> SB counters), and we have half a dozen dedicated worker threads per
> mount. Yet systems still function just fine when there are thousands
> of XFS filesystems and thousands of CPUs.
>
> Seriously, a small handful of per-cpu counters that capture
> information for all superblocks is not a big deal. Small systems
> will have relatively litte overhead, large systems have the memory
> to handle it.
>
> > So maybe a VFS-level mount option, say, "iostats" and "noiostats", and
> > some kind of global option indicating whether the default should be
> > iostats being enabled or disabled?  Bonus points if iostats can be
> > enabled or disabled after the initial mount via remount operation.
>
> Can we please just avoid mount options for stuff like this? It'll
> just never get tested unless it defaults to on, and then almost
> no-one will ever turn it off because why would you bother tweaking
> something that has not noticable impact but can give useful insights
> the workload that is running?
>
> I don't care one way or another here because this is essentially
> duplicating something we've had in XFS for 20+ years. What I want to
> avoid is blowing out the test matrix even further. Adding optional
> features has a cost in terms of testing time, so if it's a feature
> that is only rarely going to be turned on then we shouldn't add it
> at all. If it's only rearely going to be turned off, OTOH, then we
> should just make it ubiquitous and available for everything so it's
> always tested.
>
> Hence, AFAICT, the only real option for yes/no support is the
> Kconfig option. If the kernel builder turns it on, it is on for
> everything, otherwise it is off for everything.
>

I agree with this sentiment and I also share Ted's concerns
that we may be overlooking some aspect, so my preference would
be that Miklos takes the sb_iostats infra patches through his tree
to enable iostats for fuse/overlayfs (I argued in the commit message
why I think they deserve a special treatment).

Regarding the last patch -
Ted, would you be more comfortable if it came with yet another
Kconfig (e.g. FS_IOSTATS_DEFAULT)? Or perhaps with a /proc/sys/fs/
fail safety off switch (like protected_symlinks)?
That gives more options to distros.

Thanks,
Amir.