[00/27,RFC,WIP] xfsprogs: xfs_buf unification and AIO

Message ID	20201015072155.1631135-1-david@fromorbit.com (mailing list archive)
Headers	show Return-Path: <SRS0=m1l6=DW=vger.kernel.org=linux-xfs-owner@kernel.org> From: Dave Chinner <david@fromorbit.com> To: linux-xfs@vger.kernel.org Subject: [PATCH 00/27] [RFC, WIP] xfsprogs: xfs_buf unification and AIO Date: Thu, 15 Oct 2020 18:21:28 +1100 Message-Id: <20201015072155.1631135-1-david@fromorbit.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	xfsprogs: xfs_buf unification and AIO \| expand [00/27,RFC,WIP] xfsprogs: xfs_buf unification and AIO [01/27] xfsprogs: remove unused buffer tracing code [02/27] xfsprogs: remove unused IO_DEBUG functionality [03/27] libxfs: get rid of b_bcount from xfs_buf [04/27] libxfs: rename buftarg->dev to btdev [05/27] xfsprogs: get rid of ancient btree tracing fragments [06/27] xfsprogs: remove xfs_buf_t typedef [07/27] xfsprogs: introduce liburcu support [08/27] libxfs: add spinlock_t wrapper [09/27] atomic: convert to uatomic [10/27] libxfs: add kernel-compatible completion API [11/27] libxfs: add wrappers for kernel semaphores [12/27] xfsprogs: convert use-once buffer reads to uncached IO [13/27] libxfs: introduce userspace buftarg infrastructure [14/27] xfs: rename libxfs_buftarg_init to libxfs_open_devices() [15/27] libxfs: introduce userspace buftarg infrastructure [16/27] libxfs: add a synchronous IO engine to the buftarg [17/27] xfsprogs: convert libxfs_readbufr to libxfs_buf_read_uncached [18/27] libxfs: convert libxfs_bwrite to buftarg IO [19/27] libxfs: add cache infrastructure to buftarg [20/27] libxfs: add internal lru to btcache [21/27] libxfs: Add kernel list_lru wrapper [22/27] libxfs: introduce new buffer cache infrastructure [23/27] libxfs: use PSI information to detect memory pressure [24/27] libxfs: add a buftarg cache shrinker implementation [25/27] libxfs: switch buffer cache implementations [26/27] build: set platform_defs.h.in dependency correctly [27/27] libxfs: convert sync IO buftarg engine to AIO

Message ID

20201015072155.1631135-1-david@fromorbit.com (mailing list archive)

Headers

From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Subject: [PATCH 00/27] [RFC, WIP] xfsprogs: xfs_buf unification and AIO
Date: Thu, 15 Oct 2020 18:21:28 +1100
Message-Id: <20201015072155.1631135-1-david@fromorbit.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

xfsprogs: xfs_buf unification and AIO | expand

Message

Dave Chinner Oct. 15, 2020, 7:21 a.m. UTC

Hi folks,

Big rework. Aimed at unifying the user and kernel buffer caches so
the first thing to do is port the kernel buffer cache code and hack
it to pieces to use userspace AIO.

Overview:

This patchset introduces the kernel buffer caching mechanism to
userspace libxfs. THe core of this is per-AG indexing of the buffer
cache along with global scope LRU based reclaiming. Like the kernel,
it supports both cached and uncached buffers, and the implementation
of these APIs are identical between userspace and kernel space.

To allow the buffer cache implementation to be shared between kernel
and userspace, it needs to be split into two parts. The first is
the domain specific "target" abstraction that implements all the
kernel or userspace specific functions (e.g. memory allocation,
the per-AG cache indexing, the IO engine, etc). The second part is
the shared buffer cache and buffer IO API implementation.

TO use the kernel code, however, we need a bunch of infrastructure
userspace doesn't currently provide. These are atomic variables,
locking primitives, completions, memory barriers, LRU lists, memory
reclaim triggers, and so on.

The simplest, most portable choice for atomic variables and
memory barriers is to use the Userspace RCU library, which is
provides many of the same semantics and interfaces we use within the
kernel for atomic, memory barriers and RCU algorithms. Hence
xfsprogs grows a hard dependency on liburcu6, but this is packaged
on all major distros these days, so that's not a huge problem.

Locking APIs are fleshed out simply by using pthread locks. Spin
locks are not a reliable solution in userspace, so all the kernel
locking mechanisms wrap sleeping locks. We can even use a
pthread_mutex_t as a kernel semaphore replacement because, by
default, pthread_mutex_unlock() does not check the owner matches the
unlocking and hence, unlike kernel mutexes, can be used safely in
non-owner release situations.

For reclaiming the buffer cache when memory usage goes too high, we
make use of the kernel Pressure Stall Information (PSI) monitoring
API via /proc/pressure/memory. THis was introduced in kernel v5.2
and provides dynamic notification of the kernel memory allocation
stalling to reclaim memory. When we get such a notification, we
trigger the buffer cache shrinker to run to free cached buffers,
just like the kernel does. Hence we no longer need to care about how
much RAM we need to reserve for the buffer cache.

The patch series is structured in a way that:

- guts old tracing and debug interfaces 
- adds support infrastructure for atomics, locking, etc.
- creates a new buftarg abstraction
- restructures buftarg implemenation and usage
- moves uncached IO into new buftarg abstraction
- reworks userspace specific buffer read/write interfaces
- implements new functions the buftarg abstraction eeds to provide
- adds the kernel buffer cache implementation
- switches the implementation and removes the old rd/wr and caching
  code
- converts the buftarg IO engine to use AIO

The AIO engine is not fully functional yet - it does not support
discontiguous buffers correctly yet (needs an iocb per buffer map) -
but otherwise it largely works just fine.

There are -lots- of loose ends that still need to be worked on,
especially w.r.t. xfs_repair. Stuff that comes to mind:

- xfs_buf_mark_dirty() just currently calls xfs_bwrite() and does
  synchronous IO. This needs to be converted to a buftarg internal
  delwri buffer list and a flushing mechanism needs to be added.

- AIO engine does not support discontiguous buffers.

- work out best way to handle IOCBs for AIO - is embedding them in
  the xfs_buf the only way to do this efficiently?

- rationalise the xfs_buf/xfs_buftarg split. Work out where to define
  the stuff that is different between kernel and userspace (e.g. the
  struct xfs_buf) vs the stuff that remains the same (e.g. the XBF
  buffer flags)

- lots of code cleanup

- xfs_repair does not cache between phases right now - it
  unconditionally purges the AG cache at the end of AG processing.
  This keeps memory usage way down, and for filesystems that have
  too much metadata to cache entirely in RAM, it's *much* faster
  than xfs_repair v5.6.0. On smaller filesystems, however, hitting
  RAM caches is much more desirable. Hence there is work to be done
  here determining how to manage the caches effectively.

- async buffer readahead works in userspace now, but unless you have
  a very high IOP device (think in multiples of 100kiops) the
  xfs_repair prefetch mechanism that trades off bandwidth for iops is
  still faster. More investigative work is needed here.

- xfs_db could likely use readahead for btree traversals and other
  things.

- More memory pressure trigger testing needs to be done to determine
  if the trigger settings are sufficient to prevent OOM kills on
  different hardware and filesystem configurations.

- Replace AIO engine with io_uring engine?

- start working on splitting the kernel xfs_buf.[ch] code the same
  way as the userspace code and moving xfs_buf.[ch] to fs/xfs/libxfs
  so that they can be updated in sync.

There's plenty of other stuff, too, but this like is a good
start....

Cheers,

Dave.


Dave Chinner (27):
  xfsprogs: remove unused buffer tracing code
  xfsprogs: remove unused IO_DEBUG functionality
  libxfs: get rid of b_bcount from xfs_buf
  libxfs: rename buftarg->dev to btdev
  xfsprogs: get rid of ancient btree tracing fragments
  xfsprogs: remove xfs_buf_t typedef
  xfsprogs: introduce liburcu support
  libxfs: add spinlock_t wrapper
  atomic: convert to uatomic
  libxfs: add kernel-compatible completion API
  libxfs: add wrappers for kernel semaphores
  xfsprogs: convert use-once buffer reads to uncached IO
  libxfs: introduce userspace buftarg infrastructure
  xfs: rename libxfs_buftarg_init to libxfs_open_devices()
  libxfs: introduce userspace buftarg infrastructure
  libxfs: add a synchronous IO engine to the buftarg
  xfsprogs: convert libxfs_readbufr to libxfs_buf_read_uncached
  libxfs: convert libxfs_bwrite to buftarg IO
  libxfs: add cache infrastructure to buftarg
  libxfs: add internal lru to btcache
  libxfs: Add kernel list_lru wrapper
  libxfs: introduce new buffer cache infrastructure
  libxfs: use PSI information to detect memory pressure
  libxfs: add a buftarg cache shrinker implementation
  libxfs: switch buffer cache implementations
  build: set platform_defs.h.in dependency correctly
  libxfs: convert sync IO buftarg engine to AIO

 Makefile                   |    2 +-
 configure.ac               |    3 +
 copy/Makefile              |    3 +-
 copy/xfs_copy.c            |   15 +-
 db/Makefile                |    3 +-
 db/init.c                  |    4 +-
 db/io.c                    |   33 +-
 db/metadump.c              |    2 +-
 db/sb.c                    |    2 +-
 debian/control             |    2 +-
 growfs/Makefile            |    3 +-
 include/Makefile           |    6 +-
 include/atomic.h           |   63 +-
 include/builddefs.in       |    6 +-
 include/cache.h            |  133 ----
 include/completion.h       |   61 ++
 include/libxfs.h           |   30 +-
 include/libxlog.h          |    6 +-
 include/list_lru.h         |   69 ++
 include/platform_defs.h.in |    2 +
 include/sema.h             |   35 +
 include/spinlock.h         |   25 +
 include/xfs.h              |   23 +
 include/xfs_btree_trace.h  |   87 ---
 include/xfs_inode.h        |    3 +-
 include/xfs_mount.h        |   13 +
 include/xfs_trace.h        |   24 +
 include/xfs_trans.h        |    1 +
 libfrog/linux.c            |   14 +-
 libfrog/workqueue.c        |    3 +
 libxfs/Makefile            |    7 +-
 libxfs/buftarg.c           |  953 +++++++++++++++++++++++++
 libxfs/cache.c             |  724 -------------------
 libxfs/init.c              |  220 +++---
 libxfs/libxfs_api_defs.h   |    7 +
 libxfs/libxfs_io.h         |  185 +----
 libxfs/libxfs_priv.h       |   74 +-
 libxfs/logitem.c           |   10 +-
 libxfs/rdwr.c              | 1234 +-------------------------------
 libxfs/trans.c             |   21 +-
 libxfs/util.c              |    8 +-
 libxfs/xfs_alloc.c         |   16 +-
 libxfs/xfs_bmap.c          |    6 +-
 libxfs/xfs_btree.c         |   10 +-
 libxfs/xfs_buf.c           | 1357 ++++++++++++++++++++++++++++++++++++
 libxfs/xfs_buf.h           |  242 +++++++
 libxfs/xfs_buftarg.h       |  176 +++++
 libxfs/xfs_ialloc.c        |    4 +-
 libxfs/xfs_rtbitmap.c      |   22 +-
 libxlog/xfs_log_recover.c  |   23 +-
 logprint/Makefile          |    3 +-
 logprint/log_print_all.c   |    2 +-
 logprint/logprint.c        |    2 +-
 m4/Makefile                |    1 +
 m4/package_urcu.m4         |   22 +
 mdrestore/Makefile         |    3 +-
 mkfs/Makefile              |    2 +-
 mkfs/proto.c               |   11 +-
 mkfs/xfs_mkfs.c            |   61 +-
 repair/Makefile            |    2 +-
 repair/agheader.c          |    2 +-
 repair/attr_repair.c       |   14 +-
 repair/da_util.c           |    2 +-
 repair/da_util.h           |    2 +-
 repair/dino_chunks.c       |   14 +-
 repair/dinode.c            |    4 +-
 repair/incore.h            |    2 +-
 repair/phase3.c            |    7 +-
 repair/phase4.c            |    5 +-
 repair/phase5.c            |    2 +-
 repair/phase6.c            |    4 +-
 repair/prefetch.c          |  118 ++--
 repair/progress.c          |   16 +-
 repair/progress.h          |    4 +-
 repair/rt.c                |    4 +-
 repair/scan.c              |   10 +-
 repair/xfs_repair.c        |  192 +++--
 scrub/Makefile             |    3 +-
 scrub/progress.c           |    2 +
 79 files changed, 3644 insertions(+), 2847 deletions(-)
 delete mode 100644 include/cache.h
 create mode 100644 include/completion.h
 create mode 100644 include/list_lru.h
 create mode 100644 include/sema.h
 create mode 100644 include/spinlock.h
 delete mode 100644 include/xfs_btree_trace.h
 create mode 100644 libxfs/buftarg.c
 delete mode 100644 libxfs/cache.c
 create mode 100644 libxfs/xfs_buf.c
 create mode 100644 libxfs/xfs_buf.h
 create mode 100644 libxfs/xfs_buftarg.h
 create mode 100644 m4/package_urcu.m4

Comments

Dave Chinner Oct. 15, 2020, 7:29 a.m. UTC | #1

On Thu, Oct 15, 2020 at 06:21:28PM +1100, Dave Chinner wrote:
> Hi folks,
> 
> Big rework. Aimed at unifying the user and kernel buffer caches so
> the first thing to do is port the kernel buffer cache code and hack
> it to pieces to use userspace AIO.

Just a quick addition - this seems to be smoke testing on fstests
just fine at the moment. It looks like there is nothing
catastrophically broken in the conversion (apart from discontiguous
buffers w/ AIO).

The xfsprogs base is the current for-next tree (5.9.0-rc1, IIRC),
and it applies on top of the config file patchset I posted earlier.

Cheers,

Dave.

Darrick J. Wong Oct. 15, 2020, 6:37 p.m. UTC | #2

On Thu, Oct 15, 2020 at 06:21:28PM +1100, Dave Chinner wrote:
> Hi folks,
> 
> Big rework. Aimed at unifying the user and kernel buffer caches so
> the first thing to do is port the kernel buffer cache code and hack
> it to pieces to use userspace AIO.

Hurrah!

> Overview:
> 
> This patchset introduces the kernel buffer caching mechanism to
> userspace libxfs. THe core of this is per-AG indexing of the buffer
> cache along with global scope LRU based reclaiming. Like the kernel,
> it supports both cached and uncached buffers, and the implementation
> of these APIs are identical between userspace and kernel space.
> 
> To allow the buffer cache implementation to be shared between kernel
> and userspace, it needs to be split into two parts. The first is
> the domain specific "target" abstraction that implements all the
> kernel or userspace specific functions (e.g. memory allocation,
> the per-AG cache indexing, the IO engine, etc). The second part is
> the shared buffer cache and buffer IO API implementation.
> 
> TO use the kernel code, however, we need a bunch of infrastructure
> userspace doesn't currently provide. These are atomic variables,
> locking primitives, completions, memory barriers, LRU lists, memory
> reclaim triggers, and so on.
> 
> The simplest, most portable choice for atomic variables and
> memory barriers is to use the Userspace RCU library, which is
> provides many of the same semantics and interfaces we use within the
> kernel for atomic, memory barriers and RCU algorithms. Hence
> xfsprogs grows a hard dependency on liburcu6, but this is packaged
> on all major distros these days, so that's not a huge problem.
> 
> Locking APIs are fleshed out simply by using pthread locks. Spin
> locks are not a reliable solution in userspace, so all the kernel
> locking mechanisms wrap sleeping locks. We can even use a
> pthread_mutex_t as a kernel semaphore replacement because, by
> default, pthread_mutex_unlock() does not check the owner matches the
> unlocking and hence, unlike kernel mutexes, can be used safely in
> non-owner release situations.
> 
> For reclaiming the buffer cache when memory usage goes too high, we
> make use of the kernel Pressure Stall Information (PSI) monitoring
> API via /proc/pressure/memory. THis was introduced in kernel v5.2
> and provides dynamic notification of the kernel memory allocation
> stalling to reclaim memory. When we get such a notification, we
> trigger the buffer cache shrinker to run to free cached buffers,
> just like the kernel does. Hence we no longer need to care about how
> much RAM we need to reserve for the buffer cache.

<nod> I have a few comments about that, but they're all in the PSI
patch. :)

> The patch series is structured in a way that:
> 
> - guts old tracing and debug interfaces 
> - adds support infrastructure for atomics, locking, etc.
> - creates a new buftarg abstraction
> - restructures buftarg implemenation and usage
> - moves uncached IO into new buftarg abstraction
> - reworks userspace specific buffer read/write interfaces
> - implements new functions the buftarg abstraction eeds to provide
> - adds the kernel buffer cache implementation
> - switches the implementation and removes the old rd/wr and caching
>   code
> - converts the buftarg IO engine to use AIO
> 
> The AIO engine is not fully functional yet - it does not support
> discontiguous buffers correctly yet (needs an iocb per buffer map) -
> but otherwise it largely works just fine.
> 
> There are -lots- of loose ends that still need to be worked on,
> especially w.r.t. xfs_repair. Stuff that comes to mind:
> 
> - xfs_buf_mark_dirty() just currently calls xfs_bwrite() and does
>   synchronous IO. This needs to be converted to a buftarg internal
>   delwri buffer list and a flushing mechanism needs to be added.

<nod>

> - AIO engine does not support discontiguous buffers.
> 
> - work out best way to handle IOCBs for AIO - is embedding them in
>   the xfs_buf the only way to do this efficiently?

The only other way I can think of is to embed MAX_AIO_EVENTS iocbs in
the buftarg and (I guess) track which ones are free with a bitmap.  That
would cut down the iocb memory overhead to the IOs that we're actually
running at a cost of more bookkeepping and potential starvation issues
for the yuuuge buffer that takes a long time to collect all NR iocbs.

> - rationalise the xfs_buf/xfs_buftarg split. Work out where to define
>   the stuff that is different between kernel and userspace (e.g. the
>   struct xfs_buf) vs the stuff that remains the same (e.g. the XBF
>   buffer flags)

Ow my brain.

> - lots of code cleanup
> 
> - xfs_repair does not cache between phases right now - it
>   unconditionally purges the AG cache at the end of AG processing.
>   This keeps memory usage way down, and for filesystems that have
>   too much metadata to cache entirely in RAM, it's *much* faster

Why is it faster to blow out the cache?  Is it the internal overhead of
digging through a huge buftarg hash table to find the buffer that the
caller wants and/or whacking enough of the other stale buffers to free
up memory to avoid the buffer cache limits?

>   than xfs_repair v5.6.0. On smaller filesystems, however, hitting
>   RAM caches is much more desirable. Hence there is work to be done
>   here determining how to manage the caches effectively.
> 
> - async buffer readahead works in userspace now, but unless you have
>   a very high IOP device (think in multiples of 100kiops) the
>   xfs_repair prefetch mechanism that trades off bandwidth for iops is
>   still faster. More investigative work is needed here.

(No idea what this means.)

> - xfs_db could likely use readahead for btree traversals and other
>   things.
> 
> - More memory pressure trigger testing needs to be done to determine
>   if the trigger settings are sufficient to prevent OOM kills on
>   different hardware and filesystem configurations.
> 
> - Replace AIO engine with io_uring engine?

Hm, is Al actually going to review io_uring as he's been threatening to
do?  I'd hold off until that happens, or one of us goes and tests it in
anger to watch all the smoke leak out, etc...

> - start working on splitting the kernel xfs_buf.[ch] code the same
>   way as the userspace code and moving xfs_buf.[ch] to fs/xfs/libxfs
>   so that they can be updated in sync.

Aha, so yes that answers my question in ... whichever patch that was
somewhere around #17.

--D

> There's plenty of other stuff, too, but this like is a good
> start....
> 
> Cheers,
> 
> Dave.
> 
> 
> Dave Chinner (27):
>   xfsprogs: remove unused buffer tracing code
>   xfsprogs: remove unused IO_DEBUG functionality
>   libxfs: get rid of b_bcount from xfs_buf
>   libxfs: rename buftarg->dev to btdev
>   xfsprogs: get rid of ancient btree tracing fragments
>   xfsprogs: remove xfs_buf_t typedef
>   xfsprogs: introduce liburcu support
>   libxfs: add spinlock_t wrapper
>   atomic: convert to uatomic
>   libxfs: add kernel-compatible completion API
>   libxfs: add wrappers for kernel semaphores
>   xfsprogs: convert use-once buffer reads to uncached IO
>   libxfs: introduce userspace buftarg infrastructure
>   xfs: rename libxfs_buftarg_init to libxfs_open_devices()
>   libxfs: introduce userspace buftarg infrastructure
>   libxfs: add a synchronous IO engine to the buftarg
>   xfsprogs: convert libxfs_readbufr to libxfs_buf_read_uncached
>   libxfs: convert libxfs_bwrite to buftarg IO
>   libxfs: add cache infrastructure to buftarg
>   libxfs: add internal lru to btcache
>   libxfs: Add kernel list_lru wrapper
>   libxfs: introduce new buffer cache infrastructure
>   libxfs: use PSI information to detect memory pressure
>   libxfs: add a buftarg cache shrinker implementation
>   libxfs: switch buffer cache implementations
>   build: set platform_defs.h.in dependency correctly
>   libxfs: convert sync IO buftarg engine to AIO
> 
>  Makefile                   |    2 +-
>  configure.ac               |    3 +
>  copy/Makefile              |    3 +-
>  copy/xfs_copy.c            |   15 +-
>  db/Makefile                |    3 +-
>  db/init.c                  |    4 +-
>  db/io.c                    |   33 +-
>  db/metadump.c              |    2 +-
>  db/sb.c                    |    2 +-
>  debian/control             |    2 +-
>  growfs/Makefile            |    3 +-
>  include/Makefile           |    6 +-
>  include/atomic.h           |   63 +-
>  include/builddefs.in       |    6 +-
>  include/cache.h            |  133 ----
>  include/completion.h       |   61 ++
>  include/libxfs.h           |   30 +-
>  include/libxlog.h          |    6 +-
>  include/list_lru.h         |   69 ++
>  include/platform_defs.h.in |    2 +
>  include/sema.h             |   35 +
>  include/spinlock.h         |   25 +
>  include/xfs.h              |   23 +
>  include/xfs_btree_trace.h  |   87 ---
>  include/xfs_inode.h        |    3 +-
>  include/xfs_mount.h        |   13 +
>  include/xfs_trace.h        |   24 +
>  include/xfs_trans.h        |    1 +
>  libfrog/linux.c            |   14 +-
>  libfrog/workqueue.c        |    3 +
>  libxfs/Makefile            |    7 +-
>  libxfs/buftarg.c           |  953 +++++++++++++++++++++++++
>  libxfs/cache.c             |  724 -------------------
>  libxfs/init.c              |  220 +++---
>  libxfs/libxfs_api_defs.h   |    7 +
>  libxfs/libxfs_io.h         |  185 +----
>  libxfs/libxfs_priv.h       |   74 +-
>  libxfs/logitem.c           |   10 +-
>  libxfs/rdwr.c              | 1234 +-------------------------------
>  libxfs/trans.c             |   21 +-
>  libxfs/util.c              |    8 +-
>  libxfs/xfs_alloc.c         |   16 +-
>  libxfs/xfs_bmap.c          |    6 +-
>  libxfs/xfs_btree.c         |   10 +-
>  libxfs/xfs_buf.c           | 1357 ++++++++++++++++++++++++++++++++++++
>  libxfs/xfs_buf.h           |  242 +++++++
>  libxfs/xfs_buftarg.h       |  176 +++++
>  libxfs/xfs_ialloc.c        |    4 +-
>  libxfs/xfs_rtbitmap.c      |   22 +-
>  libxlog/xfs_log_recover.c  |   23 +-
>  logprint/Makefile          |    3 +-
>  logprint/log_print_all.c   |    2 +-
>  logprint/logprint.c        |    2 +-
>  m4/Makefile                |    1 +
>  m4/package_urcu.m4         |   22 +
>  mdrestore/Makefile         |    3 +-
>  mkfs/Makefile              |    2 +-
>  mkfs/proto.c               |   11 +-
>  mkfs/xfs_mkfs.c            |   61 +-
>  repair/Makefile            |    2 +-
>  repair/agheader.c          |    2 +-
>  repair/attr_repair.c       |   14 +-
>  repair/da_util.c           |    2 +-
>  repair/da_util.h           |    2 +-
>  repair/dino_chunks.c       |   14 +-
>  repair/dinode.c            |    4 +-
>  repair/incore.h            |    2 +-
>  repair/phase3.c            |    7 +-
>  repair/phase4.c            |    5 +-
>  repair/phase5.c            |    2 +-
>  repair/phase6.c            |    4 +-
>  repair/prefetch.c          |  118 ++--
>  repair/progress.c          |   16 +-
>  repair/progress.h          |    4 +-
>  repair/rt.c                |    4 +-
>  repair/scan.c              |   10 +-
>  repair/xfs_repair.c        |  192 +++--
>  scrub/Makefile             |    3 +-
>  scrub/progress.c           |    2 +
>  79 files changed, 3644 insertions(+), 2847 deletions(-)
>  delete mode 100644 include/cache.h
>  create mode 100644 include/completion.h
>  create mode 100644 include/list_lru.h
>  create mode 100644 include/sema.h
>  create mode 100644 include/spinlock.h
>  delete mode 100644 include/xfs_btree_trace.h
>  create mode 100644 libxfs/buftarg.c
>  delete mode 100644 libxfs/cache.c
>  create mode 100644 libxfs/xfs_buf.c
>  create mode 100644 libxfs/xfs_buf.h
>  create mode 100644 libxfs/xfs_buftarg.h
>  create mode 100644 m4/package_urcu.m4
> 
> -- 
> 2.28.0
>

Dave Chinner Oct. 15, 2020, 10:35 p.m. UTC | #3

On Thu, Oct 15, 2020 at 11:37:56AM -0700, Darrick J. Wong wrote:
> On Thu, Oct 15, 2020 at 06:21:28PM +1100, Dave Chinner wrote:
> > - AIO engine does not support discontiguous buffers.
> > 
> > - work out best way to handle IOCBs for AIO - is embedding them in
> >   the xfs_buf the only way to do this efficiently?
> 
> The only other way I can think of is to embed MAX_AIO_EVENTS iocbs in
> the buftarg and (I guess) track which ones are free with a bitmap.

I originally had an array embedded in the buftarg, and when I
realised that I had to rtack exactly which ones were still in use I
trashed it and moved the iocb to the struct xfs_buf so no tracking
is necessary. All I need to do is change the buffer allocation code
to allocate an iocb array when it allocates the map array so we have
a direct b_maps -> b_iocbs relationship at all times.

> That
> would cut down the iocb memory overhead to the IOs that we're actually
> running at a cost of more bookkeepping and potential starvation issues
> for the yuuuge buffer that takes a long time to collect all NR iocbs.

We've already got a huge amount of per-buffer state, so adding iocbs
to that isn't a huge deal...

> > - rationalise the xfs_buf/xfs_buftarg split. Work out where to define
> >   the stuff that is different between kernel and userspace (e.g. the
> >   struct xfs_buf) vs the stuff that remains the same (e.g. the XBF
> >   buffer flags)
> 
> Ow my brain.

Yeah. :(

> > - lots of code cleanup
> > 
> > - xfs_repair does not cache between phases right now - it
> >   unconditionally purges the AG cache at the end of AG processing.
> >   This keeps memory usage way down, and for filesystems that have
> >   too much metadata to cache entirely in RAM, it's *much* faster
> 
> Why is it faster to blow out the cache?  Is it the internal overhead of
> digging through a huge buftarg hash table to find the buffer that the
> caller wants and/or whacking enough of the other stale buffers to free
> up memory to avoid the buffer cache limits?

Lots of reasons. The size of the hash table/chain length is largely
irrelevant. The biggest issue seems to be memory allocation for the
buffers - when we allocate the buffer, malloc does a mprotect() call
on the new heap space for some reason and that takes the mmap_sem in
write mode. Which serialises all the page faults being done while
reading those buffers as they take the mmap_sem shared. Hence while
we are allocating new heap space, there's massive numbers of context
switches on the mmap_sem contention. COntext switch profile while
running at ~150,000 context switches a second:

   64.45%    64.45%  xfs_repair       [kernel.kallsyms]         [k] schedule
            |          
            |--41.02%--__memmove_sse2_unaligned_erms
            |          |          
            |           --40.95%--asm_exc_page_fault
            |                     exc_page_fault
            |                     |          
            |                      --40.95%--down_read
            |                                rwsem_down_read_slowpath
            |                                schedule
            |                                schedule
            |          
            |--7.88%--sysmalloc
            |          |          
            |           --7.87%--asm_exc_page_fault
            |                     exc_page_fault
            |                     |          
            |                      --7.86%--down_read
            |                                rwsem_down_read_slowpath
            |                                schedule
            |                                schedule
....
            |          
             --1.68%--__mprotect
                       |          
                        --1.68%--entry_SYSCALL_64_after_hwframe
                                  do_syscall_64
                                  |          
                                   --1.68%--__x64_sys_mprotect
                                             do_mprotect_pkey
                                             |          
                                              --1.68%--down_write_killable
                                                        rwsem_down_write_slowpath
                                                        schedule
                                                        schedule

The __memmove_sse2_unaligned_erms() function is from the memcpy()
loop in repair/prefetch.c where it is copying metadata from the
large IO buffer into the individual xfs_bufs (i.e. read TLB faults)
and AFAICT the mprotect() syscalls are coming from malloc heap
expansion. About 7% of the context switches are from pthread_mutex
locks, and only 2.5% of the context switches are from the pread() IO
that is being done.

IOWs, while memory footprint is growing/changing, repair performance
is largely limited by mmap_sem concurrency issues.

So my reading of this is that by bounding the memory footprint of
repair by freeing the buffers back to the heap regularly, we largely
stay out of the heap grow mprotect path and avoid this mmap_sem
contention. That allows the CPU intensive parts of prefetch and
metadata verification to run more efficiently....

> >   than xfs_repair v5.6.0. On smaller filesystems, however, hitting
> >   RAM caches is much more desirable. Hence there is work to be done
> >   here determining how to manage the caches effectively.
> > 
> > - async buffer readahead works in userspace now, but unless you have
> >   a very high IOP device (think in multiples of 100kiops) the
> >   xfs_repair prefetch mechanism that trades off bandwidth for iops is
> >   still faster. More investigative work is needed here.
> 
> (No idea what this means.)

The SSD I've been testing on peaks at about 70,000kiops for reads.
If I drive prefetch by xfs_buf_readahead() only (got patches that do
that) then, compared to 5.6.0, prefetch bandwidth drops to ~400MB/s
but the device is iops bound and so prefetch is slower than when
5.6.0 uses 2MB IOs, does ~5-10kiops and consumes 900MB/s of bandwidth.

This patchset runs the existing prefetch code at 15-20kiops and
1.8-2.1GB/s using 2MB IOs - we still get better throughput on these
SSDs by trading off iops for bandwidth.

The original patchset I wrote (but never published) had similar
performance on the above hardware, but I also ran it on faster SSDs.
On those, repair could push them to around 200-250kiops with
prefetch by xfs_buf_readahead() and that was much faster than the
2-2.5GB/s that the existing prefetch could get before being limited
by the pcie 3.0 4x bus the SSD is on. I'll have some numbers from
this patchset on my faster hardware next week, but I don't expect
them to be much different...

So, essentially, if you've got more small read IOPS capacity than
you have "sparse metadata population optimised" large IO bandwidth
(or metadata too sparse to trigger the large IO optimisations) then
using xfs_buf_readahead() appears to more efficient than using the
existing prefetch code.

That said, there's a lot of testing, observation and analysis needed
to determine what will be the best general approach here. Signs
are pointing towards "existing prefetch for rotational storage and
low iops ssds" and miminal caching and on-demand prefetch for high
end SSDs, but we'll see how that stands up...

> > - xfs_db could likely use readahead for btree traversals and other
> >   things.
> > 
> > - More memory pressure trigger testing needs to be done to determine
> >   if the trigger settings are sufficient to prevent OOM kills on
> >   different hardware and filesystem configurations.
> > 
> > - Replace AIO engine with io_uring engine?
> 
> Hm, is Al actually going to review io_uring as he's been threatening to
> do?  I'd hold off until that happens, or one of us goes and tests it in
> anger to watch all the smoke leak out, etc...

Yeah, I'm in no hurry to do this. Just making sure people understand
that it's a potential future direction.

> > - start working on splitting the kernel xfs_buf.[ch] code the same
> >   way as the userspace code and moving xfs_buf.[ch] to fs/xfs/libxfs
> >   so that they can be updated in sync.
> 
> Aha, so yes that answers my question in ... whichever patch that was
> somewhere around #17.

Right. I had to start somewhere, and given that userspace
requirements largely define how the split needs to occur, I decided
to start by making userspace work and then, once that is done,
change the kernel to match the same structure that userspace
needed....

Cheers,

Dave.