mbox series

[RFC,0/3] xfs: use large folios for buffers

Message ID 20240118222216.4131379-1-david@fromorbit.com (mailing list archive)
Headers show
Series xfs: use large folios for buffers | expand

Message

Dave Chinner Jan. 18, 2024, 10:19 p.m. UTC
The XFS buffer cache supports metadata buffers up to 64kB, and it does so by
aggregating multiple pages into a single contiguous memory region using
vmapping. This is expensive (both the setup and the runtime TLB mapping cost),
and would be unnecessary if we could allocate large contiguous memory regions
for the buffers in the first place.

Enter multi-page folios.

This patchset converts the buffer cache to use the folio API, then enhances it
to optimisitically use large folios where possible. It retains the old "vmap an
array of single page folios" functionality as a fallback when large folio
allocation fails. This means that, like page cache support for large folios, we
aren't dependent on large folio allocation succeeding all the time.

This relegates the single page array allocation mechanism to the "slow path"
that we don't have to care so much about performance of this path anymore. This
might allow us to simplify it a bit in future.

One of the issues with the folio conversion is that we use a couple of APIs that
take struct page ** (i.e. pointers to page pointer arrays) and there aren't
folio counterparts. These are the bulk page allocator and vm_map_ram(). In the
cases where they are used, we cast &bp->b_folios[] to (struct page **) knowing
that this array will only contain single page folios and that single page folios
and struct page are the same structure and so have the same address. This is a
bit of a hack (hence the RFC) but I'm not sure that it's worth adding folio
versions of these interfaces right now. We don't need to use the bulk page
allocator so much any more, because that's now a slow path and we could probably
just call folio_alloc() in a loop like we used to. What to do about vm_map_ram()
is a little less clear....

The other issue I tripped over in doing this conversion is that the
discontiguous buffer straddling code in the buf log item dirty region tracking
is broken. We don't actually exercise that code on existing configurations, and
I tripped over it when tracking down a bug in the folio conversion. I fixed it
and short-circuted the check for contiguous buffers, but that didn't fix the
failure I was seeing (which was not handling bp->b_offset and large folios
properly when building bios).

Apart from those issues, the conversion and enhancement is relatively straight
forward.  It passes fstests on both 512 and 4096 byte sector size storage (512
byte sectors exercise the XBF_KMEM path which has non-zero bp->b_offset values)
and doesn't appear to cause any problems with large directory buffers, though I
haven't done any real testing on those yet. Large folio allocations are
definitely being exercised, though, as all the inode cluster buffers are 16kB on
a 512 byte inode V5 filesystem.

Thoughts, comments, etc?

Note: this patchset is on top of the NOFS removal patchset I sent a
few days ago. That can be pulled from this git branch:

https://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-kmem-cleanup

Comments

Darrick J. Wong Jan. 19, 2024, 1:05 a.m. UTC | #1
On Fri, Jan 19, 2024 at 09:19:38AM +1100, Dave Chinner wrote:
> The XFS buffer cache supports metadata buffers up to 64kB, and it does so by
> aggregating multiple pages into a single contiguous memory region using
> vmapping. This is expensive (both the setup and the runtime TLB mapping cost),
> and would be unnecessary if we could allocate large contiguous memory regions
> for the buffers in the first place.
> 
> Enter multi-page folios.

LOL, hch and I just wrapped up making the xfbtree buffer cache work with
large folios coming from tmpfs.  Though the use case there is simpler
because we require blocksize==PAGE_SIZE, forbid the use of highmem, and
don't need discontig buffers.  Hence we sidestep vm_map_ram. :)

> This patchset converts the buffer cache to use the folio API, then enhances it
> to optimisitically use large folios where possible. It retains the old "vmap an
> array of single page folios" functionality as a fallback when large folio
> allocation fails. This means that, like page cache support for large folios, we
> aren't dependent on large folio allocation succeeding all the time.
> 
> This relegates the single page array allocation mechanism to the "slow path"
> that we don't have to care so much about performance of this path anymore. This
> might allow us to simplify it a bit in future.
> 
> One of the issues with the folio conversion is that we use a couple of APIs that
> take struct page ** (i.e. pointers to page pointer arrays) and there aren't
> folio counterparts. These are the bulk page allocator and vm_map_ram(). In the
> cases where they are used, we cast &bp->b_folios[] to (struct page **) knowing
> that this array will only contain single page folios and that single page folios
> and struct page are the same structure and so have the same address. This is a
> bit of a hack (hence the RFC) but I'm not sure that it's worth adding folio
> versions of these interfaces right now. We don't need to use the bulk page
> allocator so much any more, because that's now a slow path and we could probably
> just call folio_alloc() in a loop like we used to. What to do about vm_map_ram()
> is a little less clear....

Yeah, that's what I suspected.

> The other issue I tripped over in doing this conversion is that the
> discontiguous buffer straddling code in the buf log item dirty region tracking
> is broken. We don't actually exercise that code on existing configurations, and
> I tripped over it when tracking down a bug in the folio conversion. I fixed it
> and short-circuted the check for contiguous buffers, but that didn't fix the
> failure I was seeing (which was not handling bp->b_offset and large folios
> properly when building bios).

Yikes.

> Apart from those issues, the conversion and enhancement is relatively straight
> forward.  It passes fstests on both 512 and 4096 byte sector size storage (512
> byte sectors exercise the XBF_KMEM path which has non-zero bp->b_offset values)
> and doesn't appear to cause any problems with large directory buffers, though I
> haven't done any real testing on those yet. Large folio allocations are
> definitely being exercised, though, as all the inode cluster buffers are 16kB on
> a 512 byte inode V5 filesystem.
> 
> Thoughts, comments, etc?

Not yet.

> Note: this patchset is on top of the NOFS removal patchset I sent a
> few days ago. That can be pulled from this git branch:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-kmem-cleanup

Oooh a branch link, thank you.  It's so much easier if I can pull a
branch while picking through commits over gitweb.

--D

> 
>
Dave Chinner Jan. 22, 2024, 11:53 a.m. UTC | #2
On Mon, Jan 22, 2024 at 02:13:23AM -0800, Andi Kleen wrote:
> Dave Chinner <david@fromorbit.com> writes:
> 
> > Thoughts, comments, etc?
> 
> The interesting part is if it will cause additional tail latencies
> allocating under fragmentation with direct reclaim, compaction
> etc. being triggered before it falls back to the base page path.

It's not like I don't know these problems exist with memory
allocation. Go have a look at xlog_kvmalloc() which is an open coded
kvmalloc() that allows the high order kmalloc allocations to
fail-fast without triggering all the expensive and unnecessary
direct reclaim overhead (e.g. compaction!) because we can fall back
to vmalloc without huge concerns. When high order allocations start
to fail, then we fall back to vmalloc and then we hit the long
standing vmalloc scalability problems before anything else in XFS or
the IO path becomes a bottleneck.

IOWs, we already know that fail-fast high-order allocation is a more
efficient and effective fast path than using vmalloc/vmap_ram() all
the time. As this is an RFC, I haven't implemented stuff like this
yet - I haven't seen anything in the profiles indicating that high
order folio allocation is failing and causing lots of reclaim
overhead, so I simply haven't added fail-fast behaviour yet...

> In fact it is highly likely it will, the question is just how bad it is.
> 
> Unfortunately benchmarking for that isn't that easy, it needs artificial
> memory fragmentation and then some high stress workload, and then
> instrumenting the transactions for individual latencies. 

I stress test and measure XFS metadata performance under sustained
memory pressure all the time. This change has not caused any
obvious regressions in the short time I've been testing it.

I still need to do perf testing on large directory block sizes. That
is where high-order allocations will get stressed - that's where
xlog_kvmalloc() starts dominating the profiles as it trips over
vmalloc scalability issues...

> I would in any case add a tunable for it in case people run into this.

No tunables. It either works or it doesn't. If we can't make
it work reliably by default, we throw it in the dumpster, light it
on fire and walk away.

> Tail latencies are a common concern on many IO workloads.

Yes, for user data operations it's a common concern. For metadata,
not so much - there's so many far worse long tail latencies in
metadata operations (like waiting for journal space) that memory
allocation latencies in the metadata IO path are largely noise....

-Dave.
Andi Kleen Jan. 22, 2024, 1:34 p.m. UTC | #3
[fixed the subject, not sure what happened there]

FWIW I'm not sure fail-fail is always the right strategy here,
in many cases even with some reclaim, compaction may win. Just not if you're
on a tight budget for the latencies.

> I stress test and measure XFS metadata performance under sustained
> memory pressure all the time. This change has not caused any
> obvious regressions in the short time I've been testing it.

Did you test for tail latencies?

There are some relatively simple ways to trigger memory fragmentation,
the standard way is to allocate a very large THP backed file and then
punch a lot of holes.

> 
> I still need to do perf testing on large directory block sizes. That
> is where high-order allocations will get stressed - that's where
> xlog_kvmalloc() starts dominating the profiles as it trips over
> vmalloc scalability issues...

Yes that's true. vmalloc has many issues, although with the recent
patches to split the rbtrees with separate locks it may now look
quite different than before.

> 
> > I would in any case add a tunable for it in case people run into this.
> 
> No tunables. It either works or it doesn't. If we can't make
> it work reliably by default, we throw it in the dumpster, light it
> on fire and walk away.

I'm not sure there is a single definition of "reliably" here -- for
many workloads tail latencies don't matter, so it's always reliable,
as long as you have good aggregate throughput.

Others have very high expectations for them.

Forcing the high expectations on everyone is probably not a good
general strategy though, as there are general trade offs.

I could see that having lots of small tunables for every use might not be a
good idea. Perhaps there would be a case for a single general tunable
that controls higher order folios for everyone.

> 
> > Tail latencies are a common concern on many IO workloads.
> 
> Yes, for user data operations it's a common concern. For metadata,
> not so much - there's so many far worse long tail latencies in
> metadata operations (like waiting for journal space) that memory
> allocation latencies in the metadata IO path are largely noise....

I've seen pretty long stalls in the past.

The difference to the journal is also that it is local the file system, while
the memory is normally shared with everyone on the node or system. So the
scope of noisy neighbour impact can be quite different, especially on a large
machine. 

-Andi
Dave Chinner Jan. 23, 2024, 12:19 a.m. UTC | #4
On Mon, Jan 22, 2024 at 05:34:12AM -0800, Andi Kleen wrote:
> [fixed the subject, not sure what happened there]
> 
> FWIW I'm not sure fail-fail is always the right strategy here,
> in many cases even with some reclaim, compaction may win. Just not if you're
> on a tight budget for the latencies.
> 
> > I stress test and measure XFS metadata performance under sustained
> > memory pressure all the time. This change has not caused any
> > obvious regressions in the short time I've been testing it.
> 
> Did you test for tail latencies?

No, it's an RFC, and I mostly don't care about memory allocation
tail latencies because it is highly unlikely to be a *new issue* we
need to care about.

The fact is taht we already do so much memory allocation and high
order memory allocation (e.g. through slub, xlog_kvmalloc(), user
data IO through the page cache, etc) that if there's a long tail
latency problem with high order memory allocation then it will
already be noticably affecting the XFS data and metadata IO
latencies.

Nobody is reporting problems about excessive long tail latencies
when using XFS, so my care factor about long tail latencies in this
specific memory allocation case is close to zero.

> There are some relatively simple ways to trigger memory fragmentation,
> the standard way is to allocate a very large THP backed file and then
> punch a lot of holes.

Or just run a filesystem with lots of variable sized high order
allocations and varying cached object life times under sustained
memory pressure for a significant period of time....

> > > I would in any case add a tunable for it in case people run into this.
> > 
> > No tunables. It either works or it doesn't. If we can't make
> > it work reliably by default, we throw it in the dumpster, light it
> > on fire and walk away.
> 
> I'm not sure there is a single definition of "reliably" here -- for
> many workloads tail latencies don't matter, so it's always reliable,
> as long as you have good aggregate throughput.
> 
> Others have very high expectations for them.
> 
> Forcing the high expectations on everyone is probably not a good
> general strategy though, as there are general trade offs.

Yup, and we have to make those tradeoffs because filesystems need to
be good at "general purpose" workloads as their primary focus.
Minimising long tail latencies is really just "best effort only"
because we intimately aware of the fact that there are global
resource limitations that cause long tail latencies in filesystem
implementations that simply cannot be worked around.

> I could see that having lots of small tunables for every use might not be a
> good idea. Perhaps there would be a case for a single general tunable
> that controls higher order folios for everyone.

You're not listening: no tunables. Code that has tunables because
you think it *may not work* is code that is not ready to be merged.

> > > Tail latencies are a common concern on many IO workloads.
> > 
> > Yes, for user data operations it's a common concern. For metadata,
> > not so much - there's so many far worse long tail latencies in
> > metadata operations (like waiting for journal space) that memory
> > allocation latencies in the metadata IO path are largely noise....
> 
> I've seen pretty long stalls in the past.
> 
> The difference to the journal is also that it is local the file system, while
> the memory is normally shared with everyone on the node or system. So the
> scope of noisy neighbour impact can be quite different, especially on a large
> machine. 

Most systems run everything on a single filesystem. That makes them
just as global as memory allocation.  If the journal bottlenecks,
everyone on the system suffers the performance degradation, not just
the user who caused it.

-Dave.