[04/26] xfs: Improve metadata buffer reclaim accountability
diff mbox series

Message ID 20191009032124.10541-5-david@fromorbit.com
State New
Headers show
Series
  • mm, xfs: non-blocking inode reclaim
Related show

Commit Message

Dave Chinner Oct. 9, 2019, 3:21 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

The buffer cache shrinker frees more than just the xfs_buf slab
objects - it also frees the pages attached to the buffers. Make sure
the memory reclaim code accounts for this memory being freed
correctly, similar to how the inode shrinker accounts for pages
freed from the page cache due to mapping invalidation.

We also need to make sure that the mm subsystem knows these are
reclaimable objects. We provide the memory reclaim subsystem with a
a shrinker to reclaim xfs_bufs, so we should really mark the slab
that way.

We also have a lot of xfs_bufs in a busy system, spread them around
like we do inodes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Comments

Brian Foster Oct. 11, 2019, 12:39 p.m. UTC | #1
On Wed, Oct 09, 2019 at 02:21:02PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The buffer cache shrinker frees more than just the xfs_buf slab
> objects - it also frees the pages attached to the buffers. Make sure
> the memory reclaim code accounts for this memory being freed
> correctly, similar to how the inode shrinker accounts for pages
> freed from the page cache due to mapping invalidation.
> 
> We also need to make sure that the mm subsystem knows these are
> reclaimable objects. We provide the memory reclaim subsystem with a
> a shrinker to reclaim xfs_bufs, so we should really mark the slab
> that way.
> 
> We also have a lot of xfs_bufs in a busy system, spread them around
> like we do inodes.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Seems reasonable, but for inodes we also spread the ili zone. Should we
not be consistent with bli's as well?

Brian

>  fs/xfs/xfs_buf.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index e484f6bead53..45b470f55ad7 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -324,6 +324,9 @@ xfs_buf_free(
>  
>  			__free_page(page);
>  		}
> +		if (current->reclaim_state)
> +			current->reclaim_state->reclaimed_slab +=
> +							bp->b_page_count;
>  	} else if (bp->b_flags & _XBF_KMEM)
>  		kmem_free(bp->b_addr);
>  	_xfs_buf_free_pages(bp);
> @@ -2064,7 +2067,8 @@ int __init
>  xfs_buf_init(void)
>  {
>  	xfs_buf_zone = kmem_zone_init_flags(sizeof(xfs_buf_t), "xfs_buf",
> -						KM_ZONE_HWALIGN, NULL);
> +			KM_ZONE_HWALIGN | KM_ZONE_SPREAD | KM_ZONE_RECLAIM,
> +			NULL);
>  	if (!xfs_buf_zone)
>  		goto out;
>  
> -- 
> 2.23.0.rc1
>
Christoph Hellwig Oct. 11, 2019, 12:57 p.m. UTC | #2
On Fri, Oct 11, 2019 at 08:39:39AM -0400, Brian Foster wrote:
> Seems reasonable, but for inodes we also spread the ili zone. Should we
> not be consistent with bli's as well?

Btw, can we please kill off the stupid KM_ZONE_* flags?  We only use
each of them less than a hand ful of places, and they make reading the
code much harder.
Dave Chinner Oct. 11, 2019, 11:13 p.m. UTC | #3
On Fri, Oct 11, 2019 at 08:39:39AM -0400, Brian Foster wrote:
> On Wed, Oct 09, 2019 at 02:21:02PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The buffer cache shrinker frees more than just the xfs_buf slab
> > objects - it also frees the pages attached to the buffers. Make sure
> > the memory reclaim code accounts for this memory being freed
> > correctly, similar to how the inode shrinker accounts for pages
> > freed from the page cache due to mapping invalidation.
> > 
> > We also need to make sure that the mm subsystem knows these are
> > reclaimable objects. We provide the memory reclaim subsystem with a
> > a shrinker to reclaim xfs_bufs, so we should really mark the slab
> > that way.
> > 
> > We also have a lot of xfs_bufs in a busy system, spread them around
> > like we do inodes.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> 
> Seems reasonable, but for inodes we also spread the ili zone. Should we
> not be consistent with bli's as well?

bli's are reclaimed when the buffer is cleaned. ili's live for the
live of the inode in cache. Hence bli's are short term allocations
(much shorter than xfs_bufs they attach to) and are reclaimed much
faster than inodes and their ilis. There's also a lot less blis than
ili's, so the spread of their footprint across memory nodes doesn't
matter that much. Local access for the memcpy during formatting is
probably more important than spreading the memory usage of them
these days, anyway.

Cheers,

Dave.
Dave Chinner Oct. 11, 2019, 11:14 p.m. UTC | #4
On Fri, Oct 11, 2019 at 05:57:35AM -0700, Christoph Hellwig wrote:
> On Fri, Oct 11, 2019 at 08:39:39AM -0400, Brian Foster wrote:
> > Seems reasonable, but for inodes we also spread the ili zone. Should we
> > not be consistent with bli's as well?
> 
> Btw, can we please kill off the stupid KM_ZONE_* flags?  We only use
> each of them less than a hand ful of places, and they make reading the
> code much harder.

Separate patchset - it's not relevant to this one which is already
complex enough...

Cheers,

Dave.
Brian Foster Oct. 12, 2019, 12:05 p.m. UTC | #5
On Sat, Oct 12, 2019 at 10:13:23AM +1100, Dave Chinner wrote:
> On Fri, Oct 11, 2019 at 08:39:39AM -0400, Brian Foster wrote:
> > On Wed, Oct 09, 2019 at 02:21:02PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > The buffer cache shrinker frees more than just the xfs_buf slab
> > > objects - it also frees the pages attached to the buffers. Make sure
> > > the memory reclaim code accounts for this memory being freed
> > > correctly, similar to how the inode shrinker accounts for pages
> > > freed from the page cache due to mapping invalidation.
> > > 
> > > We also need to make sure that the mm subsystem knows these are
> > > reclaimable objects. We provide the memory reclaim subsystem with a
> > > a shrinker to reclaim xfs_bufs, so we should really mark the slab
> > > that way.
> > > 
> > > We also have a lot of xfs_bufs in a busy system, spread them around
> > > like we do inodes.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > 
> > Seems reasonable, but for inodes we also spread the ili zone. Should we
> > not be consistent with bli's as well?
> 
> bli's are reclaimed when the buffer is cleaned. ili's live for the
> live of the inode in cache. Hence bli's are short term allocations
> (much shorter than xfs_bufs they attach to) and are reclaimed much
> faster than inodes and their ilis. There's also a lot less blis than
> ili's, so the spread of their footprint across memory nodes doesn't
> matter that much. Local access for the memcpy during formatting is
> probably more important than spreading the memory usage of them
> these days, anyway.
> 

Yes, the buffer/inode lifecycle difference is why why I presume bli
zones are not ZONE_RECLAIM like ili zones. This doesn't tell me anything
about why buffers should be spread around as such and buffer log items
not, though..

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner Oct. 13, 2019, 3:14 a.m. UTC | #6
On Sat, Oct 12, 2019 at 08:05:58AM -0400, Brian Foster wrote:
> On Sat, Oct 12, 2019 at 10:13:23AM +1100, Dave Chinner wrote:
> > On Fri, Oct 11, 2019 at 08:39:39AM -0400, Brian Foster wrote:
> > > On Wed, Oct 09, 2019 at 02:21:02PM +1100, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > The buffer cache shrinker frees more than just the xfs_buf slab
> > > > objects - it also frees the pages attached to the buffers. Make sure
> > > > the memory reclaim code accounts for this memory being freed
> > > > correctly, similar to how the inode shrinker accounts for pages
> > > > freed from the page cache due to mapping invalidation.
> > > > 
> > > > We also need to make sure that the mm subsystem knows these are
> > > > reclaimable objects. We provide the memory reclaim subsystem with a
> > > > a shrinker to reclaim xfs_bufs, so we should really mark the slab
> > > > that way.
> > > > 
> > > > We also have a lot of xfs_bufs in a busy system, spread them around
> > > > like we do inodes.
> > > > 
> > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > ---
> > > 
> > > Seems reasonable, but for inodes we also spread the ili zone. Should we
> > > not be consistent with bli's as well?
> > 
> > bli's are reclaimed when the buffer is cleaned. ili's live for the
> > live of the inode in cache. Hence bli's are short term allocations
> > (much shorter than xfs_bufs they attach to) and are reclaimed much
> > faster than inodes and their ilis. There's also a lot less blis than
> > ili's, so the spread of their footprint across memory nodes doesn't
> > matter that much. Local access for the memcpy during formatting is
> > probably more important than spreading the memory usage of them
> > these days, anyway.
> > 
> 
> Yes, the buffer/inode lifecycle difference is why why I presume bli
> zones are not ZONE_RECLAIM like ili zones.

No, that is not the case. IO completion cleaning the buffer is what
frees the bli. The ili can only be freed by reclaiming the inode, so
it's memory that can only be returned to the free pool by running a
shrinker. Hence ilis are ZONE_RECLAIM to account them as memory that
can be reclaimed through shrinker invocation, while BLIs are not
because memory reclaim can't directly cause them to be freed.

> This doesn't tell me anything about why buffers should be spread
> around as such and buffer log items not, though..

xfs_bufs are long lived, are global structures, and can accumulate
in the millions if the workload requires it. IOWs, we should spread
xfs_bufs for exactly the same reasons inodes are spread.

As for BLIs, they are short term structures - a single xfs_buf might
have thousands of different blis attached to it over it's life in
the cache because the BLI is freed when the buffer is cleaned.

We don't need to spread small short term structures around NUMA
memory nodes because they don't present a long term memory imbalance
vector. In general it is better to have them allocated local to the
process that is using them where the memory access latency is
lowest, knowing that they will be freed shortly and not contribute
to long term memory usage.

Cheers,

Dave.
Brian Foster Oct. 14, 2019, 1:05 p.m. UTC | #7
On Sun, Oct 13, 2019 at 02:14:50PM +1100, Dave Chinner wrote:
> On Sat, Oct 12, 2019 at 08:05:58AM -0400, Brian Foster wrote:
> > On Sat, Oct 12, 2019 at 10:13:23AM +1100, Dave Chinner wrote:
> > > On Fri, Oct 11, 2019 at 08:39:39AM -0400, Brian Foster wrote:
> > > > On Wed, Oct 09, 2019 at 02:21:02PM +1100, Dave Chinner wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > 
> > > > > The buffer cache shrinker frees more than just the xfs_buf slab
> > > > > objects - it also frees the pages attached to the buffers. Make sure
> > > > > the memory reclaim code accounts for this memory being freed
> > > > > correctly, similar to how the inode shrinker accounts for pages
> > > > > freed from the page cache due to mapping invalidation.
> > > > > 
> > > > > We also need to make sure that the mm subsystem knows these are
> > > > > reclaimable objects. We provide the memory reclaim subsystem with a
> > > > > a shrinker to reclaim xfs_bufs, so we should really mark the slab
> > > > > that way.
> > > > > 
> > > > > We also have a lot of xfs_bufs in a busy system, spread them around
> > > > > like we do inodes.
> > > > > 
> > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > > ---
> > > > 
> > > > Seems reasonable, but for inodes we also spread the ili zone. Should we
> > > > not be consistent with bli's as well?
> > > 
> > > bli's are reclaimed when the buffer is cleaned. ili's live for the
> > > live of the inode in cache. Hence bli's are short term allocations
> > > (much shorter than xfs_bufs they attach to) and are reclaimed much
> > > faster than inodes and their ilis. There's also a lot less blis than
> > > ili's, so the spread of their footprint across memory nodes doesn't
> > > matter that much. Local access for the memcpy during formatting is
> > > probably more important than spreading the memory usage of them
> > > these days, anyway.
> > > 
> > 
> > Yes, the buffer/inode lifecycle difference is why why I presume bli
> > zones are not ZONE_RECLAIM like ili zones.
> 
> No, that is not the case. IO completion cleaning the buffer is what
> frees the bli. The ili can only be freed by reclaiming the inode, so
> it's memory that can only be returned to the free pool by running a
> shrinker. Hence ilis are ZONE_RECLAIM to account them as memory that
> can be reclaimed through shrinker invocation, while BLIs are not
> because memory reclaim can't directly cause them to be freed.
> 

That is pretty much what I said. I think we're in agreement.

> > This doesn't tell me anything about why buffers should be spread
> > around as such and buffer log items not, though..
> 
> xfs_bufs are long lived, are global structures, and can accumulate
> in the millions if the workload requires it. IOWs, we should spread
> xfs_bufs for exactly the same reasons inodes are spread.
> 
> As for BLIs, they are short term structures - a single xfs_buf might
> have thousands of different blis attached to it over it's life in
> the cache because the BLI is freed when the buffer is cleaned.
> 

Short term relative to the ILI perhaps, but these are still memory
allocations that outlive the allocating task returning to userspace, are
reused (across tasks) commonly enough and have an I/O bound life cycle.
That's also not considering page/slab buildup in the kmem cache beyond
the lifetime of individual allocations..

> We don't need to spread small short term structures around NUMA
> memory nodes because they don't present a long term memory imbalance
> vector. In general it is better to have them allocated local to the
> process that is using them where the memory access latency is
> lowest, knowing that they will be freed shortly and not contribute
> to long term memory usage.
> 

Hmm.. doesn't this all depend on enablement of a cgroup knob in the
first place? It looks to me that this behavior is tied to a per-task
state (not a per-mount or zone setting, which just allows such behavior
on the zone) where the controller has explicitly requested us to not
perform sustained allocations in the local node if possible. Instead,
spread slab allocations around at the cost of bypassing this local
allocation heuristic, presumably because $application wants prioritized
access to that memory. What am I missing?

BTW, it also looks like this is only relevant for slab. I don't see any
references in slub (or slob), but I haven't dug too deeply..

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Darrick J. Wong Oct. 30, 2019, 5:25 p.m. UTC | #8
On Wed, Oct 09, 2019 at 02:21:02PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The buffer cache shrinker frees more than just the xfs_buf slab
> objects - it also frees the pages attached to the buffers. Make sure
> the memory reclaim code accounts for this memory being freed
> correctly, similar to how the inode shrinker accounts for pages
> freed from the page cache due to mapping invalidation.
> 
> We also need to make sure that the mm subsystem knows these are
> reclaimable objects. We provide the memory reclaim subsystem with a
> a shrinker to reclaim xfs_bufs, so we should really mark the slab
> that way.
> 
> We also have a lot of xfs_bufs in a busy system, spread them around
> like we do inodes.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_buf.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index e484f6bead53..45b470f55ad7 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -324,6 +324,9 @@ xfs_buf_free(
>  
>  			__free_page(page);
>  		}
> +		if (current->reclaim_state)
> +			current->reclaim_state->reclaimed_slab +=
> +							bp->b_page_count;

Hmm, ok, I see how ZONE_RECLAIM and reclaimed_slab fit together.

>  	} else if (bp->b_flags & _XBF_KMEM)
>  		kmem_free(bp->b_addr);
>  	_xfs_buf_free_pages(bp);
> @@ -2064,7 +2067,8 @@ int __init
>  xfs_buf_init(void)
>  {
>  	xfs_buf_zone = kmem_zone_init_flags(sizeof(xfs_buf_t), "xfs_buf",
> -						KM_ZONE_HWALIGN, NULL);
> +			KM_ZONE_HWALIGN | KM_ZONE_SPREAD | KM_ZONE_RECLAIM,

I guess I'm fine with ZONE_SPREAD too, insofar as it only seems to apply
to a particular "use another node" memory policy when slab is in use.
Was that your intent?

--D

> +			NULL);
>  	if (!xfs_buf_zone)
>  		goto out;
>  
> -- 
> 2.23.0.rc1
>
Dave Chinner Oct. 30, 2019, 9:43 p.m. UTC | #9
On Wed, Oct 30, 2019 at 10:25:17AM -0700, Darrick J. Wong wrote:
> On Wed, Oct 09, 2019 at 02:21:02PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The buffer cache shrinker frees more than just the xfs_buf slab
> > objects - it also frees the pages attached to the buffers. Make sure
> > the memory reclaim code accounts for this memory being freed
> > correctly, similar to how the inode shrinker accounts for pages
> > freed from the page cache due to mapping invalidation.
> > 
> > We also need to make sure that the mm subsystem knows these are
> > reclaimable objects. We provide the memory reclaim subsystem with a
> > a shrinker to reclaim xfs_bufs, so we should really mark the slab
> > that way.
> > 
> > We also have a lot of xfs_bufs in a busy system, spread them around
> > like we do inodes.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_buf.c | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index e484f6bead53..45b470f55ad7 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -324,6 +324,9 @@ xfs_buf_free(
> >  
> >  			__free_page(page);
> >  		}
> > +		if (current->reclaim_state)
> > +			current->reclaim_state->reclaimed_slab +=
> > +							bp->b_page_count;
> 
> Hmm, ok, I see how ZONE_RECLAIM and reclaimed_slab fit together.
> 
> >  	} else if (bp->b_flags & _XBF_KMEM)
> >  		kmem_free(bp->b_addr);
> >  	_xfs_buf_free_pages(bp);
> > @@ -2064,7 +2067,8 @@ int __init
> >  xfs_buf_init(void)
> >  {
> >  	xfs_buf_zone = kmem_zone_init_flags(sizeof(xfs_buf_t), "xfs_buf",
> > -						KM_ZONE_HWALIGN, NULL);
> > +			KM_ZONE_HWALIGN | KM_ZONE_SPREAD | KM_ZONE_RECLAIM,
> 
> I guess I'm fine with ZONE_SPREAD too, insofar as it only seems to apply
> to a particular "use another node" memory policy when slab is in use.
> Was that your intent?

It's more documentation than anything - that we shouldn't be piling
these structures all on to one node because that can have severe
issues with NUMA memory reclaim algorithms. i.e. the xfs-buf
shrinker sets SHRINKER_NUMA_AWARE, so memory pressure on a single
node can reclaim all the xfs-bufs on that node without touching any
other node.

That means, for example, if we instantiate all the AG header buffers
on a single node (e.g. like we do at mount time) then memory
pressure on that one node will generate IO stalls across the entire
filesystem as other nodes doing work have to repopulate the buffer
cache for any allocation for freeing of space/inodes..

IOWs, for large NUMA systems using cpusets this cache should be
spread around all of memory, especially as it has NUMA aware
reclaim. For everyone else, it's just documentation that improper
cgroup or NUMA memory policy could cause you all sorts of problems
with this cache.

It's worth noting that SLAB_MEM_SPREAD is used almost exclusively in
filesystems for inode caches largely because, at the time (~2006),
the only reclaimable cache that could grow to any size large enough
to cause problems was the inode cache. It's been cargo-culted ever
since, whether it is needed or not (e.g. ceph).

In the case of the xfs_bufs, I've been running workloads recently
that cache several million xfs_bufs and only a handful of inodes
rather than the other way around. If we spread inodes because
caching millions on a single node can cause problems on large NUMA
machines, then we also need to spread xfs_bufs...

Cheers,

Dave.
Darrick J. Wong Oct. 31, 2019, 3:06 a.m. UTC | #10
On Thu, Oct 31, 2019 at 08:43:35AM +1100, Dave Chinner wrote:
> On Wed, Oct 30, 2019 at 10:25:17AM -0700, Darrick J. Wong wrote:
> > On Wed, Oct 09, 2019 at 02:21:02PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > The buffer cache shrinker frees more than just the xfs_buf slab
> > > objects - it also frees the pages attached to the buffers. Make sure
> > > the memory reclaim code accounts for this memory being freed
> > > correctly, similar to how the inode shrinker accounts for pages
> > > freed from the page cache due to mapping invalidation.
> > > 
> > > We also need to make sure that the mm subsystem knows these are
> > > reclaimable objects. We provide the memory reclaim subsystem with a
> > > a shrinker to reclaim xfs_bufs, so we should really mark the slab
> > > that way.
> > > 
> > > We also have a lot of xfs_bufs in a busy system, spread them around
> > > like we do inodes.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_buf.c | 6 +++++-
> > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > index e484f6bead53..45b470f55ad7 100644
> > > --- a/fs/xfs/xfs_buf.c
> > > +++ b/fs/xfs/xfs_buf.c
> > > @@ -324,6 +324,9 @@ xfs_buf_free(
> > >  
> > >  			__free_page(page);
> > >  		}
> > > +		if (current->reclaim_state)
> > > +			current->reclaim_state->reclaimed_slab +=
> > > +							bp->b_page_count;
> > 
> > Hmm, ok, I see how ZONE_RECLAIM and reclaimed_slab fit together.
> > 
> > >  	} else if (bp->b_flags & _XBF_KMEM)
> > >  		kmem_free(bp->b_addr);
> > >  	_xfs_buf_free_pages(bp);
> > > @@ -2064,7 +2067,8 @@ int __init
> > >  xfs_buf_init(void)
> > >  {
> > >  	xfs_buf_zone = kmem_zone_init_flags(sizeof(xfs_buf_t), "xfs_buf",
> > > -						KM_ZONE_HWALIGN, NULL);
> > > +			KM_ZONE_HWALIGN | KM_ZONE_SPREAD | KM_ZONE_RECLAIM,
> > 
> > I guess I'm fine with ZONE_SPREAD too, insofar as it only seems to apply
> > to a particular "use another node" memory policy when slab is in use.
> > Was that your intent?
> 
> It's more documentation than anything - that we shouldn't be piling
> these structures all on to one node because that can have severe
> issues with NUMA memory reclaim algorithms. i.e. the xfs-buf
> shrinker sets SHRINKER_NUMA_AWARE, so memory pressure on a single
> node can reclaim all the xfs-bufs on that node without touching any
> other node.
> 
> That means, for example, if we instantiate all the AG header buffers
> on a single node (e.g. like we do at mount time) then memory
> pressure on that one node will generate IO stalls across the entire
> filesystem as other nodes doing work have to repopulate the buffer
> cache for any allocation for freeing of space/inodes..
> 
> IOWs, for large NUMA systems using cpusets this cache should be
> spread around all of memory, especially as it has NUMA aware
> reclaim. For everyone else, it's just documentation that improper
> cgroup or NUMA memory policy could cause you all sorts of problems
> with this cache.
> 
> It's worth noting that SLAB_MEM_SPREAD is used almost exclusively in
> filesystems for inode caches largely because, at the time (~2006),
> the only reclaimable cache that could grow to any size large enough
> to cause problems was the inode cache. It's been cargo-culted ever
> since, whether it is needed or not (e.g. ceph).
> 
> In the case of the xfs_bufs, I've been running workloads recently
> that cache several million xfs_bufs and only a handful of inodes
> rather than the other way around. If we spread inodes because
> caching millions on a single node can cause problems on large NUMA
> machines, then we also need to spread xfs_bufs...

Hmm, could we capture this as a comment somewhere?

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner Oct. 31, 2019, 8:50 p.m. UTC | #11
On Wed, Oct 30, 2019 at 08:06:58PM -0700, Darrick J. Wong wrote:
> On Thu, Oct 31, 2019 at 08:43:35AM +1100, Dave Chinner wrote:
> > On Wed, Oct 30, 2019 at 10:25:17AM -0700, Darrick J. Wong wrote:
> > > On Wed, Oct 09, 2019 at 02:21:02PM +1100, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > The buffer cache shrinker frees more than just the xfs_buf slab
> > > > objects - it also frees the pages attached to the buffers. Make sure
> > > > the memory reclaim code accounts for this memory being freed
> > > > correctly, similar to how the inode shrinker accounts for pages
> > > > freed from the page cache due to mapping invalidation.
> > > > 
> > > > We also need to make sure that the mm subsystem knows these are
> > > > reclaimable objects. We provide the memory reclaim subsystem with a
> > > > a shrinker to reclaim xfs_bufs, so we should really mark the slab
> > > > that way.
> > > > 
> > > > We also have a lot of xfs_bufs in a busy system, spread them around
> > > > like we do inodes.
> > > > 
> > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > ---
> > > >  fs/xfs/xfs_buf.c | 6 +++++-
> > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > > index e484f6bead53..45b470f55ad7 100644
> > > > --- a/fs/xfs/xfs_buf.c
> > > > +++ b/fs/xfs/xfs_buf.c
> > > > @@ -324,6 +324,9 @@ xfs_buf_free(
> > > >  
> > > >  			__free_page(page);
> > > >  		}
> > > > +		if (current->reclaim_state)
> > > > +			current->reclaim_state->reclaimed_slab +=
> > > > +							bp->b_page_count;
> > > 
> > > Hmm, ok, I see how ZONE_RECLAIM and reclaimed_slab fit together.
> > > 
> > > >  	} else if (bp->b_flags & _XBF_KMEM)
> > > >  		kmem_free(bp->b_addr);
> > > >  	_xfs_buf_free_pages(bp);
> > > > @@ -2064,7 +2067,8 @@ int __init
> > > >  xfs_buf_init(void)
> > > >  {
> > > >  	xfs_buf_zone = kmem_zone_init_flags(sizeof(xfs_buf_t), "xfs_buf",
> > > > -						KM_ZONE_HWALIGN, NULL);
> > > > +			KM_ZONE_HWALIGN | KM_ZONE_SPREAD | KM_ZONE_RECLAIM,
> > > 
> > > I guess I'm fine with ZONE_SPREAD too, insofar as it only seems to apply
> > > to a particular "use another node" memory policy when slab is in use.
> > > Was that your intent?
> > 
> > It's more documentation than anything - that we shouldn't be piling
> > these structures all on to one node because that can have severe
> > issues with NUMA memory reclaim algorithms. i.e. the xfs-buf
> > shrinker sets SHRINKER_NUMA_AWARE, so memory pressure on a single
> > node can reclaim all the xfs-bufs on that node without touching any
> > other node.
> > 
> > That means, for example, if we instantiate all the AG header buffers
> > on a single node (e.g. like we do at mount time) then memory
> > pressure on that one node will generate IO stalls across the entire
> > filesystem as other nodes doing work have to repopulate the buffer
> > cache for any allocation for freeing of space/inodes..
> > 
> > IOWs, for large NUMA systems using cpusets this cache should be
> > spread around all of memory, especially as it has NUMA aware
> > reclaim. For everyone else, it's just documentation that improper
> > cgroup or NUMA memory policy could cause you all sorts of problems
> > with this cache.
> > 
> > It's worth noting that SLAB_MEM_SPREAD is used almost exclusively in
> > filesystems for inode caches largely because, at the time (~2006),
> > the only reclaimable cache that could grow to any size large enough
> > to cause problems was the inode cache. It's been cargo-culted ever
> > since, whether it is needed or not (e.g. ceph).
> > 
> > In the case of the xfs_bufs, I've been running workloads recently
> > that cache several million xfs_bufs and only a handful of inodes
> > rather than the other way around. If we spread inodes because
> > caching millions on a single node can cause problems on large NUMA
> > machines, then we also need to spread xfs_bufs...
> 
> Hmm, could we capture this as a comment somewhere?

Sure, but where? We're planning on getting rid of the KM_ZONE flags
in the near future, and most of this is specific to the impacts on
XFS. I could put it in xfs-super.c above where we initialise all the
slabs, I guess. Probably a separate patch, though....

Cheers,

Dave.
Darrick J. Wong Oct. 31, 2019, 9:05 p.m. UTC | #12
On Fri, Nov 01, 2019 at 07:50:49AM +1100, Dave Chinner wrote:
> On Wed, Oct 30, 2019 at 08:06:58PM -0700, Darrick J. Wong wrote:
> > On Thu, Oct 31, 2019 at 08:43:35AM +1100, Dave Chinner wrote:
> > > On Wed, Oct 30, 2019 at 10:25:17AM -0700, Darrick J. Wong wrote:
> > > > On Wed, Oct 09, 2019 at 02:21:02PM +1100, Dave Chinner wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > 
> > > > > The buffer cache shrinker frees more than just the xfs_buf slab
> > > > > objects - it also frees the pages attached to the buffers. Make sure
> > > > > the memory reclaim code accounts for this memory being freed
> > > > > correctly, similar to how the inode shrinker accounts for pages
> > > > > freed from the page cache due to mapping invalidation.
> > > > > 
> > > > > We also need to make sure that the mm subsystem knows these are
> > > > > reclaimable objects. We provide the memory reclaim subsystem with a
> > > > > a shrinker to reclaim xfs_bufs, so we should really mark the slab
> > > > > that way.
> > > > > 
> > > > > We also have a lot of xfs_bufs in a busy system, spread them around
> > > > > like we do inodes.
> > > > > 
> > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > > ---
> > > > >  fs/xfs/xfs_buf.c | 6 +++++-
> > > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > > > index e484f6bead53..45b470f55ad7 100644
> > > > > --- a/fs/xfs/xfs_buf.c
> > > > > +++ b/fs/xfs/xfs_buf.c
> > > > > @@ -324,6 +324,9 @@ xfs_buf_free(
> > > > >  
> > > > >  			__free_page(page);
> > > > >  		}
> > > > > +		if (current->reclaim_state)
> > > > > +			current->reclaim_state->reclaimed_slab +=
> > > > > +							bp->b_page_count;
> > > > 
> > > > Hmm, ok, I see how ZONE_RECLAIM and reclaimed_slab fit together.
> > > > 
> > > > >  	} else if (bp->b_flags & _XBF_KMEM)
> > > > >  		kmem_free(bp->b_addr);
> > > > >  	_xfs_buf_free_pages(bp);
> > > > > @@ -2064,7 +2067,8 @@ int __init
> > > > >  xfs_buf_init(void)
> > > > >  {
> > > > >  	xfs_buf_zone = kmem_zone_init_flags(sizeof(xfs_buf_t), "xfs_buf",
> > > > > -						KM_ZONE_HWALIGN, NULL);
> > > > > +			KM_ZONE_HWALIGN | KM_ZONE_SPREAD | KM_ZONE_RECLAIM,
> > > > 
> > > > I guess I'm fine with ZONE_SPREAD too, insofar as it only seems to apply
> > > > to a particular "use another node" memory policy when slab is in use.
> > > > Was that your intent?
> > > 
> > > It's more documentation than anything - that we shouldn't be piling
> > > these structures all on to one node because that can have severe
> > > issues with NUMA memory reclaim algorithms. i.e. the xfs-buf
> > > shrinker sets SHRINKER_NUMA_AWARE, so memory pressure on a single
> > > node can reclaim all the xfs-bufs on that node without touching any
> > > other node.
> > > 
> > > That means, for example, if we instantiate all the AG header buffers
> > > on a single node (e.g. like we do at mount time) then memory
> > > pressure on that one node will generate IO stalls across the entire
> > > filesystem as other nodes doing work have to repopulate the buffer
> > > cache for any allocation for freeing of space/inodes..
> > > 
> > > IOWs, for large NUMA systems using cpusets this cache should be
> > > spread around all of memory, especially as it has NUMA aware
> > > reclaim. For everyone else, it's just documentation that improper
> > > cgroup or NUMA memory policy could cause you all sorts of problems
> > > with this cache.
> > > 
> > > It's worth noting that SLAB_MEM_SPREAD is used almost exclusively in
> > > filesystems for inode caches largely because, at the time (~2006),
> > > the only reclaimable cache that could grow to any size large enough
> > > to cause problems was the inode cache. It's been cargo-culted ever
> > > since, whether it is needed or not (e.g. ceph).
> > > 
> > > In the case of the xfs_bufs, I've been running workloads recently
> > > that cache several million xfs_bufs and only a handful of inodes
> > > rather than the other way around. If we spread inodes because
> > > caching millions on a single node can cause problems on large NUMA
> > > machines, then we also need to spread xfs_bufs...
> > 
> > Hmm, could we capture this as a comment somewhere?
> 
> Sure, but where? We're planning on getting rid of the KM_ZONE flags
> in the near future, and most of this is specific to the impacts on
> XFS. I could put it in xfs-super.c above where we initialise all the
> slabs, I guess. Probably a separate patch, though....

Sounds like a reasonable place (to me) to record the fact that we want
inodes and metadata buffers not to end up concentrating on a single node.

At least until we start having NUMA systems with a separate "IO node" in
which to confine all the IO threads and whatnot <shudder>. :P

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Christoph Hellwig Oct. 31, 2019, 9:22 p.m. UTC | #13
On Thu, Oct 31, 2019 at 02:05:51PM -0700, Darrick J. Wong wrote:
> Sounds like a reasonable place (to me) to record the fact that we want
> inodes and metadata buffers not to end up concentrating on a single node.
> 
> At least until we start having NUMA systems with a separate "IO node" in
> which to confine all the IO threads and whatnot <shudder>. :P

Wait until someone does a Linux port to the Cray T3E ;-)
Dave Chinner Nov. 3, 2019, 9:26 p.m. UTC | #14
On Thu, Oct 31, 2019 at 02:05:51PM -0700, Darrick J. Wong wrote:
> On Fri, Nov 01, 2019 at 07:50:49AM +1100, Dave Chinner wrote:
> > On Wed, Oct 30, 2019 at 08:06:58PM -0700, Darrick J. Wong wrote:
> > > > In the case of the xfs_bufs, I've been running workloads recently
> > > > that cache several million xfs_bufs and only a handful of inodes
> > > > rather than the other way around. If we spread inodes because
> > > > caching millions on a single node can cause problems on large NUMA
> > > > machines, then we also need to spread xfs_bufs...
> > > 
> > > Hmm, could we capture this as a comment somewhere?
> > 
> > Sure, but where? We're planning on getting rid of the KM_ZONE flags
> > in the near future, and most of this is specific to the impacts on
> > XFS. I could put it in xfs-super.c above where we initialise all the
> > slabs, I guess. Probably a separate patch, though....
> 
> Sounds like a reasonable place (to me) to record the fact that we want
> inodes and metadata buffers not to end up concentrating on a single node.

Ok. I'll add yet another patch to the preliminary part of the
series. Any plans to take any of these first few patches in this
cycle?

> At least until we start having NUMA systems with a separate "IO node" in
> which to confine all the IO threads and whatnot <shudder>. :P

Been there, done that, got the t-shirt and wore it out years ago.

IO-only nodes (either via software configuration, or real
cpu/memory-less IO nodes) are one of the reasons we don't want
node-local allocation behaviour for large NUMA configs...

Cheers,

Dave.
Darrick J. Wong Nov. 4, 2019, 11:08 p.m. UTC | #15
On Mon, Nov 04, 2019 at 08:26:50AM +1100, Dave Chinner wrote:
> On Thu, Oct 31, 2019 at 02:05:51PM -0700, Darrick J. Wong wrote:
> > On Fri, Nov 01, 2019 at 07:50:49AM +1100, Dave Chinner wrote:
> > > On Wed, Oct 30, 2019 at 08:06:58PM -0700, Darrick J. Wong wrote:
> > > > > In the case of the xfs_bufs, I've been running workloads recently
> > > > > that cache several million xfs_bufs and only a handful of inodes
> > > > > rather than the other way around. If we spread inodes because
> > > > > caching millions on a single node can cause problems on large NUMA
> > > > > machines, then we also need to spread xfs_bufs...
> > > > 
> > > > Hmm, could we capture this as a comment somewhere?
> > > 
> > > Sure, but where? We're planning on getting rid of the KM_ZONE flags
> > > in the near future, and most of this is specific to the impacts on
> > > XFS. I could put it in xfs-super.c above where we initialise all the
> > > slabs, I guess. Probably a separate patch, though....
> > 
> > Sounds like a reasonable place (to me) to record the fact that we want
> > inodes and metadata buffers not to end up concentrating on a single node.
> 
> Ok. I'll add yet another patch to the preliminary part of the
> series. Any plans to take any of these first few patches in this
> cycle?

I think I have time to review patches this week. :)

(As it occurs to me that the most recent submission of this series
predates this reply, and that's why he couldn't find the patch with
a longer description...)

> > At least until we start having NUMA systems with a separate "IO node" in
> > which to confine all the IO threads and whatnot <shudder>. :P
> 
> Been there, done that, got the t-shirt and wore it out years ago.
> 
> IO-only nodes (either via software configuration, or real
> cpu/memory-less IO nodes) are one of the reasons we don't want
> node-local allocation behaviour for large NUMA configs...

Same, though purely by accident -- CPU mounting bracket cracks, CPU falls
out of machine, but the IO complex is still running, so when the machine
restarts it has a weird NUMA node containing IO but no CPU...

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Patch
diff mbox series

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index e484f6bead53..45b470f55ad7 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -324,6 +324,9 @@  xfs_buf_free(
 
 			__free_page(page);
 		}
+		if (current->reclaim_state)
+			current->reclaim_state->reclaimed_slab +=
+							bp->b_page_count;
 	} else if (bp->b_flags & _XBF_KMEM)
 		kmem_free(bp->b_addr);
 	_xfs_buf_free_pages(bp);
@@ -2064,7 +2067,8 @@  int __init
 xfs_buf_init(void)
 {
 	xfs_buf_zone = kmem_zone_init_flags(sizeof(xfs_buf_t), "xfs_buf",
-						KM_ZONE_HWALIGN, NULL);
+			KM_ZONE_HWALIGN | KM_ZONE_SPREAD | KM_ZONE_RECLAIM,
+			NULL);
 	if (!xfs_buf_zone)
 		goto out;