diff mbox series

[4/4] vfs: keep inodes with page cache off the inode shrinker LRU

Message ID 20210614211904.14420-4-hannes@cmpxchg.org (mailing list archive)
State New
Headers show
Series [1/4] mm: remove irqsave/restore locking from contexts with irqs enabled | expand

Commit Message

Johannes Weiner June 14, 2021, 9:19 p.m. UTC
Historically (pre-2.5), the inode shrinker used to reclaim only empty
inodes and skip over those that still contained page cache. This
caused problems on highmem hosts: struct inode could put fill lowmem
zones before the cache was getting reclaimed in the highmem zones.

To address this, the inode shrinker started to strip page cache to
facilitate reclaiming lowmem. However, this comes with its own set of
problems: the shrinkers may drop actively used page cache just because
the inodes are not currently open or dirty - think working with a
large git tree. It further doesn't respect cgroup memory protection
settings and can cause priority inversions between containers.

Nowadays, the page cache also holds non-resident info for evicted
cache pages in order to detect refaults. We've come to rely heavily on
this data inside reclaim for protecting the cache workingset and
driving swap behavior. We also use it to quantify and report workload
health through psi. The latter in turn is used for fleet health
monitoring, as well as driving automated memory sizing of workloads
and containers, proactive reclaim and memory offloading schemes.

The consequences of dropping page cache prematurely is that we're
seeing subtle and not-so-subtle failures in all of the above-mentioned
scenarios, with the workload generally entering unexpected thrashing
states while losing the ability to reliably detect it.

To fix this on non-highmem systems at least, going back to rotating
inodes on the LRU isn't feasible. We've tried (commit a76cf1a474d7
("mm: don't reclaim inodes with many attached pages")) and failed
(commit 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many
attached pages"")). The issue is mostly that shrinker pools attract
pressure based on their size, and when objects get skipped the
shrinkers remember this as deferred reclaim work. This accumulates
excessive pressure on the remaining inodes, and we can quickly eat
into heavily used ones, or dirty ones that require IO to reclaim, when
there potentially is plenty of cold, clean cache around still.

Instead, this patch keeps populated inodes off the inode LRU in the
first place - just like an open file or dirty state would. An
otherwise clean and unused inode then gets queued when the last cache
entry disappears. This solves the problem without reintroducing the
reclaim issues, and generally is a bit more scalable than having to
wade through potentially hundreds of thousands of busy inodes.

Locking is a bit tricky because the locks protecting the inode state
(i_lock) and the inode LRU (lru_list.lock) don't nest inside the
irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are
serialized through i_lock, taken before the i_pages lock, to make sure
depopulated inodes are queued reliably. Additions may race with
deletions, but we'll check again in the shrinker. If additions race
with the shrinker itself, we're protected by the i_lock: if
find_inode() or iput() win, the shrinker will bail on the elevated
i_count or I_REFERENCED; if the shrinker wins and goes ahead with the
inode, it will set I_FREEING and inhibit further igets(), which will
cause the other side to create a new instance of the inode instead.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/inode.c              | 46 +++++++++++++++++++++----------------
 fs/internal.h           |  1 -
 include/linux/fs.h      |  1 +
 include/linux/pagemap.h | 50 +++++++++++++++++++++++++++++++++++++++++
 mm/filemap.c            |  8 +++++++
 mm/truncate.c           | 19 ++++++++++++++--
 mm/vmscan.c             |  7 ++++++
 mm/workingset.c         | 10 +++++++++
 8 files changed, 120 insertions(+), 22 deletions(-)

Comments

Andrew Morton June 14, 2021, 9:59 p.m. UTC | #1
On Mon, 14 Jun 2021 17:19:04 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> Historically (pre-2.5), the inode shrinker used to reclaim only empty
> inodes and skip over those that still contained page cache. This
> caused problems on highmem hosts: struct inode could put fill lowmem
> zones before the cache was getting reclaimed in the highmem zones.
> 
> To address this, the inode shrinker started to strip page cache to
> facilitate reclaiming lowmem. However, this comes with its own set of
> problems: the shrinkers may drop actively used page cache just because
> the inodes are not currently open or dirty - think working with a
> large git tree. It further doesn't respect cgroup memory protection
> settings and can cause priority inversions between containers.
> 
> Nowadays, the page cache also holds non-resident info for evicted
> cache pages in order to detect refaults. We've come to rely heavily on
> this data inside reclaim for protecting the cache workingset and
> driving swap behavior. We also use it to quantify and report workload
> health through psi. The latter in turn is used for fleet health
> monitoring, as well as driving automated memory sizing of workloads
> and containers, proactive reclaim and memory offloading schemes.
> 
> The consequences of dropping page cache prematurely is that we're
> seeing subtle and not-so-subtle failures in all of the above-mentioned
> scenarios, with the workload generally entering unexpected thrashing
> states while losing the ability to reliably detect it.
> 
> To fix this on non-highmem systems at least, going back to rotating
> inodes on the LRU isn't feasible. We've tried (commit a76cf1a474d7
> ("mm: don't reclaim inodes with many attached pages")) and failed
> (commit 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many
> attached pages"")). The issue is mostly that shrinker pools attract
> pressure based on their size, and when objects get skipped the
> shrinkers remember this as deferred reclaim work. This accumulates
> excessive pressure on the remaining inodes, and we can quickly eat
> into heavily used ones, or dirty ones that require IO to reclaim, when
> there potentially is plenty of cold, clean cache around still.
> 
> Instead, this patch keeps populated inodes off the inode LRU in the
> first place - just like an open file or dirty state would. An
> otherwise clean and unused inode then gets queued when the last cache
> entry disappears. This solves the problem without reintroducing the
> reclaim issues, and generally is a bit more scalable than having to
> wade through potentially hundreds of thousands of busy inodes.
> 
> Locking is a bit tricky because the locks protecting the inode state
> (i_lock) and the inode LRU (lru_list.lock) don't nest inside the
> irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are
> serialized through i_lock, taken before the i_pages lock, to make sure
> depopulated inodes are queued reliably. Additions may race with
> deletions, but we'll check again in the shrinker. If additions race
> with the shrinker itself, we're protected by the i_lock: if
> find_inode() or iput() win, the shrinker will bail on the elevated
> i_count or I_REFERENCED; if the shrinker wins and goes ahead with the
> inode, it will set I_FREEING and inhibit further igets(), which will
> cause the other side to create a new instance of the inode instead.
> 

And what hitherto unexpected problems will this one cause, sigh.

How exhaustively has this approach been tested?
Johannes Weiner June 14, 2021, 10:41 p.m. UTC | #2
On Mon, Jun 14, 2021 at 02:59:12PM -0700, Andrew Morton wrote:
> On Mon, 14 Jun 2021 17:19:04 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > Historically (pre-2.5), the inode shrinker used to reclaim only empty
> > inodes and skip over those that still contained page cache. This
> > caused problems on highmem hosts: struct inode could put fill lowmem
> > zones before the cache was getting reclaimed in the highmem zones.
> > 
> > To address this, the inode shrinker started to strip page cache to
> > facilitate reclaiming lowmem. However, this comes with its own set of
> > problems: the shrinkers may drop actively used page cache just because
> > the inodes are not currently open or dirty - think working with a
> > large git tree. It further doesn't respect cgroup memory protection
> > settings and can cause priority inversions between containers.
> > 
> > Nowadays, the page cache also holds non-resident info for evicted
> > cache pages in order to detect refaults. We've come to rely heavily on
> > this data inside reclaim for protecting the cache workingset and
> > driving swap behavior. We also use it to quantify and report workload
> > health through psi. The latter in turn is used for fleet health
> > monitoring, as well as driving automated memory sizing of workloads
> > and containers, proactive reclaim and memory offloading schemes.
> > 
> > The consequences of dropping page cache prematurely is that we're
> > seeing subtle and not-so-subtle failures in all of the above-mentioned
> > scenarios, with the workload generally entering unexpected thrashing
> > states while losing the ability to reliably detect it.
> > 
> > To fix this on non-highmem systems at least, going back to rotating
> > inodes on the LRU isn't feasible. We've tried (commit a76cf1a474d7
> > ("mm: don't reclaim inodes with many attached pages")) and failed
> > (commit 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many
> > attached pages"")). The issue is mostly that shrinker pools attract
> > pressure based on their size, and when objects get skipped the
> > shrinkers remember this as deferred reclaim work. This accumulates
> > excessive pressure on the remaining inodes, and we can quickly eat
> > into heavily used ones, or dirty ones that require IO to reclaim, when
> > there potentially is plenty of cold, clean cache around still.
> > 
> > Instead, this patch keeps populated inodes off the inode LRU in the
> > first place - just like an open file or dirty state would. An
> > otherwise clean and unused inode then gets queued when the last cache
> > entry disappears. This solves the problem without reintroducing the
> > reclaim issues, and generally is a bit more scalable than having to
> > wade through potentially hundreds of thousands of busy inodes.
> > 
> > Locking is a bit tricky because the locks protecting the inode state
> > (i_lock) and the inode LRU (lru_list.lock) don't nest inside the
> > irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are
> > serialized through i_lock, taken before the i_pages lock, to make sure
> > depopulated inodes are queued reliably. Additions may race with
> > deletions, but we'll check again in the shrinker. If additions race
> > with the shrinker itself, we're protected by the i_lock: if
> > find_inode() or iput() win, the shrinker will bail on the elevated
> > i_count or I_REFERENCED; if the shrinker wins and goes ahead with the
> > inode, it will set I_FREEING and inhibit further igets(), which will
> > cause the other side to create a new instance of the inode instead.
> > 
> 
> And what hitherto unexpected problems will this one cause, sigh.

Yeah, I wish we could have stuck with simple rotations, but I can see
how they can become problematic - even though we haven't been able to
reproduce the issue which led to the revert in the FB fleet.

> How exhaustively has this approach been tested?

This specific patch I've put through various inode / cache loads on my
local test rig, drgn-inspected the sb inode lists to check for leaked
inodes etc. - they've held up.

We've had the previous version (which was more complicated) running on
the entire Facebook fleet for a while. I did switch us back to simple
rotations in a recent rebase, though, because the one-liner is easier
to carry out of tree and (luckily) isn't causing problems for us.

It would be good to have a fix upstream that works for everybody,
though. The above-mentioned containerization and pressure detection
failures aren't theoretical, we can't run upstream in our fleet.

We're also trying to publish containerization software that showcases
how the different cgroup controllers play together, how automated
container sizing works etc, but many of those scenarios are broken in
the upstream kernel. A kernel patch dependency is a painful one for a
project like this. Tejun can probably provide more context on this.
Dave Chinner June 15, 2021, 6:26 a.m. UTC | #3
On Mon, Jun 14, 2021 at 05:19:04PM -0400, Johannes Weiner wrote:
> Historically (pre-2.5), the inode shrinker used to reclaim only empty
> inodes and skip over those that still contained page cache. This
> caused problems on highmem hosts: struct inode could put fill lowmem
> zones before the cache was getting reclaimed in the highmem zones.
> 
> To address this, the inode shrinker started to strip page cache to
> facilitate reclaiming lowmem. However, this comes with its own set of
> problems: the shrinkers may drop actively used page cache just because
> the inodes are not currently open or dirty - think working with a
> large git tree. It further doesn't respect cgroup memory protection
> settings and can cause priority inversions between containers.
> 
> Nowadays, the page cache also holds non-resident info for evicted
> cache pages in order to detect refaults. We've come to rely heavily on
> this data inside reclaim for protecting the cache workingset and
> driving swap behavior. We also use it to quantify and report workload
> health through psi. The latter in turn is used for fleet health
> monitoring, as well as driving automated memory sizing of workloads
> and containers, proactive reclaim and memory offloading schemes.
> 
> The consequences of dropping page cache prematurely is that we're
> seeing subtle and not-so-subtle failures in all of the above-mentioned
> scenarios, with the workload generally entering unexpected thrashing
> states while losing the ability to reliably detect it.
> 
> To fix this on non-highmem systems at least, going back to rotating
> inodes on the LRU isn't feasible. We've tried (commit a76cf1a474d7
> ("mm: don't reclaim inodes with many attached pages")) and failed
> (commit 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many
> attached pages"")). The issue is mostly that shrinker pools attract
> pressure based on their size, and when objects get skipped the
> shrinkers remember this as deferred reclaim work. This accumulates
> excessive pressure on the remaining inodes, and we can quickly eat
> into heavily used ones, or dirty ones that require IO to reclaim, when
> there potentially is plenty of cold, clean cache around still.
> 
> Instead, this patch keeps populated inodes off the inode LRU in the
> first place - just like an open file or dirty state would. An
> otherwise clean and unused inode then gets queued when the last cache
> entry disappears. This solves the problem without reintroducing the
> reclaim issues, and generally is a bit more scalable than having to
> wade through potentially hundreds of thousands of busy inodes.
> 
> Locking is a bit tricky because the locks protecting the inode state
> (i_lock) and the inode LRU (lru_list.lock) don't nest inside the
> irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are
> serialized through i_lock, taken before the i_pages lock, to make sure
> depopulated inodes are queued reliably. Additions may race with
> deletions, but we'll check again in the shrinker. If additions race
> with the shrinker itself, we're protected by the i_lock: if
> find_inode() or iput() win, the shrinker will bail on the elevated
> i_count or I_REFERENCED; if the shrinker wins and goes ahead with the
> inode, it will set I_FREEING and inhibit further igets(), which will
> cause the other side to create a new instance of the inode instead.

This makes an awful mess of the inode locking and, to me, is a
serious layering violation....

> index e89df447fae3..c9956fac640e 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -23,6 +23,56 @@ static inline bool mapping_empty(struct address_space *mapping)
>  	return xa_empty(&mapping->i_pages);
>  }
>  
> +/*
> + * mapping_shrinkable - test if page cache state allows inode reclaim
> + * @mapping: the page cache mapping
> + *
> + * This checks the mapping's cache state for the pupose of inode
> + * reclaim and LRU management.
> + *
> + * The caller is expected to hold the i_lock, but is not required to
> + * hold the i_pages lock, which usually protects cache state. That's
> + * because the i_lock and the list_lru lock that protect the inode and
> + * its LRU state don't nest inside the irq-safe i_pages lock.
> + *
> + * Cache deletions are performed under the i_lock, which ensures that
> + * when an inode goes empty, it will reliably get queued on the LRU.
> + *
> + * Cache additions do not acquire the i_lock and may race with this
> + * check, in which case we'll report the inode as shrinkable when it
> + * has cache pages. This is okay: the shrinker also checks the
> + * refcount and the referenced bit, which will be elevated or set in
> + * the process of adding new cache pages to an inode.
> + */

.... because you're expanding the inode->i_lock to now
cover the page cache additions and removal and that massively
increases the scope of the i_lock. The ilock is for internal inode
serialisation purposes, not serialisation with external subsystems
that inodes interact with like mappings.

> +static inline bool mapping_shrinkable(struct address_space *mapping)
> +{
> +	void *head;
> +
> +	/*
> +	 * On highmem systems, there could be lowmem pressure from the
> +	 * inodes before there is highmem pressure from the page
> +	 * cache. Make inodes shrinkable regardless of cache state.
> +	 */
> +	if (IS_ENABLED(CONFIG_HIGHMEM))
> +		return true;
> +
> +	/* Cache completely empty? Shrink away. */
> +	head = rcu_access_pointer(mapping->i_pages.xa_head);
> +	if (!head)
> +		return true;
> +
> +	/*
> +	 * The xarray stores single offset-0 entries directly in the
> +	 * head pointer, which allows non-resident page cache entries
> +	 * to escape the shadow shrinker's list of xarray nodes. The
> +	 * inode shrinker needs to pick them up under memory pressure.
> +	 */
> +	if (!xa_is_node(head) && xa_is_value(head))
> +		return true;
> +
> +	return false;
> +}

It just occurred to me: mapping_shrinkable() == available for
shrinking, not "mapping can be shrunk by having pages freed from
it".

If this lives, it really needs a better name and API - this isn't
applying a check of whether the mapping can be shrunk, it indicates
whether the inode hosting the mapping is considered freeable or not.

>  /*
>   * Bits in mapping->flags.
>   */
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 819d2589abef..0d0d72ced961 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -260,9 +260,13 @@ void delete_from_page_cache(struct page *page)
>  	struct address_space *mapping = page_mapping(page);
>  
>  	BUG_ON(!PageLocked(page));
> +	spin_lock(&mapping->host->i_lock);
>  	xa_lock_irq(&mapping->i_pages);
>  	__delete_from_page_cache(page, NULL);
>  	xa_unlock_irq(&mapping->i_pages);
> +	if (mapping_shrinkable(mapping))
> +		inode_add_lru(mapping->host);
> +	spin_unlock(&mapping->host->i_lock);
>  

This, to me is an example of the layering problesm here. No mapping
specific function should be using locks that belong to the mapping
host for internal mapping serialisation.

Especially considering that inode_add_lru() is defined in
fs/internal.h - nothing outside the core fs code should really be
using inode_add_lru(), just lik enothing outside of fs code should
be using the inode->i_lock for anything.

This just doesn't seem like a robust, maintainable solution to the
problem you're trying to address....

> @@ -73,8 +77,10 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping,
>  		return;
>  
>  	dax = dax_mapping(mapping);
> -	if (!dax)
> +	if (!dax) {
> +		spin_lock(&mapping->host->i_lock);
>  		xa_lock_irq(&mapping->i_pages);
> +	}
>  
>  	for (i = j; i < pagevec_count(pvec); i++) {
>  		struct page *page = pvec->pages[i];
> @@ -93,8 +99,12 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping,
>  		__clear_shadow_entry(mapping, index, page);
>  	}
>  
> -	if (!dax)
> +	if (!dax) {
>  		xa_unlock_irq(&mapping->i_pages);
> +		if (mapping_shrinkable(mapping))
> +			inode_add_lru(mapping->host);
> +		spin_unlock(&mapping->host->i_lock);
> +	}
>  	pvec->nr = j;
>  }

No. Just no.

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index cc5d7cd75935..6dd5ef8a11bc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1055,6 +1055,8 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
>  	BUG_ON(!PageLocked(page));
>  	BUG_ON(mapping != page_mapping(page));
>  
> +	if (!PageSwapCache(page))
> +		spin_lock(&mapping->host->i_lock);
>  	xa_lock_irq(&mapping->i_pages);
>  	/*
>  	 * The non racy check for a busy page.
> @@ -1123,6 +1125,9 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
>  			shadow = workingset_eviction(page, target_memcg);
>  		__delete_from_page_cache(page, shadow);
>  		xa_unlock_irq(&mapping->i_pages);
> +		if (mapping_shrinkable(mapping))
> +			inode_add_lru(mapping->host);
> +		spin_unlock(&mapping->host->i_lock);
>  

No. Inode locks have absolutely no place serialising core vmscan
algorithms.

Really, all this complexity because:

| The issue is mostly that shrinker pools attract pressure based on
| their size, and when objects get skipped the shrinkers remember this
| as deferred reclaim work.

And so you want inodes with reclaimable mappings to avoid being
considered deferred work for the inode shrinker. That's what is
occuring because we are currently returning LRU_RETRY to them, or
would also happen if we returned LRU_SKIP or LRU_ROTATE just to
ignore them.

However, it's trivial to change inode_lru_isolate() to do something
like:

	if (inode_has_buffers(inode) || inode->i_data.nrpages) {
		if (!IS_ENABLED(CONFIG_HIGHMEM)) {
			spin_unlock(&inode->i_lock);
			return LRU_ROTATE_NODEFER;
		}
		.....
	}

And in __list_lru_walk_one() just add:

		case LRU_ROTATE_NODEFER:
			isolated++;
			/* fallthrough */
		case LRU_ROTATE:
			list_move_tail(item, &l->list);
			break;

And now inodes with active page cache  rotated to the tail of the
list and are considered to have had work done on them. Hence they
don't add to the work accumulation that the shrinker infrastructure
defers, and so will allow the page reclaim to do it's stuff with
page reclaim before such inodes will get reclaimed.

That's *much* simpler than your proposed patch and should get you
pretty much the same result.

Cheers,

Dave.
Johannes Weiner June 15, 2021, 6:50 p.m. UTC | #4
On Tue, Jun 15, 2021 at 04:26:40PM +1000, Dave Chinner wrote:
> On Mon, Jun 14, 2021 at 05:19:04PM -0400, Johannes Weiner wrote:
> > Historically (pre-2.5), the inode shrinker used to reclaim only empty
> > inodes and skip over those that still contained page cache. This
> > caused problems on highmem hosts: struct inode could put fill lowmem
> > zones before the cache was getting reclaimed in the highmem zones.
> > 
> > To address this, the inode shrinker started to strip page cache to
> > facilitate reclaiming lowmem. However, this comes with its own set of
> > problems: the shrinkers may drop actively used page cache just because
> > the inodes are not currently open or dirty - think working with a
> > large git tree. It further doesn't respect cgroup memory protection
> > settings and can cause priority inversions between containers.
> > 
> > Nowadays, the page cache also holds non-resident info for evicted
> > cache pages in order to detect refaults. We've come to rely heavily on
> > this data inside reclaim for protecting the cache workingset and
> > driving swap behavior. We also use it to quantify and report workload
> > health through psi. The latter in turn is used for fleet health
> > monitoring, as well as driving automated memory sizing of workloads
> > and containers, proactive reclaim and memory offloading schemes.
> > 
> > The consequences of dropping page cache prematurely is that we're
> > seeing subtle and not-so-subtle failures in all of the above-mentioned
> > scenarios, with the workload generally entering unexpected thrashing
> > states while losing the ability to reliably detect it.
> > 
> > To fix this on non-highmem systems at least, going back to rotating
> > inodes on the LRU isn't feasible. We've tried (commit a76cf1a474d7
> > ("mm: don't reclaim inodes with many attached pages")) and failed
> > (commit 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many
> > attached pages"")). The issue is mostly that shrinker pools attract
> > pressure based on their size, and when objects get skipped the
> > shrinkers remember this as deferred reclaim work. This accumulates
> > excessive pressure on the remaining inodes, and we can quickly eat
> > into heavily used ones, or dirty ones that require IO to reclaim, when
> > there potentially is plenty of cold, clean cache around still.
> > 
> > Instead, this patch keeps populated inodes off the inode LRU in the
> > first place - just like an open file or dirty state would. An
> > otherwise clean and unused inode then gets queued when the last cache
> > entry disappears. This solves the problem without reintroducing the
> > reclaim issues, and generally is a bit more scalable than having to
> > wade through potentially hundreds of thousands of busy inodes.
> > 
> > Locking is a bit tricky because the locks protecting the inode state
> > (i_lock) and the inode LRU (lru_list.lock) don't nest inside the
> > irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are
> > serialized through i_lock, taken before the i_pages lock, to make sure
> > depopulated inodes are queued reliably. Additions may race with
> > deletions, but we'll check again in the shrinker. If additions race
> > with the shrinker itself, we're protected by the i_lock: if
> > find_inode() or iput() win, the shrinker will bail on the elevated
> > i_count or I_REFERENCED; if the shrinker wins and goes ahead with the
> > inode, it will set I_FREEING and inhibit further igets(), which will
> > cause the other side to create a new instance of the inode instead.
> 
> This makes an awful mess of the inode locking and, to me, is a
> serious layering violation....

The direction through the layers is no different than set_page_dirty()
calling mark_inode_dirty() from under the page lock.

This is simply where the state change of the inode originates. What's
the benefit of having the code pretend otherwise?

> > index e89df447fae3..c9956fac640e 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -23,6 +23,56 @@ static inline bool mapping_empty(struct address_space *mapping)
> >  	return xa_empty(&mapping->i_pages);
> >  }
> >  
> > +/*
> > + * mapping_shrinkable - test if page cache state allows inode reclaim
> > + * @mapping: the page cache mapping
> > + *
> > + * This checks the mapping's cache state for the pupose of inode
> > + * reclaim and LRU management.
> > + *
> > + * The caller is expected to hold the i_lock, but is not required to
> > + * hold the i_pages lock, which usually protects cache state. That's
> > + * because the i_lock and the list_lru lock that protect the inode and
> > + * its LRU state don't nest inside the irq-safe i_pages lock.
> > + *
> > + * Cache deletions are performed under the i_lock, which ensures that
> > + * when an inode goes empty, it will reliably get queued on the LRU.
> > + *
> > + * Cache additions do not acquire the i_lock and may race with this
> > + * check, in which case we'll report the inode as shrinkable when it
> > + * has cache pages. This is okay: the shrinker also checks the
> > + * refcount and the referenced bit, which will be elevated or set in
> > + * the process of adding new cache pages to an inode.
> > + */
> 
> .... because you're expanding the inode->i_lock to now cover the
> page cache additions and removal and that massively increases the
> scope of the i_lock. The ilock is for internal inode serialisation
> purposes, not serialisation with external subsystems that inodes
> interact with like mappings.

I'm expanding it to cover another state change in the inode that
happens to originate in the mapping.

Yes, it would have been slightly easier if the inode locks nested
under the page cache locks, like they do for dirty state. This way we
could have updated the inode from the innermost context where the
mapping state changes, which would have been a bit more compact.

Unfortunately, the i_pages lock is irq-safe - to deal with writeback
completion changing cache state from IRQ context. We'd have to make
the i_lock and the list_lru lock irq-safe as well, and I'm not sure
that is more desirable than ordering the locks the other way.

Doing it this way means annotating a few more deletion entry points.
But there are a limited number of ways for pages to leave the page
cache - reclaim and truncation - so it's within reason.

The i_lock isn't expanded to cover additions, if you take a look at
the code. That would have been a much larger surface area indeed.

> > +static inline bool mapping_shrinkable(struct address_space *mapping)
> > +{
> > +	void *head;
> > +
> > +	/*
> > +	 * On highmem systems, there could be lowmem pressure from the
> > +	 * inodes before there is highmem pressure from the page
> > +	 * cache. Make inodes shrinkable regardless of cache state.
> > +	 */
> > +	if (IS_ENABLED(CONFIG_HIGHMEM))
> > +		return true;
> > +
> > +	/* Cache completely empty? Shrink away. */
> > +	head = rcu_access_pointer(mapping->i_pages.xa_head);
> > +	if (!head)
> > +		return true;
> > +
> > +	/*
> > +	 * The xarray stores single offset-0 entries directly in the
> > +	 * head pointer, which allows non-resident page cache entries
> > +	 * to escape the shadow shrinker's list of xarray nodes. The
> > +	 * inode shrinker needs to pick them up under memory pressure.
> > +	 */
> > +	if (!xa_is_node(head) && xa_is_value(head))
> > +		return true;
> > +
> > +	return false;
> > +}
> 
> It just occurred to me: mapping_shrinkable() == available for
> shrinking, not "mapping can be shrunk by having pages freed from
> it".
> 
> If this lives, it really needs a better name and API - this isn't
> applying a check of whether the mapping can be shrunk, it indicates
> whether the inode hosting the mapping is considered freeable or not.

Right, it indicates whether the state of the mapping translates to the
inode being eligigble for shrinker reclaim or not.

I've used mapping_populated() in a previous version, but this function
here captures additional reclaim policy (highmem setups e.g.), so it's
not a suitable name for it.

This is the best I could come up with, so I'm open to suggestions.

> > @@ -260,9 +260,13 @@ void delete_from_page_cache(struct page *page)
> >  	struct address_space *mapping = page_mapping(page);
> >  
> >  	BUG_ON(!PageLocked(page));
> > +	spin_lock(&mapping->host->i_lock);
> >  	xa_lock_irq(&mapping->i_pages);
> >  	__delete_from_page_cache(page, NULL);
> >  	xa_unlock_irq(&mapping->i_pages);
> > +	if (mapping_shrinkable(mapping))
> > +		inode_add_lru(mapping->host);
> > +	spin_unlock(&mapping->host->i_lock);
> >  
> 
> This, to me is an example of the layering problesm here. No mapping
> specific function should be using locks that belong to the mapping
> host for internal mapping serialisation.
> 
> Especially considering that inode_add_lru() is defined in
> fs/internal.h - nothing outside the core fs code should really be
> using inode_add_lru(), just lik enothing outside of fs code should
> be using the inode->i_lock for anything.

Again, it's no different than dirty state propagation. That's simply
the direction in which this information flows through the stack.

We can encapsulate it all into wrapper/API functions if you prefer,
but it doesn't really change the direction in which the information
travels, nor the lock scoping.

> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index cc5d7cd75935..6dd5ef8a11bc 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1055,6 +1055,8 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
> >  	BUG_ON(!PageLocked(page));
> >  	BUG_ON(mapping != page_mapping(page));
> >  
> > +	if (!PageSwapCache(page))
> > +		spin_lock(&mapping->host->i_lock);
> >  	xa_lock_irq(&mapping->i_pages);
> >  	/*
> >  	 * The non racy check for a busy page.
> > @@ -1123,6 +1125,9 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
> >  			shadow = workingset_eviction(page, target_memcg);
> >  		__delete_from_page_cache(page, shadow);
> >  		xa_unlock_irq(&mapping->i_pages);
> > +		if (mapping_shrinkable(mapping))
> > +			inode_add_lru(mapping->host);
> > +		spin_unlock(&mapping->host->i_lock);
> >  
> 
> No. Inode locks have absolutely no place serialising core vmscan
> algorithms.

What if, and hear me out on this one, core vmscan algorithms change
the state of the inode?

The alternative to propagating this change from where it occurs is
simply that the outer layer now has to do stupid things to infer it -
essentially needing to busy poll. See below.

> Really, all this complexity because:
> 
> | The issue is mostly that shrinker pools attract pressure based on
> | their size, and when objects get skipped the shrinkers remember this
> | as deferred reclaim work.
> 
> And so you want inodes with reclaimable mappings to avoid being
> considered deferred work for the inode shrinker. That's what is
> occuring because we are currently returning LRU_RETRY to them, or
> would also happen if we returned LRU_SKIP or LRU_ROTATE just to
> ignore them.
> 
> However, it's trivial to change inode_lru_isolate() to do something
> like:
> 
> 	if (inode_has_buffers(inode) || inode->i_data.nrpages) {
> 		if (!IS_ENABLED(CONFIG_HIGHMEM)) {
> 			spin_unlock(&inode->i_lock);
> 			return LRU_ROTATE_NODEFER;
> 		}
> 		.....
> 	}

Yeah, that's busy polling. It's simple and inefficient.

We can rotate the busy inodes to the end of the list when we see them,
but all *new* inodes get placed ahead of them and push them out the
back. If you have a large set of populated inodes paired with an
ongoing stream of metadata operations creating one-off inodes, you end
up continuously shoveling through hundreds of thousand of the same
pinned inodes to get to the few reclaimable ones mixed in.

Meanwhile, the MM is standing by, going "You know, I could tell you
exactly when those actually become reclaimable..."

If you think this is the more elegant implementation, simply because
it follows a made-up information hierarchy that doesn't actually map
to the real world, then I don't know what to tell you.

We take i_count and dirty inodes off the list because it's silly to
rotate them over and over, when we could just re-add them the moment
the pin goes away. It's not a stretch to do the same for populated
inodes, which usually make up a much bigger share of the pool.

> And in __list_lru_walk_one() just add:
> 
> 		case LRU_ROTATE_NODEFER:
> 			isolated++;
> 			/* fallthrough */
> 		case LRU_ROTATE:
> 			list_move_tail(item, &l->list);
> 			break;
> 
> And now inodes with active page cache  rotated to the tail of the
> list and are considered to have had work done on them. Hence they
> don't add to the work accumulation that the shrinker infrastructure
> defers, and so will allow the page reclaim to do it's stuff with
> page reclaim before such inodes will get reclaimed.
> 
> That's *much* simpler than your proposed patch and should get you
> pretty much the same result.

It solves the deferred work buildup, but it's absurdly inefficient.

So no, unfortunately I disagree with your assessment of this patch and
the arguments you've made in support of that. There is precedent for
propagating state change from the page cache to the inode, and doing
so is vastly more elegant and efficient than having the outer layer
abuse its own aging machinery to busy poll that state.
Dave Chinner June 16, 2021, 1:20 a.m. UTC | #5
On Tue, Jun 15, 2021 at 02:50:09PM -0400, Johannes Weiner wrote:
> On Tue, Jun 15, 2021 at 04:26:40PM +1000, Dave Chinner wrote:
> > On Mon, Jun 14, 2021 at 05:19:04PM -0400, Johannes Weiner wrote:
> > > Historically (pre-2.5), the inode shrinker used to reclaim only empty
> > > inodes and skip over those that still contained page cache. This
> > > caused problems on highmem hosts: struct inode could put fill lowmem
> > > zones before the cache was getting reclaimed in the highmem zones.
> > > 
> > > To address this, the inode shrinker started to strip page cache to
> > > facilitate reclaiming lowmem. However, this comes with its own set of
> > > problems: the shrinkers may drop actively used page cache just because
> > > the inodes are not currently open or dirty - think working with a
> > > large git tree. It further doesn't respect cgroup memory protection
> > > settings and can cause priority inversions between containers.
> > > 
> > > Nowadays, the page cache also holds non-resident info for evicted
> > > cache pages in order to detect refaults. We've come to rely heavily on
> > > this data inside reclaim for protecting the cache workingset and
> > > driving swap behavior. We also use it to quantify and report workload
> > > health through psi. The latter in turn is used for fleet health
> > > monitoring, as well as driving automated memory sizing of workloads
> > > and containers, proactive reclaim and memory offloading schemes.
> > > 
> > > The consequences of dropping page cache prematurely is that we're
> > > seeing subtle and not-so-subtle failures in all of the above-mentioned
> > > scenarios, with the workload generally entering unexpected thrashing
> > > states while losing the ability to reliably detect it.
> > > 
> > > To fix this on non-highmem systems at least, going back to rotating
> > > inodes on the LRU isn't feasible. We've tried (commit a76cf1a474d7
> > > ("mm: don't reclaim inodes with many attached pages")) and failed
> > > (commit 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many
> > > attached pages"")). The issue is mostly that shrinker pools attract
> > > pressure based on their size, and when objects get skipped the
> > > shrinkers remember this as deferred reclaim work. This accumulates
> > > excessive pressure on the remaining inodes, and we can quickly eat
> > > into heavily used ones, or dirty ones that require IO to reclaim, when
> > > there potentially is plenty of cold, clean cache around still.
> > > 
> > > Instead, this patch keeps populated inodes off the inode LRU in the
> > > first place - just like an open file or dirty state would. An
> > > otherwise clean and unused inode then gets queued when the last cache
> > > entry disappears. This solves the problem without reintroducing the
> > > reclaim issues, and generally is a bit more scalable than having to
> > > wade through potentially hundreds of thousands of busy inodes.
> > > 
> > > Locking is a bit tricky because the locks protecting the inode state
> > > (i_lock) and the inode LRU (lru_list.lock) don't nest inside the
> > > irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are
> > > serialized through i_lock, taken before the i_pages lock, to make sure
> > > depopulated inodes are queued reliably. Additions may race with
> > > deletions, but we'll check again in the shrinker. If additions race
> > > with the shrinker itself, we're protected by the i_lock: if
> > > find_inode() or iput() win, the shrinker will bail on the elevated
> > > i_count or I_REFERENCED; if the shrinker wins and goes ahead with the
> > > inode, it will set I_FREEING and inhibit further igets(), which will
> > > cause the other side to create a new instance of the inode instead.
> > 
> > This makes an awful mess of the inode locking and, to me, is a
> > serious layering violation....
> 
> The direction through the layers is no different than set_page_dirty()
> calling mark_inode_dirty() from under the page lock.
> 
> This is simply where the state change of the inode originates. What's
> the benefit of having the code pretend otherwise?
> 
> > > index e89df447fae3..c9956fac640e 100644
> > > --- a/include/linux/pagemap.h
> > > +++ b/include/linux/pagemap.h
> > > @@ -23,6 +23,56 @@ static inline bool mapping_empty(struct address_space *mapping)
> > >  	return xa_empty(&mapping->i_pages);
> > >  }
> > >  
> > > +/*
> > > + * mapping_shrinkable - test if page cache state allows inode reclaim
> > > + * @mapping: the page cache mapping
> > > + *
> > > + * This checks the mapping's cache state for the pupose of inode
> > > + * reclaim and LRU management.
> > > + *
> > > + * The caller is expected to hold the i_lock, but is not required to
> > > + * hold the i_pages lock, which usually protects cache state. That's
> > > + * because the i_lock and the list_lru lock that protect the inode and
> > > + * its LRU state don't nest inside the irq-safe i_pages lock.
> > > + *
> > > + * Cache deletions are performed under the i_lock, which ensures that
> > > + * when an inode goes empty, it will reliably get queued on the LRU.
> > > + *
> > > + * Cache additions do not acquire the i_lock and may race with this
> > > + * check, in which case we'll report the inode as shrinkable when it
> > > + * has cache pages. This is okay: the shrinker also checks the
> > > + * refcount and the referenced bit, which will be elevated or set in
> > > + * the process of adding new cache pages to an inode.
> > > + */
> > 
> > .... because you're expanding the inode->i_lock to now cover the
> > page cache additions and removal and that massively increases the
> > scope of the i_lock. The ilock is for internal inode serialisation
> > purposes, not serialisation with external subsystems that inodes
> > interact with like mappings.
> 
> I'm expanding it to cover another state change in the inode that
> happens to originate in the mapping.
> 
> Yes, it would have been slightly easier if the inode locks nested
> under the page cache locks, like they do for dirty state. This way we
> could have updated the inode from the innermost context where the
> mapping state changes, which would have been a bit more compact.
> 
> Unfortunately, the i_pages lock is irq-safe - to deal with writeback
> completion changing cache state from IRQ context. We'd have to make
> the i_lock and the list_lru lock irq-safe as well, and I'm not sure
> that is more desirable than ordering the locks the other way.
> 
> Doing it this way means annotating a few more deletion entry points.
> But there are a limited number of ways for pages to leave the page
> cache - reclaim and truncation - so it's within reason.
> 
> The i_lock isn't expanded to cover additions, if you take a look at
> the code. That would have been a much larger surface area indeed.
> 
> > > +static inline bool mapping_shrinkable(struct address_space *mapping)
> > > +{
> > > +	void *head;
> > > +
> > > +	/*
> > > +	 * On highmem systems, there could be lowmem pressure from the
> > > +	 * inodes before there is highmem pressure from the page
> > > +	 * cache. Make inodes shrinkable regardless of cache state.
> > > +	 */
> > > +	if (IS_ENABLED(CONFIG_HIGHMEM))
> > > +		return true;
> > > +
> > > +	/* Cache completely empty? Shrink away. */
> > > +	head = rcu_access_pointer(mapping->i_pages.xa_head);
> > > +	if (!head)
> > > +		return true;
> > > +
> > > +	/*
> > > +	 * The xarray stores single offset-0 entries directly in the
> > > +	 * head pointer, which allows non-resident page cache entries
> > > +	 * to escape the shadow shrinker's list of xarray nodes. The
> > > +	 * inode shrinker needs to pick them up under memory pressure.
> > > +	 */
> > > +	if (!xa_is_node(head) && xa_is_value(head))
> > > +		return true;
> > > +
> > > +	return false;
> > > +}
> > 
> > It just occurred to me: mapping_shrinkable() == available for
> > shrinking, not "mapping can be shrunk by having pages freed from
> > it".
> > 
> > If this lives, it really needs a better name and API - this isn't
> > applying a check of whether the mapping can be shrunk, it indicates
> > whether the inode hosting the mapping is considered freeable or not.
> 
> Right, it indicates whether the state of the mapping translates to the
> inode being eligigble for shrinker reclaim or not.
> 
> I've used mapping_populated() in a previous version, but this function
> here captures additional reclaim policy (highmem setups e.g.), so it's
> not a suitable name for it.
> 
> This is the best I could come up with, so I'm open to suggestions.
> 
> > > @@ -260,9 +260,13 @@ void delete_from_page_cache(struct page *page)
> > >  	struct address_space *mapping = page_mapping(page);
> > >  
> > >  	BUG_ON(!PageLocked(page));
> > > +	spin_lock(&mapping->host->i_lock);
> > >  	xa_lock_irq(&mapping->i_pages);
> > >  	__delete_from_page_cache(page, NULL);
> > >  	xa_unlock_irq(&mapping->i_pages);
> > > +	if (mapping_shrinkable(mapping))
> > > +		inode_add_lru(mapping->host);
> > > +	spin_unlock(&mapping->host->i_lock);
> > >  
> > 
> > This, to me is an example of the layering problesm here. No mapping
> > specific function should be using locks that belong to the mapping
> > host for internal mapping serialisation.
> > 
> > Especially considering that inode_add_lru() is defined in
> > fs/internal.h - nothing outside the core fs code should really be
> > using inode_add_lru(), just lik enothing outside of fs code should
> > be using the inode->i_lock for anything.
> 
> Again, it's no different than dirty state propagation. That's simply
> the direction in which this information flows through the stack.
> 
> We can encapsulate it all into wrapper/API functions if you prefer,
> but it doesn't really change the direction in which the information
> travels, nor the lock scoping.
> 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index cc5d7cd75935..6dd5ef8a11bc 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -1055,6 +1055,8 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
> > >  	BUG_ON(!PageLocked(page));
> > >  	BUG_ON(mapping != page_mapping(page));
> > >  
> > > +	if (!PageSwapCache(page))
> > > +		spin_lock(&mapping->host->i_lock);
> > >  	xa_lock_irq(&mapping->i_pages);
> > >  	/*
> > >  	 * The non racy check for a busy page.
> > > @@ -1123,6 +1125,9 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
> > >  			shadow = workingset_eviction(page, target_memcg);
> > >  		__delete_from_page_cache(page, shadow);
> > >  		xa_unlock_irq(&mapping->i_pages);
> > > +		if (mapping_shrinkable(mapping))
> > > +			inode_add_lru(mapping->host);
> > > +		spin_unlock(&mapping->host->i_lock);
> > >  
> > 
> > No. Inode locks have absolutely no place serialising core vmscan
> > algorithms.
> 
> What if, and hear me out on this one, core vmscan algorithms change
> the state of the inode?

Then the core vmscan algorithm has a layering violation.

vmscan operates on pages and address spaces attached to pages. It
should not be peering any deeper into the mapping than what is
contained in the mapping. If it needs to go any deeper (i.e. needs
host level operations) then that is what the address space
operations are for.  i.e. if emptying a mapping needs action to be
taken at the inode level, then it should call out via operation so
that the filesysetm can take appropriate action.

We have strong abstractions and layering between different
subsystems for good reason. If you can't make what you need work
without violating those abstractions, you need to have a compelling
reason to do so. You haven't given us one yet.

> The alternative to propagating this change from where it occurs is
> simply that the outer layer now has to do stupid things to infer it -
> essentially needing to busy poll. See below.
> 
> > Really, all this complexity because:
> > 
> > | The issue is mostly that shrinker pools attract pressure based on
> > | their size, and when objects get skipped the shrinkers remember this
> > | as deferred reclaim work.
> > 
> > And so you want inodes with reclaimable mappings to avoid being
> > considered deferred work for the inode shrinker. That's what is
> > occuring because we are currently returning LRU_RETRY to them, or
> > would also happen if we returned LRU_SKIP or LRU_ROTATE just to
> > ignore them.
> > 
> > However, it's trivial to change inode_lru_isolate() to do something
> > like:
> > 
> > 	if (inode_has_buffers(inode) || inode->i_data.nrpages) {
> > 		if (!IS_ENABLED(CONFIG_HIGHMEM)) {
> > 			spin_unlock(&inode->i_lock);
> > 			return LRU_ROTATE_NODEFER;
> > 		}
> > 		.....
> > 	}
> 
> Yeah, that's busy polling. It's simple and inefficient.

That's how the dentry and inode cache shrinkers work - the LRUs are
lazily maintained to keep LRU list lock contention out of the fast
paths. The trade off for that is some level of inefficiency in
managing inodes and dentries that shouldn't be or aren't ideally
placed on the LRU. IOWs, we push the decision on whether inodes
should be on the LRU or whether they should be reclaimed into the
shrinker scan logic, not the fast path lookup logic where inodes are
referenced or dirtied or cleaned.

IOWs, the biggest problem with increasing the rate at which we move
inodes on and off the LRU list is lock contention. It's already a
major problem and hence the return values from the LRU code to say
whether the inode was added/removed or not to the LRU.

By increasing the number inode LRU addition sites by an order of
magnitude throughout the mm/ subsystem, we greatly increase the
difficulty of managing the inode LRU. We no longer have a nice,
predictable set of locations where the inodes are added to and
removed from the LRUs, and so now we've got a huge increase in the
number of different, unpredictable vectors that can lead to lock
contention on the LRUs...

> We can rotate the busy inodes to the end of the list when we see them,
> but all *new* inodes get placed ahead of them and push them out the
> back.  If you have a large set of populated inodes paired with an
> ongoing stream of metadata operations creating one-off inodes, you end
> up continuously shoveling through hundreds of thousand of the same
> pinned inodes to get to the few reclaimable ones mixed in.

IDGI.

What has a huge stream of metadata dirty inodes going through the
cache have to do with deciding when to reclaim a data inode with a
non-zero page cache residency?

I mean, this whole patchset is intended to prevent the inode from
being reclaimed until the page cache on the inode has been reclaimed
by memory pressure, so what does it matter how often those data inodes
are rotated through the LRU? They can't be reclaimed until the page
cache frees the pages, so the presence or absence of other inodes on
the list is entirely irrelevant.

For any workload that invloves streaming inodes through the inode
cache, the inode shrinker is going to be scanning through the
hundreds of thousands of inodes anyway. Once put in that context,
the cost rotating on data inodes with page cache attached is going
to be noise, regardless of whether it is inefficient or not.

And, really, if the inode cache has so many unreferenced inode with
attached page cache attached to them that the size of the inode
cache becomes an issue, then the page reclaim algorithms are totally
failing to reclaim page cache quickly enough to maintian the desired
system cache balance. i.e. We are supposed to be keeping a balance
between the inode cache size and the page cache size, and if we
can't keep that balance because inode reclaim can't make progress
because page cache pinning inodes, then we have a page reclaim
problem, not an inode reclaim problem.

THere's also the problem of lack of visibility. RIght now, we know
that all the unreferenced VFS inodes in the system are on the LRU
and that they are generally all accounted for either by active
references or walking the LRU. This change now creates a situation
when inodes with page cache attached are not actively references but
now also can't be found by walking the LRU list. i.e. we have inodes
that are unaccounted for by various statistics. How do we know how
many inodes are pinned by the page cache? We can't see them via
inode shrinker count tracing, we can't see them via looking up all
the active references to the inode (because there are none), and so
on.

Essentially we have nothing tracking these inodes at all at this
point, so if we ever miss a call in the mm/ to add the inode to the
LRU that you've sprinked throughout the mm/ code, we essentially
leak the inode. It's hard enough tracking down bugs that result in
"VFS: busy inodes after unmount" errors without actively adding
infrastructure that intentionally makes inodes disappear from
tracking.

> Meanwhile, the MM is standing by, going "You know, I could tell you
> exactly when those actually become reclaimable..."
> 
> If you think this is the more elegant implementation, simply because
> it follows a made-up information hierarchy that doesn't actually map
> to the real world, then I don't know what to tell you.
> 
> We take i_count and dirty inodes off the list because it's silly to
> rotate them over and over, when we could just re-add them the moment
> the pin goes away. It's not a stretch to do the same for populated
> inodes, which usually make up a much bigger share of the pool.

We don't take dirty inodes off the LRU. We only take referenced
inodes off the LRU in the shrinker isolation function because we can
guarantee that when the inode reference is dropped via iput() the
inode will be added back to the LRU if it needs to be placed there.

This is all all internal to the inode cache functionality
(fs/inode.c) and that's the way it should remain because anything
else will rapidly become unmaintainable.

> > And in __list_lru_walk_one() just add:
> > 
> > 		case LRU_ROTATE_NODEFER:
> > 			isolated++;
> > 			/* fallthrough */
> > 		case LRU_ROTATE:
> > 			list_move_tail(item, &l->list);
> > 			break;
> > 
> > And now inodes with active page cache  rotated to the tail of the
> > list and are considered to have had work done on them. Hence they
> > don't add to the work accumulation that the shrinker infrastructure
> > defers, and so will allow the page reclaim to do it's stuff with
> > page reclaim before such inodes will get reclaimed.
> > 
> > That's *much* simpler than your proposed patch and should get you
> > pretty much the same result.
> 
> It solves the deferred work buildup, but it's absurdly inefficient.

So you keep saying. Show us the numbers. Show us that it's so
inefficient that it's completely unworkable. _You_ need to justify
why violating modularity and layering is the only viable solution to
this problem. Given that there is an alternative simple, straight
forward solution to the problem, it's on you to prove it is
insufficient to solve your issues.

I'm sceptical that the complexity is necessary given that in general
workloads, the inode shrinker doesn't even register in kernel
profiles and that the problem being avoided generally isn't even hit
in most workloads. IOWs, I'll take a simple but inefficient solution
for avoiding a corner case behaviour over a solution that is
complex, fragile and full of layering violations any day of the
weeks.

Whether you disagree with me or not is largely irrelevant - nothing
you are saying changes the fact that this patchset makes an
unmanageable mess of internal filesystem locks and APIs and that's
not simply acceptible. There's a simple way to acheive what you
want, so do it that way first, then we can work out if further
improvement is necessary.

Cheers,

Dave.
Johannes Weiner June 16, 2021, 4:54 a.m. UTC | #6
On Wed, Jun 16, 2021 at 11:20:08AM +1000, Dave Chinner wrote:
> On Tue, Jun 15, 2021 at 02:50:09PM -0400, Johannes Weiner wrote:
> > On Tue, Jun 15, 2021 at 04:26:40PM +1000, Dave Chinner wrote:
> > > On Mon, Jun 14, 2021 at 05:19:04PM -0400, Johannes Weiner wrote:
> > > > @@ -1123,6 +1125,9 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
> > > >  			shadow = workingset_eviction(page, target_memcg);
> > > >  		__delete_from_page_cache(page, shadow);
> > > >  		xa_unlock_irq(&mapping->i_pages);
> > > > +		if (mapping_shrinkable(mapping))
> > > > +			inode_add_lru(mapping->host);
> > > > +		spin_unlock(&mapping->host->i_lock);
> > > >  
> > > 
> > > No. Inode locks have absolutely no place serialising core vmscan
> > > algorithms.
> > 
> > What if, and hear me out on this one, core vmscan algorithms change
> > the state of the inode?
> 
> Then the core vmscan algorithm has a layering violation.

You're just playing a word game here.

If we want the inode reclaimability to be dependent on the state of
the page cache, then reclaim and truncation change the state of the
inode. It's really that simple.

We can recognize that fact in the code, or we can be obtuse about it.

> > The alternative to propagating this change from where it occurs is
> > simply that the outer layer now has to do stupid things to infer it -
> > essentially needing to busy poll. See below.
> > 
> > > Really, all this complexity because:
> > > 
> > > | The issue is mostly that shrinker pools attract pressure based on
> > > | their size, and when objects get skipped the shrinkers remember this
> > > | as deferred reclaim work.
> > > 
> > > And so you want inodes with reclaimable mappings to avoid being
> > > considered deferred work for the inode shrinker. That's what is
> > > occuring because we are currently returning LRU_RETRY to them, or
> > > would also happen if we returned LRU_SKIP or LRU_ROTATE just to
> > > ignore them.
> > > 
> > > However, it's trivial to change inode_lru_isolate() to do something
> > > like:
> > > 
> > > 	if (inode_has_buffers(inode) || inode->i_data.nrpages) {
> > > 		if (!IS_ENABLED(CONFIG_HIGHMEM)) {
> > > 			spin_unlock(&inode->i_lock);
> > > 			return LRU_ROTATE_NODEFER;
> > > 		}
> > > 		.....
> > > 	}
> > 
> > Yeah, that's busy polling. It's simple and inefficient.
> 
> That's how the dentry and inode cache shrinkers work - the LRUs are
> lazily maintained to keep LRU list lock contention out of the fast
> paths. The trade off for that is some level of inefficiency in
> managing inodes and dentries that shouldn't be or aren't ideally
> placed on the LRU. IOWs, we push the decision on whether inodes
> should be on the LRU or whether they should be reclaimed into the
> shrinker scan logic, not the fast path lookup logic where inodes are
> referenced or dirtied or cleaned.

That's just not true.

The way the inode shrinker works is that all indefinite pins that make
an inode unreclaimable for reasons other than its age are kept off the
LRU. Yes, we're lazy about removing them if pins are established after
the object is already queued. But we don't queue objects with that
state, and sites that clear this pin update the LRU.

That's true when the inode is pinned by a refcount or by the dirty
state. I'm proposing to handle the page cache state in exactly the
same fashion.

It's you who is arguing we should deal with those particular pins
differently than how we deal with all others, and I find it difficult
to follow your reasoning for that.

> IOWs, the biggest problem with increasing the rate at which we move
> inodes on and off the LRU list is lock contention. It's already a
> major problem and hence the return values from the LRU code to say
> whether the inode was added/removed or not to the LRU.
> 
> By increasing the number inode LRU addition sites by an order of
> magnitude throughout the mm/ subsystem, we greatly increase the
> difficulty of managing the inode LRU. We no longer have a nice,
> predictable set of locations where the inodes are added to and
> removed from the LRUs, and so now we've got a huge increase in the
> number of different, unpredictable vectors that can lead to lock
> contention on the LRUs...

Rotating the same unreclaimable inodes over and over is BETTER for
lock/LRU contention than adding them once at the end of their life?

> > We can rotate the busy inodes to the end of the list when we see them,
> > but all *new* inodes get placed ahead of them and push them out the
> > back.  If you have a large set of populated inodes paired with an
> > ongoing stream of metadata operations creating one-off inodes, you end
> > up continuously shoveling through hundreds of thousand of the same
> > pinned inodes to get to the few reclaimable ones mixed in.
> 
> IDGI.
> 
> What has a huge stream of metadata dirty inodes going through the
> cache have to do with deciding when to reclaim a data inode with a
> non-zero page cache residency?

It's simply about the work it needs to perform in order to find
reclaimable nodes in the presence of a large number of permanently
unreclaimable ones.

> I mean, this whole patchset is intended to prevent the inode from
> being reclaimed until the page cache on the inode has been reclaimed
> by memory pressure, so what does it matter how often those data inodes
> are rotated through the LRU? They can't be reclaimed until the page
> cache frees the pages, so the presence or absence of other inodes on
> the list is entirely irrelevant.
> 
> For any workload that invloves streaming inodes through the inode
> cache, the inode shrinker is going to be scanning through the
> hundreds of thousands of inodes anyway. Once put in that context,
> the cost rotating on data inodes with page cache attached is going
> to be noise, regardless of whether it is inefficient or not.

How do you figure that?

If you have 1,000 inodes pinned by page cache, and then start
streaming, you have to rotate 1,000 inodes before you get to the first
reclaimable one. It depends on memory pressure how many streaming
inodes you were able to add before the inode pool is capped and
reclaim runs again. At least in theory, the worst case is having to
rotate every unreclaimable inode for every one you can reclaim.

Whether this is likely to happen in practice or not, I don't
understand why we should special case page cache pins from existing
pins and build a pathological cornercase into the shrinker behavior.

> And, really, if the inode cache has so many unreferenced inode with
> attached page cache attached to them that the size of the inode
> cache becomes an issue, then the page reclaim algorithms are totally
> failing to reclaim page cache quickly enough to maintian the desired
> system cache balance. i.e. We are supposed to be keeping a balance
> between the inode cache size and the page cache size, and if we
> can't keep that balance because inode reclaim can't make progress
> because page cache pinning inodes, then we have a page reclaim
> problem, not an inode reclaim problem.

We may never reclaim that page cache if we have no reason to do so.

Say you have a git tree fully cached, and now somebody runs grep on a
large subdirectory with lots of files. We don't kick out the git tree
cache for stuff that is only touched once - reclaim goes directly for
the cache coming off the grep. But the shrinker needs to first rotate
through the entire git tree to get to the grep inodes. Hopefully by
the time it gets to them, their page cache will have gone. If not,
we'll rotate them and go through the entire git tree inodes again.

What's the point in doing this when we know the exact point in time
where it makes sense to start scanning the git tree inodes?

Numbers talk, I get it, and I'll try to get some benchmarking done
next week. But do you really need numbers to understand how poorly
this algorithm performs?

The strongest argument you can make then is that the algorithm simply
never really matters. And that seems like a stretch.

> THere's also the problem of lack of visibility. RIght now, we know
> that all the unreferenced VFS inodes in the system are on the LRU
> and that they are generally all accounted for either by active
> references or walking the LRU. This change now creates a situation
> when inodes with page cache attached are not actively references but
> now also can't be found by walking the LRU list. i.e. we have inodes
> that are unaccounted for by various statistics. How do we know how
> many inodes are pinned by the page cache? We can't see them via
> inode shrinker count tracing, we can't see them via looking up all
> the active references to the inode (because there are none), and so
> on.

Can you elaborate on which statistics you are referring to?

I can see nr_inodes and nr_unused, but I don't see the delta being
broken down further into open and dirty - and if dirty whether it's
the inode that's dirty, or the page cache. So it appears we already
don't know how many inodes can be pinned today by the page cache.

From what I can see there seems to be a general lack of insight into
the state of the icache that may need a more general solution that is
outside the scope of my patch here.

> Essentially we have nothing tracking these inodes at all at this
> point, so if we ever miss a call in the mm/ to add the inode to the
> LRU that you've sprinked throughout the mm/ code, we essentially
> leak the inode. It's hard enough tracking down bugs that result in
> "VFS: busy inodes after unmount" errors without actively adding
> infrastructure that intentionally makes inodes disappear from
> tracking.

Well yes, but that's no different than missing an iput(). And there is
a lot more of those than there are page cache deletion sides.

> > Meanwhile, the MM is standing by, going "You know, I could tell you
> > exactly when those actually become reclaimable..."
> > 
> > If you think this is the more elegant implementation, simply because
> > it follows a made-up information hierarchy that doesn't actually map
> > to the real world, then I don't know what to tell you.
> > 
> > We take i_count and dirty inodes off the list because it's silly to
> > rotate them over and over, when we could just re-add them the moment
> > the pin goes away. It's not a stretch to do the same for populated
> > inodes, which usually make up a much bigger share of the pool.
> 
> We don't take dirty inodes off the LRU. We only take referenced
> inodes off the LRU in the shrinker isolation function because we can
> guarantee that when the inode reference is dropped via iput() the
> inode will be added back to the LRU if it needs to be placed there.
> 
> This is all all internal to the inode cache functionality
> (fs/inode.c) and that's the way it should remain because anything
> else will rapidly become unmaintainable.

You don't seem to understand how this code actually works - this is
patently false. I think this conversation could be more productive if
you spent some time catching up on that first.
Dave Chinner June 17, 2021, 12:49 a.m. UTC | #7
On Wed, Jun 16, 2021 at 12:54:15AM -0400, Johannes Weiner wrote:
> On Wed, Jun 16, 2021 at 11:20:08AM +1000, Dave Chinner wrote:
> > On Tue, Jun 15, 2021 at 02:50:09PM -0400, Johannes Weiner wrote:
> > > On Tue, Jun 15, 2021 at 04:26:40PM +1000, Dave Chinner wrote:
> > > > On Mon, Jun 14, 2021 at 05:19:04PM -0400, Johannes Weiner wrote:
> > > > > @@ -1123,6 +1125,9 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
> > > > >  			shadow = workingset_eviction(page, target_memcg);
> > > > >  		__delete_from_page_cache(page, shadow);
> > > > >  		xa_unlock_irq(&mapping->i_pages);
> > > > > +		if (mapping_shrinkable(mapping))
> > > > > +			inode_add_lru(mapping->host);
> > > > > +		spin_unlock(&mapping->host->i_lock);
> > > > >  
> > > > 
> > > > No. Inode locks have absolutely no place serialising core vmscan
> > > > algorithms.
> > > 
> > > What if, and hear me out on this one, core vmscan algorithms change
> > > the state of the inode?
> > 
> > Then the core vmscan algorithm has a layering violation.
> 
> You're just playing a word game here.

No, I've given you plenty of constructive justification and ways to
restructure your patches to acheive what you say needs to be done.

You're the one that is rejecting any proposal I make outright and
making unjustified claims that "I don't understand this code".

Serious, Johannes, I understand how the inode cache shrinkers work
better than most people out there, and for you to just reject my
comments and suggestions outright with "you don't understand this
stuff so go learn about it' is pretty obnoxious and toxic behaviour.

I haven't disagreed at all with what you are trying to do, nor do I
think that being more selective about how we track inodes on the
LRUs is a bad thing. What I'm commenting on is that the proposed
changes are *really bad code*.

If you can work out a *clean* way to move inodes onto the LRU when
they are dirty then I'm all for it. But sprinkling inode->i_lock all
over the mm/ subsystem and then adding seemling randomly placed
inode lru manipulations isn't the way to do it.

You should consider centralising all the work involved marking a
mapping clean somewhere inside the mm/ code. Then add a single
callout that does the inode LRU work, similar to how the
fs-writeback.c code does it when the inode is marked clean by
inode_sync_complete().

IOWs, if you can't accept that there are problems with your approach
nor accept that it needs to be improved because *I* said there are
problems (and it seems that you mostly only react this way when *I*
happen to say "needs improvement"), then you need to go take a good
hard look at yourself.

Fix up your code so that it is acceptible to reviewers - telling
reviewers they don't know shit is a sure way to piss people off
unnecesarily. It doesn't make you look very good at all, and it
makes other people think twice about wanting to review your code or
work with you in future.

So, again, until you see fit to be stop being obnoxious and toxic
and that your patches need work before they are acceptible for
merging, I maintain my NACK on this patch as it standsi and I am
complete done with this discusion.

-Dave.
Andrew Morton June 17, 2021, 1:30 a.m. UTC | #8
On Wed, 16 Jun 2021 00:54:15 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Wed, Jun 16, 2021 at 11:20:08AM +1000, Dave Chinner wrote:
> > On Tue, Jun 15, 2021 at 02:50:09PM -0400, Johannes Weiner wrote:
> > > On Tue, Jun 15, 2021 at 04:26:40PM +1000, Dave Chinner wrote:
> > > > On Mon, Jun 14, 2021 at 05:19:04PM -0400, Johannes Weiner wrote:
> > > > > @@ -1123,6 +1125,9 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
> > > > >  			shadow = workingset_eviction(page, target_memcg);
> > > > >  		__delete_from_page_cache(page, shadow);
> > > > >  		xa_unlock_irq(&mapping->i_pages);
> > > > > +		if (mapping_shrinkable(mapping))
> > > > > +			inode_add_lru(mapping->host);
> > > > > +		spin_unlock(&mapping->host->i_lock);
> > > > >  
> > > > 
> > > > No. Inode locks have absolutely no place serialising core vmscan
> > > > algorithms.
> > > 
> > > What if, and hear me out on this one, core vmscan algorithms change
> > > the state of the inode?
> > 
> > Then the core vmscan algorithm has a layering violation.
> 
> You're just playing a word game here.

Don't think so.  David is quite correct in saying that vmscan shouldn't
mess with inode state unless it's via address_space_operations?
Johannes Weiner June 18, 2021, 4:45 p.m. UTC | #9
On Thu, Jun 17, 2021 at 10:49:42AM +1000, Dave Chinner wrote:
> On Wed, Jun 16, 2021 at 12:54:15AM -0400, Johannes Weiner wrote:
> > On Wed, Jun 16, 2021 at 11:20:08AM +1000, Dave Chinner wrote:
> > > On Tue, Jun 15, 2021 at 02:50:09PM -0400, Johannes Weiner wrote:
> > > > On Tue, Jun 15, 2021 at 04:26:40PM +1000, Dave Chinner wrote:
> > > > > On Mon, Jun 14, 2021 at 05:19:04PM -0400, Johannes Weiner wrote:
> > > > > > @@ -1123,6 +1125,9 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
> > > > > >  			shadow = workingset_eviction(page, target_memcg);
> > > > > >  		__delete_from_page_cache(page, shadow);
> > > > > >  		xa_unlock_irq(&mapping->i_pages);
> > > > > > +		if (mapping_shrinkable(mapping))
> > > > > > +			inode_add_lru(mapping->host);
> > > > > > +		spin_unlock(&mapping->host->i_lock);
> > > > > >  
> > > > > 
> > > > > No. Inode locks have absolutely no place serialising core vmscan
> > > > > algorithms.
> > > > 
> > > > What if, and hear me out on this one, core vmscan algorithms change
> > > > the state of the inode?
> > > 
> > > Then the core vmscan algorithm has a layering violation.
> > 
> > You're just playing a word game here.
> 
> No, I've given you plenty of constructive justification and ways to
> restructure your patches to acheive what you say needs to be done.
> 
> You're the one that is rejecting any proposal I make outright and
> making unjustified claims that "I don't understand this code".

Hey, come on now. The argument I was making is that page cache state
is already used to update the inode LRU, and you incorrectly claimed
that this wasn't the case. My statement was a direct response to this
impasse, not a way to weasel out of your feedback.

> I haven't disagreed at all with what you are trying to do, nor do I
> think that being more selective about how we track inodes on the
> LRUs is a bad thing.

That's what it sounded like to me, but I'll chalk that up as a
misunderstanding then.

> What I'm commenting on is that the proposed changes are *really bad
> code*.

I'm not in love with it either, I can tell you that.

But it also depends on the alternatives. I don't want to fix one bug
and introduce a scalability issue. Or reintroduce subtle unforeseen
shrinker issues that had you revert the previous fix.

A revert, I might add, that could have been the incremental fix you
proposed here. Instead you glossed over Roman's rootcausing and
eintroduced the original bug. Now we're here, almost three years
later, still on square one.

So yeah, my priority is to get the behavior right first, and then
worry about architectural beauty within those constraints.

> If you can work out a *clean* way to move inodes onto the LRU when
> they are dirty then I'm all for it. But sprinkling inode->i_lock all
> over the mm/ subsystem and then adding seemling randomly placed
> inode lru manipulations isn't the way to do it.
>
> You should consider centralising all the work involved marking a
> mapping clean somewhere inside the mm/ code. Then add a single
> callout that does the inode LRU work, similar to how the
> fs-writeback.c code does it when the inode is marked clean by
> inode_sync_complete().

Yes, I'd prefer that as well. Let's look at the options.

The main source of complication is that the page cache cannot hold a
direct reference on the inode; holding the xa_lock or the i_lock is
the only thing that keeps the inode alive after we removed the page.

So our options are either overlapping the lock sections, or taking the
rcu readlock on the page cache side to bridge the deletion and the
inode callback - which then has to deal with the possibility that the
inode may have already been destroyed by the time it's called.

I would put the RCU thing aside for now as it sounds just a bit too
hairy, and too low-level an insight into the inode lifetime from the
mapping side. The xa_lock is also dropped in several outer entry
functions, so while it would clean up the fs side a bit, we wouldn't
reduce the blast radius on the MM side.

When we overlap lock sections, there are two options:

a) This patch, with the page cache lock nesting inside the i_lock. Or,

b) the way we handle dirty state: When we call set_page_dirty() ->
   mark_inode_dirty(), we hold the lock that serializes the page cache
   state when locking and updating the inode state. The hierarchy is:

       lock_page(page)                 # MM
         spin_lock(&inode->i_lock)     # FS

   The equivalent for this patch would be to have page_cache_delete()
   call mark_inode_empty() (or whatever name works for everybody),
   again under the lock that serializes the page cache state:

       xa_lock_irq(&mapping->i_pages)  # MM
         spin_lock(&inode->i_lock)     # FS

   There would be one central place calling into the fs with an API
   function, encapsulating i_lock handling in fs/inode.c.

   Great.

   The major caveat here is that i_lock would need to become IRQ safe
   in order to nest inside the xa_lock. It's not that the semantical
   layering of code here is new in any way, it's simply the lock type.

   As far as I can see, i_lock hold times are quite short - it's a
   spinlock after all. But I haven't reviewed all the sites yet, and
   there are a lot of them. They would all need to be updated.

   Likewise, list_lru locking needs to be made irq-safe. However,
   irqsafe spinlock is sort of the inevitable fate of any lock
   embedded in a data structure API. So I'm less concerned about that.

   AFAICS nothing else nests under i_lock.

If FS folks are fine with that, I would give that conversion a
shot. Lock type dependency aside, this would retain full modularity
and a clear delineation between mapping and inode property. It would
also be a fully locked scheme, so none of the subtleties of the
current patch. The end result seems clean and maintanable.
Johannes Weiner June 18, 2021, 5:09 p.m. UTC | #10
On Wed, Jun 16, 2021 at 06:30:43PM -0700, Andrew Morton wrote:
> On Wed, 16 Jun 2021 00:54:15 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Wed, Jun 16, 2021 at 11:20:08AM +1000, Dave Chinner wrote:
> > > On Tue, Jun 15, 2021 at 02:50:09PM -0400, Johannes Weiner wrote:
> > > > On Tue, Jun 15, 2021 at 04:26:40PM +1000, Dave Chinner wrote:
> > > > > On Mon, Jun 14, 2021 at 05:19:04PM -0400, Johannes Weiner wrote:
> > > > > > @@ -1123,6 +1125,9 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
> > > > > >  			shadow = workingset_eviction(page, target_memcg);
> > > > > >  		__delete_from_page_cache(page, shadow);
> > > > > >  		xa_unlock_irq(&mapping->i_pages);
> > > > > > +		if (mapping_shrinkable(mapping))
> > > > > > +			inode_add_lru(mapping->host);
> > > > > > +		spin_unlock(&mapping->host->i_lock);
> > > > > >  
> > > > > 
> > > > > No. Inode locks have absolutely no place serialising core vmscan
> > > > > algorithms.
> > > > 
> > > > What if, and hear me out on this one, core vmscan algorithms change
> > > > the state of the inode?
> > > 
> > > Then the core vmscan algorithm has a layering violation.
> > 
> > You're just playing a word game here.
> 
> Don't think so.  David is quite correct in saying that vmscan shouldn't
> mess with inode state unless it's via address_space_operations?

It seemed to me the complaint was more about vmscan propagating this
state into the inode in general - effecting fs inode acquisitions and
LRU manipulations from the page reclaim callstack - regardless of
whether they are open-coded or indirect through API functions?

Since I mentioned better encapsulation but received no response...
Johannes Weiner June 28, 2021, 5:12 p.m. UTC | #11
On Wed, Jun 16, 2021 at 11:20:08AM +1000, Dave Chinner wrote:
> On Tue, Jun 15, 2021 at 02:50:09PM -0400, Johannes Weiner wrote:
> > On Tue, Jun 15, 2021 at 04:26:40PM +1000, Dave Chinner wrote:
> > > And in __list_lru_walk_one() just add:
> > > 
> > > 		case LRU_ROTATE_NODEFER:
> > > 			isolated++;
> > > 			/* fallthrough */
> > > 		case LRU_ROTATE:
> > > 			list_move_tail(item, &l->list);
> > > 			break;
> > > 
> > > And now inodes with active page cache  rotated to the tail of the
> > > list and are considered to have had work done on them. Hence they
> > > don't add to the work accumulation that the shrinker infrastructure
> > > defers, and so will allow the page reclaim to do it's stuff with
> > > page reclaim before such inodes will get reclaimed.
> > > 
> > > That's *much* simpler than your proposed patch and should get you
> > > pretty much the same result.
> > 
> > It solves the deferred work buildup, but it's absurdly inefficient.
> 
> So you keep saying. Show us the numbers. Show us that it's so
> inefficient that it's completely unworkable. _You_ need to justify
> why violating modularity and layering is the only viable solution to
> this problem. Given that there is an alternative simple, straight
> forward solution to the problem, it's on you to prove it is
> insufficient to solve your issues.
> 
> I'm sceptical that the complexity is necessary given that in general
> workloads, the inode shrinker doesn't even register in kernel
> profiles and that the problem being avoided generally isn't even hit
> in most workloads. IOWs, I'll take a simple but inefficient solution
> for avoiding a corner case behaviour over a solution that is
> complex, fragile and full of layering violations any day of the
> weeks.

I spent time last week benchmarking both implementations with various
combinations of icache and page cache size proportions.

You're right that most workloads don't care. But there are workloads
that do, and for them the behavior can become pathological during
drop-behind reclaim.

Page cache reclaim has two modes: 1. Workingset transitions where we
flush out the old as quickly as possible and 2. Streaming buffered IO
that doesn't benefit from caching, and so gets confined to the
smallest possible amount of memory without touching active pages.

During 1. we may rotate busy inodes a few times until their page cache
disappears. This isn't great, but at least temporary.

The issue is 2. We may do drop-behind reclaim for extended periods of
time, during which the cache workingset remains completely untouched
and the corresponding inodes never become eligible for freeing.
Rotating them over and over represents a continuous parasitic drag on
reclaim. Depending on the proportions between the icache and the
inactive cache list, this drag can make up a sizable portion or even
the majority of overall CPU consumed by reclaim. (And if you recall
the discussion around RWF_UNCACHED, dropbehind reclaim is already
bottlenecked on CPU with decent IO devices.)

My test is doing drop-behind reclaim while most memory is filled with
a cache workingset that is held by an increasing number of inodes. The
first number here is the inodes, the second is the active pages held
by each:

 1,000 * 3072 pages:  0.39%     0.05%  kswapd0          [kernel.kallsyms]   [k] shrink_slab
 10,000 * 307 pages:  0.39%     0.04%  kswapd0          [kernel.kallsyms]   [k] shrink_slab
 100,000 * 32 pages:  1.29%     0.05%  kswapd0          [kernel.kallsyms]   [k] shrink_slab
 500,000 *  6 pages: 11.36%     0.08%  kswapd0          [kernel.kallsyms]   [k] shrink_slab
1,000,000 * 3 pages: 26.40%     0.04%  kswapd0          [kernel.kallsyms]   [k] shrink_slab
1,500,000 * 2 pages: 42.97%     0.00%  kswapd0          [kernel.kallsyms]   [k] shrink_slab
3,000,000 * 1  page: 45.22%     0.00%  kswapd0          [kernel.kallsyms]   [k] shrink_slab

As we get into higher inode counts, the shrinkers end up burning most
of the reclaim cycles to rotate workingset inodes. For perspective,
with 3 million inodes, when the shrinkers eat 45% of the cycles to
busypoll the workingset inodes, page reclaim only consumes about 10%
to actually make forward progress.

IMO it goes from suboptimal to being a problem somewhere between 100k
and 500k in this table. That's not *that* many inodes - I'm counting
~74k files in my linux git tree alone.

North of 500k, it becomes pathological. That's probably less common,
but it happens in the real world. I checked the file servers that host
our internal source code trees. They have 16 times the memory of my
test box, but they routinely deal with 50 million+ inodes.

I think the additional complexity of updating the inode LRU according
to cache population state is justified in order to avoid these
pathological cornercases.
Johannes Weiner June 28, 2021, 6:58 p.m. UTC | #12
On Fri, Jun 18, 2021 at 12:45:18PM -0400, Johannes Weiner wrote:
> On Thu, Jun 17, 2021 at 10:49:42AM +1000, Dave Chinner wrote:
> > If you can work out a *clean* way to move inodes onto the LRU when
> > they are dirty then I'm all for it. But sprinkling inode->i_lock all
> > over the mm/ subsystem and then adding seemling randomly placed
> > inode lru manipulations isn't the way to do it.
> >
> > You should consider centralising all the work involved marking a
> > mapping clean somewhere inside the mm/ code. Then add a single
> > callout that does the inode LRU work, similar to how the
> > fs-writeback.c code does it when the inode is marked clean by
> > inode_sync_complete().
> 
> Yes, I'd prefer that as well. Let's look at the options.
> 
> The main source of complication is that the page cache cannot hold a
> direct reference on the inode; holding the xa_lock or the i_lock is
> the only thing that keeps the inode alive after we removed the page.
> 
> So our options are either overlapping the lock sections, or taking the
> rcu readlock on the page cache side to bridge the deletion and the
> inode callback - which then has to deal with the possibility that the
> inode may have already been destroyed by the time it's called.
> 
> I would put the RCU thing aside for now as it sounds just a bit too
> hairy, and too low-level an insight into the inode lifetime from the
> mapping side. The xa_lock is also dropped in several outer entry
> functions, so while it would clean up the fs side a bit, we wouldn't
> reduce the blast radius on the MM side.
> 
> When we overlap lock sections, there are two options:
> 
> a) This patch, with the page cache lock nesting inside the i_lock. Or,
> 
> b) the way we handle dirty state: When we call set_page_dirty() ->
>    mark_inode_dirty(), we hold the lock that serializes the page cache
>    state when locking and updating the inode state. The hierarchy is:
> 
>        lock_page(page)                 # MM
>          spin_lock(&inode->i_lock)     # FS
> 
>    The equivalent for this patch would be to have page_cache_delete()
>    call mark_inode_empty() (or whatever name works for everybody),
>    again under the lock that serializes the page cache state:
> 
>        xa_lock_irq(&mapping->i_pages)  # MM
>          spin_lock(&inode->i_lock)     # FS
> 
>    There would be one central place calling into the fs with an API
>    function, encapsulating i_lock handling in fs/inode.c.
> 
>    Great.
> 
>    The major caveat here is that i_lock would need to become IRQ safe
>    in order to nest inside the xa_lock. It's not that the semantical
>    layering of code here is new in any way, it's simply the lock type.
> 
>    As far as I can see, i_lock hold times are quite short - it's a
>    spinlock after all. But I haven't reviewed all the sites yet, and
>    there are a lot of them. They would all need to be updated.
> 
>    Likewise, list_lru locking needs to be made irq-safe. However,
>    irqsafe spinlock is sort of the inevitable fate of any lock
>    embedded in a data structure API. So I'm less concerned about that.
> 
>    AFAICS nothing else nests under i_lock.
> 
> If FS folks are fine with that, I would give that conversion a
> shot. Lock type dependency aside, this would retain full modularity
> and a clear delineation between mapping and inode property. It would
> also be a fully locked scheme, so none of the subtleties of the
> current patch. The end result seems clean and maintanable.

I spent some time auditing i_lock and I don't think this is workable
in its current scope. The irq disabling would be sprawling, and also
require things like d_lock to become irq safe.

That said, it might be possible to split the i_lock itself. AFAIU it
is used to serialize several independent operations. For example, it
seems i_size, i_blocks, i_nlink, i_dentry, i_private, i_*time updates
don't need to serialize against changes to i_state or inode
destruction beyond just holding a reference, and could presumably use
separate locks. Splitting that out could be a good thing in itself.

If the lifetime and LRU management of the inode can be moved to a new
lock, we could make those operations irq safe with little collateral
damage. And then the below cleanup to the current patch would be
possible, which would consolidate the MM side somewhat.

I don't mind pursuing this if people agree it's a good idea.

That said, this is a fairly invasive, non-trivial architectural
overhaul, which I don't think is a reasonable prerequisites for
addressing a real world problem we have today. I hope we can go ahead
with this fix for now and tag on the architectural rework - instead of
the other way around, and leaving memory management broken until I get
more familiar with vfs locking, let alone stabilizing those changes...

Here is the follow-up cleanup this would enable (would obviously
deadlock in the current version of i_lock). Honestly a little
underwhelming if you ask me, given the size of the prep work...

---
 fs/inode.c              |  7 +++++++
 fs/internal.h           |  1 +
 include/linux/fs.h      |  3 ++-
 include/linux/pagemap.h | 13 +------------
 mm/filemap.c            | 14 ++++++--------
 mm/truncate.c           | 22 +++++-----------------
 mm/vmscan.c             |  7 -------
 mm/workingset.c         | 12 ++----------
 8 files changed, 24 insertions(+), 55 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 6b74701c1954..18c0554148a3 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -457,6 +457,13 @@ static void inode_lru_list_del(struct inode *inode)
 		this_cpu_dec(nr_unused);
 }
 
+void mark_inode_shrinkable(struct inode *inode)
+{
+	spin_lock(&inode->i_lock);
+	inode_add_lru(inode);
+	spin_unlock(&inode->i_lock);
+}
+
 /**
  * inode_sb_list_add - add inode to the superblock list of inodes
  * @inode: inode to add
diff --git a/fs/internal.h b/fs/internal.h
index 3eb90dde62bd..6aeae7ef3380 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -146,6 +146,7 @@ extern int vfs_open(const struct path *, struct file *);
  * inode.c
  */
 extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc);
+extern void inode_add_lru(struct inode *inode);
 extern int dentry_needs_remove_privs(struct dentry *dentry);
 
 /*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 301cd0195036..a5fb073f54c2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2409,6 +2409,8 @@ static inline void mark_inode_dirty_sync(struct inode *inode)
 	__mark_inode_dirty(inode, I_DIRTY_SYNC);
 }
 
+extern void mark_inode_shrinkable(struct inode *inode);
+
 /*
  * Returns true if the given inode itself only has dirty timestamps (its pages
  * may still be dirty) and isn't currently being allocated or freed.
@@ -3214,7 +3216,6 @@ static inline void remove_inode_hash(struct inode *inode)
 }
 
 extern void inode_sb_list_add(struct inode *inode);
-extern void inode_add_lru(struct inode *inode);
 
 extern int sb_set_blocksize(struct super_block *, int);
 extern int sb_min_blocksize(struct super_block *, int);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index c9956fac640e..2e7436eba882 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -31,18 +31,7 @@ static inline bool mapping_empty(struct address_space *mapping)
  * reclaim and LRU management.
  *
  * The caller is expected to hold the i_lock, but is not required to
- * hold the i_pages lock, which usually protects cache state. That's
- * because the i_lock and the list_lru lock that protect the inode and
- * its LRU state don't nest inside the irq-safe i_pages lock.
- *
- * Cache deletions are performed under the i_lock, which ensures that
- * when an inode goes empty, it will reliably get queued on the LRU.
- *
- * Cache additions do not acquire the i_lock and may race with this
- * check, in which case we'll report the inode as shrinkable when it
- * has cache pages. This is okay: the shrinker also checks the
- * refcount and the referenced bit, which will be elevated or set in
- * the process of adding new cache pages to an inode.
+ * hold the i_pages lock, which usually protects cache state.
  */
 static inline bool mapping_shrinkable(struct address_space *mapping)
 {
diff --git a/mm/filemap.c b/mm/filemap.c
index 0d0d72ced961..3c10830cd396 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -143,6 +143,9 @@ static void page_cache_delete(struct address_space *mapping,
 	page->mapping = NULL;
 	/* Leave page->index set: truncation lookup relies upon it */
 	mapping->nrpages -= nr;
+
+	if (mapping_shrinkable(mapping))
+		mark_inode_shrinkable(mapping->host);
 }
 
 static void unaccount_page_cache_page(struct address_space *mapping,
@@ -260,13 +263,9 @@ void delete_from_page_cache(struct page *page)
 	struct address_space *mapping = page_mapping(page);
 
 	BUG_ON(!PageLocked(page));
-	spin_lock(&mapping->host->i_lock);
 	xa_lock_irq(&mapping->i_pages);
 	__delete_from_page_cache(page, NULL);
 	xa_unlock_irq(&mapping->i_pages);
-	if (mapping_shrinkable(mapping))
-		inode_add_lru(mapping->host);
-	spin_unlock(&mapping->host->i_lock);
 
 	page_cache_free_page(mapping, page);
 }
@@ -332,6 +331,9 @@ static void page_cache_delete_batch(struct address_space *mapping,
 		total_pages++;
 	}
 	mapping->nrpages -= total_pages;
+
+	if (mapping_shrinkable(mapping))
+		mark_inode_shrinkable(mapping->host);
 }
 
 void delete_from_page_cache_batch(struct address_space *mapping,
@@ -342,7 +344,6 @@ void delete_from_page_cache_batch(struct address_space *mapping,
 	if (!pagevec_count(pvec))
 		return;
 
-	spin_lock(&mapping->host->i_lock);
 	xa_lock_irq(&mapping->i_pages);
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		trace_mm_filemap_delete_from_page_cache(pvec->pages[i]);
@@ -351,9 +352,6 @@ void delete_from_page_cache_batch(struct address_space *mapping,
 	}
 	page_cache_delete_batch(mapping, pvec);
 	xa_unlock_irq(&mapping->i_pages);
-	if (mapping_shrinkable(mapping))
-		inode_add_lru(mapping->host);
-	spin_unlock(&mapping->host->i_lock);
 
 	for (i = 0; i < pagevec_count(pvec); i++)
 		page_cache_free_page(mapping, pvec->pages[i]);
diff --git a/mm/truncate.c b/mm/truncate.c
index 950d73fa995d..0adce9df9100 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -40,18 +40,17 @@ static inline void __clear_shadow_entry(struct address_space *mapping,
 	if (xas_load(&xas) != entry)
 		return;
 	xas_store(&xas, NULL);
+
+	if (mapping_shrinkable(mapping))
+		mark_inode_shrinkable(mapping->host);
 }
 
 static void clear_shadow_entry(struct address_space *mapping, pgoff_t index,
 			       void *entry)
 {
-	spin_lock(&mapping->host->i_lock);
 	xa_lock_irq(&mapping->i_pages);
 	__clear_shadow_entry(mapping, index, entry);
 	xa_unlock_irq(&mapping->i_pages);
-	if (mapping_shrinkable(mapping))
-		inode_add_lru(mapping->host);
-	spin_unlock(&mapping->host->i_lock);
 }
 
 /*
@@ -77,10 +76,8 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping,
 		return;
 
 	dax = dax_mapping(mapping);
-	if (!dax) {
-		spin_lock(&mapping->host->i_lock);
+	if (!dax)
 		xa_lock_irq(&mapping->i_pages);
-	}
 
 	for (i = j; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
@@ -99,12 +96,8 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping,
 		__clear_shadow_entry(mapping, index, page);
 	}
 
-	if (!dax) {
+	if (!dax)
 		xa_unlock_irq(&mapping->i_pages);
-		if (mapping_shrinkable(mapping))
-			inode_add_lru(mapping->host);
-		spin_unlock(&mapping->host->i_lock);
-	}
 	pvec->nr = j;
 }
 
@@ -579,7 +572,6 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 	if (page_has_private(page) && !try_to_release_page(page, GFP_KERNEL))
 		return 0;
 
-	spin_lock(&mapping->host->i_lock);
 	xa_lock_irq(&mapping->i_pages);
 	if (PageDirty(page))
 		goto failed;
@@ -587,9 +579,6 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 	BUG_ON(page_has_private(page));
 	__delete_from_page_cache(page, NULL);
 	xa_unlock_irq(&mapping->i_pages);
-	if (mapping_shrinkable(mapping))
-		inode_add_lru(mapping->host);
-	spin_unlock(&mapping->host->i_lock);
 
 	if (mapping->a_ops->freepage)
 		mapping->a_ops->freepage(page);
@@ -598,7 +587,6 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 	return 1;
 failed:
 	xa_unlock_irq(&mapping->i_pages);
-	spin_unlock(&mapping->host->i_lock);
 	return 0;
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6dd5ef8a11bc..cc5d7cd75935 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1055,8 +1055,6 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
 
-	if (!PageSwapCache(page))
-		spin_lock(&mapping->host->i_lock);
 	xa_lock_irq(&mapping->i_pages);
 	/*
 	 * The non racy check for a busy page.
@@ -1125,9 +1123,6 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 			shadow = workingset_eviction(page, target_memcg);
 		__delete_from_page_cache(page, shadow);
 		xa_unlock_irq(&mapping->i_pages);
-		if (mapping_shrinkable(mapping))
-			inode_add_lru(mapping->host);
-		spin_unlock(&mapping->host->i_lock);
 
 		if (freepage != NULL)
 			freepage(page);
@@ -1137,8 +1132,6 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 
 cannot_free:
 	xa_unlock_irq(&mapping->i_pages);
-	if (!PageSwapCache(page))
-		spin_unlock(&mapping->host->i_lock);
 	return 0;
 }
 
diff --git a/mm/workingset.c b/mm/workingset.c
index 9d3d2b4ce44d..1576b7ca3fbb 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -540,13 +540,6 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 		goto out;
 	}
 
-	if (!spin_trylock(&mapping->host->i_lock)) {
-		xa_unlock(&mapping->i_pages);
-		spin_unlock_irq(lru_lock);
-		ret = LRU_RETRY;
-		goto out;
-	}
-
 	list_lru_isolate(lru, item);
 	__dec_lruvec_kmem_state(node, WORKINGSET_NODES);
 
@@ -564,11 +557,10 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 	xa_delete_node(node, workingset_update_node);
 	__inc_lruvec_kmem_state(node, WORKINGSET_NODERECLAIM);
 
+	if (mapping_shrinkable(mapping))
+		mark_inode_shrinkable(mapping->host);
 out_invalid:
 	xa_unlock_irq(&mapping->i_pages);
-	if (mapping_shrinkable(mapping))
-		inode_add_lru(mapping->host);
-	spin_unlock(&mapping->host->i_lock);
 	ret = LRU_REMOVED_RETRY;
 out:
 	cond_resched();
diff mbox series

Patch

diff --git a/fs/inode.c b/fs/inode.c
index 8830a727b0af..6b74701c1954 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -424,11 +424,20 @@  void ihold(struct inode *inode)
 }
 EXPORT_SYMBOL(ihold);
 
-static void inode_lru_list_add(struct inode *inode)
+static void __inode_add_lru(struct inode *inode, bool rotate)
 {
+	if (inode->i_state & (I_DIRTY_ALL | I_SYNC | I_FREEING | I_WILL_FREE))
+		return;
+	if (atomic_read(&inode->i_count))
+		return;
+	if (!(inode->i_sb->s_flags & SB_ACTIVE))
+		return;
+	if (!mapping_shrinkable(&inode->i_data))
+		return;
+
 	if (list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru))
 		this_cpu_inc(nr_unused);
-	else
+	else if (rotate)
 		inode->i_state |= I_REFERENCED;
 }
 
@@ -439,16 +448,11 @@  static void inode_lru_list_add(struct inode *inode)
  */
 void inode_add_lru(struct inode *inode)
 {
-	if (!(inode->i_state & (I_DIRTY_ALL | I_SYNC |
-				I_FREEING | I_WILL_FREE)) &&
-	    !atomic_read(&inode->i_count) && inode->i_sb->s_flags & SB_ACTIVE)
-		inode_lru_list_add(inode);
+	__inode_add_lru(inode, false);
 }
 
-
 static void inode_lru_list_del(struct inode *inode)
 {
-
 	if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
 		this_cpu_dec(nr_unused);
 }
@@ -724,10 +728,6 @@  int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 /*
  * Isolate the inode from the LRU in preparation for freeing it.
  *
- * Any inodes which are pinned purely because of attached pagecache have their
- * pagecache removed.  If the inode has metadata buffers attached to
- * mapping->private_list then try to remove them.
- *
  * If the inode has the I_REFERENCED flag set, then it means that it has been
  * used recently - the flag is set in iput_final(). When we encounter such an
  * inode, clear the flag and move it to the back of the LRU so it gets another
@@ -743,31 +743,39 @@  static enum lru_status inode_lru_isolate(struct list_head *item,
 	struct inode	*inode = container_of(item, struct inode, i_lru);
 
 	/*
-	 * we are inverting the lru lock/inode->i_lock here, so use a trylock.
-	 * If we fail to get the lock, just skip it.
+	 * We are inverting the lru lock/inode->i_lock here, so use a
+	 * trylock. If we fail to get the lock, just skip it.
 	 */
 	if (!spin_trylock(&inode->i_lock))
 		return LRU_SKIP;
 
 	/*
-	 * Referenced or dirty inodes are still in use. Give them another pass
-	 * through the LRU as we canot reclaim them now.
+	 * Inodes can get referenced, redirtied, or repopulated while
+	 * they're already on the LRU, and this can make them
+	 * unreclaimable for a while. Remove them lazily here; iput,
+	 * sync, or the last page cache deletion will requeue them.
 	 */
 	if (atomic_read(&inode->i_count) ||
-	    (inode->i_state & ~I_REFERENCED)) {
+	    (inode->i_state & ~I_REFERENCED) ||
+	    !mapping_shrinkable(&inode->i_data)) {
 		list_lru_isolate(lru, &inode->i_lru);
 		spin_unlock(&inode->i_lock);
 		this_cpu_dec(nr_unused);
 		return LRU_REMOVED;
 	}
 
-	/* recently referenced inodes get one more pass */
+	/* Recently referenced inodes get one more pass */
 	if (inode->i_state & I_REFERENCED) {
 		inode->i_state &= ~I_REFERENCED;
 		spin_unlock(&inode->i_lock);
 		return LRU_ROTATE;
 	}
 
+	/*
+	 * On highmem systems, mapping_shrinkable() permits dropping
+	 * page cache in order to free up struct inodes: lowmem might
+	 * be under pressure before the cache inside the highmem zone.
+	 */
 	if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) {
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
@@ -1634,7 +1642,7 @@  static void iput_final(struct inode *inode)
 	if (!drop &&
 	    !(inode->i_state & I_DONTCACHE) &&
 	    (sb->s_flags & SB_ACTIVE)) {
-		inode_add_lru(inode);
+		__inode_add_lru(inode, true);
 		spin_unlock(&inode->i_lock);
 		return;
 	}
diff --git a/fs/internal.h b/fs/internal.h
index 6aeae7ef3380..3eb90dde62bd 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -146,7 +146,6 @@  extern int vfs_open(const struct path *, struct file *);
  * inode.c
  */
 extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc);
-extern void inode_add_lru(struct inode *inode);
 extern int dentry_needs_remove_privs(struct dentry *dentry);
 
 /*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8652ed7cdce8..301cd0195036 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3214,6 +3214,7 @@  static inline void remove_inode_hash(struct inode *inode)
 }
 
 extern void inode_sb_list_add(struct inode *inode);
+extern void inode_add_lru(struct inode *inode);
 
 extern int sb_set_blocksize(struct super_block *, int);
 extern int sb_min_blocksize(struct super_block *, int);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e89df447fae3..c9956fac640e 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -23,6 +23,56 @@  static inline bool mapping_empty(struct address_space *mapping)
 	return xa_empty(&mapping->i_pages);
 }
 
+/*
+ * mapping_shrinkable - test if page cache state allows inode reclaim
+ * @mapping: the page cache mapping
+ *
+ * This checks the mapping's cache state for the pupose of inode
+ * reclaim and LRU management.
+ *
+ * The caller is expected to hold the i_lock, but is not required to
+ * hold the i_pages lock, which usually protects cache state. That's
+ * because the i_lock and the list_lru lock that protect the inode and
+ * its LRU state don't nest inside the irq-safe i_pages lock.
+ *
+ * Cache deletions are performed under the i_lock, which ensures that
+ * when an inode goes empty, it will reliably get queued on the LRU.
+ *
+ * Cache additions do not acquire the i_lock and may race with this
+ * check, in which case we'll report the inode as shrinkable when it
+ * has cache pages. This is okay: the shrinker also checks the
+ * refcount and the referenced bit, which will be elevated or set in
+ * the process of adding new cache pages to an inode.
+ */
+static inline bool mapping_shrinkable(struct address_space *mapping)
+{
+	void *head;
+
+	/*
+	 * On highmem systems, there could be lowmem pressure from the
+	 * inodes before there is highmem pressure from the page
+	 * cache. Make inodes shrinkable regardless of cache state.
+	 */
+	if (IS_ENABLED(CONFIG_HIGHMEM))
+		return true;
+
+	/* Cache completely empty? Shrink away. */
+	head = rcu_access_pointer(mapping->i_pages.xa_head);
+	if (!head)
+		return true;
+
+	/*
+	 * The xarray stores single offset-0 entries directly in the
+	 * head pointer, which allows non-resident page cache entries
+	 * to escape the shadow shrinker's list of xarray nodes. The
+	 * inode shrinker needs to pick them up under memory pressure.
+	 */
+	if (!xa_is_node(head) && xa_is_value(head))
+		return true;
+
+	return false;
+}
+
 /*
  * Bits in mapping->flags.
  */
diff --git a/mm/filemap.c b/mm/filemap.c
index 819d2589abef..0d0d72ced961 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -260,9 +260,13 @@  void delete_from_page_cache(struct page *page)
 	struct address_space *mapping = page_mapping(page);
 
 	BUG_ON(!PageLocked(page));
+	spin_lock(&mapping->host->i_lock);
 	xa_lock_irq(&mapping->i_pages);
 	__delete_from_page_cache(page, NULL);
 	xa_unlock_irq(&mapping->i_pages);
+	if (mapping_shrinkable(mapping))
+		inode_add_lru(mapping->host);
+	spin_unlock(&mapping->host->i_lock);
 
 	page_cache_free_page(mapping, page);
 }
@@ -338,6 +342,7 @@  void delete_from_page_cache_batch(struct address_space *mapping,
 	if (!pagevec_count(pvec))
 		return;
 
+	spin_lock(&mapping->host->i_lock);
 	xa_lock_irq(&mapping->i_pages);
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		trace_mm_filemap_delete_from_page_cache(pvec->pages[i]);
@@ -346,6 +351,9 @@  void delete_from_page_cache_batch(struct address_space *mapping,
 	}
 	page_cache_delete_batch(mapping, pvec);
 	xa_unlock_irq(&mapping->i_pages);
+	if (mapping_shrinkable(mapping))
+		inode_add_lru(mapping->host);
+	spin_unlock(&mapping->host->i_lock);
 
 	for (i = 0; i < pagevec_count(pvec); i++)
 		page_cache_free_page(mapping, pvec->pages[i]);
diff --git a/mm/truncate.c b/mm/truncate.c
index 95934c98259a..950d73fa995d 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -45,9 +45,13 @@  static inline void __clear_shadow_entry(struct address_space *mapping,
 static void clear_shadow_entry(struct address_space *mapping, pgoff_t index,
 			       void *entry)
 {
+	spin_lock(&mapping->host->i_lock);
 	xa_lock_irq(&mapping->i_pages);
 	__clear_shadow_entry(mapping, index, entry);
 	xa_unlock_irq(&mapping->i_pages);
+	if (mapping_shrinkable(mapping))
+		inode_add_lru(mapping->host);
+	spin_unlock(&mapping->host->i_lock);
 }
 
 /*
@@ -73,8 +77,10 @@  static void truncate_exceptional_pvec_entries(struct address_space *mapping,
 		return;
 
 	dax = dax_mapping(mapping);
-	if (!dax)
+	if (!dax) {
+		spin_lock(&mapping->host->i_lock);
 		xa_lock_irq(&mapping->i_pages);
+	}
 
 	for (i = j; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
@@ -93,8 +99,12 @@  static void truncate_exceptional_pvec_entries(struct address_space *mapping,
 		__clear_shadow_entry(mapping, index, page);
 	}
 
-	if (!dax)
+	if (!dax) {
 		xa_unlock_irq(&mapping->i_pages);
+		if (mapping_shrinkable(mapping))
+			inode_add_lru(mapping->host);
+		spin_unlock(&mapping->host->i_lock);
+	}
 	pvec->nr = j;
 }
 
@@ -569,6 +579,7 @@  invalidate_complete_page2(struct address_space *mapping, struct page *page)
 	if (page_has_private(page) && !try_to_release_page(page, GFP_KERNEL))
 		return 0;
 
+	spin_lock(&mapping->host->i_lock);
 	xa_lock_irq(&mapping->i_pages);
 	if (PageDirty(page))
 		goto failed;
@@ -576,6 +587,9 @@  invalidate_complete_page2(struct address_space *mapping, struct page *page)
 	BUG_ON(page_has_private(page));
 	__delete_from_page_cache(page, NULL);
 	xa_unlock_irq(&mapping->i_pages);
+	if (mapping_shrinkable(mapping))
+		inode_add_lru(mapping->host);
+	spin_unlock(&mapping->host->i_lock);
 
 	if (mapping->a_ops->freepage)
 		mapping->a_ops->freepage(page);
@@ -584,6 +598,7 @@  invalidate_complete_page2(struct address_space *mapping, struct page *page)
 	return 1;
 failed:
 	xa_unlock_irq(&mapping->i_pages);
+	spin_unlock(&mapping->host->i_lock);
 	return 0;
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cc5d7cd75935..6dd5ef8a11bc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1055,6 +1055,8 @@  static int __remove_mapping(struct address_space *mapping, struct page *page,
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
 
+	if (!PageSwapCache(page))
+		spin_lock(&mapping->host->i_lock);
 	xa_lock_irq(&mapping->i_pages);
 	/*
 	 * The non racy check for a busy page.
@@ -1123,6 +1125,9 @@  static int __remove_mapping(struct address_space *mapping, struct page *page,
 			shadow = workingset_eviction(page, target_memcg);
 		__delete_from_page_cache(page, shadow);
 		xa_unlock_irq(&mapping->i_pages);
+		if (mapping_shrinkable(mapping))
+			inode_add_lru(mapping->host);
+		spin_unlock(&mapping->host->i_lock);
 
 		if (freepage != NULL)
 			freepage(page);
@@ -1132,6 +1137,8 @@  static int __remove_mapping(struct address_space *mapping, struct page *page,
 
 cannot_free:
 	xa_unlock_irq(&mapping->i_pages);
+	if (!PageSwapCache(page))
+		spin_unlock(&mapping->host->i_lock);
 	return 0;
 }
 
diff --git a/mm/workingset.c b/mm/workingset.c
index 4f7a306ce75a..9d3d2b4ce44d 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -540,6 +540,13 @@  static enum lru_status shadow_lru_isolate(struct list_head *item,
 		goto out;
 	}
 
+	if (!spin_trylock(&mapping->host->i_lock)) {
+		xa_unlock(&mapping->i_pages);
+		spin_unlock_irq(lru_lock);
+		ret = LRU_RETRY;
+		goto out;
+	}
+
 	list_lru_isolate(lru, item);
 	__dec_lruvec_kmem_state(node, WORKINGSET_NODES);
 
@@ -559,6 +566,9 @@  static enum lru_status shadow_lru_isolate(struct list_head *item,
 
 out_invalid:
 	xa_unlock_irq(&mapping->i_pages);
+	if (mapping_shrinkable(mapping))
+		inode_add_lru(mapping->host);
+	spin_unlock(&mapping->host->i_lock);
 	ret = LRU_REMOVED_RETRY;
 out:
 	cond_resched();