Patchwork [0/5] mm: workingset: radix tree subtleties & single-page file refaults

login
register
mail settings
Submitter Johannes Weiner
Date Oct. 24, 2016, 6:47 p.m.
Message ID <20161024184739.GB2125@cmpxchg.org>
Download mbox | patch
Permalink /patch/9392991/
State New
Headers show

Comments

Johannes Weiner - Oct. 24, 2016, 6:47 p.m.
On Wed, Oct 19, 2016 at 11:16:30AM -0700, Linus Torvalds wrote:
> On Wed, Oct 19, 2016 at 10:24 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > These patches make the radix tree code explicitely support and track
> > such special entries, to eliminate the subtleties and to restore the
> > thrash detection for single-page files.
> 
> Ugh. I'm not a huge fan. The patches may be good and be a cleanup in
> one respect, but they make one of my least favorite parts of the radix
> tree code worse.
> 
> The radix tree "tag" thing is really horribly confusing, and part of
> it is that there are two totally different "tags": the externally
> visible tags (separate array), and the magical internal tags (low bits
> of the node pointers that tag the pointers as internal or exceptional
> entries).
>
> And I think this series actually makes things even *more* complicated,
> because now the radix tree itself uses one magical entry in the
> externally visible tags for its own internal logic. So now it's
> *really* messed up - the external tags aren't entirely external any
> more.
> 
> Maybe I'm mis-reading it, and I'm just confused by the radix tree
> implementation? But if so, it's just another sign of just how
> confusing things are.

No, I think you're right. This is no good.

As I see it, the main distinction between the "tags" tags and the
lower bits in the entry pointers is that the former recurse down to
the root, and so they make lookup by tag very efficient (e.g. find
dirty pages to sync), whereas the pointer bits are cheaper when we
specifically operate on entries anyway and branch out to different
behavior depending on the type of entry (truncate a cache range,
extend tree, descend tree).

My patch violated that by adding a recursively-set "lookup" tag for
the single purpose of distinguishing entry types in root->rnode.

How about this instead: given that we already mark the shadow entries
exceptional, and the exceptional bit is part of the radix tree API,
can we just introduce a node->exceptional counter for those entries
and have the radix tree code assist us with that instead? It adds the
counting for non-shadow exceptional entries as well (shmem swap slots,
and DAX non-page entries), unfortunately, but this is way cleaner. It
also makes mapping->nrexceptional and node->exceptional consistent in
DAX (Jan, could you please double check the accounting there?)

What do you think? Lightly tested patch below.

> The external tag array itself is also somewhat nasty, in that it
> spreads out the tag bits for one entry maximally (ie different bits
> are in different words) so you can't even clear them together. I know
> why - it makes both iterating over a specific tag and any_tag_set()
> simpler, but it does seem confusing to me how we spread out the data
> almost maximally.
> 
> I really would love to see somebody take a big look at the two
> different tagging methods. If nothing else, explain it to me.
> 
> Because maybe this series is all great, and my objection is just that
> it makes it even harder for me to understand the code.
> 
> For example, could we do this simplification:
> 
>  - get rid of RADIX_TREE_TAG_LONGS entirely
>  - get rid of CONFIG_BASE_SMALL entirely
>  - just say that the tag bitmap is one unsigned long
>  - so RADIX_TREE_MAP_SIZE is just BITS_PER_LONG
> 
> and then at least we'd get rid of the double array and the confusion
> about loops that are actually almost never loops. Because right now
> RADIX_TREE_TAG_LONGS is usually 1, but is 2 if you're a 32-bit
> platform with !CONFIG_BASE_SMALL. So you need those loops, but it all
> looks almost entirely pointless.

AFAICS, !BASE_SMALL is the default unless you de-select BASE_FULL in
EXPERT, so that would cut the radix tree node in half on most 32 bit
setups and would more than double the tree nodes we have to allocate
(two where we had one, plus the additional intermediate levels).

The extra code sucks, but is that a cost we'd be willing to pay?

---

From 192c2589a5845d197f758045868844623e06b4db Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon, 17 Oct 2016 09:00:04 -0400
Subject: [PATCH] mm: workingset: restore single-page file refault tracking

Currently, we account shadow entries in the page cache in the upper
bits of the radix_tree_node->count, behind the back of the radix tree
implementation. Because the radix tree code has no awareness of them,
we have to prevent shadow entries from going through operations where
the tree implementation relies on or modifies node->count: extending
and shrinking the tree from and to a single direct root->rnode entry.

As a consequence, we cannot store shadow entries for files that only
have index 0 populated, and thus cannot detect refaults from them,
which in turn degrades the thrashing compensation in LRU reclaim.

Another consequence is that we rely on subtleties throughout the radix
tree code, such as the node->count != 1 check in the shrinking code,
which is meant to exclude multi-entry nodes but also skips nodes with
only one shadow entry since they are accounted in the upper bits. This
is error prone, and has in fact caused the bug fixed in d3798ae8c6f3
("mm: filemap: don't plant shadow entries without radix tree node").

To fix this, this patch moves the shadow counter from the upper bits
of node->count into the new node->exceptional counter, where all
exceptional entries are explicitely tracked by the radix tree.
node->count then counts all tree entries again, including shadows.

Switching from a magic node->count to accounting exceptional entries
natively in the radix tree code removes the fragile subtleties
mentioned above. It also allows us to store shadow entries for
single-page files again, as the radix tree recognizes exceptional
entries when extending the tree from the root->rnode singleton, and
thus restore refault detection and thrashing compensation for them.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/radix-tree.h | 11 ++++-------
 include/linux/swap.h       | 32 --------------------------------
 lib/radix-tree.c           | 16 +++++++++++++---
 mm/filemap.c               | 45 ++++++++++++++++++++-------------------------
 mm/truncate.c              |  6 +++---
 mm/workingset.c            | 13 ++++++++-----
 6 files changed, 48 insertions(+), 75 deletions(-)
Jan Kara - Oct. 26, 2016, 9:21 a.m.
On Mon 24-10-16 14:47:39, Johannes Weiner wrote:
> 
> ---
> 
> From 192c2589a5845d197f758045868844623e06b4db Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Mon, 17 Oct 2016 09:00:04 -0400
> Subject: [PATCH] mm: workingset: restore single-page file refault tracking
> 
> Currently, we account shadow entries in the page cache in the upper
> bits of the radix_tree_node->count, behind the back of the radix tree
> implementation. Because the radix tree code has no awareness of them,
> we have to prevent shadow entries from going through operations where
> the tree implementation relies on or modifies node->count: extending
> and shrinking the tree from and to a single direct root->rnode entry.
> 
> As a consequence, we cannot store shadow entries for files that only
> have index 0 populated, and thus cannot detect refaults from them,
> which in turn degrades the thrashing compensation in LRU reclaim.
> 
> Another consequence is that we rely on subtleties throughout the radix
> tree code, such as the node->count != 1 check in the shrinking code,
> which is meant to exclude multi-entry nodes but also skips nodes with
> only one shadow entry since they are accounted in the upper bits. This
> is error prone, and has in fact caused the bug fixed in d3798ae8c6f3
> ("mm: filemap: don't plant shadow entries without radix tree node").
> 
> To fix this, this patch moves the shadow counter from the upper bits
> of node->count into the new node->exceptional counter, where all
> exceptional entries are explicitely tracked by the radix tree.
> node->count then counts all tree entries again, including shadows.
> 
> Switching from a magic node->count to accounting exceptional entries
> natively in the radix tree code removes the fragile subtleties
> mentioned above. It also allows us to store shadow entries for
> single-page files again, as the radix tree recognizes exceptional
> entries when extending the tree from the root->rnode singleton, and
> thus restore refault detection and thrashing compensation for them.

I like this solution. Just one suggestion: I think
radix_tree_replace_slot() can now do the node counter update on its own and
that would save us having to do quite a bit of accounting outside of the
radix tree code itself and it would be less prone to bugs (forgotten
updates of a counter). What do you think?

								Honza

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/radix-tree.h | 11 ++++-------
>  include/linux/swap.h       | 32 --------------------------------
>  lib/radix-tree.c           | 16 +++++++++++++---
>  mm/filemap.c               | 45 ++++++++++++++++++++-------------------------
>  mm/truncate.c              |  6 +++---
>  mm/workingset.c            | 13 ++++++++-----
>  6 files changed, 48 insertions(+), 75 deletions(-)
> 
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index af3581b8a451..2674ed31fa84 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -80,14 +80,11 @@ static inline bool radix_tree_is_internal_node(void *ptr)
>  #define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
>  					  RADIX_TREE_MAP_SHIFT))
>  
> -/* Internally used bits of node->count */
> -#define RADIX_TREE_COUNT_SHIFT	(RADIX_TREE_MAP_SHIFT + 1)
> -#define RADIX_TREE_COUNT_MASK	((1UL << RADIX_TREE_COUNT_SHIFT) - 1)
> -
>  struct radix_tree_node {
> -	unsigned char	shift;	/* Bits remaining in each slot */
> -	unsigned char	offset;	/* Slot offset in parent */
> -	unsigned int	count;
> +	unsigned char	shift;		/* Bits remaining in each slot */
> +	unsigned char	offset;		/* Slot offset in parent */
> +	unsigned char	count;		/* Total entry count */
> +	unsigned char	exceptional;	/* Exceptional entry count */
>  	union {
>  		struct {
>  			/* Used when ascending tree */
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a56523cefb9b..660a11de0186 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -248,38 +248,6 @@ bool workingset_refault(void *shadow);
>  void workingset_activation(struct page *page);
>  extern struct list_lru workingset_shadow_nodes;
>  
> -static inline unsigned int workingset_node_pages(struct radix_tree_node *node)
> -{
> -	return node->count & RADIX_TREE_COUNT_MASK;
> -}
> -
> -static inline void workingset_node_pages_inc(struct radix_tree_node *node)
> -{
> -	node->count++;
> -}
> -
> -static inline void workingset_node_pages_dec(struct radix_tree_node *node)
> -{
> -	VM_WARN_ON_ONCE(!workingset_node_pages(node));
> -	node->count--;
> -}
> -
> -static inline unsigned int workingset_node_shadows(struct radix_tree_node *node)
> -{
> -	return node->count >> RADIX_TREE_COUNT_SHIFT;
> -}
> -
> -static inline void workingset_node_shadows_inc(struct radix_tree_node *node)
> -{
> -	node->count += 1U << RADIX_TREE_COUNT_SHIFT;
> -}
> -
> -static inline void workingset_node_shadows_dec(struct radix_tree_node *node)
> -{
> -	VM_WARN_ON_ONCE(!workingset_node_shadows(node));
> -	node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
> -}
> -
>  /* linux/mm/page_alloc.c */
>  extern unsigned long totalram_pages;
>  extern unsigned long totalreserve_pages;
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index 8e6d552c40dd..c7d8452d755e 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -220,10 +220,10 @@ static void dump_node(struct radix_tree_node *node, unsigned long index)
>  {
>  	unsigned long i;
>  
> -	pr_debug("radix node: %p offset %d tags %lx %lx %lx shift %d count %d parent %p\n",
> +	pr_debug("radix node: %p offset %d tags %lx %lx %lx shift %d count %d exceptional %d parent %p\n",
>  		node, node->offset,
>  		node->tags[0][0], node->tags[1][0], node->tags[2][0],
> -		node->shift, node->count, node->parent);
> +		node->shift, node->count, node->exceptional, node->parent);
>  
>  	for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
>  		unsigned long first = index | (i << node->shift);
> @@ -522,8 +522,14 @@ static int radix_tree_extend(struct radix_tree_root *root,
>  		node->offset = 0;
>  		node->count = 1;
>  		node->parent = NULL;
> -		if (radix_tree_is_internal_node(slot))
> +		/* Extending an existing node or root->rnode */
> +		if (radix_tree_is_internal_node(slot)) {
>  			entry_to_node(slot)->parent = node;
> +		} else {
> +			/* Moving an exceptional root->rnode to a node */
> +			if (radix_tree_exceptional_entry(slot))
> +				node->exceptional = 1;
> +		}
>  		node->slots[0] = slot;
>  		slot = node_to_entry(node);
>  		rcu_assign_pointer(root->rnode, slot);
> @@ -649,6 +655,8 @@ int __radix_tree_insert(struct radix_tree_root *root, unsigned long index,
>  	if (node) {
>  		unsigned offset = get_slot_offset(node, slot);
>  		node->count++;
> +		if (radix_tree_exceptional_entry(item))
> +			node->exceptional++;
>  		BUG_ON(tag_get(node, 0, offset));
>  		BUG_ON(tag_get(node, 1, offset));
>  		BUG_ON(tag_get(node, 2, offset));
> @@ -1561,6 +1569,8 @@ void *radix_tree_delete_item(struct radix_tree_root *root,
>  	delete_sibling_entries(node, node_to_entry(slot), offset);
>  	node->slots[offset] = NULL;
>  	node->count--;
> +	if (radix_tree_exceptional_entry(entry))
> +		node->exceptional--;
>  
>  	__radix_tree_delete_node(root, node);
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 849f459ad078..bf7d88b18374 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -128,20 +128,20 @@ static int page_cache_tree_insert(struct address_space *mapping,
>  		if (!radix_tree_exceptional_entry(p))
>  			return -EEXIST;
>  
> +		if (node) {
> +			node->exceptional--;
> +			node->count--;
> +		}
>  		mapping->nrexceptional--;
> +
>  		if (!dax_mapping(mapping)) {
>  			if (shadowp)
>  				*shadowp = p;
> -			if (node)
> -				workingset_node_shadows_dec(node);
>  		} else {
>  			/* DAX can replace empty locked entry with a hole */
>  			WARN_ON_ONCE(p !=
>  				(void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
>  					 RADIX_DAX_ENTRY_LOCK));
> -			/* DAX accounts exceptional entries as normal pages */
> -			if (node)
> -				workingset_node_pages_dec(node);
>  			/* Wakeup waiters for exceptional entry lock */
>  			dax_wake_mapping_entry_waiter(mapping, page->index,
>  						      false);
> @@ -150,7 +150,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
>  	radix_tree_replace_slot(slot, page);
>  	mapping->nrpages++;
>  	if (node) {
> -		workingset_node_pages_inc(node);
> +		node->count++;
>  		/*
>  		 * Don't track node that contains actual pages.
>  		 *
> @@ -184,38 +184,33 @@ static void page_cache_tree_delete(struct address_space *mapping,
>  
>  		radix_tree_clear_tags(&mapping->page_tree, node, slot);
>  
> -		if (!node) {
> -			VM_BUG_ON_PAGE(nr != 1, page);
> -			/*
> -			 * We need a node to properly account shadow
> -			 * entries. Don't plant any without. XXX
> -			 */
> -			shadow = NULL;
> -		}
> -
>  		radix_tree_replace_slot(slot, shadow);
>  
> -		if (!node)
> +		if (!node) {
> +			VM_BUG_ON_PAGE(nr != 1, page);
>  			break;
> +		}
>  
> -		workingset_node_pages_dec(node);
> -		if (shadow)
> -			workingset_node_shadows_inc(node);
> -		else
> +		if (shadow) {
> +			node->exceptional++;
> +		} else {
> +			node->count--;
>  			if (__radix_tree_delete_node(&mapping->page_tree, node))
>  				continue;
> +		}
>  
>  		/*
> -		 * Track node that only contains shadow entries. DAX mappings
> -		 * contain no shadow entries and may contain other exceptional
> -		 * entries so skip those.
> +		 * Track node that only contains shadow entries. DAX and SHMEM
> +		 * mappings contain no shadow entries and may contain other
> +		 * exceptional entries so skip those.
>  		 *
>  		 * Avoid acquiring the list_lru lock if already tracked.
>  		 * The list_empty() test is safe as node->private_list is
>  		 * protected by mapping->tree_lock.
>  		 */
> -		if (!dax_mapping(mapping) && !workingset_node_pages(node) &&
> -				list_empty(&node->private_list)) {
> +		if (!dax_mapping(mapping) && !shmem_mapping(mapping) &&
> +		    node->count == node->exceptional &&
> +		    list_empty(&node->private_list)) {
>  			node->private_data = mapping;
>  			list_lru_add(&workingset_shadow_nodes,
>  					&node->private_list);
> diff --git a/mm/truncate.c b/mm/truncate.c
> index a01cce450a26..b9b2a1b42822 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -53,7 +53,8 @@ static void clear_exceptional_entry(struct address_space *mapping,
>  	mapping->nrexceptional--;
>  	if (!node)
>  		goto unlock;
> -	workingset_node_shadows_dec(node);
> +	node->exceptional--;
> +	node->count--;
>  	/*
>  	 * Don't track node without shadow entries.
>  	 *
> @@ -61,8 +62,7 @@ static void clear_exceptional_entry(struct address_space *mapping,
>  	 * The list_empty() test is safe as node->private_list is
>  	 * protected by mapping->tree_lock.
>  	 */
> -	if (!workingset_node_shadows(node) &&
> -	    !list_empty(&node->private_list))
> +	if (!node->exceptional && !list_empty(&node->private_list))
>  		list_lru_del(&workingset_shadow_nodes,
>  				&node->private_list);
>  	__radix_tree_delete_node(&mapping->page_tree, node);
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 5f07db171c03..dfaec27aedd3 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -418,23 +418,26 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
>  	 * no pages, so we expect to be able to remove them all and
>  	 * delete and free the empty node afterwards.
>  	 */
> -	if (WARN_ON_ONCE(!workingset_node_shadows(node)))
> +	if (WARN_ON_ONCE(!node->exceptional))
>  		goto out_invalid;
> -	if (WARN_ON_ONCE(workingset_node_pages(node)))
> +	if (WARN_ON_ONCE(node->count != node->exceptional))
>  		goto out_invalid;
>  
>  	for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
>  		if (node->slots[i]) {
>  			if (WARN_ON_ONCE(!radix_tree_exceptional_entry(node->slots[i])))
>  				goto out_invalid;
> -			node->slots[i] = NULL;
> -			workingset_node_shadows_dec(node);
> +			if (WARN_ON_ONCE(!node->exceptional))
> +				goto out_invalid;
>  			if (WARN_ON_ONCE(!mapping->nrexceptional))
>  				goto out_invalid;
> +			node->slots[i] = NULL;
> +			node->exceptional--;
> +			node->count--;
>  			mapping->nrexceptional--;
>  		}
>  	}
> -	if (WARN_ON_ONCE(workingset_node_shadows(node)))
> +	if (WARN_ON_ONCE(node->exceptional))
>  		goto out_invalid;
>  	inc_node_state(page_pgdat(virt_to_page(node)), WORKINGSET_NODERECLAIM);
>  	__radix_tree_delete_node(&mapping->page_tree, node);
> -- 
> 2.10.0
Johannes Weiner - Oct. 26, 2016, 6:15 p.m.
On Wed, Oct 26, 2016 at 11:21:07AM +0200, Jan Kara wrote:
> On Mon 24-10-16 14:47:39, Johannes Weiner wrote:
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > Date: Mon, 17 Oct 2016 09:00:04 -0400
> > Subject: [PATCH] mm: workingset: restore single-page file refault tracking
> > 
> > Currently, we account shadow entries in the page cache in the upper
> > bits of the radix_tree_node->count, behind the back of the radix tree
> > implementation. Because the radix tree code has no awareness of them,
> > we have to prevent shadow entries from going through operations where
> > the tree implementation relies on or modifies node->count: extending
> > and shrinking the tree from and to a single direct root->rnode entry.
> > 
> > As a consequence, we cannot store shadow entries for files that only
> > have index 0 populated, and thus cannot detect refaults from them,
> > which in turn degrades the thrashing compensation in LRU reclaim.
> > 
> > Another consequence is that we rely on subtleties throughout the radix
> > tree code, such as the node->count != 1 check in the shrinking code,
> > which is meant to exclude multi-entry nodes but also skips nodes with
> > only one shadow entry since they are accounted in the upper bits. This
> > is error prone, and has in fact caused the bug fixed in d3798ae8c6f3
> > ("mm: filemap: don't plant shadow entries without radix tree node").
> > 
> > To fix this, this patch moves the shadow counter from the upper bits
> > of node->count into the new node->exceptional counter, where all
> > exceptional entries are explicitely tracked by the radix tree.
> > node->count then counts all tree entries again, including shadows.
> > 
> > Switching from a magic node->count to accounting exceptional entries
> > natively in the radix tree code removes the fragile subtleties
> > mentioned above. It also allows us to store shadow entries for
> > single-page files again, as the radix tree recognizes exceptional
> > entries when extending the tree from the root->rnode singleton, and
> > thus restore refault detection and thrashing compensation for them.
> 
> I like this solution.

Thanks Jan.

> Just one suggestion: I think radix_tree_replace_slot() can now do
> the node counter update on its own and that would save us having to
> do quite a bit of accounting outside of the radix tree code itself
> and it would be less prone to bugs (forgotten updates of a
> counter). What do you think?

This would be nice indeed, but it's bigger surgery. We need the node
in the context of existing users that do slot lookup and replacement,
which is easier for individual lookups, and harder for gang lookups
(e.g. drivers/sh/intc/virq.c::intc_subgroup_map). And they'd all get
more complicated, AFAICS, without even using exceptional entries.

I'll see if I can find a nice way to do it, but any ideas are welcome.

Thanks
Linus Torvalds - Oct. 26, 2016, 6:18 p.m.
On Mon, Oct 24, 2016 at 11:47 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> How about this instead: given that we already mark the shadow entries
> exceptional, and the exceptional bit is part of the radix tree API,
> can we just introduce a node->exceptional counter for those entries
> and have the radix tree code assist us with that instead? It adds the
> counting for non-shadow exceptional entries as well (shmem swap slots,
> and DAX non-page entries), unfortunately, but this is way cleaner. It
> also makes mapping->nrexceptional and node->exceptional consistent in
> DAX (Jan, could you please double check the accounting there?)
>
> What do you think? Lightly tested patch below.

This certainly looks way better to me. I didn't *test* it, but it
doesn't make me scratch my head the way your previous patch did.

               Linus
Johannes Weiner - Oct. 26, 2016, 6:29 p.m.
On Wed, Oct 26, 2016 at 11:18:52AM -0700, Linus Torvalds wrote:
> On Mon, Oct 24, 2016 at 11:47 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > How about this instead: given that we already mark the shadow entries
> > exceptional, and the exceptional bit is part of the radix tree API,
> > can we just introduce a node->exceptional counter for those entries
> > and have the radix tree code assist us with that instead? It adds the
> > counting for non-shadow exceptional entries as well (shmem swap slots,
> > and DAX non-page entries), unfortunately, but this is way cleaner. It
> > also makes mapping->nrexceptional and node->exceptional consistent in
> > DAX (Jan, could you please double check the accounting there?)
> >
> > What do you think? Lightly tested patch below.
> 
> This certainly looks way better to me. I didn't *test* it, but it
> doesn't make me scratch my head the way your previous patch did.

Awesome, thanks. I'll continue to beat on this for a while and then
send it on to Andrew.
Jan Kara - Oct. 27, 2016, 8:48 a.m.
On Wed 26-10-16 14:15:09, Johannes Weiner wrote:
> On Wed, Oct 26, 2016 at 11:21:07AM +0200, Jan Kara wrote:
> > On Mon 24-10-16 14:47:39, Johannes Weiner wrote:
> > > From: Johannes Weiner <hannes@cmpxchg.org>
> > > Date: Mon, 17 Oct 2016 09:00:04 -0400
> > > Subject: [PATCH] mm: workingset: restore single-page file refault tracking
> > > 
> > > Currently, we account shadow entries in the page cache in the upper
> > > bits of the radix_tree_node->count, behind the back of the radix tree
> > > implementation. Because the radix tree code has no awareness of them,
> > > we have to prevent shadow entries from going through operations where
> > > the tree implementation relies on or modifies node->count: extending
> > > and shrinking the tree from and to a single direct root->rnode entry.
> > > 
> > > As a consequence, we cannot store shadow entries for files that only
> > > have index 0 populated, and thus cannot detect refaults from them,
> > > which in turn degrades the thrashing compensation in LRU reclaim.
> > > 
> > > Another consequence is that we rely on subtleties throughout the radix
> > > tree code, such as the node->count != 1 check in the shrinking code,
> > > which is meant to exclude multi-entry nodes but also skips nodes with
> > > only one shadow entry since they are accounted in the upper bits. This
> > > is error prone, and has in fact caused the bug fixed in d3798ae8c6f3
> > > ("mm: filemap: don't plant shadow entries without radix tree node").
> > > 
> > > To fix this, this patch moves the shadow counter from the upper bits
> > > of node->count into the new node->exceptional counter, where all
> > > exceptional entries are explicitely tracked by the radix tree.
> > > node->count then counts all tree entries again, including shadows.
> > > 
> > > Switching from a magic node->count to accounting exceptional entries
> > > natively in the radix tree code removes the fragile subtleties
> > > mentioned above. It also allows us to store shadow entries for
> > > single-page files again, as the radix tree recognizes exceptional
> > > entries when extending the tree from the root->rnode singleton, and
> > > thus restore refault detection and thrashing compensation for them.
> > 
> > I like this solution.
> 
> Thanks Jan.
> 
> > Just one suggestion: I think radix_tree_replace_slot() can now do
> > the node counter update on its own and that would save us having to
> > do quite a bit of accounting outside of the radix tree code itself
> > and it would be less prone to bugs (forgotten updates of a
> > counter). What do you think?
> 
> This would be nice indeed, but it's bigger surgery. We need the node
> in the context of existing users that do slot lookup and replacement,
> which is easier for individual lookups, and harder for gang lookups
> (e.g. drivers/sh/intc/virq.c::intc_subgroup_map). And they'd all get
> more complicated, AFAICS, without even using exceptional entries.

Hum, I agree. But actually looking at e.g. the usage of
radix_tree_replace_slot() in mm/khugepaged.c I think this is even buggy
when replacing a slot referencing a page with NULL without updating
node->count. It is in the error bail-out path so I'm not surprised we did
not stumble over it.

So a relatively easy solution looks like: Create new function like
radix_tree_replace_node_slot() taking both node & slot and updating the
accounting information as needed. Make this function WARN if node is NULL
and accounting information should be updated. Make original
radix_tree_replace_slot() a wrapper around radix_tree_replace_node_slot()
passing NULL as node.

This should provide safe accounting info updates without forcing all users
to work with node pointers...

								Honza

Patch

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index af3581b8a451..2674ed31fa84 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -80,14 +80,11 @@  static inline bool radix_tree_is_internal_node(void *ptr)
 #define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
 					  RADIX_TREE_MAP_SHIFT))
 
-/* Internally used bits of node->count */
-#define RADIX_TREE_COUNT_SHIFT	(RADIX_TREE_MAP_SHIFT + 1)
-#define RADIX_TREE_COUNT_MASK	((1UL << RADIX_TREE_COUNT_SHIFT) - 1)
-
 struct radix_tree_node {
-	unsigned char	shift;	/* Bits remaining in each slot */
-	unsigned char	offset;	/* Slot offset in parent */
-	unsigned int	count;
+	unsigned char	shift;		/* Bits remaining in each slot */
+	unsigned char	offset;		/* Slot offset in parent */
+	unsigned char	count;		/* Total entry count */
+	unsigned char	exceptional;	/* Exceptional entry count */
 	union {
 		struct {
 			/* Used when ascending tree */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a56523cefb9b..660a11de0186 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -248,38 +248,6 @@  bool workingset_refault(void *shadow);
 void workingset_activation(struct page *page);
 extern struct list_lru workingset_shadow_nodes;
 
-static inline unsigned int workingset_node_pages(struct radix_tree_node *node)
-{
-	return node->count & RADIX_TREE_COUNT_MASK;
-}
-
-static inline void workingset_node_pages_inc(struct radix_tree_node *node)
-{
-	node->count++;
-}
-
-static inline void workingset_node_pages_dec(struct radix_tree_node *node)
-{
-	VM_WARN_ON_ONCE(!workingset_node_pages(node));
-	node->count--;
-}
-
-static inline unsigned int workingset_node_shadows(struct radix_tree_node *node)
-{
-	return node->count >> RADIX_TREE_COUNT_SHIFT;
-}
-
-static inline void workingset_node_shadows_inc(struct radix_tree_node *node)
-{
-	node->count += 1U << RADIX_TREE_COUNT_SHIFT;
-}
-
-static inline void workingset_node_shadows_dec(struct radix_tree_node *node)
-{
-	VM_WARN_ON_ONCE(!workingset_node_shadows(node));
-	node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
-}
-
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
 extern unsigned long totalreserve_pages;
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 8e6d552c40dd..c7d8452d755e 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -220,10 +220,10 @@  static void dump_node(struct radix_tree_node *node, unsigned long index)
 {
 	unsigned long i;
 
-	pr_debug("radix node: %p offset %d tags %lx %lx %lx shift %d count %d parent %p\n",
+	pr_debug("radix node: %p offset %d tags %lx %lx %lx shift %d count %d exceptional %d parent %p\n",
 		node, node->offset,
 		node->tags[0][0], node->tags[1][0], node->tags[2][0],
-		node->shift, node->count, node->parent);
+		node->shift, node->count, node->exceptional, node->parent);
 
 	for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
 		unsigned long first = index | (i << node->shift);
@@ -522,8 +522,14 @@  static int radix_tree_extend(struct radix_tree_root *root,
 		node->offset = 0;
 		node->count = 1;
 		node->parent = NULL;
-		if (radix_tree_is_internal_node(slot))
+		/* Extending an existing node or root->rnode */
+		if (radix_tree_is_internal_node(slot)) {
 			entry_to_node(slot)->parent = node;
+		} else {
+			/* Moving an exceptional root->rnode to a node */
+			if (radix_tree_exceptional_entry(slot))
+				node->exceptional = 1;
+		}
 		node->slots[0] = slot;
 		slot = node_to_entry(node);
 		rcu_assign_pointer(root->rnode, slot);
@@ -649,6 +655,8 @@  int __radix_tree_insert(struct radix_tree_root *root, unsigned long index,
 	if (node) {
 		unsigned offset = get_slot_offset(node, slot);
 		node->count++;
+		if (radix_tree_exceptional_entry(item))
+			node->exceptional++;
 		BUG_ON(tag_get(node, 0, offset));
 		BUG_ON(tag_get(node, 1, offset));
 		BUG_ON(tag_get(node, 2, offset));
@@ -1561,6 +1569,8 @@  void *radix_tree_delete_item(struct radix_tree_root *root,
 	delete_sibling_entries(node, node_to_entry(slot), offset);
 	node->slots[offset] = NULL;
 	node->count--;
+	if (radix_tree_exceptional_entry(entry))
+		node->exceptional--;
 
 	__radix_tree_delete_node(root, node);
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 849f459ad078..bf7d88b18374 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -128,20 +128,20 @@  static int page_cache_tree_insert(struct address_space *mapping,
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
 
+		if (node) {
+			node->exceptional--;
+			node->count--;
+		}
 		mapping->nrexceptional--;
+
 		if (!dax_mapping(mapping)) {
 			if (shadowp)
 				*shadowp = p;
-			if (node)
-				workingset_node_shadows_dec(node);
 		} else {
 			/* DAX can replace empty locked entry with a hole */
 			WARN_ON_ONCE(p !=
 				(void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
 					 RADIX_DAX_ENTRY_LOCK));
-			/* DAX accounts exceptional entries as normal pages */
-			if (node)
-				workingset_node_pages_dec(node);
 			/* Wakeup waiters for exceptional entry lock */
 			dax_wake_mapping_entry_waiter(mapping, page->index,
 						      false);
@@ -150,7 +150,7 @@  static int page_cache_tree_insert(struct address_space *mapping,
 	radix_tree_replace_slot(slot, page);
 	mapping->nrpages++;
 	if (node) {
-		workingset_node_pages_inc(node);
+		node->count++;
 		/*
 		 * Don't track node that contains actual pages.
 		 *
@@ -184,38 +184,33 @@  static void page_cache_tree_delete(struct address_space *mapping,
 
 		radix_tree_clear_tags(&mapping->page_tree, node, slot);
 
-		if (!node) {
-			VM_BUG_ON_PAGE(nr != 1, page);
-			/*
-			 * We need a node to properly account shadow
-			 * entries. Don't plant any without. XXX
-			 */
-			shadow = NULL;
-		}
-
 		radix_tree_replace_slot(slot, shadow);
 
-		if (!node)
+		if (!node) {
+			VM_BUG_ON_PAGE(nr != 1, page);
 			break;
+		}
 
-		workingset_node_pages_dec(node);
-		if (shadow)
-			workingset_node_shadows_inc(node);
-		else
+		if (shadow) {
+			node->exceptional++;
+		} else {
+			node->count--;
 			if (__radix_tree_delete_node(&mapping->page_tree, node))
 				continue;
+		}
 
 		/*
-		 * Track node that only contains shadow entries. DAX mappings
-		 * contain no shadow entries and may contain other exceptional
-		 * entries so skip those.
+		 * Track node that only contains shadow entries. DAX and SHMEM
+		 * mappings contain no shadow entries and may contain other
+		 * exceptional entries so skip those.
 		 *
 		 * Avoid acquiring the list_lru lock if already tracked.
 		 * The list_empty() test is safe as node->private_list is
 		 * protected by mapping->tree_lock.
 		 */
-		if (!dax_mapping(mapping) && !workingset_node_pages(node) &&
-				list_empty(&node->private_list)) {
+		if (!dax_mapping(mapping) && !shmem_mapping(mapping) &&
+		    node->count == node->exceptional &&
+		    list_empty(&node->private_list)) {
 			node->private_data = mapping;
 			list_lru_add(&workingset_shadow_nodes,
 					&node->private_list);
diff --git a/mm/truncate.c b/mm/truncate.c
index a01cce450a26..b9b2a1b42822 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -53,7 +53,8 @@  static void clear_exceptional_entry(struct address_space *mapping,
 	mapping->nrexceptional--;
 	if (!node)
 		goto unlock;
-	workingset_node_shadows_dec(node);
+	node->exceptional--;
+	node->count--;
 	/*
 	 * Don't track node without shadow entries.
 	 *
@@ -61,8 +62,7 @@  static void clear_exceptional_entry(struct address_space *mapping,
 	 * The list_empty() test is safe as node->private_list is
 	 * protected by mapping->tree_lock.
 	 */
-	if (!workingset_node_shadows(node) &&
-	    !list_empty(&node->private_list))
+	if (!node->exceptional && !list_empty(&node->private_list))
 		list_lru_del(&workingset_shadow_nodes,
 				&node->private_list);
 	__radix_tree_delete_node(&mapping->page_tree, node);
diff --git a/mm/workingset.c b/mm/workingset.c
index 5f07db171c03..dfaec27aedd3 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -418,23 +418,26 @@  static enum lru_status shadow_lru_isolate(struct list_head *item,
 	 * no pages, so we expect to be able to remove them all and
 	 * delete and free the empty node afterwards.
 	 */
-	if (WARN_ON_ONCE(!workingset_node_shadows(node)))
+	if (WARN_ON_ONCE(!node->exceptional))
 		goto out_invalid;
-	if (WARN_ON_ONCE(workingset_node_pages(node)))
+	if (WARN_ON_ONCE(node->count != node->exceptional))
 		goto out_invalid;
 
 	for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
 		if (node->slots[i]) {
 			if (WARN_ON_ONCE(!radix_tree_exceptional_entry(node->slots[i])))
 				goto out_invalid;
-			node->slots[i] = NULL;
-			workingset_node_shadows_dec(node);
+			if (WARN_ON_ONCE(!node->exceptional))
+				goto out_invalid;
 			if (WARN_ON_ONCE(!mapping->nrexceptional))
 				goto out_invalid;
+			node->slots[i] = NULL;
+			node->exceptional--;
+			node->count--;
 			mapping->nrexceptional--;
 		}
 	}
-	if (WARN_ON_ONCE(workingset_node_shadows(node)))
+	if (WARN_ON_ONCE(node->exceptional))
 		goto out_invalid;
 	inc_node_state(page_pgdat(virt_to_page(node)), WORKINGSET_NODERECLAIM);
 	__radix_tree_delete_node(&mapping->page_tree, node);