diff mbox

[02/10] mm: workingset: tell cache transitions from workingset thrashing

Message ID 20180712172942.10094-3-hannes@cmpxchg.org (mailing list archive)
State New, archived
Headers show

Commit Message

Johannes Weiner July 12, 2018, 5:29 p.m. UTC
Refaults happen during transitions between workingsets as well as
in-place thrashing. Knowing the difference between the two has a range
of applications, including measuring the impact of memory shortage on
the system performance, as well as the ability to smarter balance
pressure between the filesystem cache and the swap-backed workingset.

During workingset transitions, inactive cache refaults and pushes out
established active cache. When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.

Introduce a new page flag that tells on eviction whether the page has
been active or not in its lifetime. This bit is then stored in the
shadow entry, to classify refaults as transitioning or thrashing.

How many page->flags does this leave us with on 32-bit?

	20 bits are always page flags

	21 if you have an MMU

	23 with the zone bits for DMA, Normal, HighMem, Movable

	29 with the sparsemem section bits

	30 if PAE is enabled

	31 with this patch.

So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA
nodes. If that's not enough, the system can switch to discontigmem and
re-gain the 6 or 7 sparsemem section bits.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h         |  1 +
 include/linux/page-flags.h     |  5 +-
 include/linux/swap.h           |  2 +-
 include/trace/events/mmflags.h |  1 +
 mm/filemap.c                   |  9 ++--
 mm/huge_memory.c               |  1 +
 mm/memcontrol.c                |  2 +
 mm/migrate.c                   |  2 +
 mm/swap_state.c                |  1 +
 mm/vmscan.c                    |  1 +
 mm/vmstat.c                    |  1 +
 mm/workingset.c                | 95 ++++++++++++++++++++++------------
 12 files changed, 79 insertions(+), 42 deletions(-)

Comments

Arnd Bergmann July 23, 2018, 1:36 p.m. UTC | #1
On Thu, Jul 12, 2018 at 7:29 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> How many page->flags does this leave us with on 32-bit?
>
>         20 bits are always page flags
>
>         21 if you have an MMU
>
>         23 with the zone bits for DMA, Normal, HighMem, Movable
>
>         29 with the sparsemem section bits
>
>         30 if PAE is enabled
>
>         31 with this patch.
>
> So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA
> nodes. If that's not enough, the system can switch to discontigmem and
> re-gain the 6 or 7 sparsemem section bits.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

It seems we ran out of bits on arm64 in randconfig builds:

In file included from /git/arm-soc/include/linux/kernel.h:10,
                 from /git/arm-soc/arch/arm64/mm/init.c:20:
/git/arm-soc/arch/arm64/mm/init.c: In function 'mem_init':
/git/arm-soc/include/linux/compiler.h:357:38: error: call to
'__compiletime_assert_618' declared with attribute error: BUILD_BUG_ON
failed: sizeof(struct page) > (1 << STRUCT_PAGE_MAX_SHIFT)
  _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
                                      ^
/git/arm-soc/include/linux/compiler.h:337:4: note: in definition of
macro '__compiletime_assert'
    prefix ## suffix();    \
    ^~~~~~
/git/arm-soc/include/linux/compiler.h:357:2: note: in expansion of
macro '_compiletime_assert'
  _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
  ^~~~~~~~~~~~~~~~~~~
/git/arm-soc/include/linux/build_bug.h:45:37: note: in expansion of
macro 'compiletime_assert'
 #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
                                     ^~~~~~~~~~~~~~~~~~
/git/arm-soc/include/linux/build_bug.h:69:2: note: in expansion of
macro 'BUILD_BUG_ON_MSG'
  BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
  ^~~~~~~~~~~~~~~~
/git/arm-soc/arch/arm64/mm/init.c:618:2: note: in expansion of macro
'BUILD_BUG_ON'
  BUILD_BUG_ON(sizeof(struct page) > (1 << STRUCT_PAGE_MAX_SHIFT));
  ^~~~~~~~~~~~
/git/arm-soc/scripts/Makefile.build:317: recipe for target
'arch/arm64/mm/init.o' failed

Apparently this triggered

#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_CPUPID_SHIFT <=
BITS_PER_LONG - NR_PAGEFLAGS
#define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
#else
#define LAST_CPUPID_WIDTH 0
#endif

and in turn

#if defined(CONFIG_NUMA_BALANCING) && LAST_CPUPID_WIDTH == 0
#define LAST_CPUPID_NOT_IN_PAGE_FLAGS
#endif

and that _last_cpupid in struct page made sizeof(struct page) larger than 64.

This is for a randconfig build, see https://pastebin.com/YuwSTah3
for the configuration file, some of the relevant options are

CONFIG_64BIT=y
CONFIG_MEMCG=y
CONFIG_SPARSEMEM=y
CONFIG_ARM64_PA_BITS=52
CONFIG_ARM64_64K_PAGES=y
CONFIG_NR_CPUS=64
CONFIG_NUMA_BALANCING=y
# CONFIG_SPARSEMEM_VMEMMAP is not set
CONFIG_NODES_SHIFT=2
# CONFIG_ARCH_USES_PG_UNCACHED is not set
CONFIG_MEMORY_FAILURE=y
CONFIG_IDLE_PAGE_TRACKING=y

#define MAX_NR_ZONES 3
#define ZONES_SHIFT 2
#define MAX_PHYSMEM_BITS 52
#define SECTION_SIZE_BITS 30
#define SECTIONS_WIDTH 22
#define ZONES_WIDTH 2
#define NODES_SHIFT 2
#define LAST__PID_SHIFT 8
#define NR_CPUS_BITS 6
#define LAST_CPUPID_SHIFT 14
#define NR_PAGEFLAGS 25

With the extra page flag, the sum of SECTIONS_WIDTH, NODES_SHIFT,  ZONES_WIDTH,
LAST_CPUPID_SHIFT, and NR_PAGEFLAGS is now 65. Before this change, I could
not trigger that error in randconfig builds. However, setting CONFIG_NR_CPUS or
CONFIG_NODES_SHIFT higher than the defaults would trigger it as well (randconfig
does not randomize those options).

       Arnd
Johannes Weiner July 23, 2018, 3:23 p.m. UTC | #2
Hi Arnd,

On Mon, Jul 23, 2018 at 03:36:09PM +0200, Arnd Bergmann wrote:
> On Thu, Jul 12, 2018 at 7:29 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > How many page->flags does this leave us with on 32-bit?
> >
> >         20 bits are always page flags
> >
> >         21 if you have an MMU
> >
> >         23 with the zone bits for DMA, Normal, HighMem, Movable
> >
> >         29 with the sparsemem section bits
> >
> >         30 if PAE is enabled
> >
> >         31 with this patch.
> >
> > So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA
> > nodes. If that's not enough, the system can switch to discontigmem and
> > re-gain the 6 or 7 sparsemem section bits.
> >
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> It seems we ran out of bits on arm64 in randconfig builds:
> 
> In file included from /git/arm-soc/include/linux/kernel.h:10,
>                  from /git/arm-soc/arch/arm64/mm/init.c:20:
> /git/arm-soc/arch/arm64/mm/init.c: In function 'mem_init':
> /git/arm-soc/include/linux/compiler.h:357:38: error: call to
> '__compiletime_assert_618' declared with attribute error: BUILD_BUG_ON
> failed: sizeof(struct page) > (1 << STRUCT_PAGE_MAX_SHIFT)

This BUILD_BUG_ON() is to make sure we're sizing the VMEMMAP struct
page array properly (address space divided by struct page size).

From the code:

/*
 * Log2 of the upper bound of the size of a struct page. Used for sizing
 * the vmemmap region only, does not affect actual memory footprint.
 * We don't use sizeof(struct page) directly since taking its size here
 * requires its definition to be available at this point in the inclusion
 * chain, and it may not be a power of 2 in the first place.
 */
#define STRUCT_PAGE_MAX_SHIFT	6

> Apparently this triggered
> 
> #if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_CPUPID_SHIFT <=
> BITS_PER_LONG - NR_PAGEFLAGS
> #define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
> #else
> #define LAST_CPUPID_WIDTH 0
> #endif
> 
> and in turn
> 
> #if defined(CONFIG_NUMA_BALANCING) && LAST_CPUPID_WIDTH == 0
> #define LAST_CPUPID_NOT_IN_PAGE_FLAGS
> #endif
> 
> and that _last_cpupid in struct page made sizeof(struct page) larger than 64.
> 
> This is for a randconfig build, see https://pastebin.com/YuwSTah3
> for the configuration file, some of the relevant options are
> 
> CONFIG_64BIT=y
> CONFIG_MEMCG=y
> CONFIG_SPARSEMEM=y
> CONFIG_ARM64_PA_BITS=52
> CONFIG_ARM64_64K_PAGES=y
> CONFIG_NR_CPUS=64
> CONFIG_NUMA_BALANCING=y
> # CONFIG_SPARSEMEM_VMEMMAP is not set

However, the check isn't conditional on that config option. And when
VMEMMAP is disabled, we need 22 additional bits to identify the sparse
memory sections in page->flags as well:

> CONFIG_NODES_SHIFT=2
> # CONFIG_ARCH_USES_PG_UNCACHED is not set
> CONFIG_MEMORY_FAILURE=y
> CONFIG_IDLE_PAGE_TRACKING=y
> 
> #define MAX_NR_ZONES 3
> #define ZONES_SHIFT 2
> #define MAX_PHYSMEM_BITS 52
> #define SECTION_SIZE_BITS 30
> #define SECTIONS_WIDTH 22

^^^ Those we get back with VMEMMAP enabled.

So for configs for which the check is intended, it passes. We just
need to make it conditional to those.

---

From 1d24635a6c7cd395bad5c29a3b9e5d2e98d9ab84 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon, 23 Jul 2018 10:18:23 -0400
Subject: [PATCH] arm64: fix vmemmap BUILD_BUG_ON() triggering on !vmemmap
 setups

Arnd reports the following arm64 randconfig build error with the PSI
patches that add another page flag:

  /git/arm-soc/arch/arm64/mm/init.c: In function 'mem_init':
  /git/arm-soc/include/linux/compiler.h:357:38: error: call to
  '__compiletime_assert_618' declared with attribute error: BUILD_BUG_ON
  failed: sizeof(struct page) > (1 << STRUCT_PAGE_MAX_SHIFT)

The additional page flag causes other information stored in
page->flags to get bumped into their own struct page member:

  #if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_CPUPID_SHIFT <=
  BITS_PER_LONG - NR_PAGEFLAGS
  #define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
  #else
  #define LAST_CPUPID_WIDTH 0
  #endif

  #if defined(CONFIG_NUMA_BALANCING) && LAST_CPUPID_WIDTH == 0
  #define LAST_CPUPID_NOT_IN_PAGE_FLAGS
  #endif

which in turn causes the struct page size to exceed the size set in
STRUCT_PAGE_MAX_SHIFT. This value is an an estimate used to size the
VMEMMAP page array according to address space and struct page size.

However, the check is performed - and triggers here - on a !VMEMMAP
config, which consumes an additional 22 page bits for the sparse
section id. When VMEMMAP is enabled, those bits are returned, cpupid
doesn't need its own member, and the page passes the VMEMMAP check.

Restrict that check to the situation it was meant to check: that we
are sizing the VMEMMAP page array correctly.

Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 arch/arm64/mm/init.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 1b18b4722420..72c9b6778b0a 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -611,11 +611,13 @@ void __init mem_init(void)
 	BUILD_BUG_ON(TASK_SIZE_32			> TASK_SIZE_64);
 #endif
 
+#ifndef CONFIG_SPARSEMEM_VMEMMAP
 	/*
 	 * Make sure we chose the upper bound of sizeof(struct page)
-	 * correctly.
+	 * correctly when sizing the VMEMMAP array.
 	 */
 	BUILD_BUG_ON(sizeof(struct page) > (1 << STRUCT_PAGE_MAX_SHIFT));
+#endif
 
 	if (PAGE_SIZE >= 16384 && get_num_physpages() <= 128) {
 		extern int sysctl_overcommit_memory;
Arnd Bergmann July 23, 2018, 3:35 p.m. UTC | #3
On Mon, Jul 23, 2018 at 5:23 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Mon, Jul 23, 2018 at 03:36:09PM +0200, Arnd Bergmann wrote:
>> On Thu, Jul 12, 2018 at 7:29 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> In file included from /git/arm-soc/include/linux/kernel.h:10,
>>                  from /git/arm-soc/arch/arm64/mm/init.c:20:
>> /git/arm-soc/arch/arm64/mm/init.c: In function 'mem_init':
>> /git/arm-soc/include/linux/compiler.h:357:38: error: call to
>> '__compiletime_assert_618' declared with attribute error: BUILD_BUG_ON
>> failed: sizeof(struct page) > (1 << STRUCT_PAGE_MAX_SHIFT)
>
> This BUILD_BUG_ON() is to make sure we're sizing the VMEMMAP struct
> page array properly (address space divided by struct page size).
>
> From the code:
>
> /*
>  * Log2 of the upper bound of the size of a struct page. Used for sizing
>  * the vmemmap region only, does not affect actual memory footprint.
>  * We don't use sizeof(struct page) directly since taking its size here
>  * requires its definition to be available at this point in the inclusion
>  * chain, and it may not be a power of 2 in the first place.
>  */
> #define STRUCT_PAGE_MAX_SHIFT   6
>
...
> However, the check isn't conditional on that config option. And when
> VMEMMAP is disabled, we need 22 additional bits to identify the sparse
> memory sections in page->flags as well:
>
>> CONFIG_NODES_SHIFT=2
>> # CONFIG_ARCH_USES_PG_UNCACHED is not set
>> CONFIG_MEMORY_FAILURE=y
>> CONFIG_IDLE_PAGE_TRACKING=y
>>
>> #define MAX_NR_ZONES 3
>> #define ZONES_SHIFT 2
>> #define MAX_PHYSMEM_BITS 52
>> #define SECTION_SIZE_BITS 30
>> #define SECTIONS_WIDTH 22
>
> ^^^ Those we get back with VMEMMAP enabled.
>
> So for configs for which the check is intended, it passes. We just
> need to make it conditional to those.

Ok, thanks for the analysis, I had missed that and was about to
send a different patch to increase STRUCT_PAGE_MAX_SHIFT
in some configurations, which is not as good.

> From 1d24635a6c7cd395bad5c29a3b9e5d2e98d9ab84 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Mon, 23 Jul 2018 10:18:23 -0400
> Subject: [PATCH] arm64: fix vmemmap BUILD_BUG_ON() triggering on !vmemmap
>  setups
>
> Arnd reports the following arm64 randconfig build error with the PSI
> patches that add another page flag:
>

You could add further text here that I had just added to my
patch description (not sent):

    Further experiments show that the build error already existed before,
    but was only triggered with larger values of CONFIG_NR_CPU and/or
    CONFIG_NODES_SHIFT that might be used in actual configurations but
    not in randconfig builds.

    With longer CPU and node masks, I could recreate the problem with
    kernels as old as linux-4.7 when arm64 NUMA support got added.

    Cc: stable@vger.kernel.org
    Fixes: 1a2db300348b ("arm64, numa: Add NUMA support for arm64 platforms.")
    Fixes: 3e1907d5bf5a ("arm64: mm: move vmemmap region right below
the linear region")

>  arch/arm64/mm/init.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 1b18b4722420..72c9b6778b0a 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -611,11 +611,13 @@ void __init mem_init(void)
>         BUILD_BUG_ON(TASK_SIZE_32                       > TASK_SIZE_64);
>  #endif
>
> +#ifndef CONFIG_SPARSEMEM_VMEMMAP
>         /*

I tested it on two broken configurations, and found that you have
a typo here, it should be 'ifdef', not 'ifndef'. With that change, it
seems to build fine.

Tested-by: Arnd Bergmann <arnd@arndb.de>

      Arnd
Johannes Weiner July 23, 2018, 4:27 p.m. UTC | #4
On Mon, Jul 23, 2018 at 05:35:35PM +0200, Arnd Bergmann wrote:
> On Mon, Jul 23, 2018 at 5:23 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > From 1d24635a6c7cd395bad5c29a3b9e5d2e98d9ab84 Mon Sep 17 00:00:00 2001
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > Date: Mon, 23 Jul 2018 10:18:23 -0400
> > Subject: [PATCH] arm64: fix vmemmap BUILD_BUG_ON() triggering on !vmemmap
> >  setups
> >
> > Arnd reports the following arm64 randconfig build error with the PSI
> > patches that add another page flag:
> >
> 
> You could add further text here that I had just added to my
> patch description (not sent):
> 
>     Further experiments show that the build error already existed before,
>     but was only triggered with larger values of CONFIG_NR_CPU and/or
>     CONFIG_NODES_SHIFT that might be used in actual configurations but
>     not in randconfig builds.
> 
>     With longer CPU and node masks, I could recreate the problem with
>     kernels as old as linux-4.7 when arm64 NUMA support got added.
> 
>     Cc: stable@vger.kernel.org
>     Fixes: 1a2db300348b ("arm64, numa: Add NUMA support for arm64 platforms.")
>     Fixes: 3e1907d5bf5a ("arm64: mm: move vmemmap region right below
> the linear region")

Sure thing.

> >  arch/arm64/mm/init.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > index 1b18b4722420..72c9b6778b0a 100644
> > --- a/arch/arm64/mm/init.c
> > +++ b/arch/arm64/mm/init.c
> > @@ -611,11 +611,13 @@ void __init mem_init(void)
> >         BUILD_BUG_ON(TASK_SIZE_32                       > TASK_SIZE_64);
> >  #endif
> >
> > +#ifndef CONFIG_SPARSEMEM_VMEMMAP
> >         /*
> 
> I tested it on two broken configurations, and found that you have
> a typo here, it should be 'ifdef', not 'ifndef'. With that change, it
> seems to build fine.
> 
> Tested-by: Arnd Bergmann <arnd@arndb.de>

Thanks for testing it, I don't have a cross-compile toolchain set up.

---

From 34c4c4549f09f971d2d391a8d652d56cb9b05475 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon, 23 Jul 2018 10:18:23 -0400
Subject: [PATCH] arm64: fix vmemmap BUILD_BUG_ON() triggering on !vmemmap
 setups

Arnd reports the following arm64 randconfig build error with the PSI
patches that add another page flag:

  /git/arm-soc/arch/arm64/mm/init.c: In function 'mem_init':
  /git/arm-soc/include/linux/compiler.h:357:38: error: call to
  '__compiletime_assert_618' declared with attribute error: BUILD_BUG_ON
  failed: sizeof(struct page) > (1 << STRUCT_PAGE_MAX_SHIFT)

The additional page flag causes other information stored in
page->flags to get bumped into their own struct page member:

  #if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_CPUPID_SHIFT <=
  BITS_PER_LONG - NR_PAGEFLAGS
  #define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
  #else
  #define LAST_CPUPID_WIDTH 0
  #endif

  #if defined(CONFIG_NUMA_BALANCING) && LAST_CPUPID_WIDTH == 0
  #define LAST_CPUPID_NOT_IN_PAGE_FLAGS
  #endif

which in turn causes the struct page size to exceed the size set in
STRUCT_PAGE_MAX_SHIFT. This value is an an estimate used to size the
VMEMMAP page array according to address space and struct page size.

However, the check is performed - and triggers here - on a !VMEMMAP
config, which consumes an additional 22 page bits for the sparse
section id. When VMEMMAP is enabled, those bits are returned, cpupid
doesn't need its own member, and the page passes the VMEMMAP check.

Restrict that check to the situation it was meant to check: that we
are sizing the VMEMMAP page array correctly.

Says Arnd:

    Further experiments show that the build error already existed before,
    but was only triggered with larger values of CONFIG_NR_CPU and/or
    CONFIG_NODES_SHIFT that might be used in actual configurations but
    not in randconfig builds.

    With longer CPU and node masks, I could recreate the problem with
    kernels as old as linux-4.7 when arm64 NUMA support got added.

Reported-by: Arnd Bergmann <arnd@arndb.de>
Tested-by: Arnd Bergmann <arnd@arndb.de>
Cc: stable@vger.kernel.org
Fixes: 1a2db300348b ("arm64, numa: Add NUMA support for arm64 platforms.")
Fixes: 3e1907d5bf5a ("arm64: mm: move vmemmap region right below the linear region")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 arch/arm64/mm/init.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 1b18b4722420..86d9f9d303b0 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -611,11 +611,13 @@ void __init mem_init(void)
 	BUILD_BUG_ON(TASK_SIZE_32			> TASK_SIZE_64);
 #endif
 
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
 	/*
 	 * Make sure we chose the upper bound of sizeof(struct page)
-	 * correctly.
+	 * correctly when sizing the VMEMMAP array.
 	 */
 	BUILD_BUG_ON(sizeof(struct page) > (1 << STRUCT_PAGE_MAX_SHIFT));
+#endif
 
 	if (PAGE_SIZE >= 16384 && get_num_physpages() <= 128) {
 		extern int sysctl_overcommit_memory;
Will Deacon July 24, 2018, 3:04 p.m. UTC | #5
On Mon, Jul 23, 2018 at 12:27:35PM -0400, Johannes Weiner wrote:
> On Mon, Jul 23, 2018 at 05:35:35PM +0200, Arnd Bergmann wrote:
> > On Mon, Jul 23, 2018 at 5:23 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > > index 1b18b4722420..72c9b6778b0a 100644
> > > --- a/arch/arm64/mm/init.c
> > > +++ b/arch/arm64/mm/init.c
> > > @@ -611,11 +611,13 @@ void __init mem_init(void)
> > >         BUILD_BUG_ON(TASK_SIZE_32                       > TASK_SIZE_64);
> > >  #endif
> > >
> > > +#ifndef CONFIG_SPARSEMEM_VMEMMAP
> > >         /*
> > 
> > I tested it on two broken configurations, and found that you have
> > a typo here, it should be 'ifdef', not 'ifndef'. With that change, it
> > seems to build fine.
> > 
> > Tested-by: Arnd Bergmann <arnd@arndb.de>
> 
> Thanks for testing it, I don't have a cross-compile toolchain set up.
> 
> ---

Thanks Arnd, Johannes. I can pick this up for -rc7 via the arm64 tree,
unless it's already queued elsewhere?

Will

> From 34c4c4549f09f971d2d391a8d652d56cb9b05475 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Mon, 23 Jul 2018 10:18:23 -0400
> Subject: [PATCH] arm64: fix vmemmap BUILD_BUG_ON() triggering on !vmemmap
>  setups
> 
> Arnd reports the following arm64 randconfig build error with the PSI
> patches that add another page flag:
> 
>   /git/arm-soc/arch/arm64/mm/init.c: In function 'mem_init':
>   /git/arm-soc/include/linux/compiler.h:357:38: error: call to
>   '__compiletime_assert_618' declared with attribute error: BUILD_BUG_ON
>   failed: sizeof(struct page) > (1 << STRUCT_PAGE_MAX_SHIFT)
> 
> The additional page flag causes other information stored in
> page->flags to get bumped into their own struct page member:
> 
>   #if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_CPUPID_SHIFT <=
>   BITS_PER_LONG - NR_PAGEFLAGS
>   #define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
>   #else
>   #define LAST_CPUPID_WIDTH 0
>   #endif
> 
>   #if defined(CONFIG_NUMA_BALANCING) && LAST_CPUPID_WIDTH == 0
>   #define LAST_CPUPID_NOT_IN_PAGE_FLAGS
>   #endif
> 
> which in turn causes the struct page size to exceed the size set in
> STRUCT_PAGE_MAX_SHIFT. This value is an an estimate used to size the
> VMEMMAP page array according to address space and struct page size.
> 
> However, the check is performed - and triggers here - on a !VMEMMAP
> config, which consumes an additional 22 page bits for the sparse
> section id. When VMEMMAP is enabled, those bits are returned, cpupid
> doesn't need its own member, and the page passes the VMEMMAP check.
> 
> Restrict that check to the situation it was meant to check: that we
> are sizing the VMEMMAP page array correctly.
> 
> Says Arnd:
> 
>     Further experiments show that the build error already existed before,
>     but was only triggered with larger values of CONFIG_NR_CPU and/or
>     CONFIG_NODES_SHIFT that might be used in actual configurations but
>     not in randconfig builds.
> 
>     With longer CPU and node masks, I could recreate the problem with
>     kernels as old as linux-4.7 when arm64 NUMA support got added.
> 
> Reported-by: Arnd Bergmann <arnd@arndb.de>
> Tested-by: Arnd Bergmann <arnd@arndb.de>
> Cc: stable@vger.kernel.org
> Fixes: 1a2db300348b ("arm64, numa: Add NUMA support for arm64 platforms.")
> Fixes: 3e1907d5bf5a ("arm64: mm: move vmemmap region right below the linear region")
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  arch/arm64/mm/init.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 1b18b4722420..86d9f9d303b0 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -611,11 +611,13 @@ void __init mem_init(void)
>  	BUILD_BUG_ON(TASK_SIZE_32			> TASK_SIZE_64);
>  #endif
>  
> +#ifdef CONFIG_SPARSEMEM_VMEMMAP
>  	/*
>  	 * Make sure we chose the upper bound of sizeof(struct page)
> -	 * correctly.
> +	 * correctly when sizing the VMEMMAP array.
>  	 */
>  	BUILD_BUG_ON(sizeof(struct page) > (1 << STRUCT_PAGE_MAX_SHIFT));
> +#endif
>  
>  	if (PAGE_SIZE >= 16384 && get_num_physpages() <= 128) {
>  		extern int sysctl_overcommit_memory;
> -- 
> 2.18.0
>
Will Deacon July 25, 2018, 4:06 p.m. UTC | #6
On Tue, Jul 24, 2018 at 04:04:48PM +0100, Will Deacon wrote:
> On Mon, Jul 23, 2018 at 12:27:35PM -0400, Johannes Weiner wrote:
> > On Mon, Jul 23, 2018 at 05:35:35PM +0200, Arnd Bergmann wrote:
> > > On Mon, Jul 23, 2018 at 5:23 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > > > index 1b18b4722420..72c9b6778b0a 100644
> > > > --- a/arch/arm64/mm/init.c
> > > > +++ b/arch/arm64/mm/init.c
> > > > @@ -611,11 +611,13 @@ void __init mem_init(void)
> > > >         BUILD_BUG_ON(TASK_SIZE_32                       > TASK_SIZE_64);
> > > >  #endif
> > > >
> > > > +#ifndef CONFIG_SPARSEMEM_VMEMMAP
> > > >         /*
> > > 
> > > I tested it on two broken configurations, and found that you have
> > > a typo here, it should be 'ifdef', not 'ifndef'. With that change, it
> > > seems to build fine.
> > > 
> > > Tested-by: Arnd Bergmann <arnd@arndb.de>
> > 
> > Thanks for testing it, I don't have a cross-compile toolchain set up.
> > 
> > ---
> 
> Thanks Arnd, Johannes. I can pick this up for -rc7 via the arm64 tree,
> unless it's already queued elsewhere?

I've pushed this to the arm64 for-next/fixes branch heading for -rc7.

Will
diff mbox

Patch

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 32699b2dc52a..6af87946d241 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -163,6 +163,7 @@  enum node_stat_item {
 	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
 	WORKINGSET_REFAULT,
 	WORKINGSET_ACTIVATE,
+	WORKINGSET_RESTORE,
 	WORKINGSET_NODERECLAIM,
 	NR_ANON_MAPPED,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e34a27727b9a..7af1c3c15d8e 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -69,13 +69,14 @@ 
  */
 enum pageflags {
 	PG_locked,		/* Page is locked. Don't touch. */
-	PG_error,
 	PG_referenced,
 	PG_uptodate,
 	PG_dirty,
 	PG_lru,
 	PG_active,
+	PG_workingset,
 	PG_waiters,		/* Page has waiters, check its waitqueue. Must be bit #7 and in the same byte as "PG_locked" */
+	PG_error,
 	PG_slab,
 	PG_owner_priv_1,	/* Owner use. If pagecache, fs may use*/
 	PG_arch_1,
@@ -280,6 +281,8 @@  PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD)
 PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
 PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
 	TESTCLEARFLAG(Active, active, PF_HEAD)
+PAGEFLAG(Workingset, workingset, PF_HEAD)
+	TESTCLEARFLAG(Workingset, workingset, PF_HEAD)
 __PAGEFLAG(Slab, slab, PF_NO_TAIL)
 __PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL)
 PAGEFLAG(Checked, checked, PF_NO_COMPOUND)	   /* Used by some filesystems */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2417d288e016..d8c47dcdec6f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -296,7 +296,7 @@  struct vma_swap_readahead {
 
 /* linux/mm/workingset.c */
 void *workingset_eviction(struct address_space *mapping, struct page *page);
-bool workingset_refault(void *shadow);
+void workingset_refault(struct page *page, void *shadow);
 void workingset_activation(struct page *page);
 
 /* Do not use directly, use workingset_lookup_update */
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index a81cffb76d89..a1675d43777e 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -88,6 +88,7 @@ 
 	{1UL << PG_dirty,		"dirty"		},		\
 	{1UL << PG_lru,			"lru"		},		\
 	{1UL << PG_active,		"active"	},		\
+	{1UL << PG_workingset,		"workingset"	},		\
 	{1UL << PG_slab,		"slab"		},		\
 	{1UL << PG_owner_priv_1,	"owner_priv_1"	},		\
 	{1UL << PG_arch_1,		"arch_1"	},		\
diff --git a/mm/filemap.c b/mm/filemap.c
index 0604cb02e6f3..bd36b7226cf4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -915,12 +915,9 @@  int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 		 * data from the working set, only to cache data that will
 		 * get overwritten with something else, is a waste of memory.
 		 */
-		if (!(gfp_mask & __GFP_WRITE) &&
-		    shadow && workingset_refault(shadow)) {
-			SetPageActive(page);
-			workingset_activation(page);
-		} else
-			ClearPageActive(page);
+		WARN_ON_ONCE(PageActive(page));
+		if (!(gfp_mask & __GFP_WRITE) && shadow)
+			workingset_refault(page, shadow);
 		lru_cache_add(page);
 	}
 	return ret;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b9f3dbd885bd..c67ecf77ea8b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2370,6 +2370,7 @@  static void __split_huge_page_tail(struct page *head, int tail,
 			 (1L << PG_mlocked) |
 			 (1L << PG_uptodate) |
 			 (1L << PG_active) |
+			 (1L << PG_workingset) |
 			 (1L << PG_locked) |
 			 (1L << PG_unevictable) |
 			 (1L << PG_dirty)));
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2bd3df3d101a..c59519d600ea 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5283,6 +5283,8 @@  static int memory_stat_show(struct seq_file *m, void *v)
 		   stat[WORKINGSET_REFAULT]);
 	seq_printf(m, "workingset_activate %lu\n",
 		   stat[WORKINGSET_ACTIVATE]);
+	seq_printf(m, "workingset_restore %lu\n",
+		   stat[WORKINGSET_RESTORE]);
 	seq_printf(m, "workingset_nodereclaim %lu\n",
 		   stat[WORKINGSET_NODERECLAIM]);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 8c0af0f7cab1..a6a9114e62dc 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -682,6 +682,8 @@  void migrate_page_states(struct page *newpage, struct page *page)
 		SetPageActive(newpage);
 	} else if (TestClearPageUnevictable(page))
 		SetPageUnevictable(newpage);
+	if (PageWorkingset(page))
+		SetPageWorkingset(newpage);
 	if (PageChecked(page))
 		SetPageChecked(newpage);
 	if (PageMappedToDisk(page))
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 07f9aa2340c3..2721ef8862d1 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -451,6 +451,7 @@  struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 			/*
 			 * Initiate read into locked page and return.
 			 */
+			SetPageWorkingset(new_page);
 			lru_cache_add_anon(new_page);
 			*new_page_allocated = true;
 			return new_page;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9270a4370d54..8d1ad48ffbcd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1976,6 +1976,7 @@  static void shrink_active_list(unsigned long nr_to_scan,
 		}
 
 		ClearPageActive(page);	/* we are de-activating */
+		SetPageWorkingset(page);
 		list_add(&page->lru, &l_inactive);
 	}
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index a2b9518980ce..507dc9c01b88 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1145,6 +1145,7 @@  const char * const vmstat_text[] = {
 	"nr_isolated_file",
 	"workingset_refault",
 	"workingset_activate",
+	"workingset_restore",
 	"workingset_nodereclaim",
 	"nr_anon_pages",
 	"nr_mapped",
diff --git a/mm/workingset.c b/mm/workingset.c
index 53759a3cf99a..ef6be3d92116 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -121,7 +121,7 @@ 
  * the only thing eating into inactive list space is active pages.
  *
  *
- *		Activating refaulting pages
+ *		Refaulting inactive pages
  *
  * All that is known about the active list is that the pages have been
  * accessed more than once in the past.  This means that at any given
@@ -134,6 +134,10 @@ 
  * used less frequently than the refaulting page - or even not used at
  * all anymore.
  *
+ * That means if inactive cache is refaulting with a suitable refault
+ * distance, we assume the cache workingset is transitioning and put
+ * pressure on the current active list.
+ *
  * If this is wrong and demotion kicks in, the pages which are truly
  * used more frequently will be reactivated while the less frequently
  * used once will be evicted from memory.
@@ -141,6 +145,14 @@ 
  * But if this is right, the stale pages will be pushed out of memory
  * and the used pages get to stay in cache.
  *
+ *		Refaulting active pages
+ *
+ * If on the other hand the refaulting pages have recently been
+ * deactivated, it means that the active list is no longer protecting
+ * actively used cache from reclaim. The cache is NOT transitioning to
+ * a different workingset; the existing workingset is thrashing in the
+ * space allocated to the page cache.
+ *
  *
  *		Implementation
  *
@@ -156,8 +168,7 @@ 
  */
 
 #define EVICTION_SHIFT	(RADIX_TREE_EXCEPTIONAL_ENTRY + \
-			 NODES_SHIFT +	\
-			 MEM_CGROUP_ID_SHIFT)
+			 1 + NODES_SHIFT + MEM_CGROUP_ID_SHIFT)
 #define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
 
 /*
@@ -170,23 +181,28 @@ 
  */
 static unsigned int bucket_order __read_mostly;
 
-static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction)
+static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
+			 bool workingset)
 {
 	eviction >>= bucket_order;
 	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
 	eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
+	eviction = (eviction << 1) | workingset;
 	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
 
 	return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
 }
 
 static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
-			  unsigned long *evictionp)
+			  unsigned long *evictionp, bool *workingsetp)
 {
 	unsigned long entry = (unsigned long)shadow;
 	int memcgid, nid;
+	bool workingset;
 
 	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
+	workingset = entry & 1;
+	entry >>= 1;
 	nid = entry & ((1UL << NODES_SHIFT) - 1);
 	entry >>= NODES_SHIFT;
 	memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
@@ -195,6 +211,7 @@  static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 	*memcgidp = memcgid;
 	*pgdat = NODE_DATA(nid);
 	*evictionp = entry << bucket_order;
+	*workingsetp = workingset;
 }
 
 /**
@@ -207,8 +224,8 @@  static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
  */
 void *workingset_eviction(struct address_space *mapping, struct page *page)
 {
-	struct mem_cgroup *memcg = page_memcg(page);
 	struct pglist_data *pgdat = page_pgdat(page);
+	struct mem_cgroup *memcg = page_memcg(page);
 	int memcgid = mem_cgroup_id(memcg);
 	unsigned long eviction;
 	struct lruvec *lruvec;
@@ -220,30 +237,30 @@  void *workingset_eviction(struct address_space *mapping, struct page *page)
 
 	lruvec = mem_cgroup_lruvec(pgdat, memcg);
 	eviction = atomic_long_inc_return(&lruvec->inactive_age);
-	return pack_shadow(memcgid, pgdat, eviction);
+	return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
 }
 
 /**
  * workingset_refault - evaluate the refault of a previously evicted page
+ * @page: the freshly allocated replacement page
  * @shadow: shadow entry of the evicted page
  *
  * Calculates and evaluates the refault distance of the previously
  * evicted page in the context of the node it was allocated in.
- *
- * Returns %true if the page should be activated, %false otherwise.
  */
-bool workingset_refault(void *shadow)
+void workingset_refault(struct page *page, void *shadow)
 {
 	unsigned long refault_distance;
+	struct pglist_data *pgdat;
 	unsigned long active_file;
 	struct mem_cgroup *memcg;
 	unsigned long eviction;
 	struct lruvec *lruvec;
 	unsigned long refault;
-	struct pglist_data *pgdat;
+	bool workingset;
 	int memcgid;
 
-	unpack_shadow(shadow, &memcgid, &pgdat, &eviction);
+	unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
 
 	rcu_read_lock();
 	/*
@@ -263,41 +280,51 @@  bool workingset_refault(void *shadow)
 	 * configurations instead.
 	 */
 	memcg = mem_cgroup_from_id(memcgid);
-	if (!mem_cgroup_disabled() && !memcg) {
-		rcu_read_unlock();
-		return false;
-	}
+	if (!mem_cgroup_disabled() && !memcg)
+		goto out;
 	lruvec = mem_cgroup_lruvec(pgdat, memcg);
 	refault = atomic_long_read(&lruvec->inactive_age);
 	active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES);
 
 	/*
-	 * The unsigned subtraction here gives an accurate distance
-	 * across inactive_age overflows in most cases.
+	 * Calculate the refault distance
 	 *
-	 * There is a special case: usually, shadow entries have a
-	 * short lifetime and are either refaulted or reclaimed along
-	 * with the inode before they get too old.  But it is not
-	 * impossible for the inactive_age to lap a shadow entry in
-	 * the field, which can then can result in a false small
-	 * refault distance, leading to a false activation should this
-	 * old entry actually refault again.  However, earlier kernels
-	 * used to deactivate unconditionally with *every* reclaim
-	 * invocation for the longest time, so the occasional
-	 * inappropriate activation leading to pressure on the active
-	 * list is not a problem.
+	 * The unsigned subtraction here gives an accurate distance
+	 * across inactive_age overflows in most cases. There is a
+	 * special case: usually, shadow entries have a short lifetime
+	 * and are either refaulted or reclaimed along with the inode
+	 * before they get too old.  But it is not impossible for the
+	 * inactive_age to lap a shadow entry in the field, which can
+	 * then can result in a false small refault distance, leading
+	 * to a false activation should this old entry actually
+	 * refault again.  However, earlier kernels used to deactivate
+	 * unconditionally with *every* reclaim invocation for the
+	 * longest time, so the occasional inappropriate activation
+	 * leading to pressure on the active list is not a problem.
 	 */
 	refault_distance = (refault - eviction) & EVICTION_MASK;
 
 	inc_lruvec_state(lruvec, WORKINGSET_REFAULT);
 
-	if (refault_distance <= active_file) {
-		inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
-		rcu_read_unlock();
-		return true;
+	/*
+	 * Compare the distance to the existing workingset size. We
+	 * don't act on pages that couldn't stay resident even if all
+	 * the memory was available to the page cache.
+	 */
+	if (refault_distance > active_file)
+		goto out;
+
+	SetPageActive(page);
+	atomic_long_inc(&lruvec->inactive_age);
+	inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
+
+	/* Page was active prior to eviction */
+	if (workingset) {
+		SetPageWorkingset(page);
+		inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
 	}
+out:
 	rcu_read_unlock();
-	return false;
 }
 
 /**