[v2,-mm] mm: account deferred split THPs into MemAvailable

Message ID	1566410125-66011-1-git-send-email-yang.shi@linux.alibaba.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=I31T=WR=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7705B2082F From: Yang Shi <yang.shi@linux.alibaba.com> To: mhocko@suse.com, kirill.shutemov@linux.intel.com, hannes@cmpxchg.org, vbabka@suse.cz, rientjes@google.com, akpm@linux-foundation.org Cc: yang.shi@linux.alibaba.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable Date: Thu, 22 Aug 2019 01:55:25 +0800 Message-Id: <1566410125-66011-1-git-send-email-yang.shi@linux.alibaba.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[v2,-mm] mm: account deferred split THPs into MemAvailable \| expand [v2,-mm] mm: account deferred split THPs into MemAvailable

Yang Shi Aug. 21, 2019, 5:55 p.m. UTC

Available memory is one of the most important metrics for memory
pressure.  Currently, the deferred split THPs are not accounted into
available memory, but they are reclaimable actually, like reclaimable
slabs.

And, they seems very common with the common workloads when THP is
enabled.  A simple run with MariaDB test of mmtest with THP enabled as
always shows it could generate over fifteen thousand deferred split THPs
(accumulated around 30G in one hour run, 75% of 40G memory for my VM).
It looks worth accounting in MemAvailable.

Record the number of freeable normal pages of deferred split THPs into
the second tail page, and account it into KReclaimable.  Although THP
allocations are not exactly "kernel allocations", once they are unmapped,
they are in fact kernel-only.  KReclaimable has been accounted into
MemAvailable.

When the deferred split THPs get split due to memory pressure or freed,
just decrease by the recorded number.

With this change when running program which populates 1G address space
then madvise(MADV_DONTNEED) 511 pages for every THP, /proc/meminfo would
show the deferred split THPs are accounted properly.

Populated by before calling madvise(MADV_DONTNEED):
MemAvailable:   43531960 kB
AnonPages:       1096660 kB
KReclaimable:      26156 kB
AnonHugePages:   1056768 kB

After calling madvise(MADV_DONTNEED):
MemAvailable:   44411164 kB
AnonPages:         50140 kB
KReclaimable:    1070640 kB
AnonHugePages:     10240 kB

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 Documentation/filesystems/proc.txt |  4 ++--
 include/linux/huge_mm.h            |  7 +++++--
 include/linux/mm_types.h           |  3 ++-
 mm/huge_memory.c                   | 13 ++++++++++++-
 mm/rmap.c                          |  4 ++--
 5 files changed, 23 insertions(+), 8 deletions(-)

Michal Hocko Aug. 22, 2019, 8:04 a.m. UTC | #1

On Thu 22-08-19 01:55:25, Yang Shi wrote:
> Available memory is one of the most important metrics for memory
> pressure.

I would disagree with this statement. It is a rough estimate that tells
how much memory you can allocate before going into a more expensive
reclaim (mostly swapping). Allocating that amount still might result in
direct reclaim induced stalls. I do realize that this is simple metric
that is attractive to use and works in many cases though.

> Currently, the deferred split THPs are not accounted into
> available memory, but they are reclaimable actually, like reclaimable
> slabs.
> 
> And, they seems very common with the common workloads when THP is
> enabled.  A simple run with MariaDB test of mmtest with THP enabled as
> always shows it could generate over fifteen thousand deferred split THPs
> (accumulated around 30G in one hour run, 75% of 40G memory for my VM).
> It looks worth accounting in MemAvailable.

OK, this makes sense. But your above numbers are really worrying.
Accumulating such a large amount of pages that are likely not going to
be used is really bad. They are essentially blocking any higher order
allocations and also push the system towards more memory pressure.

IIUC deferred splitting is mostly a workaround for nasty locking issues
during splitting, right? This is not really an optimization to cache
THPs for reuse or something like that. What is the reason this is not
done from a worker context? At least THPs which would be freed
completely sound like a good candidate for kworker tear down, no?

> Record the number of freeable normal pages of deferred split THPs into
> the second tail page, and account it into KReclaimable.  Although THP
> allocations are not exactly "kernel allocations", once they are unmapped,
> they are in fact kernel-only.  KReclaimable has been accounted into
> MemAvailable.

This sounds reasonable to me.
 
> When the deferred split THPs get split due to memory pressure or freed,
> just decrease by the recorded number.
> 
> With this change when running program which populates 1G address space
> then madvise(MADV_DONTNEED) 511 pages for every THP, /proc/meminfo would
> show the deferred split THPs are accounted properly.
> 
> Populated by before calling madvise(MADV_DONTNEED):
> MemAvailable:   43531960 kB
> AnonPages:       1096660 kB
> KReclaimable:      26156 kB
> AnonHugePages:   1056768 kB
> 
> After calling madvise(MADV_DONTNEED):
> MemAvailable:   44411164 kB
> AnonPages:         50140 kB
> KReclaimable:    1070640 kB
> AnonHugePages:     10240 kB
> 
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: David Rientjes <rientjes@google.com>
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>

Other than the above concern, which is little bit orthogonal, the patch
looks reasonable to me. I might be missing subtle THPisms so I am not
going to ack though.

> ---
>  Documentation/filesystems/proc.txt |  4 ++--
>  include/linux/huge_mm.h            |  7 +++++--
>  include/linux/mm_types.h           |  3 ++-
>  mm/huge_memory.c                   | 13 ++++++++++++-
>  mm/rmap.c                          |  4 ++--
>  5 files changed, 23 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index 99ca040..93fc183 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -968,8 +968,8 @@ ShmemHugePages: Memory used by shared memory (shmem) and tmpfs allocated
>                with huge pages
>  ShmemPmdMapped: Shared memory mapped into userspace with huge pages
>  KReclaimable: Kernel allocations that the kernel will attempt to reclaim
> -              under memory pressure. Includes SReclaimable (below), and other
> -              direct allocations with a shrinker.
> +              under memory pressure. Includes SReclaimable (below), deferred
> +              split THPs, and other direct allocations with a shrinker.
>          Slab: in-kernel data structures cache
>  SReclaimable: Part of Slab, that might be reclaimed, such as caches
>    SUnreclaim: Part of Slab, that cannot be reclaimed on memory pressure
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 61c9ffd..c194630 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -162,7 +162,7 @@ static inline int split_huge_page(struct page *page)
>  {
>  	return split_huge_page_to_list(page, NULL);
>  }
> -void deferred_split_huge_page(struct page *page);
> +void deferred_split_huge_page(struct page *page, unsigned int nr);
>  
>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  		unsigned long address, bool freeze, struct page *page);
> @@ -324,7 +324,10 @@ static inline int split_huge_page(struct page *page)
>  {
>  	return 0;
>  }
> -static inline void deferred_split_huge_page(struct page *page) {}
> +static inline void deferred_split_huge_page(struct page *page, unsigned int nr)
> +{
> +}
> +
>  #define split_huge_pmd(__vma, __pmd, __address)	\
>  	do { } while (0)
>  
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 156640c..17e0fc5 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -138,7 +138,8 @@ struct page {
>  		};
>  		struct {	/* Second tail page of compound page */
>  			unsigned long _compound_pad_1;	/* compound_head */
> -			unsigned long _compound_pad_2;
> +			/* Freeable normal pages for deferred split shrinker */
> +			unsigned long nr_freeable;
>  			/* For both global and memcg */
>  			struct list_head deferred_list;
>  		};
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c9a596e..e04ac4d 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -524,6 +524,7 @@ void prep_transhuge_page(struct page *page)
>  
>  	INIT_LIST_HEAD(page_deferred_list(page));
>  	set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
> +	page[2].nr_freeable = 0;
>  }
>  
>  static unsigned long __thp_get_unmapped_area(struct file *filp, unsigned long len,
> @@ -2766,6 +2767,10 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>  		if (!list_empty(page_deferred_list(head))) {
>  			ds_queue->split_queue_len--;
>  			list_del(page_deferred_list(head));
> +			__mod_node_page_state(page_pgdat(page),
> +					NR_KERNEL_MISC_RECLAIMABLE,
> +					-head[2].nr_freeable);
> +			head[2].nr_freeable = 0;
>  		}
>  		if (mapping)
>  			__dec_node_page_state(page, NR_SHMEM_THPS);
> @@ -2816,11 +2821,14 @@ void free_transhuge_page(struct page *page)
>  		ds_queue->split_queue_len--;
>  		list_del(page_deferred_list(page));
>  	}
> +	__mod_node_page_state(page_pgdat(page), NR_KERNEL_MISC_RECLAIMABLE,
> +			      -page[2].nr_freeable);
> +	page[2].nr_freeable = 0;
>  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>  	free_compound_page(page);
>  }
>  
> -void deferred_split_huge_page(struct page *page)
> +void deferred_split_huge_page(struct page *page, unsigned int nr)
>  {
>  	struct deferred_split *ds_queue = get_deferred_split_queue(page);
>  #ifdef CONFIG_MEMCG
> @@ -2844,6 +2852,9 @@ void deferred_split_huge_page(struct page *page)
>  		return;
>  
>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> +	page[2].nr_freeable += nr;
> +	__mod_node_page_state(page_pgdat(page), NR_KERNEL_MISC_RECLAIMABLE,
> +			      nr);
>  	if (list_empty(page_deferred_list(page))) {
>  		count_vm_event(THP_DEFERRED_SPLIT_PAGE);
>  		list_add_tail(page_deferred_list(page), &ds_queue->split_queue);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index e5dfe2a..6008fab 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1286,7 +1286,7 @@ static void page_remove_anon_compound_rmap(struct page *page)
>  
>  	if (nr) {
>  		__mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, -nr);
> -		deferred_split_huge_page(page);
> +		deferred_split_huge_page(page, nr);
>  	}
>  }
>  
> @@ -1320,7 +1320,7 @@ void page_remove_rmap(struct page *page, bool compound)
>  		clear_page_mlock(page);
>  
>  	if (PageTransCompound(page))
> -		deferred_split_huge_page(compound_head(page));
> +		deferred_split_huge_page(compound_head(page), 1);
>  
>  	/*
>  	 * It would be tidy to reset the PageAnon mapping here,
> -- 
> 1.8.3.1

Vlastimil Babka Aug. 22, 2019, 12:56 p.m. UTC | #2

On 8/22/19 10:04 AM, Michal Hocko wrote:
> On Thu 22-08-19 01:55:25, Yang Shi wrote:
>> Available memory is one of the most important metrics for memory
>> pressure.
> 
> I would disagree with this statement. It is a rough estimate that tells
> how much memory you can allocate before going into a more expensive
> reclaim (mostly swapping). Allocating that amount still might result in
> direct reclaim induced stalls. I do realize that this is simple metric
> that is attractive to use and works in many cases though.
> 
>> Currently, the deferred split THPs are not accounted into
>> available memory, but they are reclaimable actually, like reclaimable
>> slabs.
>> 
>> And, they seems very common with the common workloads when THP is
>> enabled.  A simple run with MariaDB test of mmtest with THP enabled as
>> always shows it could generate over fifteen thousand deferred split THPs
>> (accumulated around 30G in one hour run, 75% of 40G memory for my VM).
>> It looks worth accounting in MemAvailable.
> 
> OK, this makes sense. But your above numbers are really worrying.
> Accumulating such a large amount of pages that are likely not going to
> be used is really bad. They are essentially blocking any higher order
> allocations and also push the system towards more memory pressure.
> 
> IIUC deferred splitting is mostly a workaround for nasty locking issues
> during splitting, right? This is not really an optimization to cache
> THPs for reuse or something like that. What is the reason this is not
> done from a worker context? At least THPs which would be freed
> completely sound like a good candidate for kworker tear down, no?

Agreed that it's a good question. For Kirill :) Maybe with kworker approach we
also wouldn't need the cgroup awareness?

>> Record the number of freeable normal pages of deferred split THPs into
>> the second tail page, and account it into KReclaimable.  Although THP
>> allocations are not exactly "kernel allocations", once they are unmapped,
>> they are in fact kernel-only.  KReclaimable has been accounted into
>> MemAvailable.
> 
> This sounds reasonable to me.
>  
>> When the deferred split THPs get split due to memory pressure or freed,
>> just decrease by the recorded number.
>> 
>> With this change when running program which populates 1G address space
>> then madvise(MADV_DONTNEED) 511 pages for every THP, /proc/meminfo would
>> show the deferred split THPs are accounted properly.
>> 
>> Populated by before calling madvise(MADV_DONTNEED):
>> MemAvailable:   43531960 kB
>> AnonPages:       1096660 kB
>> KReclaimable:      26156 kB
>> AnonHugePages:   1056768 kB
>> 
>> After calling madvise(MADV_DONTNEED):
>> MemAvailable:   44411164 kB
>> AnonPages:         50140 kB
>> KReclaimable:    1070640 kB
>> AnonHugePages:     10240 kB
>> 
>> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: David Rientjes <rientjes@google.com>
>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>

Thanks, looks like it wasn't too difficult with the 2nd tail page use :)

...

>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -524,6 +524,7 @@ void prep_transhuge_page(struct page *page)
>>  
>>  	INIT_LIST_HEAD(page_deferred_list(page));
>>  	set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
>> +	page[2].nr_freeable = 0;
>>  }
>>  
>>  static unsigned long __thp_get_unmapped_area(struct file *filp, unsigned long len,
>> @@ -2766,6 +2767,10 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>>  		if (!list_empty(page_deferred_list(head))) {
>>  			ds_queue->split_queue_len--;
>>  			list_del(page_deferred_list(head));
>> +			__mod_node_page_state(page_pgdat(page),
>> +					NR_KERNEL_MISC_RECLAIMABLE,
>> +					-head[2].nr_freeable);
>> +			head[2].nr_freeable = 0;
>>  		}
>>  		if (mapping)
>>  			__dec_node_page_state(page, NR_SHMEM_THPS);
>> @@ -2816,11 +2821,14 @@ void free_transhuge_page(struct page *page)
>>  		ds_queue->split_queue_len--;
>>  		list_del(page_deferred_list(page));
>>  	}
>> +	__mod_node_page_state(page_pgdat(page), NR_KERNEL_MISC_RECLAIMABLE,
>> +			      -page[2].nr_freeable);
>> +	page[2].nr_freeable = 0;

Wouldn't it be safer to fully tie the nr_freeable use to adding the page to the
deffered list? So here the code would be in the if (!list_empty()) { } part above.

>>  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>>  	free_compound_page(page);
>>  }
>>  
>> -void deferred_split_huge_page(struct page *page)
>> +void deferred_split_huge_page(struct page *page, unsigned int nr)
>>  {
>>  	struct deferred_split *ds_queue = get_deferred_split_queue(page);
>>  #ifdef CONFIG_MEMCG
>> @@ -2844,6 +2852,9 @@ void deferred_split_huge_page(struct page *page)
>>  		return;
>>  
>>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>> +	page[2].nr_freeable += nr;
>> +	__mod_node_page_state(page_pgdat(page), NR_KERNEL_MISC_RECLAIMABLE,
>> +			      nr);

Same here, only do this when adding to the list, below? Or we might perhaps
account base pages multiple times?

>>  	if (list_empty(page_deferred_list(page))) {
>>  		count_vm_event(THP_DEFERRED_SPLIT_PAGE);
>>  		list_add_tail(page_deferred_list(page), &ds_queue->split_queue);
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index e5dfe2a..6008fab 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1286,7 +1286,7 @@ static void page_remove_anon_compound_rmap(struct page *page)
>>  
>>  	if (nr) {
>>  		__mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, -nr);
>> -		deferred_split_huge_page(page);
>> +		deferred_split_huge_page(page, nr);
>>  	}
>>  }
>>  
>> @@ -1320,7 +1320,7 @@ void page_remove_rmap(struct page *page, bool compound)
>>  		clear_page_mlock(page);
>>  
>>  	if (PageTransCompound(page))
>> -		deferred_split_huge_page(compound_head(page));
>> +		deferred_split_huge_page(compound_head(page), 1);
>>  
>>  	/*
>>  	 * It would be tidy to reset the PageAnon mapping here,
>> -- 
>> 1.8.3.1
>

Kirill A. Shutemov Aug. 22, 2019, 3:29 p.m. UTC | #3

On Thu, Aug 22, 2019 at 02:56:56PM +0200, Vlastimil Babka wrote:
> On 8/22/19 10:04 AM, Michal Hocko wrote:
> > On Thu 22-08-19 01:55:25, Yang Shi wrote:
> >> Available memory is one of the most important metrics for memory
> >> pressure.
> > 
> > I would disagree with this statement. It is a rough estimate that tells
> > how much memory you can allocate before going into a more expensive
> > reclaim (mostly swapping). Allocating that amount still might result in
> > direct reclaim induced stalls. I do realize that this is simple metric
> > that is attractive to use and works in many cases though.
> > 
> >> Currently, the deferred split THPs are not accounted into
> >> available memory, but they are reclaimable actually, like reclaimable
> >> slabs.
> >> 
> >> And, they seems very common with the common workloads when THP is
> >> enabled.  A simple run with MariaDB test of mmtest with THP enabled as
> >> always shows it could generate over fifteen thousand deferred split THPs
> >> (accumulated around 30G in one hour run, 75% of 40G memory for my VM).
> >> It looks worth accounting in MemAvailable.
> > 
> > OK, this makes sense. But your above numbers are really worrying.
> > Accumulating such a large amount of pages that are likely not going to
> > be used is really bad. They are essentially blocking any higher order
> > allocations and also push the system towards more memory pressure.
> > 
> > IIUC deferred splitting is mostly a workaround for nasty locking issues
> > during splitting, right? This is not really an optimization to cache
> > THPs for reuse or something like that. What is the reason this is not
> > done from a worker context? At least THPs which would be freed
> > completely sound like a good candidate for kworker tear down, no?
> 
> Agreed that it's a good question. For Kirill :) Maybe with kworker approach we
> also wouldn't need the cgroup awareness?

I don't remember a particular locking issue, but I cannot say there's
none :P

It's artifact from decoupling PMD split from compound page split: the same
page can be mapped multiple times with combination of PMDs and PTEs. Split
of one PMD doesn't need to trigger split of all PMDs and underlying
compound page.

Other consideration is the fact that page split can fail and we need to
have fallback for this case.

Also in most cases THP split would be just waste of time if we would do
them at the spot. If you don't have memory pressure it's better to wait
until process termination: less pages on LRU is still beneficial.

Main source of partly mapped THPs comes from exit path. When PMD mapping
of THP got split across multiple VMAs (for instance due to mprotect()),
in exit path we unmap PTEs belonging to one VMA just before unmapping the
rest of the page. It would be total waste of time to split the page in
this scenario.

The whole deferred split thing still looks as a reasonable compromise
to me.

We may have some kind of watermark and try to keep the number of deferred
split THP under it. But it comes with own set of problems: what if all
these pages are pinned for really long time and effectively not available
for split.

Yang Shi Aug. 22, 2019, 3:33 p.m. UTC | #4

On 8/22/19 1:04 AM, Michal Hocko wrote:
> On Thu 22-08-19 01:55:25, Yang Shi wrote:
>> Available memory is one of the most important metrics for memory
>> pressure.
> I would disagree with this statement. It is a rough estimate that tells
> how much memory you can allocate before going into a more expensive
> reclaim (mostly swapping). Allocating that amount still might result in
> direct reclaim induced stalls. I do realize that this is simple metric
> that is attractive to use and works in many cases though.

OK, I would rephrase this a little it, say "useful metric".

>
>> Currently, the deferred split THPs are not accounted into
>> available memory, but they are reclaimable actually, like reclaimable
>> slabs.
>>
>> And, they seems very common with the common workloads when THP is
>> enabled.  A simple run with MariaDB test of mmtest with THP enabled as
>> always shows it could generate over fifteen thousand deferred split THPs
>> (accumulated around 30G in one hour run, 75% of 40G memory for my VM).
>> It looks worth accounting in MemAvailable.
> OK, this makes sense. But your above numbers are really worrying.
> Accumulating such a large amount of pages that are likely not going to
> be used is really bad. They are essentially blocking any higher order
> allocations and also push the system towards more memory pressure.

That is accumulated number, during the running of the test, some of them 
were freed by shrinker already. IOW, it should not reach that much at 
any given time.

>
> IIUC deferred splitting is mostly a workaround for nasty locking issues
> during splitting, right? This is not really an optimization to cache
> THPs for reuse or something like that. What is the reason this is not
> done from a worker context? At least THPs which would be freed
> completely sound like a good candidate for kworker tear down, no?

Yes, deferred split THP was introduced to avoid locking issues according 
to the document. Memcg awareness would help to trigger the shrinker more 
often.

I think it could be done in a worker context, but when to trigger to 
worker is a subtle problem.

>
>> Record the number of freeable normal pages of deferred split THPs into
>> the second tail page, and account it into KReclaimable.  Although THP
>> allocations are not exactly "kernel allocations", once they are unmapped,
>> they are in fact kernel-only.  KReclaimable has been accounted into
>> MemAvailable.
> This sounds reasonable to me.
>   
>> When the deferred split THPs get split due to memory pressure or freed,
>> just decrease by the recorded number.
>>
>> With this change when running program which populates 1G address space
>> then madvise(MADV_DONTNEED) 511 pages for every THP, /proc/meminfo would
>> show the deferred split THPs are accounted properly.
>>
>> Populated by before calling madvise(MADV_DONTNEED):
>> MemAvailable:   43531960 kB
>> AnonPages:       1096660 kB
>> KReclaimable:      26156 kB
>> AnonHugePages:   1056768 kB
>>
>> After calling madvise(MADV_DONTNEED):
>> MemAvailable:   44411164 kB
>> AnonPages:         50140 kB
>> KReclaimable:    1070640 kB
>> AnonHugePages:     10240 kB
>>
>> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: David Rientjes <rientjes@google.com>
>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> Other than the above concern, which is little bit orthogonal, the patch
> looks reasonable to me. I might be missing subtle THPisms so I am not
> going to ack though.
>
>> ---
>>   Documentation/filesystems/proc.txt |  4 ++--
>>   include/linux/huge_mm.h            |  7 +++++--
>>   include/linux/mm_types.h           |  3 ++-
>>   mm/huge_memory.c                   | 13 ++++++++++++-
>>   mm/rmap.c                          |  4 ++--
>>   5 files changed, 23 insertions(+), 8 deletions(-)
>>
>> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
>> index 99ca040..93fc183 100644
>> --- a/Documentation/filesystems/proc.txt
>> +++ b/Documentation/filesystems/proc.txt
>> @@ -968,8 +968,8 @@ ShmemHugePages: Memory used by shared memory (shmem) and tmpfs allocated
>>                 with huge pages
>>   ShmemPmdMapped: Shared memory mapped into userspace with huge pages
>>   KReclaimable: Kernel allocations that the kernel will attempt to reclaim
>> -              under memory pressure. Includes SReclaimable (below), and other
>> -              direct allocations with a shrinker.
>> +              under memory pressure. Includes SReclaimable (below), deferred
>> +              split THPs, and other direct allocations with a shrinker.
>>           Slab: in-kernel data structures cache
>>   SReclaimable: Part of Slab, that might be reclaimed, such as caches
>>     SUnreclaim: Part of Slab, that cannot be reclaimed on memory pressure
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 61c9ffd..c194630 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -162,7 +162,7 @@ static inline int split_huge_page(struct page *page)
>>   {
>>   	return split_huge_page_to_list(page, NULL);
>>   }
>> -void deferred_split_huge_page(struct page *page);
>> +void deferred_split_huge_page(struct page *page, unsigned int nr);
>>   
>>   void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>   		unsigned long address, bool freeze, struct page *page);
>> @@ -324,7 +324,10 @@ static inline int split_huge_page(struct page *page)
>>   {
>>   	return 0;
>>   }
>> -static inline void deferred_split_huge_page(struct page *page) {}
>> +static inline void deferred_split_huge_page(struct page *page, unsigned int nr)
>> +{
>> +}
>> +
>>   #define split_huge_pmd(__vma, __pmd, __address)	\
>>   	do { } while (0)
>>   
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 156640c..17e0fc5 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -138,7 +138,8 @@ struct page {
>>   		};
>>   		struct {	/* Second tail page of compound page */
>>   			unsigned long _compound_pad_1;	/* compound_head */
>> -			unsigned long _compound_pad_2;
>> +			/* Freeable normal pages for deferred split shrinker */
>> +			unsigned long nr_freeable;
>>   			/* For both global and memcg */
>>   			struct list_head deferred_list;
>>   		};
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index c9a596e..e04ac4d 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -524,6 +524,7 @@ void prep_transhuge_page(struct page *page)
>>   
>>   	INIT_LIST_HEAD(page_deferred_list(page));
>>   	set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
>> +	page[2].nr_freeable = 0;
>>   }
>>   
>>   static unsigned long __thp_get_unmapped_area(struct file *filp, unsigned long len,
>> @@ -2766,6 +2767,10 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>>   		if (!list_empty(page_deferred_list(head))) {
>>   			ds_queue->split_queue_len--;
>>   			list_del(page_deferred_list(head));
>> +			__mod_node_page_state(page_pgdat(page),
>> +					NR_KERNEL_MISC_RECLAIMABLE,
>> +					-head[2].nr_freeable);
>> +			head[2].nr_freeable = 0;
>>   		}
>>   		if (mapping)
>>   			__dec_node_page_state(page, NR_SHMEM_THPS);
>> @@ -2816,11 +2821,14 @@ void free_transhuge_page(struct page *page)
>>   		ds_queue->split_queue_len--;
>>   		list_del(page_deferred_list(page));
>>   	}
>> +	__mod_node_page_state(page_pgdat(page), NR_KERNEL_MISC_RECLAIMABLE,
>> +			      -page[2].nr_freeable);
>> +	page[2].nr_freeable = 0;
>>   	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>>   	free_compound_page(page);
>>   }
>>   
>> -void deferred_split_huge_page(struct page *page)
>> +void deferred_split_huge_page(struct page *page, unsigned int nr)
>>   {
>>   	struct deferred_split *ds_queue = get_deferred_split_queue(page);
>>   #ifdef CONFIG_MEMCG
>> @@ -2844,6 +2852,9 @@ void deferred_split_huge_page(struct page *page)
>>   		return;
>>   
>>   	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>> +	page[2].nr_freeable += nr;
>> +	__mod_node_page_state(page_pgdat(page), NR_KERNEL_MISC_RECLAIMABLE,
>> +			      nr);
>>   	if (list_empty(page_deferred_list(page))) {
>>   		count_vm_event(THP_DEFERRED_SPLIT_PAGE);
>>   		list_add_tail(page_deferred_list(page), &ds_queue->split_queue);
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index e5dfe2a..6008fab 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1286,7 +1286,7 @@ static void page_remove_anon_compound_rmap(struct page *page)
>>   
>>   	if (nr) {
>>   		__mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, -nr);
>> -		deferred_split_huge_page(page);
>> +		deferred_split_huge_page(page, nr);
>>   	}
>>   }
>>   
>> @@ -1320,7 +1320,7 @@ void page_remove_rmap(struct page *page, bool compound)
>>   		clear_page_mlock(page);
>>   
>>   	if (PageTransCompound(page))
>> -		deferred_split_huge_page(compound_head(page));
>> +		deferred_split_huge_page(compound_head(page), 1);
>>   
>>   	/*
>>   	 * It would be tidy to reset the PageAnon mapping here,
>> -- 
>> 1.8.3.1

Kirill A. Shutemov Aug. 22, 2019, 3:49 p.m. UTC | #5

On Thu, Aug 22, 2019 at 02:56:56PM +0200, Vlastimil Babka wrote:
> >> @@ -2816,11 +2821,14 @@ void free_transhuge_page(struct page *page)
> >>  		ds_queue->split_queue_len--;
> >>  		list_del(page_deferred_list(page));
> >>  	}
> >> +	__mod_node_page_state(page_pgdat(page), NR_KERNEL_MISC_RECLAIMABLE,
> >> +			      -page[2].nr_freeable);
> >> +	page[2].nr_freeable = 0;
> 
> Wouldn't it be safer to fully tie the nr_freeable use to adding the page to the
> deffered list? So here the code would be in the if (!list_empty()) { } part above.
> 
> >>  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> >>  	free_compound_page(page);
> >>  }
> >>  
> >> -void deferred_split_huge_page(struct page *page)
> >> +void deferred_split_huge_page(struct page *page, unsigned int nr)
> >>  {
> >>  	struct deferred_split *ds_queue = get_deferred_split_queue(page);
> >>  #ifdef CONFIG_MEMCG
> >> @@ -2844,6 +2852,9 @@ void deferred_split_huge_page(struct page *page)
> >>  		return;
> >>  
> >>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> >> +	page[2].nr_freeable += nr;
> >> +	__mod_node_page_state(page_pgdat(page), NR_KERNEL_MISC_RECLAIMABLE,
> >> +			      nr);
> 
> Same here, only do this when adding to the list, below? Or we might perhaps
> account base pages multiple times?

No, it cannot be under list_empty() check. Consider the case when a THP
got unmapped 4k a time. You need to put it on the list once, but account
it every time.

Yang Shi Aug. 22, 2019, 3:57 p.m. UTC | #6

On 8/22/19 5:56 AM, Vlastimil Babka wrote:
> On 8/22/19 10:04 AM, Michal Hocko wrote:
>> On Thu 22-08-19 01:55:25, Yang Shi wrote:
>>> Available memory is one of the most important metrics for memory
>>> pressure.
>> I would disagree with this statement. It is a rough estimate that tells
>> how much memory you can allocate before going into a more expensive
>> reclaim (mostly swapping). Allocating that amount still might result in
>> direct reclaim induced stalls. I do realize that this is simple metric
>> that is attractive to use and works in many cases though.
>>
>>> Currently, the deferred split THPs are not accounted into
>>> available memory, but they are reclaimable actually, like reclaimable
>>> slabs.
>>>
>>> And, they seems very common with the common workloads when THP is
>>> enabled.  A simple run with MariaDB test of mmtest with THP enabled as
>>> always shows it could generate over fifteen thousand deferred split THPs
>>> (accumulated around 30G in one hour run, 75% of 40G memory for my VM).
>>> It looks worth accounting in MemAvailable.
>> OK, this makes sense. But your above numbers are really worrying.
>> Accumulating such a large amount of pages that are likely not going to
>> be used is really bad. They are essentially blocking any higher order
>> allocations and also push the system towards more memory pressure.
>>
>> IIUC deferred splitting is mostly a workaround for nasty locking issues
>> during splitting, right? This is not really an optimization to cache
>> THPs for reuse or something like that. What is the reason this is not
>> done from a worker context? At least THPs which would be freed
>> completely sound like a good candidate for kworker tear down, no?
> Agreed that it's a good question. For Kirill :) Maybe with kworker approach we
> also wouldn't need the cgroup awareness?
>
>>> Record the number of freeable normal pages of deferred split THPs into
>>> the second tail page, and account it into KReclaimable.  Although THP
>>> allocations are not exactly "kernel allocations", once they are unmapped,
>>> they are in fact kernel-only.  KReclaimable has been accounted into
>>> MemAvailable.
>> This sounds reasonable to me.
>>   
>>> When the deferred split THPs get split due to memory pressure or freed,
>>> just decrease by the recorded number.
>>>
>>> With this change when running program which populates 1G address space
>>> then madvise(MADV_DONTNEED) 511 pages for every THP, /proc/meminfo would
>>> show the deferred split THPs are accounted properly.
>>>
>>> Populated by before calling madvise(MADV_DONTNEED):
>>> MemAvailable:   43531960 kB
>>> AnonPages:       1096660 kB
>>> KReclaimable:      26156 kB
>>> AnonHugePages:   1056768 kB
>>>
>>> After calling madvise(MADV_DONTNEED):
>>> MemAvailable:   44411164 kB
>>> AnonPages:         50140 kB
>>> KReclaimable:    1070640 kB
>>> AnonHugePages:     10240 kB
>>>
>>> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
>>> Cc: Michal Hocko <mhocko@kernel.org>
>>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>>> Cc: David Rientjes <rientjes@google.com>
>>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> Thanks, looks like it wasn't too difficult with the 2nd tail page use :)
>
> ...
>
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -524,6 +524,7 @@ void prep_transhuge_page(struct page *page)
>>>   
>>>   	INIT_LIST_HEAD(page_deferred_list(page));
>>>   	set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
>>> +	page[2].nr_freeable = 0;
>>>   }
>>>   
>>>   static unsigned long __thp_get_unmapped_area(struct file *filp, unsigned long len,
>>> @@ -2766,6 +2767,10 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>>>   		if (!list_empty(page_deferred_list(head))) {
>>>   			ds_queue->split_queue_len--;
>>>   			list_del(page_deferred_list(head));
>>> +			__mod_node_page_state(page_pgdat(page),
>>> +					NR_KERNEL_MISC_RECLAIMABLE,
>>> +					-head[2].nr_freeable);
>>> +			head[2].nr_freeable = 0;
>>>   		}
>>>   		if (mapping)
>>>   			__dec_node_page_state(page, NR_SHMEM_THPS);
>>> @@ -2816,11 +2821,14 @@ void free_transhuge_page(struct page *page)
>>>   		ds_queue->split_queue_len--;
>>>   		list_del(page_deferred_list(page));
>>>   	}
>>> +	__mod_node_page_state(page_pgdat(page), NR_KERNEL_MISC_RECLAIMABLE,
>>> +			      -page[2].nr_freeable);
>>> +	page[2].nr_freeable = 0;
> Wouldn't it be safer to fully tie the nr_freeable use to adding the page to the
> deffered list? So here the code would be in the if (!list_empty()) { } part above.

IMHO, having it outside (!list_empty()) sounds safer actually since we'd 
better dec nr_freeable unconditionally in free path.

>
>>>   	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>>>   	free_compound_page(page);
>>>   }
>>>   
>>> -void deferred_split_huge_page(struct page *page)
>>> +void deferred_split_huge_page(struct page *page, unsigned int nr)
>>>   {
>>>   	struct deferred_split *ds_queue = get_deferred_split_queue(page);
>>>   #ifdef CONFIG_MEMCG
>>> @@ -2844,6 +2852,9 @@ void deferred_split_huge_page(struct page *page)
>>>   		return;
>>>   
>>>   	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>>> +	page[2].nr_freeable += nr;
>>> +	__mod_node_page_state(page_pgdat(page), NR_KERNEL_MISC_RECLAIMABLE,
>>> +			      nr);
> Same here, only do this when adding to the list, below? Or we might perhaps
> account base pages multiple times?

page_remove_rmap() might be called for a whole THP (i.e. 
do_huge_pmd_wp_page()), or the sub pages of THP (i.e. 
madvise(MADV_DONTNEED) -> zap_pte_range).

When it is called for sub pages of THP, the THP would be added into 
deferred split queue at the first time, then next time (the other sub 
pages are unmapped), since the THP is already on deferred split queue, 
list_empty() will return false, so this may cause we just account one 
subpage even though the most sub pages are unmapped.

I don't think they would be accounted multiple times, since 
deferred_split_huge_page() is just called when there is no map at all. 
Unmapping a unmapped page should be a bug.

>
>>>   	if (list_empty(page_deferred_list(page))) {
>>>   		count_vm_event(THP_DEFERRED_SPLIT_PAGE);
>>>   		list_add_tail(page_deferred_list(page), &ds_queue->split_queue);
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index e5dfe2a..6008fab 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1286,7 +1286,7 @@ static void page_remove_anon_compound_rmap(struct page *page)
>>>   
>>>   	if (nr) {
>>>   		__mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, -nr);
>>> -		deferred_split_huge_page(page);
>>> +		deferred_split_huge_page(page, nr);
>>>   	}
>>>   }
>>>   
>>> @@ -1320,7 +1320,7 @@ void page_remove_rmap(struct page *page, bool compound)
>>>   		clear_page_mlock(page);
>>>   
>>>   	if (PageTransCompound(page))
>>> -		deferred_split_huge_page(compound_head(page));
>>> +		deferred_split_huge_page(compound_head(page), 1);
>>>   
>>>   	/*
>>>   	 * It would be tidy to reset the PageAnon mapping here,
>>> -- 
>>> 1.8.3.1

Michal Hocko Aug. 26, 2019, 7:40 a.m. UTC | #7

On Thu 22-08-19 18:29:34, Kirill A. Shutemov wrote:
> On Thu, Aug 22, 2019 at 02:56:56PM +0200, Vlastimil Babka wrote:
> > On 8/22/19 10:04 AM, Michal Hocko wrote:
> > > On Thu 22-08-19 01:55:25, Yang Shi wrote:
> > >> Available memory is one of the most important metrics for memory
> > >> pressure.
> > > 
> > > I would disagree with this statement. It is a rough estimate that tells
> > > how much memory you can allocate before going into a more expensive
> > > reclaim (mostly swapping). Allocating that amount still might result in
> > > direct reclaim induced stalls. I do realize that this is simple metric
> > > that is attractive to use and works in many cases though.
> > > 
> > >> Currently, the deferred split THPs are not accounted into
> > >> available memory, but they are reclaimable actually, like reclaimable
> > >> slabs.
> > >> 
> > >> And, they seems very common with the common workloads when THP is
> > >> enabled.  A simple run with MariaDB test of mmtest with THP enabled as
> > >> always shows it could generate over fifteen thousand deferred split THPs
> > >> (accumulated around 30G in one hour run, 75% of 40G memory for my VM).
> > >> It looks worth accounting in MemAvailable.
> > > 
> > > OK, this makes sense. But your above numbers are really worrying.
> > > Accumulating such a large amount of pages that are likely not going to
> > > be used is really bad. They are essentially blocking any higher order
> > > allocations and also push the system towards more memory pressure.
> > > 
> > > IIUC deferred splitting is mostly a workaround for nasty locking issues
> > > during splitting, right? This is not really an optimization to cache
> > > THPs for reuse or something like that. What is the reason this is not
> > > done from a worker context? At least THPs which would be freed
> > > completely sound like a good candidate for kworker tear down, no?
> > 
> > Agreed that it's a good question. For Kirill :) Maybe with kworker approach we
> > also wouldn't need the cgroup awareness?
> 
> I don't remember a particular locking issue, but I cannot say there's
> none :P
> 
> It's artifact from decoupling PMD split from compound page split: the same
> page can be mapped multiple times with combination of PMDs and PTEs. Split
> of one PMD doesn't need to trigger split of all PMDs and underlying
> compound page.
> 
> Other consideration is the fact that page split can fail and we need to
> have fallback for this case.
> 
> Also in most cases THP split would be just waste of time if we would do
> them at the spot. If you don't have memory pressure it's better to wait
> until process termination: less pages on LRU is still beneficial.

This might be true but the reality shows that a lot of THPs might be
waiting for the memory pressure that is essentially freeable on the
spot. So I am not really convinced that "less pages on LRUs" is really a
plausible justification. Can we free at least those THPs which are
unmapped completely without any pte mappings?

> Main source of partly mapped THPs comes from exit path. When PMD mapping
> of THP got split across multiple VMAs (for instance due to mprotect()),
> in exit path we unmap PTEs belonging to one VMA just before unmapping the
> rest of the page. It would be total waste of time to split the page in
> this scenario.
> 
> The whole deferred split thing still looks as a reasonable compromise
> to me.

Even when it leads to all other problems mentioned in this and memcg
deferred reclaim series?

> We may have some kind of watermark and try to keep the number of deferred
> split THP under it. But it comes with own set of problems: what if all
> these pages are pinned for really long time and effectively not available
> for split.

Again, why cannot we simply push the freeing where there are no other
mappings? This should be pretty common case, right? I am still not sure
that waiting for the memory reclaim is a general win. Do you have any
examples of workloads that measurably benefit from this lazy approach
without any other downsides? In other words how exactly do we measure
cost/benefit model of this heuristic?

Michal Hocko Aug. 26, 2019, 7:43 a.m. UTC | #8

On Thu 22-08-19 08:33:40, Yang Shi wrote:
> 
> 
> On 8/22/19 1:04 AM, Michal Hocko wrote:
> > On Thu 22-08-19 01:55:25, Yang Shi wrote:
[...]
> > > And, they seems very common with the common workloads when THP is
> > > enabled.  A simple run with MariaDB test of mmtest with THP enabled as
> > > always shows it could generate over fifteen thousand deferred split THPs
> > > (accumulated around 30G in one hour run, 75% of 40G memory for my VM).
> > > It looks worth accounting in MemAvailable.
> > OK, this makes sense. But your above numbers are really worrying.
> > Accumulating such a large amount of pages that are likely not going to
> > be used is really bad. They are essentially blocking any higher order
> > allocations and also push the system towards more memory pressure.
> 
> That is accumulated number, during the running of the test, some of them
> were freed by shrinker already. IOW, it should not reach that much at any
> given time.

Then the above description is highly misleading. What is the actual
number of lingering THPs that wait for the memory pressure in the peak?
 
> > IIUC deferred splitting is mostly a workaround for nasty locking issues
> > during splitting, right? This is not really an optimization to cache
> > THPs for reuse or something like that. What is the reason this is not
> > done from a worker context? At least THPs which would be freed
> > completely sound like a good candidate for kworker tear down, no?
> 
> Yes, deferred split THP was introduced to avoid locking issues according to
> the document. Memcg awareness would help to trigger the shrinker more often.
> 
> I think it could be done in a worker context, but when to trigger to worker
> is a subtle problem.

Why? What is the problem to trigger it after unmap of a batch worth of
THPs?

Kirill A. Shutemov Aug. 26, 2019, 1:15 p.m. UTC | #9

On Mon, Aug 26, 2019 at 09:40:35AM +0200, Michal Hocko wrote:
> On Thu 22-08-19 18:29:34, Kirill A. Shutemov wrote:
> > On Thu, Aug 22, 2019 at 02:56:56PM +0200, Vlastimil Babka wrote:
> > > On 8/22/19 10:04 AM, Michal Hocko wrote:
> > > > On Thu 22-08-19 01:55:25, Yang Shi wrote:
> > > >> Available memory is one of the most important metrics for memory
> > > >> pressure.
> > > > 
> > > > I would disagree with this statement. It is a rough estimate that tells
> > > > how much memory you can allocate before going into a more expensive
> > > > reclaim (mostly swapping). Allocating that amount still might result in
> > > > direct reclaim induced stalls. I do realize that this is simple metric
> > > > that is attractive to use and works in many cases though.
> > > > 
> > > >> Currently, the deferred split THPs are not accounted into
> > > >> available memory, but they are reclaimable actually, like reclaimable
> > > >> slabs.
> > > >> 
> > > >> And, they seems very common with the common workloads when THP is
> > > >> enabled.  A simple run with MariaDB test of mmtest with THP enabled as
> > > >> always shows it could generate over fifteen thousand deferred split THPs
> > > >> (accumulated around 30G in one hour run, 75% of 40G memory for my VM).
> > > >> It looks worth accounting in MemAvailable.
> > > > 
> > > > OK, this makes sense. But your above numbers are really worrying.
> > > > Accumulating such a large amount of pages that are likely not going to
> > > > be used is really bad. They are essentially blocking any higher order
> > > > allocations and also push the system towards more memory pressure.
> > > > 
> > > > IIUC deferred splitting is mostly a workaround for nasty locking issues
> > > > during splitting, right? This is not really an optimization to cache
> > > > THPs for reuse or something like that. What is the reason this is not
> > > > done from a worker context? At least THPs which would be freed
> > > > completely sound like a good candidate for kworker tear down, no?
> > > 
> > > Agreed that it's a good question. For Kirill :) Maybe with kworker approach we
> > > also wouldn't need the cgroup awareness?
> > 
> > I don't remember a particular locking issue, but I cannot say there's
> > none :P
> > 
> > It's artifact from decoupling PMD split from compound page split: the same
> > page can be mapped multiple times with combination of PMDs and PTEs. Split
> > of one PMD doesn't need to trigger split of all PMDs and underlying
> > compound page.
> > 
> > Other consideration is the fact that page split can fail and we need to
> > have fallback for this case.
> > 
> > Also in most cases THP split would be just waste of time if we would do
> > them at the spot. If you don't have memory pressure it's better to wait
> > until process termination: less pages on LRU is still beneficial.
> 
> This might be true but the reality shows that a lot of THPs might be
> waiting for the memory pressure that is essentially freeable on the
> spot. So I am not really convinced that "less pages on LRUs" is really a
> plausible justification. Can we free at least those THPs which are
> unmapped completely without any pte mappings?

Unmapped completely pages will be freed with current code. Deferred split
only applies to partly mapped THPs: at least on 4k of the THP is still
mapped somewhere.

> > Main source of partly mapped THPs comes from exit path. When PMD mapping
> > of THP got split across multiple VMAs (for instance due to mprotect()),
> > in exit path we unmap PTEs belonging to one VMA just before unmapping the
> > rest of the page. It would be total waste of time to split the page in
> > this scenario.
> > 
> > The whole deferred split thing still looks as a reasonable compromise
> > to me.
> 
> Even when it leads to all other problems mentioned in this and memcg
> deferred reclaim series?

Yes.

You would still need deferred split even if you *try* to split the page on
the spot. split_huge_page() can fail (due to pin on the page) and you will
need to have a way to try again later.

You'll not win anything in complexity by trying split_huge_page()
immediately. I would ague you'll create much more complexity.

> > We may have some kind of watermark and try to keep the number of deferred
> > split THP under it. But it comes with own set of problems: what if all
> > these pages are pinned for really long time and effectively not available
> > for split.
> 
> Again, why cannot we simply push the freeing where there are no other
> mappings? This should be pretty common case, right?

Partly mapped THP is not common case at all.

To get to this point you will need to create a mapping, fault in THP and
then unmap part of it. It requires very active memory management on
application side. This kind of applications usually knows if THP is a fit
for them.

> I am still not sure that waiting for the memory reclaim is a general
> win.

It wins CPU cycles by not doing the work that is likely unneeded.
split_huge_page() is not particularly lightweight operation from locking
and atomic ops POV.

> Do you have any examples of workloads that measurably benefit from
> this lazy approach without any other downsides? In other words how
> exactly do we measure cost/benefit model of this heuristic?

Example? Sure.

Compiling mm/memory.c in my setup generates 8 deferred split. 4 of them
triggered from exit path. The rest 4 comes from MADV_DONTNEED. It doesn't
make sense to convert any of them to in-place split: for short-lived
process any split if waste of time without any benefit.

Yang Shi Aug. 27, 2019, 4:27 a.m. UTC | #10

On 8/26/19 12:43 AM, Michal Hocko wrote:
> On Thu 22-08-19 08:33:40, Yang Shi wrote:
>>
>> On 8/22/19 1:04 AM, Michal Hocko wrote:
>>> On Thu 22-08-19 01:55:25, Yang Shi wrote:
> [...]
>>>> And, they seems very common with the common workloads when THP is
>>>> enabled.  A simple run with MariaDB test of mmtest with THP enabled as
>>>> always shows it could generate over fifteen thousand deferred split THPs
>>>> (accumulated around 30G in one hour run, 75% of 40G memory for my VM).
>>>> It looks worth accounting in MemAvailable.
>>> OK, this makes sense. But your above numbers are really worrying.
>>> Accumulating such a large amount of pages that are likely not going to
>>> be used is really bad. They are essentially blocking any higher order
>>> allocations and also push the system towards more memory pressure.
>> That is accumulated number, during the running of the test, some of them
>> were freed by shrinker already. IOW, it should not reach that much at any
>> given time.
> Then the above description is highly misleading. What is the actual
> number of lingering THPs that wait for the memory pressure in the peak?

By rerunning sysbench mariadb test of mmtest, I didn't see too many THPs 
in the peak. I saw around 2K THPs sometimes on my VM with 40G memory. 
But they were short-lived (should be freed when the test exit). And, the 
number of accumulated THPs are variable.

And, this reminded me to go back double check our internal bug report 
which lead to the "make deferred split shrinker memcg aware" patchset.

In that case, a mysql instance with real production load was running in 
a memcg with ~86G limit, the number of deferred split THPs may reach to 
~68G (~34K deferred split THPs) in a few hours. The deferred split THP 
shrinker was not invoked since global memory pressure is still fine 
since the host has 256G memory, but memcg limit reclaim was triggered.

And, I can't tell if all those deferred split THPs came from mysql or 
not since there were some other processes run in that container too 
according to the oom log.

I will update the commit log with the more solid data from production 
environment.

>   
>>> IIUC deferred splitting is mostly a workaround for nasty locking issues
>>> during splitting, right? This is not really an optimization to cache
>>> THPs for reuse or something like that. What is the reason this is not
>>> done from a worker context? At least THPs which would be freed
>>> completely sound like a good candidate for kworker tear down, no?
>> Yes, deferred split THP was introduced to avoid locking issues according to
>> the document. Memcg awareness would help to trigger the shrinker more often.
>>
>> I think it could be done in a worker context, but when to trigger to worker
>> is a subtle problem.
> Why? What is the problem to trigger it after unmap of a batch worth of
> THPs?

This leads to another question, how many THPs are "a batch of worth"? 
And, they may be short-lived as showed by Kirill's example, we can't 
tell in advance how long the THPs life time is. We may waste cpu cycles 
to do something unneeded.

Michal Hocko Aug. 27, 2019, 5:59 a.m. UTC | #11

On Mon 26-08-19 21:27:38, Yang Shi wrote:
> 
> 
> On 8/26/19 12:43 AM, Michal Hocko wrote:
> > On Thu 22-08-19 08:33:40, Yang Shi wrote:
> > > 
> > > On 8/22/19 1:04 AM, Michal Hocko wrote:
> > > > On Thu 22-08-19 01:55:25, Yang Shi wrote:
> > [...]
> > > > > And, they seems very common with the common workloads when THP is
> > > > > enabled.  A simple run with MariaDB test of mmtest with THP enabled as
> > > > > always shows it could generate over fifteen thousand deferred split THPs
> > > > > (accumulated around 30G in one hour run, 75% of 40G memory for my VM).
> > > > > It looks worth accounting in MemAvailable.
> > > > OK, this makes sense. But your above numbers are really worrying.
> > > > Accumulating such a large amount of pages that are likely not going to
> > > > be used is really bad. They are essentially blocking any higher order
> > > > allocations and also push the system towards more memory pressure.
> > > That is accumulated number, during the running of the test, some of them
> > > were freed by shrinker already. IOW, it should not reach that much at any
> > > given time.
> > Then the above description is highly misleading. What is the actual
> > number of lingering THPs that wait for the memory pressure in the peak?
> 
> By rerunning sysbench mariadb test of mmtest, I didn't see too many THPs in
> the peak. I saw around 2K THPs sometimes on my VM with 40G memory. But they
> were short-lived (should be freed when the test exit). And, the number of
> accumulated THPs are variable.
> 
> And, this reminded me to go back double check our internal bug report which
> lead to the "make deferred split shrinker memcg aware" patchset.
> 
> In that case, a mysql instance with real production load was running in a
> memcg with ~86G limit, the number of deferred split THPs may reach to ~68G
> (~34K deferred split THPs) in a few hours. The deferred split THP shrinker
> was not invoked since global memory pressure is still fine since the host
> has 256G memory, but memcg limit reclaim was triggered.
> 
> And, I can't tell if all those deferred split THPs came from mysql or not
> since there were some other processes run in that container too according to
> the oom log.
> 
> I will update the commit log with the more solid data from production
> environment.

This is a very useful information. Thanks!

> > > > IIUC deferred splitting is mostly a workaround for nasty locking issues
> > > > during splitting, right? This is not really an optimization to cache
> > > > THPs for reuse or something like that. What is the reason this is not
> > > > done from a worker context? At least THPs which would be freed
> > > > completely sound like a good candidate for kworker tear down, no?
> > > Yes, deferred split THP was introduced to avoid locking issues according to
> > > the document. Memcg awareness would help to trigger the shrinker more often.
> > > 
> > > I think it could be done in a worker context, but when to trigger to worker
> > > is a subtle problem.
> > Why? What is the problem to trigger it after unmap of a batch worth of
> > THPs?
> 
> This leads to another question, how many THPs are "a batch of worth"?

Some arbitrary reasonable number. Few dozens of THPs waiting for split
are no big deal. Going into GB as you pointed out above is definitely a
problem.

Michal Hocko Aug. 27, 2019, 6:01 a.m. UTC | #12

On Mon 26-08-19 16:15:38, Kirill A. Shutemov wrote:
> On Mon, Aug 26, 2019 at 09:40:35AM +0200, Michal Hocko wrote:
> > On Thu 22-08-19 18:29:34, Kirill A. Shutemov wrote:
> > > On Thu, Aug 22, 2019 at 02:56:56PM +0200, Vlastimil Babka wrote:
> > > > On 8/22/19 10:04 AM, Michal Hocko wrote:
> > > > > On Thu 22-08-19 01:55:25, Yang Shi wrote:
> > > > >> Available memory is one of the most important metrics for memory
> > > > >> pressure.
> > > > > 
> > > > > I would disagree with this statement. It is a rough estimate that tells
> > > > > how much memory you can allocate before going into a more expensive
> > > > > reclaim (mostly swapping). Allocating that amount still might result in
> > > > > direct reclaim induced stalls. I do realize that this is simple metric
> > > > > that is attractive to use and works in many cases though.
> > > > > 
> > > > >> Currently, the deferred split THPs are not accounted into
> > > > >> available memory, but they are reclaimable actually, like reclaimable
> > > > >> slabs.
> > > > >> 
> > > > >> And, they seems very common with the common workloads when THP is
> > > > >> enabled.  A simple run with MariaDB test of mmtest with THP enabled as
> > > > >> always shows it could generate over fifteen thousand deferred split THPs
> > > > >> (accumulated around 30G in one hour run, 75% of 40G memory for my VM).
> > > > >> It looks worth accounting in MemAvailable.
> > > > > 
> > > > > OK, this makes sense. But your above numbers are really worrying.
> > > > > Accumulating such a large amount of pages that are likely not going to
> > > > > be used is really bad. They are essentially blocking any higher order
> > > > > allocations and also push the system towards more memory pressure.
> > > > > 
> > > > > IIUC deferred splitting is mostly a workaround for nasty locking issues
> > > > > during splitting, right? This is not really an optimization to cache
> > > > > THPs for reuse or something like that. What is the reason this is not
> > > > > done from a worker context? At least THPs which would be freed
> > > > > completely sound like a good candidate for kworker tear down, no?
> > > > 
> > > > Agreed that it's a good question. For Kirill :) Maybe with kworker approach we
> > > > also wouldn't need the cgroup awareness?
> > > 
> > > I don't remember a particular locking issue, but I cannot say there's
> > > none :P
> > > 
> > > It's artifact from decoupling PMD split from compound page split: the same
> > > page can be mapped multiple times with combination of PMDs and PTEs. Split
> > > of one PMD doesn't need to trigger split of all PMDs and underlying
> > > compound page.
> > > 
> > > Other consideration is the fact that page split can fail and we need to
> > > have fallback for this case.
> > > 
> > > Also in most cases THP split would be just waste of time if we would do
> > > them at the spot. If you don't have memory pressure it's better to wait
> > > until process termination: less pages on LRU is still beneficial.
> > 
> > This might be true but the reality shows that a lot of THPs might be
> > waiting for the memory pressure that is essentially freeable on the
> > spot. So I am not really convinced that "less pages on LRUs" is really a
> > plausible justification. Can we free at least those THPs which are
> > unmapped completely without any pte mappings?
> 
> Unmapped completely pages will be freed with current code. Deferred split
> only applies to partly mapped THPs: at least on 4k of the THP is still
> mapped somewhere.

Hmm, I am probably misreading the code but at least current Linus' tree
reads page_remove_rmap -> [page_remove_anon_compound_rmap ->\ deferred_split_huge_page even
for fully mapped THP.

> > > Main source of partly mapped THPs comes from exit path. When PMD mapping
> > > of THP got split across multiple VMAs (for instance due to mprotect()),
> > > in exit path we unmap PTEs belonging to one VMA just before unmapping the
> > > rest of the page. It would be total waste of time to split the page in
> > > this scenario.
> > > 
> > > The whole deferred split thing still looks as a reasonable compromise
> > > to me.
> > 
> > Even when it leads to all other problems mentioned in this and memcg
> > deferred reclaim series?
> 
> Yes.
> 
> You would still need deferred split even if you *try* to split the page on
> the spot. split_huge_page() can fail (due to pin on the page) and you will
> need to have a way to try again later.
> 
> You'll not win anything in complexity by trying split_huge_page()
> immediately. I would ague you'll create much more complexity.

I am not arguing for in place split. I am arguing to do it ASAP rather
than to wait for memory pressure which might be in an unbound amount of
time. So let me ask again. Why cannot we do that in the worker context?
Essentially schedure the work item right away?

> > > We may have some kind of watermark and try to keep the number of deferred
> > > split THP under it. But it comes with own set of problems: what if all
> > > these pages are pinned for really long time and effectively not available
> > > for split.
> > 
> > Again, why cannot we simply push the freeing where there are no other
> > mappings? This should be pretty common case, right?
> 
> Partly mapped THP is not common case at all.
> 
> To get to this point you will need to create a mapping, fault in THP and
> then unmap part of it. It requires very active memory management on
> application side. This kind of applications usually knows if THP is a fit
> for them.

See other email by Yang Shi for practical examples.

> > I am still not sure that waiting for the memory reclaim is a general
> > win.
> 
> It wins CPU cycles by not doing the work that is likely unneeded.
> split_huge_page() is not particularly lightweight operation from locking
> and atomic ops POV.
> 
> > Do you have any examples of workloads that measurably benefit from
> > this lazy approach without any other downsides? In other words how
> > exactly do we measure cost/benefit model of this heuristic?
> 
> Example? Sure.
> 
> Compiling mm/memory.c in my setup generates 8 deferred split. 4 of them
> triggered from exit path. The rest 4 comes from MADV_DONTNEED. It doesn't
> make sense to convert any of them to in-place split: for short-lived
> process any split if waste of time without any benefit.

Right, I understand that part. And again, I am not arguing for in place
split up. All I do care about is _when_ to trigger the "cleanup" aka
when we do free the memory or split the THP depending on its state. I
argue that waiting for the memory pressure is too late and examples
mentioned elsewhere in the thread confirm that.

Kirill A. Shutemov Aug. 27, 2019, 8:32 a.m. UTC | #13

On Tue, Aug 27, 2019 at 07:59:41AM +0200, Michal Hocko wrote:
> > > > > IIUC deferred splitting is mostly a workaround for nasty locking issues
> > > > > during splitting, right? This is not really an optimization to cache
> > > > > THPs for reuse or something like that. What is the reason this is not
> > > > > done from a worker context? At least THPs which would be freed
> > > > > completely sound like a good candidate for kworker tear down, no?
> > > > Yes, deferred split THP was introduced to avoid locking issues according to
> > > > the document. Memcg awareness would help to trigger the shrinker more often.
> > > > 
> > > > I think it could be done in a worker context, but when to trigger to worker
> > > > is a subtle problem.
> > > Why? What is the problem to trigger it after unmap of a batch worth of
> > > THPs?
> > 
> > This leads to another question, how many THPs are "a batch of worth"?
> 
> Some arbitrary reasonable number. Few dozens of THPs waiting for split
> are no big deal. Going into GB as you pointed out above is definitely a
> problem.

This will not work if these GBs worth of THPs are pinned (like with
RDMA).

We can kick the deferred split each N calls of deferred_split_huge_page()
if more than M pages queued or something.

Do we want to kick it again after some time if split from deferred queue
has failed?

The check if the page is splittable is not exactly free, so everyting has
trade offs.

Michal Hocko Aug. 27, 2019, 9 a.m. UTC | #14

On Tue 27-08-19 11:32:15, Kirill A. Shutemov wrote:
> On Tue, Aug 27, 2019 at 07:59:41AM +0200, Michal Hocko wrote:
> > > > > > IIUC deferred splitting is mostly a workaround for nasty locking issues
> > > > > > during splitting, right? This is not really an optimization to cache
> > > > > > THPs for reuse or something like that. What is the reason this is not
> > > > > > done from a worker context? At least THPs which would be freed
> > > > > > completely sound like a good candidate for kworker tear down, no?
> > > > > Yes, deferred split THP was introduced to avoid locking issues according to
> > > > > the document. Memcg awareness would help to trigger the shrinker more often.
> > > > > 
> > > > > I think it could be done in a worker context, but when to trigger to worker
> > > > > is a subtle problem.
> > > > Why? What is the problem to trigger it after unmap of a batch worth of
> > > > THPs?
> > > 
> > > This leads to another question, how many THPs are "a batch of worth"?
> > 
> > Some arbitrary reasonable number. Few dozens of THPs waiting for split
> > are no big deal. Going into GB as you pointed out above is definitely a
> > problem.
> 
> This will not work if these GBs worth of THPs are pinned (like with
> RDMA).

Yes, but this is the case we cannot do anything about in any deferred
scheme unless we hood into unpinning call path. We might get there
eventually with the newly forming api.

> We can kick the deferred split each N calls of deferred_split_huge_page()
> if more than M pages queued or something.

Yes, that sounds reasonable to me. N can be few dozens of THPs. An
explicit flush API after unmap is done would be helpful as well.

> Do we want to kick it again after some time if split from deferred queue
> has failed?

I wouldn't mind to have reclaim path do the fallback and see how that

Kirill A. Shutemov Aug. 27, 2019, 11:02 a.m. UTC | #15

On Tue, Aug 27, 2019 at 08:01:39AM +0200, Michal Hocko wrote:
> On Mon 26-08-19 16:15:38, Kirill A. Shutemov wrote:
> > On Mon, Aug 26, 2019 at 09:40:35AM +0200, Michal Hocko wrote:
> > > On Thu 22-08-19 18:29:34, Kirill A. Shutemov wrote:
> > > > On Thu, Aug 22, 2019 at 02:56:56PM +0200, Vlastimil Babka wrote:
> > > > > On 8/22/19 10:04 AM, Michal Hocko wrote:
> > > > > > On Thu 22-08-19 01:55:25, Yang Shi wrote:
> > > > > >> Available memory is one of the most important metrics for memory
> > > > > >> pressure.
> > > > > > 
> > > > > > I would disagree with this statement. It is a rough estimate that tells
> > > > > > how much memory you can allocate before going into a more expensive
> > > > > > reclaim (mostly swapping). Allocating that amount still might result in
> > > > > > direct reclaim induced stalls. I do realize that this is simple metric
> > > > > > that is attractive to use and works in many cases though.
> > > > > > 
> > > > > >> Currently, the deferred split THPs are not accounted into
> > > > > >> available memory, but they are reclaimable actually, like reclaimable
> > > > > >> slabs.
> > > > > >> 
> > > > > >> And, they seems very common with the common workloads when THP is
> > > > > >> enabled.  A simple run with MariaDB test of mmtest with THP enabled as
> > > > > >> always shows it could generate over fifteen thousand deferred split THPs
> > > > > >> (accumulated around 30G in one hour run, 75% of 40G memory for my VM).
> > > > > >> It looks worth accounting in MemAvailable.
> > > > > > 
> > > > > > OK, this makes sense. But your above numbers are really worrying.
> > > > > > Accumulating such a large amount of pages that are likely not going to
> > > > > > be used is really bad. They are essentially blocking any higher order
> > > > > > allocations and also push the system towards more memory pressure.
> > > > > > 
> > > > > > IIUC deferred splitting is mostly a workaround for nasty locking issues
> > > > > > during splitting, right? This is not really an optimization to cache
> > > > > > THPs for reuse or something like that. What is the reason this is not
> > > > > > done from a worker context? At least THPs which would be freed
> > > > > > completely sound like a good candidate for kworker tear down, no?
> > > > > 
> > > > > Agreed that it's a good question. For Kirill :) Maybe with kworker approach we
> > > > > also wouldn't need the cgroup awareness?
> > > > 
> > > > I don't remember a particular locking issue, but I cannot say there's
> > > > none :P
> > > > 
> > > > It's artifact from decoupling PMD split from compound page split: the same
> > > > page can be mapped multiple times with combination of PMDs and PTEs. Split
> > > > of one PMD doesn't need to trigger split of all PMDs and underlying
> > > > compound page.
> > > > 
> > > > Other consideration is the fact that page split can fail and we need to
> > > > have fallback for this case.
> > > > 
> > > > Also in most cases THP split would be just waste of time if we would do
> > > > them at the spot. If you don't have memory pressure it's better to wait
> > > > until process termination: less pages on LRU is still beneficial.
> > > 
> > > This might be true but the reality shows that a lot of THPs might be
> > > waiting for the memory pressure that is essentially freeable on the
> > > spot. So I am not really convinced that "less pages on LRUs" is really a
> > > plausible justification. Can we free at least those THPs which are
> > > unmapped completely without any pte mappings?
> > 
> > Unmapped completely pages will be freed with current code. Deferred split
> > only applies to partly mapped THPs: at least on 4k of the THP is still
> > mapped somewhere.
> 
> Hmm, I am probably misreading the code but at least current Linus' tree
> reads page_remove_rmap -> [page_remove_anon_compound_rmap ->\ deferred_split_huge_page even
> for fully mapped THP.

Well, you read correctly, but it was not intended. I screwed it up at some
point.

See the patch below. It should make it work as intened.

It's not bug as such, but inefficientcy. We add page to the queue where
it's not needed.

> > > > Main source of partly mapped THPs comes from exit path. When PMD mapping
> > > > of THP got split across multiple VMAs (for instance due to mprotect()),
> > > > in exit path we unmap PTEs belonging to one VMA just before unmapping the
> > > > rest of the page. It would be total waste of time to split the page in
> > > > this scenario.
> > > > 
> > > > The whole deferred split thing still looks as a reasonable compromise
> > > > to me.
> > > 
> > > Even when it leads to all other problems mentioned in this and memcg
> > > deferred reclaim series?
> > 
> > Yes.
> > 
> > You would still need deferred split even if you *try* to split the page on
> > the spot. split_huge_page() can fail (due to pin on the page) and you will
> > need to have a way to try again later.
> > 
> > You'll not win anything in complexity by trying split_huge_page()
> > immediately. I would ague you'll create much more complexity.
> 
> I am not arguing for in place split. I am arguing to do it ASAP rather
> than to wait for memory pressure which might be in an unbound amount of
> time. So let me ask again. Why cannot we do that in the worker context?
> Essentially schedure the work item right away?

Let me look into it.

diff --git a/mm/rmap.c b/mm/rmap.c
index 003377e24232..45388f1bf317 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1271,12 +1271,20 @@ static void page_remove_anon_compound_rmap(struct page *page)
 	if (TestClearPageDoubleMap(page)) {
 		/*
 		 * Subpages can be mapped with PTEs too. Check how many of
-		 * themi are still mapped.
+		 * them are still mapped.
 		 */
 		for (i = 0, nr = 0; i < HPAGE_PMD_NR; i++) {
 			if (atomic_add_negative(-1, &page[i]._mapcount))
 				nr++;
 		}
+
+		/*
+		 * Queue the page for deferred split if at least one small
+		 * page of the compound page is unmapped, but at least one
+		 * small page is still mapped.
+		 */
+		if (nr && nr < HPAGE_PMD_NR)
+			deferred_split_huge_page(page);
 	} else {
 		nr = HPAGE_PMD_NR;
 	}
@@ -1284,10 +1292,8 @@ static void page_remove_anon_compound_rmap(struct page *page)
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
 
-	if (nr) {
+	if (nr)
 		__mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, -nr);
-		deferred_split_huge_page(page);
-	}
 }
 
 /**

Michal Hocko Aug. 27, 2019, 11:48 a.m. UTC | #16

On Tue 27-08-19 14:02:10, Kirill A. Shutemov wrote:
> On Tue, Aug 27, 2019 at 08:01:39AM +0200, Michal Hocko wrote:
> > On Mon 26-08-19 16:15:38, Kirill A. Shutemov wrote:
> > > On Mon, Aug 26, 2019 at 09:40:35AM +0200, Michal Hocko wrote:
> > > > On Thu 22-08-19 18:29:34, Kirill A. Shutemov wrote:
> > > > > On Thu, Aug 22, 2019 at 02:56:56PM +0200, Vlastimil Babka wrote:
> > > > > > On 8/22/19 10:04 AM, Michal Hocko wrote:
> > > > > > > On Thu 22-08-19 01:55:25, Yang Shi wrote:
> > > > > > >> Available memory is one of the most important metrics for memory
> > > > > > >> pressure.
> > > > > > > 
> > > > > > > I would disagree with this statement. It is a rough estimate that tells
> > > > > > > how much memory you can allocate before going into a more expensive
> > > > > > > reclaim (mostly swapping). Allocating that amount still might result in
> > > > > > > direct reclaim induced stalls. I do realize that this is simple metric
> > > > > > > that is attractive to use and works in many cases though.
> > > > > > > 
> > > > > > >> Currently, the deferred split THPs are not accounted into
> > > > > > >> available memory, but they are reclaimable actually, like reclaimable
> > > > > > >> slabs.
> > > > > > >> 
> > > > > > >> And, they seems very common with the common workloads when THP is
> > > > > > >> enabled.  A simple run with MariaDB test of mmtest with THP enabled as
> > > > > > >> always shows it could generate over fifteen thousand deferred split THPs
> > > > > > >> (accumulated around 30G in one hour run, 75% of 40G memory for my VM).
> > > > > > >> It looks worth accounting in MemAvailable.
> > > > > > > 
> > > > > > > OK, this makes sense. But your above numbers are really worrying.
> > > > > > > Accumulating such a large amount of pages that are likely not going to
> > > > > > > be used is really bad. They are essentially blocking any higher order
> > > > > > > allocations and also push the system towards more memory pressure.
> > > > > > > 
> > > > > > > IIUC deferred splitting is mostly a workaround for nasty locking issues
> > > > > > > during splitting, right? This is not really an optimization to cache
> > > > > > > THPs for reuse or something like that. What is the reason this is not
> > > > > > > done from a worker context? At least THPs which would be freed
> > > > > > > completely sound like a good candidate for kworker tear down, no?
> > > > > > 
> > > > > > Agreed that it's a good question. For Kirill :) Maybe with kworker approach we
> > > > > > also wouldn't need the cgroup awareness?
> > > > > 
> > > > > I don't remember a particular locking issue, but I cannot say there's
> > > > > none :P
> > > > > 
> > > > > It's artifact from decoupling PMD split from compound page split: the same
> > > > > page can be mapped multiple times with combination of PMDs and PTEs. Split
> > > > > of one PMD doesn't need to trigger split of all PMDs and underlying
> > > > > compound page.
> > > > > 
> > > > > Other consideration is the fact that page split can fail and we need to
> > > > > have fallback for this case.
> > > > > 
> > > > > Also in most cases THP split would be just waste of time if we would do
> > > > > them at the spot. If you don't have memory pressure it's better to wait
> > > > > until process termination: less pages on LRU is still beneficial.
> > > > 
> > > > This might be true but the reality shows that a lot of THPs might be
> > > > waiting for the memory pressure that is essentially freeable on the
> > > > spot. So I am not really convinced that "less pages on LRUs" is really a
> > > > plausible justification. Can we free at least those THPs which are
> > > > unmapped completely without any pte mappings?
> > > 
> > > Unmapped completely pages will be freed with current code. Deferred split
> > > only applies to partly mapped THPs: at least on 4k of the THP is still
> > > mapped somewhere.
> > 
> > Hmm, I am probably misreading the code but at least current Linus' tree
> > reads page_remove_rmap -> [page_remove_anon_compound_rmap ->\ deferred_split_huge_page even
> > for fully mapped THP.
> 
> Well, you read correctly, but it was not intended. I screwed it up at some
> point.
> 
> See the patch below. It should make it work as intened.

OK, this would be indeed much better indeed. I was really under
impression that the deferred splitting is required due to locking.

Anyway this should take care of the most common usecase. If we can make
the odd cases of partially mapped THPs be handled deferred&earlier than
maybe do not really need the whole memcg deferred shrinkers and other
complications. So let's see.

Vlastimil Babka Aug. 27, 2019, 12:01 p.m. UTC | #17

On 8/27/19 1:02 PM, Kirill A. Shutemov wrote:
> On Tue, Aug 27, 2019 at 08:01:39AM +0200, Michal Hocko wrote:
>> On Mon 26-08-19 16:15:38, Kirill A. Shutemov wrote:
>>>
>>> Unmapped completely pages will be freed with current code. Deferred split
>>> only applies to partly mapped THPs: at least on 4k of the THP is still
>>> mapped somewhere.
>>
>> Hmm, I am probably misreading the code but at least current Linus' tree
>> reads page_remove_rmap -> [page_remove_anon_compound_rmap ->\ deferred_split_huge_page even
>> for fully mapped THP.
> 
> Well, you read correctly, but it was not intended. I screwed it up at some
> point.
> 
> See the patch below. It should make it work as intened.
> 
> It's not bug as such, but inefficientcy. We add page to the queue where
> it's not needed.

But that adding to queue doesn't affect whether the page will be freed
immediately if there are no more partial mappings, right? I don't see
deferred_split_huge_page() pinning the page.
So your patch wouldn't make THPs freed immediately in cases where they
haven't been freed before immediately, it just fixes a minor
inefficiency with queue manipulation?

Michal Hocko Aug. 27, 2019, 12:09 p.m. UTC | #18

On Tue 27-08-19 14:01:56, Vlastimil Babka wrote:
> On 8/27/19 1:02 PM, Kirill A. Shutemov wrote:
> > On Tue, Aug 27, 2019 at 08:01:39AM +0200, Michal Hocko wrote:
> >> On Mon 26-08-19 16:15:38, Kirill A. Shutemov wrote:
> >>>
> >>> Unmapped completely pages will be freed with current code. Deferred split
> >>> only applies to partly mapped THPs: at least on 4k of the THP is still
> >>> mapped somewhere.
> >>
> >> Hmm, I am probably misreading the code but at least current Linus' tree
> >> reads page_remove_rmap -> [page_remove_anon_compound_rmap ->\ deferred_split_huge_page even
> >> for fully mapped THP.
> > 
> > Well, you read correctly, but it was not intended. I screwed it up at some
> > point.
> > 
> > See the patch below. It should make it work as intened.
> > 
> > It's not bug as such, but inefficientcy. We add page to the queue where
> > it's not needed.
> 
> But that adding to queue doesn't affect whether the page will be freed
> immediately if there are no more partial mappings, right? I don't see
> deferred_split_huge_page() pinning the page.
> So your patch wouldn't make THPs freed immediately in cases where they
> haven't been freed before immediately, it just fixes a minor
> inefficiency with queue manipulation?

Ohh, right. I can see that in free_transhuge_page now. So fully mapped
THPs really do not matter and what I have considered an odd case is
really happening more often.

That being said this will not help at all for what Yang Shi is seeing
and we need a more proactive deferred splitting as I've mentioned
earlier.

Kirill A. Shutemov Aug. 27, 2019, 12:17 p.m. UTC | #19

On Tue, Aug 27, 2019 at 02:09:23PM +0200, Michal Hocko wrote:
> On Tue 27-08-19 14:01:56, Vlastimil Babka wrote:
> > On 8/27/19 1:02 PM, Kirill A. Shutemov wrote:
> > > On Tue, Aug 27, 2019 at 08:01:39AM +0200, Michal Hocko wrote:
> > >> On Mon 26-08-19 16:15:38, Kirill A. Shutemov wrote:
> > >>>
> > >>> Unmapped completely pages will be freed with current code. Deferred split
> > >>> only applies to partly mapped THPs: at least on 4k of the THP is still
> > >>> mapped somewhere.
> > >>
> > >> Hmm, I am probably misreading the code but at least current Linus' tree
> > >> reads page_remove_rmap -> [page_remove_anon_compound_rmap ->\ deferred_split_huge_page even
> > >> for fully mapped THP.
> > > 
> > > Well, you read correctly, but it was not intended. I screwed it up at some
> > > point.
> > > 
> > > See the patch below. It should make it work as intened.
> > > 
> > > It's not bug as such, but inefficientcy. We add page to the queue where
> > > it's not needed.
> > 
> > But that adding to queue doesn't affect whether the page will be freed
> > immediately if there are no more partial mappings, right? I don't see
> > deferred_split_huge_page() pinning the page.
> > So your patch wouldn't make THPs freed immediately in cases where they
> > haven't been freed before immediately, it just fixes a minor
> > inefficiency with queue manipulation?
> 
> Ohh, right. I can see that in free_transhuge_page now. So fully mapped
> THPs really do not matter and what I have considered an odd case is
> really happening more often.
> 
> That being said this will not help at all for what Yang Shi is seeing
> and we need a more proactive deferred splitting as I've mentioned
> earlier.

It was not intended to fix the issue. It's fix for current logic. I'm
playing with the work approach now.

Kirill A. Shutemov Aug. 27, 2019, 12:59 p.m. UTC | #20

On Tue, Aug 27, 2019 at 03:17:39PM +0300, Kirill A. Shutemov wrote:
> On Tue, Aug 27, 2019 at 02:09:23PM +0200, Michal Hocko wrote:
> > On Tue 27-08-19 14:01:56, Vlastimil Babka wrote:
> > > On 8/27/19 1:02 PM, Kirill A. Shutemov wrote:
> > > > On Tue, Aug 27, 2019 at 08:01:39AM +0200, Michal Hocko wrote:
> > > >> On Mon 26-08-19 16:15:38, Kirill A. Shutemov wrote:
> > > >>>
> > > >>> Unmapped completely pages will be freed with current code. Deferred split
> > > >>> only applies to partly mapped THPs: at least on 4k of the THP is still
> > > >>> mapped somewhere.
> > > >>
> > > >> Hmm, I am probably misreading the code but at least current Linus' tree
> > > >> reads page_remove_rmap -> [page_remove_anon_compound_rmap ->\ deferred_split_huge_page even
> > > >> for fully mapped THP.
> > > > 
> > > > Well, you read correctly, but it was not intended. I screwed it up at some
> > > > point.
> > > > 
> > > > See the patch below. It should make it work as intened.
> > > > 
> > > > It's not bug as such, but inefficientcy. We add page to the queue where
> > > > it's not needed.
> > > 
> > > But that adding to queue doesn't affect whether the page will be freed
> > > immediately if there are no more partial mappings, right? I don't see
> > > deferred_split_huge_page() pinning the page.
> > > So your patch wouldn't make THPs freed immediately in cases where they
> > > haven't been freed before immediately, it just fixes a minor
> > > inefficiency with queue manipulation?
> > 
> > Ohh, right. I can see that in free_transhuge_page now. So fully mapped
> > THPs really do not matter and what I have considered an odd case is
> > really happening more often.
> > 
> > That being said this will not help at all for what Yang Shi is seeing
> > and we need a more proactive deferred splitting as I've mentioned
> > earlier.
> 
> It was not intended to fix the issue. It's fix for current logic. I'm
> playing with the work approach now.

Below is what I've come up with. It appears to be functional.

Any comments?

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d77d717c620c..c576e9d772b7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -758,6 +758,8 @@ typedef struct pglist_data {
 	spinlock_t split_queue_lock;
 	struct list_head split_queue;
 	unsigned long split_queue_len;
+	unsigned int deferred_split_calls;
+	struct work_struct deferred_split_work;
 #endif
 
 	/* Fields commonly accessed by the page reclaim scanner */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index de1f15969e27..12d109bbe8ac 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2820,22 +2820,6 @@ void free_transhuge_page(struct page *page)
 	free_compound_page(page);
 }
 
-void deferred_split_huge_page(struct page *page)
-{
-	struct pglist_data *pgdata = NODE_DATA(page_to_nid(page));
-	unsigned long flags;
-
-	VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-
-	spin_lock_irqsave(&pgdata->split_queue_lock, flags);
-	if (list_empty(page_deferred_list(page))) {
-		count_vm_event(THP_DEFERRED_SPLIT_PAGE);
-		list_add_tail(page_deferred_list(page), &pgdata->split_queue);
-		pgdata->split_queue_len++;
-	}
-	spin_unlock_irqrestore(&pgdata->split_queue_lock, flags);
-}
-
 static unsigned long deferred_split_count(struct shrinker *shrink,
 		struct shrink_control *sc)
 {
@@ -2901,6 +2885,44 @@ static struct shrinker deferred_split_shrinker = {
 	.flags = SHRINKER_NUMA_AWARE,
 };
 
+void flush_deferred_split_queue(struct work_struct *work)
+{
+	struct pglist_data *pgdata;
+	struct shrink_control sc;
+
+	pgdata = container_of(work, struct pglist_data, deferred_split_work);
+	sc.nid = pgdata->node_id;
+	sc.nr_to_scan = 0; /* Unlimited */
+
+	deferred_split_scan(NULL, &sc);
+}
+
+#define NR_CALLS_TO_SPLIT 32
+#define NR_PAGES_ON_QUEUE_TO_SPLIT 16
+
+void deferred_split_huge_page(struct page *page)
+{
+	struct pglist_data *pgdata = NODE_DATA(page_to_nid(page));
+	unsigned long flags;
+
+	VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+
+	spin_lock_irqsave(&pgdata->split_queue_lock, flags);
+	if (list_empty(page_deferred_list(page))) {
+		count_vm_event(THP_DEFERRED_SPLIT_PAGE);
+		list_add_tail(page_deferred_list(page), &pgdata->split_queue);
+		pgdata->split_queue_len++;
+		pgdata->deferred_split_calls++;
+	}
+
+	if (pgdata->deferred_split_calls > NR_CALLS_TO_SPLIT &&
+			pgdata->split_queue_len > NR_PAGES_ON_QUEUE_TO_SPLIT) {
+		pgdata->deferred_split_calls = 0;
+		schedule_work(&pgdata->deferred_split_work);
+	}
+	spin_unlock_irqrestore(&pgdata->split_queue_lock, flags);
+}
+
 #ifdef CONFIG_DEBUG_FS
 static int split_huge_pages_set(void *data, u64 val)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9c9194959271..86af66d463e9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6636,11 +6636,14 @@ static unsigned long __init calc_memmap_size(unsigned long spanned_pages,
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+void flush_deferred_split_queue(struct work_struct *work);
 static void pgdat_init_split_queue(struct pglist_data *pgdat)
 {
 	spin_lock_init(&pgdat->split_queue_lock);
 	INIT_LIST_HEAD(&pgdat->split_queue);
 	pgdat->split_queue_len = 0;
+	pgdat->deferred_split_calls = 0;
+	INIT_WORK(&pgdat->deferred_split_work, flush_deferred_split_queue);
 }
 #else
 static void pgdat_init_split_queue(struct pglist_data *pgdat) {}

Yang Shi Aug. 27, 2019, 5:06 p.m. UTC | #21

On 8/27/19 5:59 AM, Kirill A. Shutemov wrote:
> On Tue, Aug 27, 2019 at 03:17:39PM +0300, Kirill A. Shutemov wrote:
>> On Tue, Aug 27, 2019 at 02:09:23PM +0200, Michal Hocko wrote:
>>> On Tue 27-08-19 14:01:56, Vlastimil Babka wrote:
>>>> On 8/27/19 1:02 PM, Kirill A. Shutemov wrote:
>>>>> On Tue, Aug 27, 2019 at 08:01:39AM +0200, Michal Hocko wrote:
>>>>>> On Mon 26-08-19 16:15:38, Kirill A. Shutemov wrote:
>>>>>>> Unmapped completely pages will be freed with current code. Deferred split
>>>>>>> only applies to partly mapped THPs: at least on 4k of the THP is still
>>>>>>> mapped somewhere.
>>>>>> Hmm, I am probably misreading the code but at least current Linus' tree
>>>>>> reads page_remove_rmap -> [page_remove_anon_compound_rmap ->\ deferred_split_huge_page even
>>>>>> for fully mapped THP.
>>>>> Well, you read correctly, but it was not intended. I screwed it up at some
>>>>> point.
>>>>>
>>>>> See the patch below. It should make it work as intened.
>>>>>
>>>>> It's not bug as such, but inefficientcy. We add page to the queue where
>>>>> it's not needed.
>>>> But that adding to queue doesn't affect whether the page will be freed
>>>> immediately if there are no more partial mappings, right? I don't see
>>>> deferred_split_huge_page() pinning the page.
>>>> So your patch wouldn't make THPs freed immediately in cases where they
>>>> haven't been freed before immediately, it just fixes a minor
>>>> inefficiency with queue manipulation?
>>> Ohh, right. I can see that in free_transhuge_page now. So fully mapped
>>> THPs really do not matter and what I have considered an odd case is
>>> really happening more often.
>>>
>>> That being said this will not help at all for what Yang Shi is seeing
>>> and we need a more proactive deferred splitting as I've mentioned
>>> earlier.
>> It was not intended to fix the issue. It's fix for current logic. I'm
>> playing with the work approach now.
> Below is what I've come up with. It appears to be functional.
>
> Any comments?

Thanks, Kirill and Michal. Doing split more proactive is definitely a 
choice to eliminate huge accumulated deferred split THPs, I did think 
about this approach before I came up with memcg aware approach. But, I 
thought this approach has some problems:

First of all, we can't prove if this is a universal win for the most 
workloads or not. For some workloads (as I mentioned about our usecase), 
we do see a lot THPs accumulated for a while, but they are very 
short-lived for other workloads, i.e. kernel build.

Secondly, it may be not fair for some workloads which don't generate too 
many deferred split THPs or those THPs are short-lived. Actually, the 
cpu time is abused by the excessive deferred split THPs generators, 
isn't it?

With memcg awareness, the deferred split THPs actually are isolated and 
capped by memcg. The long-lived deferred split THPs can't be accumulated 
too many due to the limit of memcg. And, cpu time spent in splitting 
them would just account to the memcgs who generate that many deferred 
split THPs, who generate them who pay for it. This sounds more fair and 
we could achieve much better isolation.

And, I think the discussion is diverted and mislead by the number of 
excessive deferred split THPs. To be clear, I didn't mean the excessive 
deferred split THPs are problem for us (I agree it may waste memory to 
have that many deferred split THPs not usable), the problem is the oom 
since they couldn't be split by memcg limit reclaim since the shrinker 
was not memcg aware.

>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index d77d717c620c..c576e9d772b7 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -758,6 +758,8 @@ typedef struct pglist_data {
>   	spinlock_t split_queue_lock;
>   	struct list_head split_queue;
>   	unsigned long split_queue_len;
> +	unsigned int deferred_split_calls;
> +	struct work_struct deferred_split_work;
>   #endif
>   
>   	/* Fields commonly accessed by the page reclaim scanner */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index de1f15969e27..12d109bbe8ac 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2820,22 +2820,6 @@ void free_transhuge_page(struct page *page)
>   	free_compound_page(page);
>   }
>   
> -void deferred_split_huge_page(struct page *page)
> -{
> -	struct pglist_data *pgdata = NODE_DATA(page_to_nid(page));
> -	unsigned long flags;
> -
> -	VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> -
> -	spin_lock_irqsave(&pgdata->split_queue_lock, flags);
> -	if (list_empty(page_deferred_list(page))) {
> -		count_vm_event(THP_DEFERRED_SPLIT_PAGE);
> -		list_add_tail(page_deferred_list(page), &pgdata->split_queue);
> -		pgdata->split_queue_len++;
> -	}
> -	spin_unlock_irqrestore(&pgdata->split_queue_lock, flags);
> -}
> -
>   static unsigned long deferred_split_count(struct shrinker *shrink,
>   		struct shrink_control *sc)
>   {
> @@ -2901,6 +2885,44 @@ static struct shrinker deferred_split_shrinker = {
>   	.flags = SHRINKER_NUMA_AWARE,
>   };
>   
> +void flush_deferred_split_queue(struct work_struct *work)
> +{
> +	struct pglist_data *pgdata;
> +	struct shrink_control sc;
> +
> +	pgdata = container_of(work, struct pglist_data, deferred_split_work);
> +	sc.nid = pgdata->node_id;
> +	sc.nr_to_scan = 0; /* Unlimited */
> +
> +	deferred_split_scan(NULL, &sc);
> +}
> +
> +#define NR_CALLS_TO_SPLIT 32
> +#define NR_PAGES_ON_QUEUE_TO_SPLIT 16
> +
> +void deferred_split_huge_page(struct page *page)
> +{
> +	struct pglist_data *pgdata = NODE_DATA(page_to_nid(page));
> +	unsigned long flags;
> +
> +	VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> +
> +	spin_lock_irqsave(&pgdata->split_queue_lock, flags);
> +	if (list_empty(page_deferred_list(page))) {
> +		count_vm_event(THP_DEFERRED_SPLIT_PAGE);
> +		list_add_tail(page_deferred_list(page), &pgdata->split_queue);
> +		pgdata->split_queue_len++;
> +		pgdata->deferred_split_calls++;
> +	}
> +
> +	if (pgdata->deferred_split_calls > NR_CALLS_TO_SPLIT &&
> +			pgdata->split_queue_len > NR_PAGES_ON_QUEUE_TO_SPLIT) {
> +		pgdata->deferred_split_calls = 0;
> +		schedule_work(&pgdata->deferred_split_work);
> +	}
> +	spin_unlock_irqrestore(&pgdata->split_queue_lock, flags);
> +}
> +
>   #ifdef CONFIG_DEBUG_FS
>   static int split_huge_pages_set(void *data, u64 val)
>   {
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9c9194959271..86af66d463e9 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6636,11 +6636,14 @@ static unsigned long __init calc_memmap_size(unsigned long spanned_pages,
>   }
>   
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +void flush_deferred_split_queue(struct work_struct *work);
>   static void pgdat_init_split_queue(struct pglist_data *pgdat)
>   {
>   	spin_lock_init(&pgdat->split_queue_lock);
>   	INIT_LIST_HEAD(&pgdat->split_queue);
>   	pgdat->split_queue_len = 0;
> +	pgdat->deferred_split_calls = 0;
> +	INIT_WORK(&pgdat->deferred_split_work, flush_deferred_split_queue);
>   }
>   #else
>   static void pgdat_init_split_queue(struct pglist_data *pgdat) {}

Michal Hocko Aug. 28, 2019, 7:57 a.m. UTC | #22

On Tue 27-08-19 10:06:20, Yang Shi wrote:
> 
> 
> On 8/27/19 5:59 AM, Kirill A. Shutemov wrote:
> > On Tue, Aug 27, 2019 at 03:17:39PM +0300, Kirill A. Shutemov wrote:
> > > On Tue, Aug 27, 2019 at 02:09:23PM +0200, Michal Hocko wrote:
> > > > On Tue 27-08-19 14:01:56, Vlastimil Babka wrote:
> > > > > On 8/27/19 1:02 PM, Kirill A. Shutemov wrote:
> > > > > > On Tue, Aug 27, 2019 at 08:01:39AM +0200, Michal Hocko wrote:
> > > > > > > On Mon 26-08-19 16:15:38, Kirill A. Shutemov wrote:
> > > > > > > > Unmapped completely pages will be freed with current code. Deferred split
> > > > > > > > only applies to partly mapped THPs: at least on 4k of the THP is still
> > > > > > > > mapped somewhere.
> > > > > > > Hmm, I am probably misreading the code but at least current Linus' tree
> > > > > > > reads page_remove_rmap -> [page_remove_anon_compound_rmap ->\ deferred_split_huge_page even
> > > > > > > for fully mapped THP.
> > > > > > Well, you read correctly, but it was not intended. I screwed it up at some
> > > > > > point.
> > > > > > 
> > > > > > See the patch below. It should make it work as intened.
> > > > > > 
> > > > > > It's not bug as such, but inefficientcy. We add page to the queue where
> > > > > > it's not needed.
> > > > > But that adding to queue doesn't affect whether the page will be freed
> > > > > immediately if there are no more partial mappings, right? I don't see
> > > > > deferred_split_huge_page() pinning the page.
> > > > > So your patch wouldn't make THPs freed immediately in cases where they
> > > > > haven't been freed before immediately, it just fixes a minor
> > > > > inefficiency with queue manipulation?
> > > > Ohh, right. I can see that in free_transhuge_page now. So fully mapped
> > > > THPs really do not matter and what I have considered an odd case is
> > > > really happening more often.
> > > > 
> > > > That being said this will not help at all for what Yang Shi is seeing
> > > > and we need a more proactive deferred splitting as I've mentioned
> > > > earlier.
> > > It was not intended to fix the issue. It's fix for current logic. I'm
> > > playing with the work approach now.
> > Below is what I've come up with. It appears to be functional.
> > 
> > Any comments?
> 
> Thanks, Kirill and Michal. Doing split more proactive is definitely a choice
> to eliminate huge accumulated deferred split THPs, I did think about this
> approach before I came up with memcg aware approach. But, I thought this
> approach has some problems:
> 
> First of all, we can't prove if this is a universal win for the most
> workloads or not. For some workloads (as I mentioned about our usecase), we
> do see a lot THPs accumulated for a while, but they are very short-lived for
> other workloads, i.e. kernel build.
> 
> Secondly, it may be not fair for some workloads which don't generate too
> many deferred split THPs or those THPs are short-lived. Actually, the cpu
> time is abused by the excessive deferred split THPs generators, isn't it?

Yes this is indeed true. Do we have any idea on how much time that
actually is?

> With memcg awareness, the deferred split THPs actually are isolated and
> capped by memcg. The long-lived deferred split THPs can't be accumulated too
> many due to the limit of memcg. And, cpu time spent in splitting them would
> just account to the memcgs who generate that many deferred split THPs, who
> generate them who pay for it. This sounds more fair and we could achieve
> much better isolation.

On the other hand, deferring the split and free up a non trivial amount
of memory is a problem I consider quite serious because it affects not
only the memcg workload which has to do the reclaim but also other
consumers of memory beucase large memory blocks could be used for higher
order allocations.

> And, I think the discussion is diverted and mislead by the number of
> excessive deferred split THPs. To be clear, I didn't mean the excessive
> deferred split THPs are problem for us (I agree it may waste memory to have
> that many deferred split THPs not usable), the problem is the oom since they
> couldn't be split by memcg limit reclaim since the shrinker was not memcg
> aware.

Well, I would like to see how much of a problem the memcg OOM really is
after deferred splitting is more time constrained. Maybe we will find
that there is no special memcg aware solution really needed.

Kirill A. Shutemov Aug. 28, 2019, 2:03 p.m. UTC | #23

On Wed, Aug 28, 2019 at 09:57:08AM +0200, Michal Hocko wrote:
> On Tue 27-08-19 10:06:20, Yang Shi wrote:
> > 
> > 
> > On 8/27/19 5:59 AM, Kirill A. Shutemov wrote:
> > > On Tue, Aug 27, 2019 at 03:17:39PM +0300, Kirill A. Shutemov wrote:
> > > > On Tue, Aug 27, 2019 at 02:09:23PM +0200, Michal Hocko wrote:
> > > > > On Tue 27-08-19 14:01:56, Vlastimil Babka wrote:
> > > > > > On 8/27/19 1:02 PM, Kirill A. Shutemov wrote:
> > > > > > > On Tue, Aug 27, 2019 at 08:01:39AM +0200, Michal Hocko wrote:
> > > > > > > > On Mon 26-08-19 16:15:38, Kirill A. Shutemov wrote:
> > > > > > > > > Unmapped completely pages will be freed with current code. Deferred split
> > > > > > > > > only applies to partly mapped THPs: at least on 4k of the THP is still
> > > > > > > > > mapped somewhere.
> > > > > > > > Hmm, I am probably misreading the code but at least current Linus' tree
> > > > > > > > reads page_remove_rmap -> [page_remove_anon_compound_rmap ->\ deferred_split_huge_page even
> > > > > > > > for fully mapped THP.
> > > > > > > Well, you read correctly, but it was not intended. I screwed it up at some
> > > > > > > point.
> > > > > > > 
> > > > > > > See the patch below. It should make it work as intened.
> > > > > > > 
> > > > > > > It's not bug as such, but inefficientcy. We add page to the queue where
> > > > > > > it's not needed.
> > > > > > But that adding to queue doesn't affect whether the page will be freed
> > > > > > immediately if there are no more partial mappings, right? I don't see
> > > > > > deferred_split_huge_page() pinning the page.
> > > > > > So your patch wouldn't make THPs freed immediately in cases where they
> > > > > > haven't been freed before immediately, it just fixes a minor
> > > > > > inefficiency with queue manipulation?
> > > > > Ohh, right. I can see that in free_transhuge_page now. So fully mapped
> > > > > THPs really do not matter and what I have considered an odd case is
> > > > > really happening more often.
> > > > > 
> > > > > That being said this will not help at all for what Yang Shi is seeing
> > > > > and we need a more proactive deferred splitting as I've mentioned
> > > > > earlier.
> > > > It was not intended to fix the issue. It's fix for current logic. I'm
> > > > playing with the work approach now.
> > > Below is what I've come up with. It appears to be functional.
> > > 
> > > Any comments?
> > 
> > Thanks, Kirill and Michal. Doing split more proactive is definitely a choice
> > to eliminate huge accumulated deferred split THPs, I did think about this
> > approach before I came up with memcg aware approach. But, I thought this
> > approach has some problems:
> > 
> > First of all, we can't prove if this is a universal win for the most
> > workloads or not. For some workloads (as I mentioned about our usecase), we
> > do see a lot THPs accumulated for a while, but they are very short-lived for
> > other workloads, i.e. kernel build.
> > 
> > Secondly, it may be not fair for some workloads which don't generate too
> > many deferred split THPs or those THPs are short-lived. Actually, the cpu
> > time is abused by the excessive deferred split THPs generators, isn't it?
> 
> Yes this is indeed true. Do we have any idea on how much time that
> actually is?

For uncontented case, splitting 1G worth of pages (2MiB x 512) takes a bit
more than 50 ms in my setup. But it's best-case scenario: pages not shared
across multiple processes, no contention on ptl, page lock, etc.

> > With memcg awareness, the deferred split THPs actually are isolated and
> > capped by memcg. The long-lived deferred split THPs can't be accumulated too
> > many due to the limit of memcg. And, cpu time spent in splitting them would
> > just account to the memcgs who generate that many deferred split THPs, who
> > generate them who pay for it. This sounds more fair and we could achieve
> > much better isolation.
> 
> On the other hand, deferring the split and free up a non trivial amount
> of memory is a problem I consider quite serious because it affects not
> only the memcg workload which has to do the reclaim but also other
> consumers of memory beucase large memory blocks could be used for higher
> order allocations.

Maybe instead of drive the split from number of pages on queue we can take
a hint from compaction that is struggles to get high order pages?

We can also try to use schedule_delayed_work() instead of plain
schedule_work() to give short-lived page chance to get freed before
splitting attempt.

> > And, I think the discussion is diverted and mislead by the number of
> > excessive deferred split THPs. To be clear, I didn't mean the excessive
> > deferred split THPs are problem for us (I agree it may waste memory to have
> > that many deferred split THPs not usable), the problem is the oom since they
> > couldn't be split by memcg limit reclaim since the shrinker was not memcg
> > aware.
> 
> Well, I would like to see how much of a problem the memcg OOM really is
> after deferred splitting is more time constrained. Maybe we will find
> that there is no special memcg aware solution really needed.
> -- 
> Michal Hocko
> SUSE Labs

Kirill A . Shutemov Aug. 28, 2019, 2:46 p.m. UTC | #24

On Wed, Aug 28, 2019 at 02:12:53PM +0000, Michal Hocko wrote:
> On Wed 28-08-19 17:03:29, Kirill A. Shutemov wrote:
> > On Wed, Aug 28, 2019 at 09:57:08AM +0200, Michal Hocko wrote:
> > > On Tue 27-08-19 10:06:20, Yang Shi wrote:
> > > > 
> > > > 
> > > > On 8/27/19 5:59 AM, Kirill A. Shutemov wrote:
> > > > > On Tue, Aug 27, 2019 at 03:17:39PM +0300, Kirill A. Shutemov wrote:
> > > > > > On Tue, Aug 27, 2019 at 02:09:23PM +0200, Michal Hocko wrote:
> > > > > > > On Tue 27-08-19 14:01:56, Vlastimil Babka wrote:
> > > > > > > > On 8/27/19 1:02 PM, Kirill A. Shutemov wrote:
> > > > > > > > > On Tue, Aug 27, 2019 at 08:01:39AM +0200, Michal Hocko wrote:
> > > > > > > > > > On Mon 26-08-19 16:15:38, Kirill A. Shutemov wrote:
> > > > > > > > > > > Unmapped completely pages will be freed with current code. Deferred split
> > > > > > > > > > > only applies to partly mapped THPs: at least on 4k of the THP is still
> > > > > > > > > > > mapped somewhere.
> > > > > > > > > > Hmm, I am probably misreading the code but at least current Linus' tree
> > > > > > > > > > reads page_remove_rmap -> [page_remove_anon_compound_rmap ->\ deferred_split_huge_page even
> > > > > > > > > > for fully mapped THP.
> > > > > > > > > Well, you read correctly, but it was not intended. I screwed it up at some
> > > > > > > > > point.
> > > > > > > > > 
> > > > > > > > > See the patch below. It should make it work as intened.
> > > > > > > > > 
> > > > > > > > > It's not bug as such, but inefficientcy. We add page to the queue where
> > > > > > > > > it's not needed.
> > > > > > > > But that adding to queue doesn't affect whether the page will be freed
> > > > > > > > immediately if there are no more partial mappings, right? I don't see
> > > > > > > > deferred_split_huge_page() pinning the page.
> > > > > > > > So your patch wouldn't make THPs freed immediately in cases where they
> > > > > > > > haven't been freed before immediately, it just fixes a minor
> > > > > > > > inefficiency with queue manipulation?
> > > > > > > Ohh, right. I can see that in free_transhuge_page now. So fully mapped
> > > > > > > THPs really do not matter and what I have considered an odd case is
> > > > > > > really happening more often.
> > > > > > > 
> > > > > > > That being said this will not help at all for what Yang Shi is seeing
> > > > > > > and we need a more proactive deferred splitting as I've mentioned
> > > > > > > earlier.
> > > > > > It was not intended to fix the issue. It's fix for current logic. I'm
> > > > > > playing with the work approach now.
> > > > > Below is what I've come up with. It appears to be functional.
> > > > > 
> > > > > Any comments?
> > > > 
> > > > Thanks, Kirill and Michal. Doing split more proactive is definitely a choice
> > > > to eliminate huge accumulated deferred split THPs, I did think about this
> > > > approach before I came up with memcg aware approach. But, I thought this
> > > > approach has some problems:
> > > > 
> > > > First of all, we can't prove if this is a universal win for the most
> > > > workloads or not. For some workloads (as I mentioned about our usecase), we
> > > > do see a lot THPs accumulated for a while, but they are very short-lived for
> > > > other workloads, i.e. kernel build.
> > > > 
> > > > Secondly, it may be not fair for some workloads which don't generate too
> > > > many deferred split THPs or those THPs are short-lived. Actually, the cpu
> > > > time is abused by the excessive deferred split THPs generators, isn't it?
> > > 
> > > Yes this is indeed true. Do we have any idea on how much time that
> > > actually is?
> > 
> > For uncontented case, splitting 1G worth of pages (2MiB x 512) takes a bit
> > more than 50 ms in my setup. But it's best-case scenario: pages not shared
> > across multiple processes, no contention on ptl, page lock, etc.
> 
> Any idea about a bad case?

Not really.

How bad you want it to get? How many processes share the page? Access
pattern? Locking situation?

Worst case scenarion: no progress on splitting due to pins or locking
conflicts (trylock failure).

> > > > With memcg awareness, the deferred split THPs actually are isolated and
> > > > capped by memcg. The long-lived deferred split THPs can't be accumulated too
> > > > many due to the limit of memcg. And, cpu time spent in splitting them would
> > > > just account to the memcgs who generate that many deferred split THPs, who
> > > > generate them who pay for it. This sounds more fair and we could achieve
> > > > much better isolation.
> > > 
> > > On the other hand, deferring the split and free up a non trivial amount
> > > of memory is a problem I consider quite serious because it affects not
> > > only the memcg workload which has to do the reclaim but also other
> > > consumers of memory beucase large memory blocks could be used for higher
> > > order allocations.
> > 
> > Maybe instead of drive the split from number of pages on queue we can take
> > a hint from compaction that is struggles to get high order pages?
> 
> This is still unbounded in time.

I'm not sure we should focus on time.

We need to make sure that we don't overal system health worse. Who cares
if we have pages on deferred split list as long as we don't have other
user for the memory?

> > We can also try to use schedule_delayed_work() instead of plain
> > schedule_work() to give short-lived page chance to get freed before
> > splitting attempt.
> 
> No problem with that as long as this is well bound in time.
> -- 
> Michal Hocko
> SUSE Labs

Michal Hocko Aug. 28, 2019, 4:02 p.m. UTC | #25

On Wed 28-08-19 17:46:59, Kirill A. Shutemov wrote:
> On Wed, Aug 28, 2019 at 02:12:53PM +0000, Michal Hocko wrote:
> > On Wed 28-08-19 17:03:29, Kirill A. Shutemov wrote:
> > > On Wed, Aug 28, 2019 at 09:57:08AM +0200, Michal Hocko wrote:
> > > > On Tue 27-08-19 10:06:20, Yang Shi wrote:
> > > > > 
> > > > > 
> > > > > On 8/27/19 5:59 AM, Kirill A. Shutemov wrote:
> > > > > > On Tue, Aug 27, 2019 at 03:17:39PM +0300, Kirill A. Shutemov wrote:
> > > > > > > On Tue, Aug 27, 2019 at 02:09:23PM +0200, Michal Hocko wrote:
> > > > > > > > On Tue 27-08-19 14:01:56, Vlastimil Babka wrote:
> > > > > > > > > On 8/27/19 1:02 PM, Kirill A. Shutemov wrote:
> > > > > > > > > > On Tue, Aug 27, 2019 at 08:01:39AM +0200, Michal Hocko wrote:
> > > > > > > > > > > On Mon 26-08-19 16:15:38, Kirill A. Shutemov wrote:
> > > > > > > > > > > > Unmapped completely pages will be freed with current code. Deferred split
> > > > > > > > > > > > only applies to partly mapped THPs: at least on 4k of the THP is still
> > > > > > > > > > > > mapped somewhere.
> > > > > > > > > > > Hmm, I am probably misreading the code but at least current Linus' tree
> > > > > > > > > > > reads page_remove_rmap -> [page_remove_anon_compound_rmap ->\ deferred_split_huge_page even
> > > > > > > > > > > for fully mapped THP.
> > > > > > > > > > Well, you read correctly, but it was not intended. I screwed it up at some
> > > > > > > > > > point.
> > > > > > > > > > 
> > > > > > > > > > See the patch below. It should make it work as intened.
> > > > > > > > > > 
> > > > > > > > > > It's not bug as such, but inefficientcy. We add page to the queue where
> > > > > > > > > > it's not needed.
> > > > > > > > > But that adding to queue doesn't affect whether the page will be freed
> > > > > > > > > immediately if there are no more partial mappings, right? I don't see
> > > > > > > > > deferred_split_huge_page() pinning the page.
> > > > > > > > > So your patch wouldn't make THPs freed immediately in cases where they
> > > > > > > > > haven't been freed before immediately, it just fixes a minor
> > > > > > > > > inefficiency with queue manipulation?
> > > > > > > > Ohh, right. I can see that in free_transhuge_page now. So fully mapped
> > > > > > > > THPs really do not matter and what I have considered an odd case is
> > > > > > > > really happening more often.
> > > > > > > > 
> > > > > > > > That being said this will not help at all for what Yang Shi is seeing
> > > > > > > > and we need a more proactive deferred splitting as I've mentioned
> > > > > > > > earlier.
> > > > > > > It was not intended to fix the issue. It's fix for current logic. I'm
> > > > > > > playing with the work approach now.
> > > > > > Below is what I've come up with. It appears to be functional.
> > > > > > 
> > > > > > Any comments?
> > > > > 
> > > > > Thanks, Kirill and Michal. Doing split more proactive is definitely a choice
> > > > > to eliminate huge accumulated deferred split THPs, I did think about this
> > > > > approach before I came up with memcg aware approach. But, I thought this
> > > > > approach has some problems:
> > > > > 
> > > > > First of all, we can't prove if this is a universal win for the most
> > > > > workloads or not. For some workloads (as I mentioned about our usecase), we
> > > > > do see a lot THPs accumulated for a while, but they are very short-lived for
> > > > > other workloads, i.e. kernel build.
> > > > > 
> > > > > Secondly, it may be not fair for some workloads which don't generate too
> > > > > many deferred split THPs or those THPs are short-lived. Actually, the cpu
> > > > > time is abused by the excessive deferred split THPs generators, isn't it?
> > > > 
> > > > Yes this is indeed true. Do we have any idea on how much time that
> > > > actually is?
> > > 
> > > For uncontented case, splitting 1G worth of pages (2MiB x 512) takes a bit
> > > more than 50 ms in my setup. But it's best-case scenario: pages not shared
> > > across multiple processes, no contention on ptl, page lock, etc.
> > 
> > Any idea about a bad case?
> 
> Not really.
> 
> How bad you want it to get? How many processes share the page? Access
> pattern? Locking situation?

Let's say how hard a regular user can make this?
 
> Worst case scenarion: no progress on splitting due to pins or locking
> conflicts (trylock failure).
> 
> > > > > With memcg awareness, the deferred split THPs actually are isolated and
> > > > > capped by memcg. The long-lived deferred split THPs can't be accumulated too
> > > > > many due to the limit of memcg. And, cpu time spent in splitting them would
> > > > > just account to the memcgs who generate that many deferred split THPs, who
> > > > > generate them who pay for it. This sounds more fair and we could achieve
> > > > > much better isolation.
> > > > 
> > > > On the other hand, deferring the split and free up a non trivial amount
> > > > of memory is a problem I consider quite serious because it affects not
> > > > only the memcg workload which has to do the reclaim but also other
> > > > consumers of memory beucase large memory blocks could be used for higher
> > > > order allocations.
> > > 
> > > Maybe instead of drive the split from number of pages on queue we can take
> > > a hint from compaction that is struggles to get high order pages?
> > 
> > This is still unbounded in time.
> 
> I'm not sure we should focus on time.
> 
> We need to make sure that we don't overal system health worse. Who cares
> if we have pages on deferred split list as long as we don't have other
> user for the memory?

We do care for all those users which do not want to get stalled when
requesting that memory. And you cannot really predict that, right? So
the sooner the better. Modulo time wasted for the pointless splitting of
course. I am afraid defining the best timing here is going to be hard
but let's focus on workloads that are known to generate partial THPs and
see how that behaves.

Yang Shi Aug. 29, 2019, 5:03 p.m. UTC | #26

On Wed, Aug 28, 2019 at 9:02 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 28-08-19 17:46:59, Kirill A. Shutemov wrote:
> > On Wed, Aug 28, 2019 at 02:12:53PM +0000, Michal Hocko wrote:
> > > On Wed 28-08-19 17:03:29, Kirill A. Shutemov wrote:
> > > > On Wed, Aug 28, 2019 at 09:57:08AM +0200, Michal Hocko wrote:
> > > > > On Tue 27-08-19 10:06:20, Yang Shi wrote:
> > > > > >
> > > > > >
> > > > > > On 8/27/19 5:59 AM, Kirill A. Shutemov wrote:
> > > > > > > On Tue, Aug 27, 2019 at 03:17:39PM +0300, Kirill A. Shutemov wrote:
> > > > > > > > On Tue, Aug 27, 2019 at 02:09:23PM +0200, Michal Hocko wrote:
> > > > > > > > > On Tue 27-08-19 14:01:56, Vlastimil Babka wrote:
> > > > > > > > > > On 8/27/19 1:02 PM, Kirill A. Shutemov wrote:
> > > > > > > > > > > On Tue, Aug 27, 2019 at 08:01:39AM +0200, Michal Hocko wrote:
> > > > > > > > > > > > On Mon 26-08-19 16:15:38, Kirill A. Shutemov wrote:
> > > > > > > > > > > > > Unmapped completely pages will be freed with current code. Deferred split
> > > > > > > > > > > > > only applies to partly mapped THPs: at least on 4k of the THP is still
> > > > > > > > > > > > > mapped somewhere.
> > > > > > > > > > > > Hmm, I am probably misreading the code but at least current Linus' tree
> > > > > > > > > > > > reads page_remove_rmap -> [page_remove_anon_compound_rmap ->\ deferred_split_huge_page even
> > > > > > > > > > > > for fully mapped THP.
> > > > > > > > > > > Well, you read correctly, but it was not intended. I screwed it up at some
> > > > > > > > > > > point.
> > > > > > > > > > >
> > > > > > > > > > > See the patch below. It should make it work as intened.
> > > > > > > > > > >
> > > > > > > > > > > It's not bug as such, but inefficientcy. We add page to the queue where
> > > > > > > > > > > it's not needed.
> > > > > > > > > > But that adding to queue doesn't affect whether the page will be freed
> > > > > > > > > > immediately if there are no more partial mappings, right? I don't see
> > > > > > > > > > deferred_split_huge_page() pinning the page.
> > > > > > > > > > So your patch wouldn't make THPs freed immediately in cases where they
> > > > > > > > > > haven't been freed before immediately, it just fixes a minor
> > > > > > > > > > inefficiency with queue manipulation?
> > > > > > > > > Ohh, right. I can see that in free_transhuge_page now. So fully mapped
> > > > > > > > > THPs really do not matter and what I have considered an odd case is
> > > > > > > > > really happening more often.
> > > > > > > > >
> > > > > > > > > That being said this will not help at all for what Yang Shi is seeing
> > > > > > > > > and we need a more proactive deferred splitting as I've mentioned
> > > > > > > > > earlier.
> > > > > > > > It was not intended to fix the issue. It's fix for current logic. I'm
> > > > > > > > playing with the work approach now.
> > > > > > > Below is what I've come up with. It appears to be functional.
> > > > > > >
> > > > > > > Any comments?
> > > > > >
> > > > > > Thanks, Kirill and Michal. Doing split more proactive is definitely a choice
> > > > > > to eliminate huge accumulated deferred split THPs, I did think about this
> > > > > > approach before I came up with memcg aware approach. But, I thought this
> > > > > > approach has some problems:
> > > > > >
> > > > > > First of all, we can't prove if this is a universal win for the most
> > > > > > workloads or not. For some workloads (as I mentioned about our usecase), we
> > > > > > do see a lot THPs accumulated for a while, but they are very short-lived for
> > > > > > other workloads, i.e. kernel build.
> > > > > >
> > > > > > Secondly, it may be not fair for some workloads which don't generate too
> > > > > > many deferred split THPs or those THPs are short-lived. Actually, the cpu
> > > > > > time is abused by the excessive deferred split THPs generators, isn't it?
> > > > >
> > > > > Yes this is indeed true. Do we have any idea on how much time that
> > > > > actually is?
> > > >
> > > > For uncontented case, splitting 1G worth of pages (2MiB x 512) takes a bit
> > > > more than 50 ms in my setup. But it's best-case scenario: pages not shared
> > > > across multiple processes, no contention on ptl, page lock, etc.
> > >
> > > Any idea about a bad case?
> >
> > Not really.
> >
> > How bad you want it to get? How many processes share the page? Access
> > pattern? Locking situation?
>
> Let's say how hard a regular user can make this?
>
> > Worst case scenarion: no progress on splitting due to pins or locking
> > conflicts (trylock failure).
> >
> > > > > > With memcg awareness, the deferred split THPs actually are isolated and
> > > > > > capped by memcg. The long-lived deferred split THPs can't be accumulated too
> > > > > > many due to the limit of memcg. And, cpu time spent in splitting them would
> > > > > > just account to the memcgs who generate that many deferred split THPs, who
> > > > > > generate them who pay for it. This sounds more fair and we could achieve
> > > > > > much better isolation.
> > > > >
> > > > > On the other hand, deferring the split and free up a non trivial amount
> > > > > of memory is a problem I consider quite serious because it affects not
> > > > > only the memcg workload which has to do the reclaim but also other
> > > > > consumers of memory beucase large memory blocks could be used for higher
> > > > > order allocations.
> > > >
> > > > Maybe instead of drive the split from number of pages on queue we can take
> > > > a hint from compaction that is struggles to get high order pages?
> > >
> > > This is still unbounded in time.
> >
> > I'm not sure we should focus on time.
> >
> > We need to make sure that we don't overal system health worse. Who cares
> > if we have pages on deferred split list as long as we don't have other
> > user for the memory?
>
> We do care for all those users which do not want to get stalled when
> requesting that memory. And you cannot really predict that, right? So
> the sooner the better. Modulo time wasted for the pointless splitting of
> course. I am afraid defining the best timing here is going to be hard
> but let's focus on workloads that are known to generate partial THPs and
> see how that behaves.

I'm supposed we are just concerned by the global memory pressure
incurred by the excessive deferred split THPs. As long as no other
users for that memory we don't have to waste time to care about it.
So, I'm wondering why not we do harder in kswapd?

Currently, deferred split THPs get shrunk like slab. The number of
objects scanned is determined by some factors, i.e. scan priority,
shrinker->seeks, etc, to avoid over reclaim for filesystem caches to
avoid extra I/O. But, we don't have to worry about over reclaim for
deferred split THPs, right? We definitely could shrink them more
aggressively in kswapd context. For example, we could simply set
shrinker->seeks to 0, now it is DEFAULT_SEEKS.

And, we also could consider boost water mark to wake up kswapd earlier
once we see excessive deferred split THPs accumulated.

>
> --
> Michal Hocko
> SUSE Labs
>

Michal Hocko Aug. 30, 2019, 6:23 a.m. UTC | #27

On Thu 29-08-19 10:03:21, Yang Shi wrote:
> On Wed, Aug 28, 2019 at 9:02 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Wed 28-08-19 17:46:59, Kirill A. Shutemov wrote:
> > > On Wed, Aug 28, 2019 at 02:12:53PM +0000, Michal Hocko wrote:
> > > > On Wed 28-08-19 17:03:29, Kirill A. Shutemov wrote:
> > > > > On Wed, Aug 28, 2019 at 09:57:08AM +0200, Michal Hocko wrote:
> > > > > > On Tue 27-08-19 10:06:20, Yang Shi wrote:
> > > > > > >
> > > > > > >
> > > > > > > On 8/27/19 5:59 AM, Kirill A. Shutemov wrote:
> > > > > > > > On Tue, Aug 27, 2019 at 03:17:39PM +0300, Kirill A. Shutemov wrote:
> > > > > > > > > On Tue, Aug 27, 2019 at 02:09:23PM +0200, Michal Hocko wrote:
> > > > > > > > > > On Tue 27-08-19 14:01:56, Vlastimil Babka wrote:
> > > > > > > > > > > On 8/27/19 1:02 PM, Kirill A. Shutemov wrote:
> > > > > > > > > > > > On Tue, Aug 27, 2019 at 08:01:39AM +0200, Michal Hocko wrote:
> > > > > > > > > > > > > On Mon 26-08-19 16:15:38, Kirill A. Shutemov wrote:
> > > > > > > > > > > > > > Unmapped completely pages will be freed with current code. Deferred split
> > > > > > > > > > > > > > only applies to partly mapped THPs: at least on 4k of the THP is still
> > > > > > > > > > > > > > mapped somewhere.
> > > > > > > > > > > > > Hmm, I am probably misreading the code but at least current Linus' tree
> > > > > > > > > > > > > reads page_remove_rmap -> [page_remove_anon_compound_rmap ->\ deferred_split_huge_page even
> > > > > > > > > > > > > for fully mapped THP.
> > > > > > > > > > > > Well, you read correctly, but it was not intended. I screwed it up at some
> > > > > > > > > > > > point.
> > > > > > > > > > > >
> > > > > > > > > > > > See the patch below. It should make it work as intened.
> > > > > > > > > > > >
> > > > > > > > > > > > It's not bug as such, but inefficientcy. We add page to the queue where
> > > > > > > > > > > > it's not needed.
> > > > > > > > > > > But that adding to queue doesn't affect whether the page will be freed
> > > > > > > > > > > immediately if there are no more partial mappings, right? I don't see
> > > > > > > > > > > deferred_split_huge_page() pinning the page.
> > > > > > > > > > > So your patch wouldn't make THPs freed immediately in cases where they
> > > > > > > > > > > haven't been freed before immediately, it just fixes a minor
> > > > > > > > > > > inefficiency with queue manipulation?
> > > > > > > > > > Ohh, right. I can see that in free_transhuge_page now. So fully mapped
> > > > > > > > > > THPs really do not matter and what I have considered an odd case is
> > > > > > > > > > really happening more often.
> > > > > > > > > >
> > > > > > > > > > That being said this will not help at all for what Yang Shi is seeing
> > > > > > > > > > and we need a more proactive deferred splitting as I've mentioned
> > > > > > > > > > earlier.
> > > > > > > > > It was not intended to fix the issue. It's fix for current logic. I'm
> > > > > > > > > playing with the work approach now.
> > > > > > > > Below is what I've come up with. It appears to be functional.
> > > > > > > >
> > > > > > > > Any comments?
> > > > > > >
> > > > > > > Thanks, Kirill and Michal. Doing split more proactive is definitely a choice
> > > > > > > to eliminate huge accumulated deferred split THPs, I did think about this
> > > > > > > approach before I came up with memcg aware approach. But, I thought this
> > > > > > > approach has some problems:
> > > > > > >
> > > > > > > First of all, we can't prove if this is a universal win for the most
> > > > > > > workloads or not. For some workloads (as I mentioned about our usecase), we
> > > > > > > do see a lot THPs accumulated for a while, but they are very short-lived for
> > > > > > > other workloads, i.e. kernel build.
> > > > > > >
> > > > > > > Secondly, it may be not fair for some workloads which don't generate too
> > > > > > > many deferred split THPs or those THPs are short-lived. Actually, the cpu
> > > > > > > time is abused by the excessive deferred split THPs generators, isn't it?
> > > > > >
> > > > > > Yes this is indeed true. Do we have any idea on how much time that
> > > > > > actually is?
> > > > >
> > > > > For uncontented case, splitting 1G worth of pages (2MiB x 512) takes a bit
> > > > > more than 50 ms in my setup. But it's best-case scenario: pages not shared
> > > > > across multiple processes, no contention on ptl, page lock, etc.
> > > >
> > > > Any idea about a bad case?
> > >
> > > Not really.
> > >
> > > How bad you want it to get? How many processes share the page? Access
> > > pattern? Locking situation?
> >
> > Let's say how hard a regular user can make this?
> >
> > > Worst case scenarion: no progress on splitting due to pins or locking
> > > conflicts (trylock failure).
> > >
> > > > > > > With memcg awareness, the deferred split THPs actually are isolated and
> > > > > > > capped by memcg. The long-lived deferred split THPs can't be accumulated too
> > > > > > > many due to the limit of memcg. And, cpu time spent in splitting them would
> > > > > > > just account to the memcgs who generate that many deferred split THPs, who
> > > > > > > generate them who pay for it. This sounds more fair and we could achieve
> > > > > > > much better isolation.
> > > > > >
> > > > > > On the other hand, deferring the split and free up a non trivial amount
> > > > > > of memory is a problem I consider quite serious because it affects not
> > > > > > only the memcg workload which has to do the reclaim but also other
> > > > > > consumers of memory beucase large memory blocks could be used for higher
> > > > > > order allocations.
> > > > >
> > > > > Maybe instead of drive the split from number of pages on queue we can take
> > > > > a hint from compaction that is struggles to get high order pages?
> > > >
> > > > This is still unbounded in time.
> > >
> > > I'm not sure we should focus on time.
> > >
> > > We need to make sure that we don't overal system health worse. Who cares
> > > if we have pages on deferred split list as long as we don't have other
> > > user for the memory?
> >
> > We do care for all those users which do not want to get stalled when
> > requesting that memory. And you cannot really predict that, right? So
> > the sooner the better. Modulo time wasted for the pointless splitting of
> > course. I am afraid defining the best timing here is going to be hard
> > but let's focus on workloads that are known to generate partial THPs and
> > see how that behaves.
> 
> I'm supposed we are just concerned by the global memory pressure
> incurred by the excessive deferred split THPs. As long as no other
> users for that memory we don't have to waste time to care about it.
> So, I'm wondering why not we do harder in kswapd?

kswapd is already late. There shouldn't be any need for the reclaim as
long as there is a lot of memory that can be directly freed.

> Currently, deferred split THPs get shrunk like slab. The number of
> objects scanned is determined by some factors, i.e. scan priority,
> shrinker->seeks, etc, to avoid over reclaim for filesystem caches to
> avoid extra I/O. But, we don't have to worry about over reclaim for
> deferred split THPs, right? We definitely could shrink them more
> aggressively in kswapd context.

This is certainly possible. I am just wondering why should we cram this
into the reclaim when we have a reasonable trigger to do that.

> For example, we could simply set shrinker->seeks to 0, now it is
> DEFAULT_SEEKS.
> 
> And, we also could consider boost water mark to wake up kswapd earlier
> once we see excessive deferred split THPs accumulated.

This has other side effect, right?

Kirill A. Shutemov Aug. 30, 2019, 12:53 p.m. UTC | #28

On Wed, Aug 28, 2019 at 06:02:24PM +0200, Michal Hocko wrote:
> > > 
> > > Any idea about a bad case?
> > 
> > Not really.
> > 
> > How bad you want it to get? How many processes share the page? Access
> > pattern? Locking situation?
> 
> Let's say how hard a regular user can make this?

It bumped to ~170 ms if each THP was mapped 5 times.

Adding ptl contention (tight loop of MADV_DONTNEED) in 3 processes that
maps the page, the time to split bumped to ~740ms.

Overally, it's reasonable to take ~100ms per GB of huge pages as rule of
thumb.

[v2,-mm] mm: account deferred split THPs into MemAvailable

Commit Message

Comments

Patch