[next] mm/swap.c: reduce lock contention in lru_cache_add

Message ID	1605860847-47445-1-git-send-email-alex.shi@linux.alibaba.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=cYCG=E2=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D9C3E22240 From: Alex Shi <alex.shi@linux.alibaba.com> To: Cc: Konstantin Khlebnikov <koct9i@gmail.com>, Andrew Morton <akpm@linux-foundation.org>, Hugh Dickins <hughd@google.com>, Yu Zhao <yuzhao@google.com>, Michal Hocko <mhocko@suse.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add Date: Fri, 20 Nov 2020 16:27:27 +0800 Message-Id: <1605860847-47445-1-git-send-email-alex.shi@linux.alibaba.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[next] mm/swap.c: reduce lock contention in lru_cache_add \| expand [next] mm/swap.c: reduce lock contention in lru_cache_add

Alex Shi Nov. 20, 2020, 8:27 a.m. UTC

The current relock logical will change lru_lock when found a new
lruvec, so if 2 memcgs are reading file or alloc page at same time,
they could hold the lru_lock alternately, and wait for each other for
fairness attribute of ticket spin lock.

This patch will sort that all lru_locks and only hold them once in
above scenario. That could reduce fairness waiting for lock reget.
Than, vm-scalability/case-lru-file-readtwice could get ~5% performance
gain on my 2P*20core*HT machine.

Suggested-by: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 49 insertions(+), 8 deletions(-)

Andrew Morton Nov. 20, 2020, 11:19 p.m. UTC | #1

On Fri, 20 Nov 2020 16:27:27 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote:

> The current relock logical will change lru_lock when found a new
> lruvec, so if 2 memcgs are reading file or alloc page at same time,
> they could hold the lru_lock alternately, and wait for each other for
> fairness attribute of ticket spin lock.
> 
> This patch will sort that all lru_locks and only hold them once in
> above scenario. That could reduce fairness waiting for lock reget.
> Than, vm-scalability/case-lru-file-readtwice could get ~5% performance
> gain on my 2P*20core*HT machine.

But what happens when all or most of the pages belong to the same
lruvec?  This sounds like the common case - won't it suffer?

Alex Shi Nov. 23, 2020, 4:46 a.m. UTC | #2

在 2020/11/21 上午7:19, Andrew Morton 写道:
> On Fri, 20 Nov 2020 16:27:27 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote:
> 
>> The current relock logical will change lru_lock when found a new
>> lruvec, so if 2 memcgs are reading file or alloc page at same time,
>> they could hold the lru_lock alternately, and wait for each other for
>> fairness attribute of ticket spin lock.
>>
>> This patch will sort that all lru_locks and only hold them once in
>> above scenario. That could reduce fairness waiting for lock reget.
>> Than, vm-scalability/case-lru-file-readtwice could get ~5% performance
>> gain on my 2P*20core*HT machine.
> 
> But what happens when all or most of the pages belong to the same
> lruvec?  This sounds like the common case - won't it suffer?
> 
Hi Andrew,

My testing show no regression on this situation, like original centos7,
The most spending time is on lru_lock for lru sensitive case.

Thanks
Alex

Vlastimil Babka Nov. 25, 2020, 3:38 p.m. UTC | #3

On 11/20/20 9:27 AM, Alex Shi wrote:
> The current relock logical will change lru_lock when found a new
> lruvec, so if 2 memcgs are reading file or alloc page at same time,
> they could hold the lru_lock alternately, and wait for each other for
> fairness attribute of ticket spin lock.
> 
> This patch will sort that all lru_locks and only hold them once in
> above scenario. That could reduce fairness waiting for lock reget.
> Than, vm-scalability/case-lru-file-readtwice could get ~5% performance
> gain on my 2P*20core*HT machine.

Hm, once you sort the pages like this, it's a shame not to splice them 
instead of more list_del() + list_add() iterations. update_lru_size() 
could be also called once?

> Suggested-by: Konstantin Khlebnikov <koct9i@gmail.com>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Yu Zhao <yuzhao@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>   mm/swap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++--------
>   1 file changed, 49 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 490553f3f9ef..c787b38bf9c0 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -1009,24 +1009,65 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
>   	trace_mm_lru_insertion(page, lru);
>   }
>   
> +struct lruvecs {
> +	struct list_head lists[PAGEVEC_SIZE];
> +	struct lruvec *vecs[PAGEVEC_SIZE];
> +};
> +
> +/* Sort pvec pages on their lruvec */
> +int sort_page_lruvec(struct lruvecs *lruvecs, struct pagevec *pvec)
> +{
> +	int i, j, nr_lruvec;
> +	struct page *page;
> +	struct lruvec *lruvec = NULL;
> +
> +	lruvecs->vecs[0] = NULL;
> +	for (i = nr_lruvec = 0; i < pagevec_count(pvec); i++) {
> +		page = pvec->pages[i];
> +		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +
> +		/* Try to find a same lruvec */
> +		for (j = 0; j <= nr_lruvec; j++)
> +			if (lruvec == lruvecs->vecs[j])
> +				break;
> +
> +		/* A new lruvec */
> +		if (j > nr_lruvec) {
> +			INIT_LIST_HEAD(&lruvecs->lists[nr_lruvec]);
> +			lruvecs->vecs[nr_lruvec] = lruvec;
> +			j = nr_lruvec++;
> +			lruvecs->vecs[nr_lruvec] = 0;
> +		}
> +
> +		list_add_tail(&page->lru, &lruvecs->lists[j]);
> +	}
> +
> +	return nr_lruvec;
> +}
> +
>   /*
>    * Add the passed pages to the LRU, then drop the caller's refcount
>    * on them.  Reinitialises the caller's pagevec.
>    */
>   void __pagevec_lru_add(struct pagevec *pvec)
>   {
> -	int i;
> -	struct lruvec *lruvec = NULL;
> +	int i, nr_lruvec;
>   	unsigned long flags = 0;
> +	struct page *page;
> +	struct lruvecs lruvecs;
>   
> -	for (i = 0; i < pagevec_count(pvec); i++) {
> -		struct page *page = pvec->pages[i];
> +	nr_lruvec = sort_page_lruvec(&lruvecs, pvec);
>   
> -		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
> -		__pagevec_lru_add_fn(page, lruvec);
> +	for (i = 0; i < nr_lruvec; i++) {
> +		spin_lock_irqsave(&lruvecs.vecs[i]->lru_lock, flags);
> +		while (!list_empty(&lruvecs.lists[i])) {
> +			page = lru_to_page(&lruvecs.lists[i]);
> +			list_del(&page->lru);
> +			__pagevec_lru_add_fn(page, lruvecs.vecs[i]);
> +		}
> +		spin_unlock_irqrestore(&lruvecs.vecs[i]->lru_lock, flags);
>   	}
> -	if (lruvec)
> -		unlock_page_lruvec_irqrestore(lruvec, flags);
> +
>   	release_pages(pvec->pages, pvec->nr);
>   	pagevec_reinit(pvec);
>   }
>

Alex Shi Nov. 26, 2020, 3:12 a.m. UTC | #4

在 2020/11/25 下午11:38, Vlastimil Babka 写道:
> On 11/20/20 9:27 AM, Alex Shi wrote:
>> The current relock logical will change lru_lock when found a new
>> lruvec, so if 2 memcgs are reading file or alloc page at same time,
>> they could hold the lru_lock alternately, and wait for each other for
>> fairness attribute of ticket spin lock.
>>
>> This patch will sort that all lru_locks and only hold them once in
>> above scenario. That could reduce fairness waiting for lock reget.
>> Than, vm-scalability/case-lru-file-readtwice could get ~5% performance
>> gain on my 2P*20core*HT machine.
> 
> Hm, once you sort the pages like this, it's a shame not to splice them instead of more list_del() + list_add() iterations. update_lru_size() could be also called once?

Yes, looks it's a good idea to use splice instead of list_del/add, but pages
may on different lru list in a same lruvec, and also may come from different
zones. That could involve 5 cycles for different lists, and more for zones...

I give up the try.

Yu Zhao Nov. 26, 2020, 4:52 a.m. UTC | #5

On Fri, Nov 20, 2020 at 04:27:27PM +0800, Alex Shi wrote:
> The current relock logical will change lru_lock when found a new
> lruvec, so if 2 memcgs are reading file or alloc page at same time,
> they could hold the lru_lock alternately, and wait for each other for
> fairness attribute of ticket spin lock.
> 
> This patch will sort that all lru_locks and only hold them once in
> above scenario. That could reduce fairness waiting for lock reget.
> Than, vm-scalability/case-lru-file-readtwice could get ~5% performance
> gain on my 2P*20core*HT machine.
> 
> Suggested-by: Konstantin Khlebnikov <koct9i@gmail.com>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Yu Zhao <yuzhao@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  mm/swap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 49 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 490553f3f9ef..c787b38bf9c0 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -1009,24 +1009,65 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
>  	trace_mm_lru_insertion(page, lru);
>  }
>  
> +struct lruvecs {
> +	struct list_head lists[PAGEVEC_SIZE];
> +	struct lruvec *vecs[PAGEVEC_SIZE];
> +};
> +
> +/* Sort pvec pages on their lruvec */
> +int sort_page_lruvec(struct lruvecs *lruvecs, struct pagevec *pvec)
> +{
> +	int i, j, nr_lruvec;
> +	struct page *page;
> +	struct lruvec *lruvec = NULL;
> +
> +	lruvecs->vecs[0] = NULL;
> +	for (i = nr_lruvec = 0; i < pagevec_count(pvec); i++) {
> +		page = pvec->pages[i];
> +		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +
> +		/* Try to find a same lruvec */
> +		for (j = 0; j <= nr_lruvec; j++)
> +			if (lruvec == lruvecs->vecs[j])
> +				break;
> +
> +		/* A new lruvec */
> +		if (j > nr_lruvec) {
> +			INIT_LIST_HEAD(&lruvecs->lists[nr_lruvec]);
> +			lruvecs->vecs[nr_lruvec] = lruvec;
> +			j = nr_lruvec++;
> +			lruvecs->vecs[nr_lruvec] = 0;
> +		}
> +
> +		list_add_tail(&page->lru, &lruvecs->lists[j]);
> +	}
> +
> +	return nr_lruvec;
> +}
> +
>  /*
>   * Add the passed pages to the LRU, then drop the caller's refcount
>   * on them.  Reinitialises the caller's pagevec.
>   */
>  void __pagevec_lru_add(struct pagevec *pvec)
>  {
> -	int i;
> -	struct lruvec *lruvec = NULL;
> +	int i, nr_lruvec;
>  	unsigned long flags = 0;
> +	struct page *page;
> +	struct lruvecs lruvecs;
>  
> -	for (i = 0; i < pagevec_count(pvec); i++) {
> -		struct page *page = pvec->pages[i];
> +	nr_lruvec = sort_page_lruvec(&lruvecs, pvec);

Simply looping pvec multiple times (15 at most) for different lruvecs
would be better because 1) it requires no extra data structures and
therefore has better cache locality (theoretically faster) 2) it only
loops once when !CONFIG_MEMCG and !CONFIG_NUMA and therefore has no
impact on Android and Chrome OS.

> -		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
> -		__pagevec_lru_add_fn(page, lruvec);
> +	for (i = 0; i < nr_lruvec; i++) {
> +		spin_lock_irqsave(&lruvecs.vecs[i]->lru_lock, flags);
> +		while (!list_empty(&lruvecs.lists[i])) {
> +			page = lru_to_page(&lruvecs.lists[i]);
> +			list_del(&page->lru);
> +			__pagevec_lru_add_fn(page, lruvecs.vecs[i]);
> +		}
> +		spin_unlock_irqrestore(&lruvecs.vecs[i]->lru_lock, flags);
>  	}
> -	if (lruvec)
> -		unlock_page_lruvec_irqrestore(lruvec, flags);
> +
>  	release_pages(pvec->pages, pvec->nr);
>  	pagevec_reinit(pvec);
>  }
> -- 
> 2.29.GIT
>

Alex Shi Nov. 26, 2020, 6:39 a.m. UTC | #6

在 2020/11/26 下午12:52, Yu Zhao 写道:
>>   */
>>  void __pagevec_lru_add(struct pagevec *pvec)
>>  {
>> -	int i;
>> -	struct lruvec *lruvec = NULL;
>> +	int i, nr_lruvec;
>>  	unsigned long flags = 0;
>> +	struct page *page;
>> +	struct lruvecs lruvecs;
>>  
>> -	for (i = 0; i < pagevec_count(pvec); i++) {
>> -		struct page *page = pvec->pages[i];
>> +	nr_lruvec = sort_page_lruvec(&lruvecs, pvec);
> Simply looping pvec multiple times (15 at most) for different lruvecs
> would be better because 1) it requires no extra data structures and
> therefore has better cache locality (theoretically faster) 2) it only
> loops once when !CONFIG_MEMCG and !CONFIG_NUMA and therefore has no
> impact on Android and Chrome OS.
> 

With multiple memcgs, it do help a lot, I had gotten 30% grain on readtwice
case. but yes, w/o MEMCG and NUMA, it's good to keep old behavior. So 
would you like has a proposal for this?

Thanks
Alex

Yu Zhao Nov. 26, 2020, 7:24 a.m. UTC | #7

On Thu, Nov 26, 2020 at 02:39:03PM +0800, Alex Shi wrote:
> 
> 
> 在 2020/11/26 下午12:52, Yu Zhao 写道:
> >>   */
> >>  void __pagevec_lru_add(struct pagevec *pvec)
> >>  {
> >> -	int i;
> >> -	struct lruvec *lruvec = NULL;
> >> +	int i, nr_lruvec;
> >>  	unsigned long flags = 0;
> >> +	struct page *page;
> >> +	struct lruvecs lruvecs;
> >>  
> >> -	for (i = 0; i < pagevec_count(pvec); i++) {
> >> -		struct page *page = pvec->pages[i];
> >> +	nr_lruvec = sort_page_lruvec(&lruvecs, pvec);
> > Simply looping pvec multiple times (15 at most) for different lruvecs
> > would be better because 1) it requires no extra data structures and
> > therefore has better cache locality (theoretically faster) 2) it only
> > loops once when !CONFIG_MEMCG and !CONFIG_NUMA and therefore has no
> > impact on Android and Chrome OS.
> > 
> 
> With multiple memcgs, it do help a lot, I had gotten 30% grain on readtwice
> case. but yes, w/o MEMCG and NUMA, it's good to keep old behavior. So 
> would you like has a proposal for this?

Oh, no, I'm not against your idea. I was saying it doesn't seem
necessary to sort -- a nested loop would just do the job given
pagevec is small.

diff --git a/mm/swap.c b/mm/swap.c
index cb3794e13b48..1d238edc2907 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -996,15 +996,26 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
  */
 void __pagevec_lru_add(struct pagevec *pvec)
 {
-	int i;
+	int i, j;
 	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
 
+		if (!page)
+			continue;
+
 		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
-		__pagevec_lru_add_fn(page, lruvec);
+
+		for (j = i; j < pagevec_count(pvec); j++) {
+			if (page_to_nid(pvec->pages[j]) != page_to_nid(page) ||
+			    page_memcg(pvec->pages[j]) != page_memcg(page))
+				continue;
+
+			__pagevec_lru_add_fn(pvec->pages[j], lruvec);
+			pvec->pages[j] = NULL;
+		}
 	}
 	if (lruvec)
 		unlock_page_lruvec_irqrestore(lruvec, flags);

Alex Shi Nov. 26, 2020, 8:09 a.m. UTC | #8

在 2020/11/26 下午3:24, Yu Zhao 写道:
> Oh, no, I'm not against your idea. I was saying it doesn't seem
> necessary to sort -- a nested loop would just do the job given
> pagevec is small.
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index cb3794e13b48..1d238edc2907 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -996,15 +996,26 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
>   */
>  void __pagevec_lru_add(struct pagevec *pvec)
>  {
> -	int i;
> +	int i, j;
>  	struct lruvec *lruvec = NULL;
>  	unsigned long flags = 0;
>  
>  	for (i = 0; i < pagevec_count(pvec); i++) {
>  		struct page *page = pvec->pages[i];
>  
> +		if (!page)
> +			continue;
> +
>  		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
> -		__pagevec_lru_add_fn(page, lruvec);
> +
> +		for (j = i; j < pagevec_count(pvec); j++) {
> +			if (page_to_nid(pvec->pages[j]) != page_to_nid(page) ||
> +			    page_memcg(pvec->pages[j]) != page_memcg(page))
> +				continue;
> +
> +			__pagevec_lru_add_fn(pvec->pages[j], lruvec);
> +			pvec->pages[j] = NULL;
> +		}

Uh, I have to say your method is more better than mine.
And this could be reused for all relock_page_lruvec. I expect this could
speed up lru performance a lot!


>  	}
>  	if (lruvec)
>  		unlock_page_lruvec_irqrestore(lruvec, flags);

Vlastimil Babka Nov. 26, 2020, 11:05 a.m. UTC | #9

On 11/26/20 4:12 AM, Alex Shi wrote:
> 
> 
> 在 2020/11/25 下午11:38, Vlastimil Babka 写道:
>> On 11/20/20 9:27 AM, Alex Shi wrote:
>>> The current relock logical will change lru_lock when found a new
>>> lruvec, so if 2 memcgs are reading file or alloc page at same time,
>>> they could hold the lru_lock alternately, and wait for each other for
>>> fairness attribute of ticket spin lock.
>>>
>>> This patch will sort that all lru_locks and only hold them once in
>>> above scenario. That could reduce fairness waiting for lock reget.
>>> Than, vm-scalability/case-lru-file-readtwice could get ~5% performance
>>> gain on my 2P*20core*HT machine.
>> 
>> Hm, once you sort the pages like this, it's a shame not to splice them instead of more list_del() + list_add() iterations. update_lru_size() could be also called once?
> 
> Yes, looks it's a good idea to use splice instead of list_del/add, but pages
> may on different lru list in a same lruvec, and also may come from different
> zones. That could involve 5 cycles for different lists, and more for zones...

Hmm, zones wouldn't affect splicing (there's a per-node lru these days), but 
would affect accounting. And yeah, there are 5 lru lists, and we probably need 
to be under lru lock to stabilize where a page belongs, so pre-sorting without 
lock wouldn't be safe? Bummer.

> I give up the try.
>

Vlastimil Babka Nov. 26, 2020, 11:22 a.m. UTC | #10

On 11/26/20 8:24 AM, Yu Zhao wrote:
> On Thu, Nov 26, 2020 at 02:39:03PM +0800, Alex Shi wrote:
>> 
>> 
>> 在 2020/11/26 下午12:52, Yu Zhao 写道:
>> >>   */
>> >>  void __pagevec_lru_add(struct pagevec *pvec)
>> >>  {
>> >> -	int i;
>> >> -	struct lruvec *lruvec = NULL;
>> >> +	int i, nr_lruvec;
>> >>  	unsigned long flags = 0;
>> >> +	struct page *page;
>> >> +	struct lruvecs lruvecs;
>> >>  
>> >> -	for (i = 0; i < pagevec_count(pvec); i++) {
>> >> -		struct page *page = pvec->pages[i];
>> >> +	nr_lruvec = sort_page_lruvec(&lruvecs, pvec);
>> > Simply looping pvec multiple times (15 at most) for different lruvecs
>> > would be better because 1) it requires no extra data structures and
>> > therefore has better cache locality (theoretically faster) 2) it only
>> > loops once when !CONFIG_MEMCG and !CONFIG_NUMA and therefore has no
>> > impact on Android and Chrome OS.
>> > 
>> 
>> With multiple memcgs, it do help a lot, I had gotten 30% grain on readtwice
>> case. but yes, w/o MEMCG and NUMA, it's good to keep old behavior. So 
>> would you like has a proposal for this?
> 
> Oh, no, I'm not against your idea. I was saying it doesn't seem
> necessary to sort -- a nested loop would just do the job given
> pagevec is small.

Yeah that could work. The worst case doesn't look nice (O(n^2)) but it should be 
rather rare to have pages from really multiple memcgs/nodes?

Maybe with the following change? Avoids going through the nulls if we got lucky 
(or have !MEMCG !NUMA).

> diff --git a/mm/swap.c b/mm/swap.c
> index cb3794e13b48..1d238edc2907 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -996,15 +996,26 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
>    */
>   void __pagevec_lru_add(struct pagevec *pvec)
>   {
> -	int i;
> +	int i, j;

int i, j, n;

>   	struct lruvec *lruvec = NULL;
>   	unsigned long flags = 0;
>   

n = pagevec_count(pvec);
>   	for (i = 0; i < pagevec_count(pvec); i++) {

    	for (i = 0; n; i++) {

>   		struct page *page = pvec->pages[i];
>   
> +		if (!page)
> +			continue;
> +
>   		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
> -		__pagevec_lru_add_fn(page, lruvec);

		n--;

> +
> +		for (j = i; j < pagevec_count(pvec); j++) {
> +			if (page_to_nid(pvec->pages[j]) != page_to_nid(page) ||
> +			    page_memcg(pvec->pages[j]) != page_memcg(page))
> +				continue;
> +
> +			__pagevec_lru_add_fn(pvec->pages[j], lruvec);
> +			pvec->pages[j] = NULL;

			n--;
> +		}
>   	}
>   	if (lruvec)
>   		unlock_page_lruvec_irqrestore(lruvec, flags);
>

Vlastimil Babka Nov. 26, 2020, 3:44 p.m. UTC | #11

On 11/26/20 12:22 PM, Vlastimil Babka wrote:
> On 11/26/20 8:24 AM, Yu Zhao wrote:
>> On Thu, Nov 26, 2020 at 02:39:03PM +0800, Alex Shi wrote:
>>> 
>>> 
>>> 在 2020/11/26 下午12:52, Yu Zhao 写道:
>>> >>   */
>>> >>  void __pagevec_lru_add(struct pagevec *pvec)
>>> >>  {
>>> >> -	int i;
>>> >> -	struct lruvec *lruvec = NULL;
>>> >> +	int i, nr_lruvec;
>>> >>  	unsigned long flags = 0;
>>> >> +	struct page *page;
>>> >> +	struct lruvecs lruvecs;
>>> >>  
>>> >> -	for (i = 0; i < pagevec_count(pvec); i++) {
>>> >> -		struct page *page = pvec->pages[i];
>>> >> +	nr_lruvec = sort_page_lruvec(&lruvecs, pvec);
>>> > Simply looping pvec multiple times (15 at most) for different lruvecs
>>> > would be better because 1) it requires no extra data structures and
>>> > therefore has better cache locality (theoretically faster) 2) it only
>>> > loops once when !CONFIG_MEMCG and !CONFIG_NUMA and therefore has no
>>> > impact on Android and Chrome OS.
>>> > 
>>> 
>>> With multiple memcgs, it do help a lot, I had gotten 30% grain on readtwice
>>> case. but yes, w/o MEMCG and NUMA, it's good to keep old behavior. So 
>>> would you like has a proposal for this?
>> 
>> Oh, no, I'm not against your idea. I was saying it doesn't seem
>> necessary to sort -- a nested loop would just do the job given
>> pagevec is small.
> 
> Yeah that could work. The worst case doesn't look nice (O(n^2)) but it should be
> rather rare to have pages from really multiple memcgs/nodes?

However, Matthew wanted to increase pagevec size [1] and once 15^2 becomes 63^2, 
it starts to be somewhat more worrying.

[1] https://lore.kernel.org/linux-mm/20201105172651.2455-1-willy@infradead.org/

> Maybe with the following change? Avoids going through the nulls if we got lucky
> (or have !MEMCG !NUMA).
> 
>> diff --git a/mm/swap.c b/mm/swap.c
>> index cb3794e13b48..1d238edc2907 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -996,15 +996,26 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
>>    */
>>   void __pagevec_lru_add(struct pagevec *pvec)
>>   {
>> -	int i;
>> +	int i, j;
> 
> int i, j, n;
> 
>>   	struct lruvec *lruvec = NULL;
>>   	unsigned long flags = 0;
>>   
> 
> n = pagevec_count(pvec);
>>   	for (i = 0; i < pagevec_count(pvec); i++) {
> 
>      	for (i = 0; n; i++) {
> 
>>   		struct page *page = pvec->pages[i];
>>   
>> +		if (!page)
>> +			continue;
>> +
>>   		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
>> -		__pagevec_lru_add_fn(page, lruvec);
> 
> 		n--;
> 
>> +
>> +		for (j = i; j < pagevec_count(pvec); j++) {
>> +			if (page_to_nid(pvec->pages[j]) != page_to_nid(page) ||
>> +			    page_memcg(pvec->pages[j]) != page_memcg(page))
>> +				continue;
>> +
>> +			__pagevec_lru_add_fn(pvec->pages[j], lruvec);
>> +			pvec->pages[j] = NULL;
> 
> 			n--;
>> +		}
>>   	}
>>   	if (lruvec)
>>   		unlock_page_lruvec_irqrestore(lruvec, flags);
>> 
>

Matthew Wilcox Nov. 26, 2020, 3:55 p.m. UTC | #12

On Thu, Nov 26, 2020 at 04:44:04PM +0100, Vlastimil Babka wrote:
> However, Matthew wanted to increase pagevec size [1] and once 15^2 becomes
> 63^2, it starts to be somewhat more worrying.
> 
> [1] https://lore.kernel.org/linux-mm/20201105172651.2455-1-willy@infradead.org/

Well, Tim wanted it ;-)

I would suggest that rather than an insertion sort (or was it a bubble
sort?), we should be using a Shell sort.  It's ideal for these kinds of
smallish arrays.

https://en.wikipedia.org/wiki/Shellsort

Alex Shi Nov. 27, 2020, 3:14 a.m. UTC | #13

在 2020/11/26 下午11:55, Matthew Wilcox 写道:
> On Thu, Nov 26, 2020 at 04:44:04PM +0100, Vlastimil Babka wrote:
>> However, Matthew wanted to increase pagevec size [1] and once 15^2 becomes
>> 63^2, it starts to be somewhat more worrying.
>>
>> [1] https://lore.kernel.org/linux-mm/20201105172651.2455-1-willy@infradead.org/
> 
> Well, Tim wanted it ;-)
> 
> I would suggest that rather than an insertion sort (or was it a bubble
> sort?), we should be using a Shell sort.  It's ideal for these kinds of
> smallish arrays.
> 
> https://en.wikipedia.org/wiki/Shellsort
> 

Uh, looks perfect good!. I gonna look into it. :)

Thanks!

[next] mm/swap.c: reduce lock contention in lru_cache_add

Commit Message

Comments

Patch