Message ID | 1593752873-4493-16-git-send-email-alex.shi@linux.alibaba.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | per memcg lru lock | expand |
On Fri, Jul 3, 2020 at 8:09 AM Alex Shi <alex.shi@linux.alibaba.com> wrote: > > Hugh Dickins' found a memcg change bug on original version: > If we want to change the pgdat->lru_lock to memcg's lruvec lock, we have > to serialize mem_cgroup_move_account during pagevec_lru_move_fn. The > possible bad scenario would like: > > cpu 0 cpu 1 > lruvec = mem_cgroup_page_lruvec() > if (!isolate_lru_page()) > mem_cgroup_move_account > > spin_lock_irqsave(&lruvec->lru_lock <== wrong lock. > > So we need the ClearPageLRU to block isolate_lru_page(), then serialize > the memcg change here. > > Reported-by: Hugh Dickins <hughd@google.com> > Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: linux-mm@kvack.org > Cc: linux-kernel@vger.kernel.org > --- > mm/swap.c | 31 +++++++++++++++++++------------ > 1 file changed, 19 insertions(+), 12 deletions(-) > > diff --git a/mm/swap.c b/mm/swap.c > index b24d5f69b93a..55eb2c2eed03 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -203,7 +203,7 @@ int get_kernel_page(unsigned long start, int write, struct page **pages) > EXPORT_SYMBOL_GPL(get_kernel_page); > > static void pagevec_lru_move_fn(struct pagevec *pvec, > - void (*move_fn)(struct page *page, struct lruvec *lruvec)) > + void (*move_fn)(struct page *page, struct lruvec *lruvec), bool add) > { > int i; > struct pglist_data *pgdat = NULL; > @@ -221,8 +221,15 @@ static void pagevec_lru_move_fn(struct pagevec *pvec, > spin_lock_irqsave(&pgdat->lru_lock, flags); > } > > + /* new page add to lru or page moving between lru */ > + if (!add && !TestClearPageLRU(page)) > + continue; > + > lruvec = mem_cgroup_page_lruvec(page, pgdat); > (*move_fn)(page, lruvec); > + > + if (!add) > + SetPageLRU(page); > } > if (pgdat) > spin_unlock_irqrestore(&pgdat->lru_lock, flags); > @@ -259,7 +266,7 @@ void rotate_reclaimable_page(struct page *page) > local_lock_irqsave(&lru_rotate.lock, flags); > pvec = this_cpu_ptr(&lru_rotate.pvec); > if (!pagevec_add(pvec, page) || PageCompound(page)) > - pagevec_lru_move_fn(pvec, pagevec_move_tail_fn); > + pagevec_lru_move_fn(pvec, pagevec_move_tail_fn, false); > local_unlock_irqrestore(&lru_rotate.lock, flags); > } > } > @@ -328,7 +335,7 @@ static void activate_page_drain(int cpu) > struct pagevec *pvec = &per_cpu(lru_pvecs.activate_page, cpu); > > if (pagevec_count(pvec)) > - pagevec_lru_move_fn(pvec, __activate_page); > + pagevec_lru_move_fn(pvec, __activate_page, false); > } > > static bool need_activate_page_drain(int cpu) > @@ -346,7 +353,7 @@ void activate_page(struct page *page) > pvec = this_cpu_ptr(&lru_pvecs.activate_page); > get_page(page); > if (!pagevec_add(pvec, page) || PageCompound(page)) > - pagevec_lru_move_fn(pvec, __activate_page); > + pagevec_lru_move_fn(pvec, __activate_page, false); > local_unlock(&lru_pvecs.lock); > } > } > @@ -621,21 +628,21 @@ void lru_add_drain_cpu(int cpu) > > /* No harm done if a racing interrupt already did this */ > local_lock_irqsave(&lru_rotate.lock, flags); > - pagevec_lru_move_fn(pvec, pagevec_move_tail_fn); > + pagevec_lru_move_fn(pvec, pagevec_move_tail_fn, false); > local_unlock_irqrestore(&lru_rotate.lock, flags); > } > > pvec = &per_cpu(lru_pvecs.lru_deactivate_file, cpu); > if (pagevec_count(pvec)) > - pagevec_lru_move_fn(pvec, lru_deactivate_file_fn); > + pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, false); > > pvec = &per_cpu(lru_pvecs.lru_deactivate, cpu); > if (pagevec_count(pvec)) > - pagevec_lru_move_fn(pvec, lru_deactivate_fn); > + pagevec_lru_move_fn(pvec, lru_deactivate_fn, false); > > pvec = &per_cpu(lru_pvecs.lru_lazyfree, cpu); > if (pagevec_count(pvec)) > - pagevec_lru_move_fn(pvec, lru_lazyfree_fn); > + pagevec_lru_move_fn(pvec, lru_lazyfree_fn, false); > > activate_page_drain(cpu); > } > @@ -664,7 +671,7 @@ void deactivate_file_page(struct page *page) > pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate_file); > > if (!pagevec_add(pvec, page) || PageCompound(page)) > - pagevec_lru_move_fn(pvec, lru_deactivate_file_fn); > + pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, false); > local_unlock(&lru_pvecs.lock); > } > } > @@ -686,7 +693,7 @@ void deactivate_page(struct page *page) > pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate); > get_page(page); > if (!pagevec_add(pvec, page) || PageCompound(page)) > - pagevec_lru_move_fn(pvec, lru_deactivate_fn); > + pagevec_lru_move_fn(pvec, lru_deactivate_fn, false); > local_unlock(&lru_pvecs.lock); > } > } > @@ -708,7 +715,7 @@ void mark_page_lazyfree(struct page *page) > pvec = this_cpu_ptr(&lru_pvecs.lru_lazyfree); > get_page(page); > if (!pagevec_add(pvec, page) || PageCompound(page)) > - pagevec_lru_move_fn(pvec, lru_lazyfree_fn); > + pagevec_lru_move_fn(pvec, lru_lazyfree_fn, false); > local_unlock(&lru_pvecs.lock); > } > } > @@ -976,7 +983,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec) > */ > void __pagevec_lru_add(struct pagevec *pvec) > { > - pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn); > + pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, true); > } It seems better to open code version in lru_add than adding a bool argument which is true just for one user. Also with this new lru protection logic lru_add could be optimized: It could prepare a list of pages and under lru_lock do only list splice and bumping counter. Since PageLRU isn't set yet nobody could touch these pages in lru. After that lru_add could iterate pages from first to last without lru_lock to set PageLRU and drop reference. So, lru_add will do O(1) operations under lru_lock regardless of the count of pages it added. Actually per-cpu vector for adding could be replaced with per-cpu lists and\or per-lruvec atomic slist. Thus incommig pages will be already in list structure rather than page vector. This allows to accumulate more pages and offload adding to kswapd or direct reclaim. > > /** > -- > 1.8.3.1 > >
在 2020/7/3 下午5:13, Konstantin Khlebnikov 写道: >> @@ -976,7 +983,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec) >> */ >> void __pagevec_lru_add(struct pagevec *pvec) >> { >> - pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn); >> + pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, true); >> } > It seems better to open code version in lru_add than adding a bool > argument which is true just for one user. Right, I will rewrite this part as your suggestion. Thanks! > > Also with this new lru protection logic lru_add could be optimized: > It could prepare a list of pages and under lru_lock do only list > splice and bumping counter. > Since PageLRU isn't set yet nobody could touch these pages in lru. > After that lru_add could iterate pages from first to last without > lru_lock to set PageLRU and drop reference. > > So, lru_add will do O(1) operations under lru_lock regardless of the > count of pages it added. > > Actually per-cpu vector for adding could be replaced with per-cpu > lists and\or per-lruvec atomic slist. > Thus incommig pages will be already in list structure rather than page vector. > This allows to accumulate more pages and offload adding to kswapd or > direct reclaim. > That's a great idea! Guess what the new struct we need would be like this? I like to try this. :) diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h index 081d934eda64..d62778c8c184 100644 --- a/include/linux/pagevec.h +++ b/include/linux/pagevec.h @@ -20,7 +20,7 @@ struct pagevec { unsigned char nr; bool percpu_pvec_drained; - struct page *pages[PAGEVEC_SIZE]; + struct list_head veclist; };
On Sat, Jul 04, 2020 at 07:34:59PM +0800, Alex Shi wrote: > That's a great idea! Guess what the new struct we need would be like this? > I like to try this. :) > > > diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h > index 081d934eda64..d62778c8c184 100644 > --- a/include/linux/pagevec.h > +++ b/include/linux/pagevec.h > @@ -20,7 +20,7 @@ > struct pagevec { > unsigned char nr; > bool percpu_pvec_drained; > - struct page *pages[PAGEVEC_SIZE]; > + struct list_head veclist; > }; pagevecs are used not just for LRU. If you want to use a list_head for LRU then define a new structure.
在 2020/7/4 下午7:39, Matthew Wilcox 写道: > On Sat, Jul 04, 2020 at 07:34:59PM +0800, Alex Shi wrote: >> That's a great idea! Guess what the new struct we need would be like this? >> I like to try this. :) >> >> >> diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h >> index 081d934eda64..d62778c8c184 100644 >> --- a/include/linux/pagevec.h >> +++ b/include/linux/pagevec.h >> @@ -20,7 +20,7 @@ >> struct pagevec { >> unsigned char nr; >> bool percpu_pvec_drained; >> - struct page *pages[PAGEVEC_SIZE]; >> + struct list_head veclist; >> }; > > pagevecs are used not just for LRU. If you want to use a list_head for > LRU then define a new structure. > yes, there are much page don't use page->lru, like slab etc. we need a new struct. Thanks! Alex
On Sat, Jul 04, 2020 at 09:12:46PM +0800, Alex Shi wrote: > 在 2020/7/4 下午7:39, Matthew Wilcox 写道: > > On Sat, Jul 04, 2020 at 07:34:59PM +0800, Alex Shi wrote: > >> That's a great idea! Guess what the new struct we need would be like this? > >> I like to try this. :) > >> > >> > >> diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h > >> index 081d934eda64..d62778c8c184 100644 > >> --- a/include/linux/pagevec.h > >> +++ b/include/linux/pagevec.h > >> @@ -20,7 +20,7 @@ > >> struct pagevec { > >> unsigned char nr; > >> bool percpu_pvec_drained; > >> - struct page *pages[PAGEVEC_SIZE]; > >> + struct list_head veclist; > >> }; > > > > pagevecs are used not just for LRU. If you want to use a list_head for > > LRU then define a new structure. > > yes, there are much page don't use page->lru, like slab etc. we need a new struct. That's not what I mean. Slab pages aren't on the LRU anyway. Consider the callers of page_cache_delete_batch(). These use a pagevec for a non-LRU purpose, and they will be much slower with a list_head than with an array.
在 2020/7/4 下午9:33, Matthew Wilcox 写道: > On Sat, Jul 04, 2020 at 09:12:46PM +0800, Alex Shi wrote: >> 在 2020/7/4 下午7:39, Matthew Wilcox 写道: >>> On Sat, Jul 04, 2020 at 07:34:59PM +0800, Alex Shi wrote: >>>> That's a great idea! Guess what the new struct we need would be like this? >>>> I like to try this. :) >>>> >>>> >>>> diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h >>>> index 081d934eda64..d62778c8c184 100644 >>>> --- a/include/linux/pagevec.h >>>> +++ b/include/linux/pagevec.h >>>> @@ -20,7 +20,7 @@ >>>> struct pagevec { >>>> unsigned char nr; >>>> bool percpu_pvec_drained; >>>> - struct page *pages[PAGEVEC_SIZE]; >>>> + struct list_head veclist; >>>> }; >>> >>> pagevecs are used not just for LRU. If you want to use a list_head for >>> LRU then define a new structure. >> >> yes, there are much page don't use page->lru, like slab etc. we need a new struct> > That's not what I mean. Slab pages aren't on the LRU anyway. Right. I mean, that's reason for a new struct if we change to list. > > Consider the callers of page_cache_delete_batch(). These use a pagevec > for a non-LRU purpose, and they will be much slower with a list_head than > with an array. > Thanks for the info. If the list is slower than pagevec, maybe it's not worth to do the change. Since pagevec could handle any kind of pages, anon/file, non-active/active, but one list only fit for just one kind of list. 5 kinds of list adding would increase the complexity. Consider this, I am wondering if it's worth? Thanks Alex
diff --git a/mm/swap.c b/mm/swap.c index b24d5f69b93a..55eb2c2eed03 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -203,7 +203,7 @@ int get_kernel_page(unsigned long start, int write, struct page **pages) EXPORT_SYMBOL_GPL(get_kernel_page); static void pagevec_lru_move_fn(struct pagevec *pvec, - void (*move_fn)(struct page *page, struct lruvec *lruvec)) + void (*move_fn)(struct page *page, struct lruvec *lruvec), bool add) { int i; struct pglist_data *pgdat = NULL; @@ -221,8 +221,15 @@ static void pagevec_lru_move_fn(struct pagevec *pvec, spin_lock_irqsave(&pgdat->lru_lock, flags); } + /* new page add to lru or page moving between lru */ + if (!add && !TestClearPageLRU(page)) + continue; + lruvec = mem_cgroup_page_lruvec(page, pgdat); (*move_fn)(page, lruvec); + + if (!add) + SetPageLRU(page); } if (pgdat) spin_unlock_irqrestore(&pgdat->lru_lock, flags); @@ -259,7 +266,7 @@ void rotate_reclaimable_page(struct page *page) local_lock_irqsave(&lru_rotate.lock, flags); pvec = this_cpu_ptr(&lru_rotate.pvec); if (!pagevec_add(pvec, page) || PageCompound(page)) - pagevec_lru_move_fn(pvec, pagevec_move_tail_fn); + pagevec_lru_move_fn(pvec, pagevec_move_tail_fn, false); local_unlock_irqrestore(&lru_rotate.lock, flags); } } @@ -328,7 +335,7 @@ static void activate_page_drain(int cpu) struct pagevec *pvec = &per_cpu(lru_pvecs.activate_page, cpu); if (pagevec_count(pvec)) - pagevec_lru_move_fn(pvec, __activate_page); + pagevec_lru_move_fn(pvec, __activate_page, false); } static bool need_activate_page_drain(int cpu) @@ -346,7 +353,7 @@ void activate_page(struct page *page) pvec = this_cpu_ptr(&lru_pvecs.activate_page); get_page(page); if (!pagevec_add(pvec, page) || PageCompound(page)) - pagevec_lru_move_fn(pvec, __activate_page); + pagevec_lru_move_fn(pvec, __activate_page, false); local_unlock(&lru_pvecs.lock); } } @@ -621,21 +628,21 @@ void lru_add_drain_cpu(int cpu) /* No harm done if a racing interrupt already did this */ local_lock_irqsave(&lru_rotate.lock, flags); - pagevec_lru_move_fn(pvec, pagevec_move_tail_fn); + pagevec_lru_move_fn(pvec, pagevec_move_tail_fn, false); local_unlock_irqrestore(&lru_rotate.lock, flags); } pvec = &per_cpu(lru_pvecs.lru_deactivate_file, cpu); if (pagevec_count(pvec)) - pagevec_lru_move_fn(pvec, lru_deactivate_file_fn); + pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, false); pvec = &per_cpu(lru_pvecs.lru_deactivate, cpu); if (pagevec_count(pvec)) - pagevec_lru_move_fn(pvec, lru_deactivate_fn); + pagevec_lru_move_fn(pvec, lru_deactivate_fn, false); pvec = &per_cpu(lru_pvecs.lru_lazyfree, cpu); if (pagevec_count(pvec)) - pagevec_lru_move_fn(pvec, lru_lazyfree_fn); + pagevec_lru_move_fn(pvec, lru_lazyfree_fn, false); activate_page_drain(cpu); } @@ -664,7 +671,7 @@ void deactivate_file_page(struct page *page) pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate_file); if (!pagevec_add(pvec, page) || PageCompound(page)) - pagevec_lru_move_fn(pvec, lru_deactivate_file_fn); + pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, false); local_unlock(&lru_pvecs.lock); } } @@ -686,7 +693,7 @@ void deactivate_page(struct page *page) pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate); get_page(page); if (!pagevec_add(pvec, page) || PageCompound(page)) - pagevec_lru_move_fn(pvec, lru_deactivate_fn); + pagevec_lru_move_fn(pvec, lru_deactivate_fn, false); local_unlock(&lru_pvecs.lock); } } @@ -708,7 +715,7 @@ void mark_page_lazyfree(struct page *page) pvec = this_cpu_ptr(&lru_pvecs.lru_lazyfree); get_page(page); if (!pagevec_add(pvec, page) || PageCompound(page)) - pagevec_lru_move_fn(pvec, lru_lazyfree_fn); + pagevec_lru_move_fn(pvec, lru_lazyfree_fn, false); local_unlock(&lru_pvecs.lock); } } @@ -976,7 +983,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec) */ void __pagevec_lru_add(struct pagevec *pvec) { - pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn); + pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, true); } /**
Hugh Dickins' found a memcg change bug on original version: If we want to change the pgdat->lru_lock to memcg's lruvec lock, we have to serialize mem_cgroup_move_account during pagevec_lru_move_fn. The possible bad scenario would like: cpu 0 cpu 1 lruvec = mem_cgroup_page_lruvec() if (!isolate_lru_page()) mem_cgroup_move_account spin_lock_irqsave(&lruvec->lru_lock <== wrong lock. So we need the ClearPageLRU to block isolate_lru_page(), then serialize the memcg change here. Reported-by: Hugh Dickins <hughd@google.com> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/swap.c | 31 +++++++++++++++++++------------ 1 file changed, 19 insertions(+), 12 deletions(-)