diff mbox series

[RFC,1/2] mm: disable LRU pagevec during the migration temporarily

Message ID 20210216170348.1513483-1-minchan@kernel.org (mailing list archive)
State New
Headers show
Series [RFC,1/2] mm: disable LRU pagevec during the migration temporarily | expand

Commit Message

Minchan Kim Feb. 16, 2021, 5:03 p.m. UTC
LRU pagevec holds refcount of pages until the pagevec are drained.
It could prevent migration since the refcount of the page is greater
than the expection in migration logic. To mitigate the issue,
callers of migrate_pages drains LRU pagevec via migrate_prep or
lru_add_drain_all before migrate_pages call.

However, it's not enough because pages coming into pagevec after the
draining call still could stay at the pagevec so it could keep
preventing page migration. Since some callers of migrate_pages have
retrial logic with LRU draining, the page would migrate at next trail
but it is still fragile in that it doesn't close the fundamental race
between upcoming LRU pages into pagvec and migration so the migration
failure could cause contiguous memory allocation failure in the end.

The other concern is migration keeps retrying until pages in pagevec
are drained. During the time, migration repeatedly allocates target
page, unmap source page from page table of processes and then get to
know the failure, restore the original page to pagetable of processes,
free target page, which is also not good.

To solve the issue, this patch tries to close the race rather than
relying on retrial and luck. The idea is to introduce
migration-in-progress tracking count with introducing IPI barrier
after atomic updating the count to minimize read-side overhead.

The migrate_prep increases migrate_pending_count under the lock
and IPI call to guarantee every CPU see the uptodate value
of migrate_pending_count. Then, drain pagevec via lru_add_drain_all.
From now on, no LRU pages could reach pagevec since LRU handling
functions skips the batching if migration is in progress with checking
migrate_pedning(IOW, pagevec should be empty until migration is done).
Every migrate_prep's caller should call migrate_finish in pair to
decrease the migration tracking count.

With the migrate_pending, vulnerable places to make migration failure
could catch migration-in-progress and make their plan to help the
migration(e.g., bh_lru_install[1]) in future.

[1] https://lore.kernel.org/linux-mm/c083b0ab6e410e33ca880d639f90ef4f6f3b33ff.1613020616.git.cgoldswo@codeaurora.org/

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/migrate.h |  3 +++
 mm/mempolicy.c          |  6 +++++
 mm/migrate.c            | 55 ++++++++++++++++++++++++++++++++++++++---
 mm/page_alloc.c         |  3 +++
 mm/swap.c               | 24 +++++++++++++-----
 5 files changed, 82 insertions(+), 9 deletions(-)

Comments

Matthew Wilcox Feb. 16, 2021, 6:22 p.m. UTC | #1
On Tue, Feb 16, 2021 at 09:03:47AM -0800, Minchan Kim wrote:
> LRU pagevec holds refcount of pages until the pagevec are drained.
> It could prevent migration since the refcount of the page is greater
> than the expection in migration logic. To mitigate the issue,
> callers of migrate_pages drains LRU pagevec via migrate_prep or
> lru_add_drain_all before migrate_pages call.
> 
> However, it's not enough because pages coming into pagevec after the
> draining call still could stay at the pagevec so it could keep
> preventing page migration. Since some callers of migrate_pages have
> retrial logic with LRU draining, the page would migrate at next trail
> but it is still fragile in that it doesn't close the fundamental race
> between upcoming LRU pages into pagvec and migration so the migration
> failure could cause contiguous memory allocation failure in the end.

Have you been able to gather any numbers on this?  eg does migration
now succeed 5% more often?
Minchan Kim Feb. 16, 2021, 9:30 p.m. UTC | #2
On Tue, Feb 16, 2021 at 06:22:42PM +0000, Matthew Wilcox wrote:
> On Tue, Feb 16, 2021 at 09:03:47AM -0800, Minchan Kim wrote:
> > LRU pagevec holds refcount of pages until the pagevec are drained.
> > It could prevent migration since the refcount of the page is greater
> > than the expection in migration logic. To mitigate the issue,
> > callers of migrate_pages drains LRU pagevec via migrate_prep or
> > lru_add_drain_all before migrate_pages call.
> > 
> > However, it's not enough because pages coming into pagevec after the
> > draining call still could stay at the pagevec so it could keep
> > preventing page migration. Since some callers of migrate_pages have
> > retrial logic with LRU draining, the page would migrate at next trail
> > but it is still fragile in that it doesn't close the fundamental race
> > between upcoming LRU pages into pagvec and migration so the migration
> > failure could cause contiguous memory allocation failure in the end.
> 
> Have you been able to gather any numbers on this?  eg does migration
> now succeed 5% more often?

What I measured was how many times migrate_pages retried with force mode
below debug code.
The test was android apps launching with cma allocation in background.
Total cma allocation count was about 500 during the entire testing 
and have seen about 400 retrial with below debug code.
With this patchset(with bug fix), the retrial count was reduced under 30.

What I measured was how many times the migrate_pages 
diff --git a/mm/migrate.c b/mm/migrate.c
index 04a98bb2f568..caa661be2d16 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1459,6 +1459,11 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
                                                private, page, pass > 2, mode,
                                                reason);

+                       if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
+                               printk(KERN_ERR, "pfn 0x%lx reason %d\n", page_to_pfn(page), rc);
+                               dump_page(page, "fail to migrate");
+                       }
+
Michal Hocko Feb. 17, 2021, 8:59 a.m. UTC | #3
On Tue 16-02-21 09:03:47, Minchan Kim wrote:
> LRU pagevec holds refcount of pages until the pagevec are drained.
> It could prevent migration since the refcount of the page is greater
> than the expection in migration logic. To mitigate the issue,
> callers of migrate_pages drains LRU pagevec via migrate_prep or
> lru_add_drain_all before migrate_pages call.
> 
> However, it's not enough because pages coming into pagevec after the
> draining call still could stay at the pagevec so it could keep
> preventing page migration. Since some callers of migrate_pages have
> retrial logic with LRU draining, the page would migrate at next trail
> but it is still fragile in that it doesn't close the fundamental race
> between upcoming LRU pages into pagvec and migration so the migration
> failure could cause contiguous memory allocation failure in the end.

Please put some numbers on how often this happens here.

> The other concern is migration keeps retrying until pages in pagevec
> are drained. During the time, migration repeatedly allocates target
> page, unmap source page from page table of processes and then get to
> know the failure, restore the original page to pagetable of processes,
> free target page, which is also not good.

This is not good for performance you mean, rigth?
 
> To solve the issue, this patch tries to close the race rather than
> relying on retrial and luck. The idea is to introduce
> migration-in-progress tracking count with introducing IPI barrier
> after atomic updating the count to minimize read-side overhead.
>
> The migrate_prep increases migrate_pending_count under the lock
> and IPI call to guarantee every CPU see the uptodate value
> of migrate_pending_count. Then, drain pagevec via lru_add_drain_all.
> >From now on, no LRU pages could reach pagevec since LRU handling
> functions skips the batching if migration is in progress with checking
> migrate_pedning(IOW, pagevec should be empty until migration is done).
> Every migrate_prep's caller should call migrate_finish in pair to
> decrease the migration tracking count.

migrate_prep already does schedule draining on each cpu which has pages
queued. Why isn't it enough to disable pcp lru caches right before
draining in migrate_prep? More on IPI side below
 
[...]
> +static DEFINE_SPINLOCK(migrate_pending_lock);
> +static unsigned long migrate_pending_count;
> +static DEFINE_PER_CPU(struct work_struct, migrate_pending_work);
> +
> +static void read_migrate_pending(struct work_struct *work)
> +{
> +	/* TODO : not sure it's needed */
> +	unsigned long dummy = __READ_ONCE(migrate_pending_count);
> +	(void)dummy;

What are you trying to achieve here? Are you just trying to enforce read
memory barrier here?

> +}
> +
> +bool migrate_pending(void)
> +{
> +	return migrate_pending_count;
> +}
> +
>  /*
>   * migrate_prep() needs to be called before we start compiling a list of pages
>   * to be migrated using isolate_lru_page(). If scheduling work on other CPUs is
> @@ -64,11 +80,27 @@
>   */
>  void migrate_prep(void)
>  {
> +	unsigned int cpu;
> +
> +	spin_lock(&migrate_pending_lock);
> +	migrate_pending_count++;
> +	spin_unlock(&migrate_pending_lock);

I suspect you do not want to add atomic_read inside hot paths, right? Is
this really something that we have to microoptimize for? atomic_read is
a simple READ_ONCE on many archs.

> +
> +	for_each_online_cpu(cpu) {
> +		struct work_struct *work = &per_cpu(migrate_pending_work, cpu);
> +
> +		INIT_WORK(work, read_migrate_pending);
> +		queue_work_on(cpu, mm_percpu_wq, work);
> +	}
> +
> +	for_each_online_cpu(cpu)
> +		flush_work(&per_cpu(migrate_pending_work, cpu));

I also do not follow this scheme. Where is the IPI you are mentioning
above?

> +	/*
> +	 * From now on, every online cpu will see uptodate
> +	 * migarte_pending_work.
> +	 */
>  	/*
>  	 * Clear the LRU lists so pages can be isolated.
> -	 * Note that pages may be moved off the LRU after we have
> -	 * drained them. Those pages will fail to migrate like other
> -	 * pages that may be busy.
>  	 */
>  	lru_add_drain_all();

Overall, this looks rather heavy weight to my taste. Have you tried to
play with a simple atomic counter approach? atomic_read when adding to
the cache and atomic_inc inside migrate_prep followed by lrdu_add_drain.
Michal Hocko Feb. 17, 2021, 9:50 a.m. UTC | #4
On Wed 17-02-21 09:59:54, Michal Hocko wrote:
> On Tue 16-02-21 09:03:47, Minchan Kim wrote:
[...]
> >  /*
> >   * migrate_prep() needs to be called before we start compiling a list of pages
> >   * to be migrated using isolate_lru_page(). If scheduling work on other CPUs is
> > @@ -64,11 +80,27 @@
> >   */
> >  void migrate_prep(void)
> >  {
> > +	unsigned int cpu;
> > +
> > +	spin_lock(&migrate_pending_lock);
> > +	migrate_pending_count++;
> > +	spin_unlock(&migrate_pending_lock);
> 
> I suspect you do not want to add atomic_read inside hot paths, right? Is
> this really something that we have to microoptimize for? atomic_read is
> a simple READ_ONCE on many archs.

Or you rather wanted to prevent from read memory barrier to enfore the
ordering.

> > +
> > +	for_each_online_cpu(cpu) {
> > +		struct work_struct *work = &per_cpu(migrate_pending_work, cpu);
> > +
> > +		INIT_WORK(work, read_migrate_pending);
> > +		queue_work_on(cpu, mm_percpu_wq, work);
> > +	}
> > +
> > +	for_each_online_cpu(cpu)
> > +		flush_work(&per_cpu(migrate_pending_work, cpu));
> 
> I also do not follow this scheme. Where is the IPI you are mentioning
> above?

Thinking about it some more I think you mean the rescheduling IPI here?

> > +	/*
> > +	 * From now on, every online cpu will see uptodate
> > +	 * migarte_pending_work.
> > +	 */
> >  	/*
> >  	 * Clear the LRU lists so pages can be isolated.
> > -	 * Note that pages may be moved off the LRU after we have
> > -	 * drained them. Those pages will fail to migrate like other
> > -	 * pages that may be busy.
> >  	 */
> >  	lru_add_drain_all();
> 
> Overall, this looks rather heavy weight to my taste. Have you tried to
> play with a simple atomic counter approach? atomic_read when adding to
> the cache and atomic_inc inside migrate_prep followed by lrdu_add_drain.

If you really want a strong ordering then it should be sufficient to
simply alter lru_add_drain_all to force draining all CPUs. This will
make sure no new pages are added to the pcp lists and you will also sync
up anything that has accumulated because of a race between atomic_read
and inc:
diff --git a/mm/swap.c b/mm/swap.c
index 2cca7141470c..91600d7bb7a8 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -745,7 +745,7 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
  * Calling this function with cpu hotplug locks held can actually lead
  * to obscure indirect dependencies via WQ context.
  */
-void lru_add_drain_all(void)
+void lru_add_drain_all(bool force_all_cpus)
 {
 	/*
 	 * lru_drain_gen - Global pages generation number
@@ -820,7 +820,8 @@ void lru_add_drain_all(void)
 	for_each_online_cpu(cpu) {
 		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
 
-		if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
+		if (force_all_cpus ||
+		    pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
 		    data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
 		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
 		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
Minchan Kim Feb. 17, 2021, 8:46 p.m. UTC | #5
On Wed, Feb 17, 2021 at 09:59:54AM +0100, Michal Hocko wrote:
> On Tue 16-02-21 09:03:47, Minchan Kim wrote:
> > LRU pagevec holds refcount of pages until the pagevec are drained.
> > It could prevent migration since the refcount of the page is greater
> > than the expection in migration logic. To mitigate the issue,
> > callers of migrate_pages drains LRU pagevec via migrate_prep or
> > lru_add_drain_all before migrate_pages call.
> > 
> > However, it's not enough because pages coming into pagevec after the
> > draining call still could stay at the pagevec so it could keep
> > preventing page migration. Since some callers of migrate_pages have
> > retrial logic with LRU draining, the page would migrate at next trail
> > but it is still fragile in that it doesn't close the fundamental race
> > between upcoming LRU pages into pagvec and migration so the migration
> > failure could cause contiguous memory allocation failure in the end.
> 
> Please put some numbers on how often this happens here.

Actually, it's hard to get real number of cma allocation failure
since we didn't introduce any cma allocation statistics.
I am working on it to make the failure more visible.
https://lore.kernel.org/linux-mm/20210217170025.512704-1-minchan@kernel.org/

Having said, the cma allocation failure is one of the most common bug
reporting from field test I got periodically. I couldn't say they all
comes from pagevec issue since it's *really hard* to get who was the long
time holder of additional refcount at the moment. However, I kept tracking
down pinner from drivers and resolved them but the problem is still
reported and saw pagevec is one of common place to hold additional
refcount. 

I think data below is not exact one you want to see but that's the
way how I could see how this patch effectives reduce vulnerability.

Quote from Matthew quesiton
````
What I measured was how many times migrate_pages retried with force mode
below debug code.
The test was android apps launching with cma allocation in background.
Total cma allocation count was about 500 during the entire testing
and have seen about 400 retrial with below debug code.
With this patchset(with bug fix), the retrial count was reduced under 30.

What I measured was how many times the migrate_pages
diff --git a/mm/migrate.c b/mm/migrate.c
index 04a98bb2f568..caa661be2d16 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1459,6 +1459,11 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
                                                private, page, pass > 2, mode,
                                                reason);

+                       if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
+                               printk(KERN_ERR, "pfn 0x%lx reason %d\n", page_to_pfn(page), rc);
+                               dump_page(page, "fail to migrate");
+                       }
+

```

IMO, rather than relying on retrial, it would be better if we close
such race from the beginning like free memory isolation logic in
start_isolate_page_range. Furthmore, migrate_pending function
will provide a way to detect migration-in-progress other vulnerable
places in future.

> 
> > The other concern is migration keeps retrying until pages in pagevec
> > are drained. During the time, migration repeatedly allocates target
> > page, unmap source page from page table of processes and then get to
> > know the failure, restore the original page to pagetable of processes,
> > free target page, which is also not good.
> 
> This is not good for performance you mean, rigth?

It's not good for allocation latency as well as performance but
it's not a top reason for this work. Goal is mostly focus on
closing the race to reduce pontential sites to migration failure
for CMA.

>  
> > To solve the issue, this patch tries to close the race rather than
> > relying on retrial and luck. The idea is to introduce
> > migration-in-progress tracking count with introducing IPI barrier
> > after atomic updating the count to minimize read-side overhead.
> >
> > The migrate_prep increases migrate_pending_count under the lock
> > and IPI call to guarantee every CPU see the uptodate value
> > of migrate_pending_count. Then, drain pagevec via lru_add_drain_all.
> > >From now on, no LRU pages could reach pagevec since LRU handling
> > functions skips the batching if migration is in progress with checking
> > migrate_pedning(IOW, pagevec should be empty until migration is done).
> > Every migrate_prep's caller should call migrate_finish in pair to
> > decrease the migration tracking count.
> 
> migrate_prep already does schedule draining on each cpu which has pages
> queued. Why isn't it enough to disable pcp lru caches right before
> draining in migrate_prep? More on IPI side below

Yub, with simple tweak of lru_add_drain_all(bool disable) and queue
it for every CPU, it would work.

>  
> [...]
> > +static DEFINE_SPINLOCK(migrate_pending_lock);
> > +static unsigned long migrate_pending_count;
> > +static DEFINE_PER_CPU(struct work_struct, migrate_pending_work);
> > +
> > +static void read_migrate_pending(struct work_struct *work)
> > +{
> > +	/* TODO : not sure it's needed */
> > +	unsigned long dummy = __READ_ONCE(migrate_pending_count);
> > +	(void)dummy;
> 
> What are you trying to achieve here? Are you just trying to enforce read
> memory barrier here?

I though IPI by migrate_pending_work will guarantee the memory ordering
so it wouldn't be needed but wanted to get attention from reviewer.
If it's not needed, I will drop it.

> 
> > +}
> > +
> > +bool migrate_pending(void)
> > +{
> > +	return migrate_pending_count;
> > +}
> > +
> >  /*
> >   * migrate_prep() needs to be called before we start compiling a list of pages
> >   * to be migrated using isolate_lru_page(). If scheduling work on other CPUs is
> > @@ -64,11 +80,27 @@
> >   */
> >  void migrate_prep(void)
> >  {
> > +	unsigned int cpu;
> > +
> > +	spin_lock(&migrate_pending_lock);
> > +	migrate_pending_count++;
> > +	spin_unlock(&migrate_pending_lock);
> 
> I suspect you do not want to add atomic_read inside hot paths, right? Is
> this really something that we have to microoptimize for? atomic_read is
> a simple READ_ONCE on many archs.

It's also spin_lock_irq_save in some arch. If the new synchonization is
heavily compilcated, atomic would be better for simple start but I thought
this locking scheme is too simple so no need to add atomic operation in
readside.
Minchan Kim Feb. 17, 2021, 8:51 p.m. UTC | #6
On Wed, Feb 17, 2021 at 10:50:55AM +0100, Michal Hocko wrote:
> On Wed 17-02-21 09:59:54, Michal Hocko wrote:
> > On Tue 16-02-21 09:03:47, Minchan Kim wrote:
> [...]
> > >  /*
> > >   * migrate_prep() needs to be called before we start compiling a list of pages
> > >   * to be migrated using isolate_lru_page(). If scheduling work on other CPUs is
> > > @@ -64,11 +80,27 @@
> > >   */
> > >  void migrate_prep(void)
> > >  {
> > > +	unsigned int cpu;
> > > +
> > > +	spin_lock(&migrate_pending_lock);
> > > +	migrate_pending_count++;
> > > +	spin_unlock(&migrate_pending_lock);
> > 
> > I suspect you do not want to add atomic_read inside hot paths, right? Is
> > this really something that we have to microoptimize for? atomic_read is
> > a simple READ_ONCE on many archs.
> 
> Or you rather wanted to prevent from read memory barrier to enfore the
> ordering.

Yub.

> 
> > > +
> > > +	for_each_online_cpu(cpu) {
> > > +		struct work_struct *work = &per_cpu(migrate_pending_work, cpu);
> > > +
> > > +		INIT_WORK(work, read_migrate_pending);
> > > +		queue_work_on(cpu, mm_percpu_wq, work);
> > > +	}
> > > +
> > > +	for_each_online_cpu(cpu)
> > > +		flush_work(&per_cpu(migrate_pending_work, cpu));
> > 
> > I also do not follow this scheme. Where is the IPI you are mentioning
> > above?
> 
> Thinking about it some more I think you mean the rescheduling IPI here?

True.

> 
> > > +	/*
> > > +	 * From now on, every online cpu will see uptodate
> > > +	 * migarte_pending_work.
> > > +	 */
> > >  	/*
> > >  	 * Clear the LRU lists so pages can be isolated.
> > > -	 * Note that pages may be moved off the LRU after we have
> > > -	 * drained them. Those pages will fail to migrate like other
> > > -	 * pages that may be busy.
> > >  	 */
> > >  	lru_add_drain_all();
> > 
> > Overall, this looks rather heavy weight to my taste. Have you tried to
> > play with a simple atomic counter approach? atomic_read when adding to
> > the cache and atomic_inc inside migrate_prep followed by lrdu_add_drain.

I'd like to avoid atomic operation if we could.

> 
> If you really want a strong ordering then it should be sufficient to
> simply alter lru_add_drain_all to force draining all CPUs. This will
> make sure no new pages are added to the pcp lists and you will also sync
> up anything that has accumulated because of a race between atomic_read
> and inc:
> diff --git a/mm/swap.c b/mm/swap.c
> index 2cca7141470c..91600d7bb7a8 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -745,7 +745,7 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
>   * Calling this function with cpu hotplug locks held can actually lead
>   * to obscure indirect dependencies via WQ context.
>   */
> -void lru_add_drain_all(void)
> +void lru_add_drain_all(bool force_all_cpus)
>  {
>  	/*
>  	 * lru_drain_gen - Global pages generation number
> @@ -820,7 +820,8 @@ void lru_add_drain_all(void)
>  	for_each_online_cpu(cpu) {
>  		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
>  
> -		if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
> +		if (force_all_cpus ||
> +		    pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
>  		    data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
>  		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
>  		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||

Yub, that's a idea.
How about this?

diff --git a/mm/migrate.c b/mm/migrate.c
index a69da8aaeccd..2531642dd9ce 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -57,6 +57,14 @@
 
 #include "internal.h"
 
+static DEFINE_SPINLOCK(migrate_pending_lock);
+static unsigned long migrate_pending_count;
+
+bool migrate_pending(void)
+{
+	return migrate_pending_count;
+}
+
 /*
  * migrate_prep() needs to be called before we start compiling a list of pages
  * to be migrated using isolate_lru_page(). If scheduling work on other CPUs is
@@ -64,13 +72,20 @@
  */
 void migrate_prep(void)
 {
+	unsigned int cpu;
+
+	/*
+	 * lru_add_drain_all's IPI will make sure no new pages are added
+	 * to the pcp lists and drain them all.
+	 */
+	spin_lock(&migrate_pending_lock);
+	migrate_pending_count++;
+	spin_unlock(&migrate_pending_lock);
+
 	/*
 	 * Clear the LRU lists so pages can be isolated.
-	 * Note that pages may be moved off the LRU after we have
-	 * drained them. Those pages will fail to migrate like other
-	 * pages that may be busy.
 	 */
-	lru_add_drain_all();
+	lru_add_drain_all(true);
 }
 
 /* Do the necessary work of migrate_prep but not if it involves other CPUs */
@@ -79,6 +94,15 @@ void migrate_prep_local(void)
 	lru_add_drain();
 }
 
+void migrate_finish(void)
+{
+	int cpu;
+
+	spin_lock(&migrate_pending_lock);
+	migrate_pending_count--;
+	spin_unlock(&migrate_pending_lock);
+}

A odd here is there are no barrier for migrate_finish for
updating migrate_pending_count so other CPUs will see
stale value until they encounters the barrier by chance.
However, it wouldn't be a big deal, IMHO.

What do you think?
Matthew Wilcox Feb. 17, 2021, 9:11 p.m. UTC | #7
On Wed, Feb 17, 2021 at 12:51:25PM -0800, Minchan Kim wrote:
> I'd like to avoid atomic operation if we could.

Why do you think that the spinlock is better?
Matthew Wilcox Feb. 17, 2021, 9:16 p.m. UTC | #8
On Wed, Feb 17, 2021 at 12:46:19PM -0800, Minchan Kim wrote:
> > I suspect you do not want to add atomic_read inside hot paths, right? Is
> > this really something that we have to microoptimize for? atomic_read is
> > a simple READ_ONCE on many archs.
> 
> It's also spin_lock_irq_save in some arch. If the new synchonization is
> heavily compilcated, atomic would be better for simple start but I thought
> this locking scheme is too simple so no need to add atomic operation in
> readside.

What arch uses a spinlock for atomic_read()?  I just had a quick grep and
didn't see any.
Minchan Kim Feb. 17, 2021, 9:22 p.m. UTC | #9
On Wed, Feb 17, 2021 at 09:11:43PM +0000, Matthew Wilcox wrote:
> On Wed, Feb 17, 2021 at 12:51:25PM -0800, Minchan Kim wrote:
> > I'd like to avoid atomic operation if we could.
> 
> Why do you think that the spinlock is better?
> 

Michal suggested atomic_inc in writeside and atomic_read in readside.
I wanted to avoid atomic operation in readside path.

You are suggesting atomic for writeside and just read(not atomic_read)
in readside?
Minchan Kim Feb. 17, 2021, 9:32 p.m. UTC | #10
On Wed, Feb 17, 2021 at 09:16:12PM +0000, Matthew Wilcox wrote:
> On Wed, Feb 17, 2021 at 12:46:19PM -0800, Minchan Kim wrote:
> > > I suspect you do not want to add atomic_read inside hot paths, right? Is
> > > this really something that we have to microoptimize for? atomic_read is
> > > a simple READ_ONCE on many archs.
> > 
> > It's also spin_lock_irq_save in some arch. If the new synchonization is
> > heavily compilcated, atomic would be better for simple start but I thought
> > this locking scheme is too simple so no need to add atomic operation in
> > readside.
> 
> What arch uses a spinlock for atomic_read()?  I just had a quick grep and
> didn't see any.

Ah, my bad. I was confused with update side.
Okay, let's use atomic op to make it simple.

Thanks for the review, Mattew and Michal.
Michal Hocko Feb. 18, 2021, 8:17 a.m. UTC | #11
On Wed 17-02-21 13:32:05, Minchan Kim wrote:
> On Wed, Feb 17, 2021 at 09:16:12PM +0000, Matthew Wilcox wrote:
> > On Wed, Feb 17, 2021 at 12:46:19PM -0800, Minchan Kim wrote:
> > > > I suspect you do not want to add atomic_read inside hot paths, right? Is
> > > > this really something that we have to microoptimize for? atomic_read is
> > > > a simple READ_ONCE on many archs.
> > > 
> > > It's also spin_lock_irq_save in some arch. If the new synchonization is
> > > heavily compilcated, atomic would be better for simple start but I thought
> > > this locking scheme is too simple so no need to add atomic operation in
> > > readside.
> > 
> > What arch uses a spinlock for atomic_read()?  I just had a quick grep and
> > didn't see any.
> 
> Ah, my bad. I was confused with update side.
> Okay, let's use atomic op to make it simple.

Thanks. This should make the code much more simple. Before you send
another version for the review I have another thing to consider. You are
kind of wiring this into the migration code but control over lru pcp
caches can be used in other paths as well. Memory offlining would be
another user. We already disable page allocator pcp caches to prevent
regular draining. We could do the same with lru pcp caches.
David Hildenbrand Feb. 18, 2021, 8:24 a.m. UTC | #12
On 18.02.21 09:17, Michal Hocko wrote:
> On Wed 17-02-21 13:32:05, Minchan Kim wrote:
>> On Wed, Feb 17, 2021 at 09:16:12PM +0000, Matthew Wilcox wrote:
>>> On Wed, Feb 17, 2021 at 12:46:19PM -0800, Minchan Kim wrote:
>>>>> I suspect you do not want to add atomic_read inside hot paths, right? Is
>>>>> this really something that we have to microoptimize for? atomic_read is
>>>>> a simple READ_ONCE on many archs.
>>>>
>>>> It's also spin_lock_irq_save in some arch. If the new synchonization is
>>>> heavily compilcated, atomic would be better for simple start but I thought
>>>> this locking scheme is too simple so no need to add atomic operation in
>>>> readside.
>>>
>>> What arch uses a spinlock for atomic_read()?  I just had a quick grep and
>>> didn't see any.
>>
>> Ah, my bad. I was confused with update side.
>> Okay, let's use atomic op to make it simple.
> 
> Thanks. This should make the code much more simple. Before you send
> another version for the review I have another thing to consider. You are
> kind of wiring this into the migration code but control over lru pcp
> caches can be used in other paths as well. Memory offlining would be
> another user. We already disable page allocator pcp caches to prevent
> regular draining. We could do the same with lru pcp caches.
> 

Agreed. And dealing with PCP more reliably might also be of interest in 
context of more reliable alloc_contig_range().
Minchan Kim Feb. 18, 2021, 3:52 p.m. UTC | #13
On Thu, Feb 18, 2021 at 09:17:02AM +0100, Michal Hocko wrote:
> On Wed 17-02-21 13:32:05, Minchan Kim wrote:
> > On Wed, Feb 17, 2021 at 09:16:12PM +0000, Matthew Wilcox wrote:
> > > On Wed, Feb 17, 2021 at 12:46:19PM -0800, Minchan Kim wrote:
> > > > > I suspect you do not want to add atomic_read inside hot paths, right? Is
> > > > > this really something that we have to microoptimize for? atomic_read is
> > > > > a simple READ_ONCE on many archs.
> > > > 
> > > > It's also spin_lock_irq_save in some arch. If the new synchonization is
> > > > heavily compilcated, atomic would be better for simple start but I thought
> > > > this locking scheme is too simple so no need to add atomic operation in
> > > > readside.
> > > 
> > > What arch uses a spinlock for atomic_read()?  I just had a quick grep and
> > > didn't see any.
> > 
> > Ah, my bad. I was confused with update side.
> > Okay, let's use atomic op to make it simple.
> 
> Thanks. This should make the code much more simple. Before you send
> another version for the review I have another thing to consider. You are
> kind of wiring this into the migration code but control over lru pcp
> caches can be used in other paths as well. Memory offlining would be
> another user. We already disable page allocator pcp caches to prevent
> regular draining. We could do the same with lru pcp caches.

I didn't catch your point here. If memory offlining is interested on
disabling lru pcp, it could call migrate_prep and migrate_finish
like other places. Are you suggesting this one?

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index a969463bdda4..0ec1c13bfe32 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1425,8 +1425,12 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
                node_clear(mtc.nid, nmask);
                if (nodes_empty(nmask))
                        node_set(mtc.nid, nmask);
+
+               migrate_prep();
                ret = migrate_pages(&source, alloc_migration_target, NULL,
                        (unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_HOTPLUG);
+
+               migrate_finish();
                if (ret) {
                        list_for_each_entry(page, &source, lru) {
                                pr_warn("migrating pfn %lx failed ret:%d ",
Michal Hocko Feb. 18, 2021, 4:08 p.m. UTC | #14
On Thu 18-02-21 07:52:25, Minchan Kim wrote:
> On Thu, Feb 18, 2021 at 09:17:02AM +0100, Michal Hocko wrote:
> > On Wed 17-02-21 13:32:05, Minchan Kim wrote:
> > > On Wed, Feb 17, 2021 at 09:16:12PM +0000, Matthew Wilcox wrote:
> > > > On Wed, Feb 17, 2021 at 12:46:19PM -0800, Minchan Kim wrote:
> > > > > > I suspect you do not want to add atomic_read inside hot paths, right? Is
> > > > > > this really something that we have to microoptimize for? atomic_read is
> > > > > > a simple READ_ONCE on many archs.
> > > > > 
> > > > > It's also spin_lock_irq_save in some arch. If the new synchonization is
> > > > > heavily compilcated, atomic would be better for simple start but I thought
> > > > > this locking scheme is too simple so no need to add atomic operation in
> > > > > readside.
> > > > 
> > > > What arch uses a spinlock for atomic_read()?  I just had a quick grep and
> > > > didn't see any.
> > > 
> > > Ah, my bad. I was confused with update side.
> > > Okay, let's use atomic op to make it simple.
> > 
> > Thanks. This should make the code much more simple. Before you send
> > another version for the review I have another thing to consider. You are
> > kind of wiring this into the migration code but control over lru pcp
> > caches can be used in other paths as well. Memory offlining would be
> > another user. We already disable page allocator pcp caches to prevent
> > regular draining. We could do the same with lru pcp caches.
> 
> I didn't catch your point here. If memory offlining is interested on
> disabling lru pcp, it could call migrate_prep and migrate_finish
> like other places. Are you suggesting this one?

What I meant to say is that you can have a look at this not as an
integral part of the migration code but rather a common functionality
that migration and others can use. So instead of an implicit part of
migrate_prep this would become disable_lru_cache and migrate_finish
would become lruc_cache_enable. See my point? 

An advantage of that would be that this would match the pcp page
allocator disabling and we could have it in place for the whole
operation to make the page state more stable wrt. LRU state (PageLRU).
 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index a969463bdda4..0ec1c13bfe32 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1425,8 +1425,12 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>                 node_clear(mtc.nid, nmask);
>                 if (nodes_empty(nmask))
>                         node_set(mtc.nid, nmask);
> +
> +               migrate_prep();
>                 ret = migrate_pages(&source, alloc_migration_target, NULL,
>                         (unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_HOTPLUG);
> +
> +               migrate_finish();
>                 if (ret) {
>                         list_for_each_entry(page, &source, lru) {
>                                 pr_warn("migrating pfn %lx failed ret:%d ",
>
Minchan Kim Feb. 18, 2021, 4:21 p.m. UTC | #15
On Thu, Feb 18, 2021 at 05:08:58PM +0100, Michal Hocko wrote:
> On Thu 18-02-21 07:52:25, Minchan Kim wrote:
> > On Thu, Feb 18, 2021 at 09:17:02AM +0100, Michal Hocko wrote:
> > > On Wed 17-02-21 13:32:05, Minchan Kim wrote:
> > > > On Wed, Feb 17, 2021 at 09:16:12PM +0000, Matthew Wilcox wrote:
> > > > > On Wed, Feb 17, 2021 at 12:46:19PM -0800, Minchan Kim wrote:
> > > > > > > I suspect you do not want to add atomic_read inside hot paths, right? Is
> > > > > > > this really something that we have to microoptimize for? atomic_read is
> > > > > > > a simple READ_ONCE on many archs.
> > > > > > 
> > > > > > It's also spin_lock_irq_save in some arch. If the new synchonization is
> > > > > > heavily compilcated, atomic would be better for simple start but I thought
> > > > > > this locking scheme is too simple so no need to add atomic operation in
> > > > > > readside.
> > > > > 
> > > > > What arch uses a spinlock for atomic_read()?  I just had a quick grep and
> > > > > didn't see any.
> > > > 
> > > > Ah, my bad. I was confused with update side.
> > > > Okay, let's use atomic op to make it simple.
> > > 
> > > Thanks. This should make the code much more simple. Before you send
> > > another version for the review I have another thing to consider. You are
> > > kind of wiring this into the migration code but control over lru pcp
> > > caches can be used in other paths as well. Memory offlining would be
> > > another user. We already disable page allocator pcp caches to prevent
> > > regular draining. We could do the same with lru pcp caches.
> > 
> > I didn't catch your point here. If memory offlining is interested on
> > disabling lru pcp, it could call migrate_prep and migrate_finish
> > like other places. Are you suggesting this one?
> 
> What I meant to say is that you can have a look at this not as an
> integral part of the migration code but rather a common functionality
> that migration and others can use. So instead of an implicit part of
> migrate_prep this would become disable_lru_cache and migrate_finish
> would become lruc_cache_enable. See my point? 
> 
> An advantage of that would be that this would match the pcp page
> allocator disabling and we could have it in place for the whole
> operation to make the page state more stable wrt. LRU state (PageLRU).

Understood. Thanks for the clarification.
diff mbox series

Patch

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 3a389633b68f..047d5358fe0d 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -46,6 +46,8 @@  extern int isolate_movable_page(struct page *page, isolate_mode_t mode);
 extern void putback_movable_page(struct page *page);
 
 extern void migrate_prep(void);
+extern void migrate_finish(void);
+extern bool migrate_pending(void);
 extern void migrate_prep_local(void);
 extern void migrate_page_states(struct page *newpage, struct page *page);
 extern void migrate_page_copy(struct page *newpage, struct page *page);
@@ -67,6 +69,7 @@  static inline int isolate_movable_page(struct page *page, isolate_mode_t mode)
 	{ return -EBUSY; }
 
 static inline int migrate_prep(void) { return -ENOSYS; }
+static inline void migrate_finish(void) {}
 static inline int migrate_prep_local(void) { return -ENOSYS; }
 
 static inline void migrate_page_states(struct page *newpage, struct page *page)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 6961238c7ef5..46d9986c7bf0 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1208,6 +1208,8 @@  int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from,
 			break;
 	}
 	mmap_read_unlock(mm);
+	migrate_finish();
+
 	if (err < 0)
 		return err;
 	return busy;
@@ -1371,6 +1373,10 @@  static long do_mbind(unsigned long start, unsigned long len,
 	mmap_write_unlock(mm);
 mpol_out:
 	mpol_put(new);
+
+	if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+		migrate_finish();
+
 	return err;
 }
 
diff --git a/mm/migrate.c b/mm/migrate.c
index a69da8aaeccd..d70e113eee04 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -57,6 +57,22 @@ 
 
 #include "internal.h"
 
+static DEFINE_SPINLOCK(migrate_pending_lock);
+static unsigned long migrate_pending_count;
+static DEFINE_PER_CPU(struct work_struct, migrate_pending_work);
+
+static void read_migrate_pending(struct work_struct *work)
+{
+	/* TODO : not sure it's needed */
+	unsigned long dummy = __READ_ONCE(migrate_pending_count);
+	(void)dummy;
+}
+
+bool migrate_pending(void)
+{
+	return migrate_pending_count;
+}
+
 /*
  * migrate_prep() needs to be called before we start compiling a list of pages
  * to be migrated using isolate_lru_page(). If scheduling work on other CPUs is
@@ -64,11 +80,27 @@ 
  */
 void migrate_prep(void)
 {
+	unsigned int cpu;
+
+	spin_lock(&migrate_pending_lock);
+	migrate_pending_count++;
+	spin_unlock(&migrate_pending_lock);
+
+	for_each_online_cpu(cpu) {
+		struct work_struct *work = &per_cpu(migrate_pending_work, cpu);
+
+		INIT_WORK(work, read_migrate_pending);
+		queue_work_on(cpu, mm_percpu_wq, work);
+	}
+
+	for_each_online_cpu(cpu)
+		flush_work(&per_cpu(migrate_pending_work, cpu));
+	/*
+	 * From now on, every online cpu will see uptodate
+	 * migarte_pending_work.
+	 */
 	/*
 	 * Clear the LRU lists so pages can be isolated.
-	 * Note that pages may be moved off the LRU after we have
-	 * drained them. Those pages will fail to migrate like other
-	 * pages that may be busy.
 	 */
 	lru_add_drain_all();
 }
@@ -79,6 +111,22 @@  void migrate_prep_local(void)
 	lru_add_drain();
 }
 
+void migrate_finish(void)
+{
+	int cpu;
+
+	spin_lock(&migrate_pending_lock);
+	migrate_pending_count--;
+	spin_unlock(&migrate_pending_lock);
+
+	for_each_online_cpu(cpu) {
+		struct work_struct *work = &per_cpu(migrate_pending_work, cpu);
+
+		INIT_WORK(work, read_migrate_pending);
+		queue_work_on(cpu, mm_percpu_wq, work);
+	}
+}
+
 int isolate_movable_page(struct page *page, isolate_mode_t mode)
 {
 	struct address_space *mapping;
@@ -1837,6 +1885,7 @@  static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
 	if (err >= 0)
 		err = err1;
 out:
+	migrate_finish();
 	return err;
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6446778cbc6b..e4cb959f64dc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8493,6 +8493,9 @@  static int __alloc_contig_migrate_range(struct compact_control *cc,
 		ret = migrate_pages(&cc->migratepages, alloc_migration_target,
 				NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE);
 	}
+
+	migrate_finish();
+
 	if (ret < 0) {
 		putback_movable_pages(&cc->migratepages);
 		return ret;
diff --git a/mm/swap.c b/mm/swap.c
index 31b844d4ed94..e42c4b4bf2b3 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -36,6 +36,7 @@ 
 #include <linux/hugetlb.h>
 #include <linux/page_idle.h>
 #include <linux/local_lock.h>
+#include <linux/migrate.h>
 
 #include "internal.h"
 
@@ -235,6 +236,17 @@  static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
 	}
 }
 
+/* return true if pagevec needs flush */
+static bool pagevec_add_and_need_flush(struct pagevec *pvec, struct page *page)
+{
+	bool ret = false;
+
+	if (!pagevec_add(pvec, page) || PageCompound(page) || migrate_pending())
+		ret = true;
+
+	return ret;
+}
+
 /*
  * Writeback is about to end against a page which has been marked for immediate
  * reclaim.  If it still appears to be reclaimable, move it to the tail of the
@@ -252,7 +264,7 @@  void rotate_reclaimable_page(struct page *page)
 		get_page(page);
 		local_lock_irqsave(&lru_rotate.lock, flags);
 		pvec = this_cpu_ptr(&lru_rotate.pvec);
-		if (!pagevec_add(pvec, page) || PageCompound(page))
+		if (pagevec_add_and_need_flush(pvec, page))
 			pagevec_lru_move_fn(pvec, pagevec_move_tail_fn);
 		local_unlock_irqrestore(&lru_rotate.lock, flags);
 	}
@@ -343,7 +355,7 @@  static void activate_page(struct page *page)
 		local_lock(&lru_pvecs.lock);
 		pvec = this_cpu_ptr(&lru_pvecs.activate_page);
 		get_page(page);
-		if (!pagevec_add(pvec, page) || PageCompound(page))
+		if (pagevec_add_and_need_flush(pvec, page))
 			pagevec_lru_move_fn(pvec, __activate_page);
 		local_unlock(&lru_pvecs.lock);
 	}
@@ -458,7 +470,7 @@  void lru_cache_add(struct page *page)
 	get_page(page);
 	local_lock(&lru_pvecs.lock);
 	pvec = this_cpu_ptr(&lru_pvecs.lru_add);
-	if (!pagevec_add(pvec, page) || PageCompound(page))
+	if (pagevec_add_and_need_flush(pvec, page))
 		__pagevec_lru_add(pvec);
 	local_unlock(&lru_pvecs.lock);
 }
@@ -654,7 +666,7 @@  void deactivate_file_page(struct page *page)
 		local_lock(&lru_pvecs.lock);
 		pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate_file);
 
-		if (!pagevec_add(pvec, page) || PageCompound(page))
+		if (pagevec_add_and_need_flush(pvec, page))
 			pagevec_lru_move_fn(pvec, lru_deactivate_file_fn);
 		local_unlock(&lru_pvecs.lock);
 	}
@@ -676,7 +688,7 @@  void deactivate_page(struct page *page)
 		local_lock(&lru_pvecs.lock);
 		pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate);
 		get_page(page);
-		if (!pagevec_add(pvec, page) || PageCompound(page))
+		if (pagevec_add_and_need_flush(pvec, page))
 			pagevec_lru_move_fn(pvec, lru_deactivate_fn);
 		local_unlock(&lru_pvecs.lock);
 	}
@@ -698,7 +710,7 @@  void mark_page_lazyfree(struct page *page)
 		local_lock(&lru_pvecs.lock);
 		pvec = this_cpu_ptr(&lru_pvecs.lru_lazyfree);
 		get_page(page);
-		if (!pagevec_add(pvec, page) || PageCompound(page))
+		if (pagevec_add_and_need_flush(pvec, page))
 			pagevec_lru_move_fn(pvec, lru_lazyfree_fn);
 		local_unlock(&lru_pvecs.lock);
 	}