diff mbox series

[v3,2/3] mm: disable LRU pagevec during the migration temporarily

Message ID 20210310161429.399432-2-minchan@kernel.org (mailing list archive)
State New, archived
Headers show
Series [v3,1/3] mm: replace migrate_prep with lru_add_drain_all | expand

Commit Message

Minchan Kim March 10, 2021, 4:14 p.m. UTC
LRU pagevec holds refcount of pages until the pagevec are drained.
It could prevent migration since the refcount of the page is greater
than the expection in migration logic. To mitigate the issue,
callers of migrate_pages drains LRU pagevec via migrate_prep or
lru_add_drain_all before migrate_pages call.

However, it's not enough because pages coming into pagevec after the
draining call still could stay at the pagevec so it could keep
preventing page migration. Since some callers of migrate_pages have
retrial logic with LRU draining, the page would migrate at next trail
but it is still fragile in that it doesn't close the fundamental race
between upcoming LRU pages into pagvec and migration so the migration
failure could cause contiguous memory allocation failure in the end.

To close the race, this patch disables lru caches(i.e, pagevec)
during ongoing migration until migrate is done.

Since it's really hard to reproduce, I measured how many times
migrate_pages retried with force mode(it is about a fallback to a
sync migration) with below debug code.

int migrate_pages(struct list_head *from, new_page_t get_new_page,
			..
			..

if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
       printk(KERN_ERR, "pfn 0x%lx reason %d\n", page_to_pfn(page), rc);
       dump_page(page, "fail to migrate");
}

The test was repeating android apps launching with cma allocation
in background every five seconds. Total cma allocation count was
about 500 during the testing. With this patch, the dump_page count
was reduced from 400 to 30.

The new interface is also useful for memory hotplug which currently
drains lru pcp caches after each migration failure. This is rather
suboptimal as it has to disrupt others running during the operation.
With the new interface the operation happens only once. This is also in
line with pcp allocator cache which are disabled for the offlining as
well.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/swap.h |  3 ++
 mm/memory_hotplug.c  |  3 +-
 mm/mempolicy.c       |  4 ++-
 mm/migrate.c         |  3 +-
 mm/swap.c            | 79 ++++++++++++++++++++++++++++++++++++--------
 5 files changed, 75 insertions(+), 17 deletions(-)

Comments

Chris Goldsworthy March 11, 2021, 10:41 p.m. UTC | #1
On 2021-03-10 08:14, Minchan Kim wrote:
> LRU pagevec holds refcount of pages until the pagevec are drained.
> It could prevent migration since the refcount of the page is greater
> than the expection in migration logic. To mitigate the issue,
> callers of migrate_pages drains LRU pagevec via migrate_prep or
> lru_add_drain_all before migrate_pages call.
> 
> However, it's not enough because pages coming into pagevec after the
> draining call still could stay at the pagevec so it could keep
> preventing page migration. Since some callers of migrate_pages have
> retrial logic with LRU draining, the page would migrate at next trail
> but it is still fragile in that it doesn't close the fundamental race
> between upcoming LRU pages into pagvec and migration so the migration
> failure could cause contiguous memory allocation failure in the end.
> 
> To close the race, this patch disables lru caches(i.e, pagevec)
> during ongoing migration until migrate is done.
> 
> Since it's really hard to reproduce, I measured how many times
> migrate_pages retried with force mode(it is about a fallback to a
> sync migration) with below debug code.
> 
> int migrate_pages(struct list_head *from, new_page_t get_new_page,
> 			..
> 			..
> 
> if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
>        printk(KERN_ERR, "pfn 0x%lx reason %d\n", page_to_pfn(page), 
> rc);
>        dump_page(page, "fail to migrate");
> }
> 
> The test was repeating android apps launching with cma allocation
> in background every five seconds. Total cma allocation count was
> about 500 during the testing. With this patch, the dump_page count
> was reduced from 400 to 30.
> 
> The new interface is also useful for memory hotplug which currently
> drains lru pcp caches after each migration failure. This is rather
> suboptimal as it has to disrupt others running during the operation.
> With the new interface the operation happens only once. This is also in
> line with pcp allocator cache which are disabled for the offlining as
> well.
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---

Hi Minchan,

This all looks good to me - feel free to add a Reviewed-by from me.

Thanks,

Chris.
Michal Hocko March 12, 2021, 8:21 a.m. UTC | #2
On Wed 10-03-21 08:14:28, Minchan Kim wrote:
> LRU pagevec holds refcount of pages until the pagevec are drained.
> It could prevent migration since the refcount of the page is greater
> than the expection in migration logic. To mitigate the issue,
> callers of migrate_pages drains LRU pagevec via migrate_prep or
> lru_add_drain_all before migrate_pages call.
> 
> However, it's not enough because pages coming into pagevec after the
> draining call still could stay at the pagevec so it could keep
> preventing page migration. Since some callers of migrate_pages have
> retrial logic with LRU draining, the page would migrate at next trail
> but it is still fragile in that it doesn't close the fundamental race
> between upcoming LRU pages into pagvec and migration so the migration
> failure could cause contiguous memory allocation failure in the end.
> 
> To close the race, this patch disables lru caches(i.e, pagevec)
> during ongoing migration until migrate is done.
> 
> Since it's really hard to reproduce, I measured how many times
> migrate_pages retried with force mode(it is about a fallback to a
> sync migration) with below debug code.
> 
> int migrate_pages(struct list_head *from, new_page_t get_new_page,
> 			..
> 			..
> 
> if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
>        printk(KERN_ERR, "pfn 0x%lx reason %d\n", page_to_pfn(page), rc);
>        dump_page(page, "fail to migrate");
> }
> 
> The test was repeating android apps launching with cma allocation
> in background every five seconds. Total cma allocation count was
> about 500 during the testing. With this patch, the dump_page count
> was reduced from 400 to 30.
> 
> The new interface is also useful for memory hotplug which currently
> drains lru pcp caches after each migration failure. This is rather
> suboptimal as it has to disrupt others running during the operation.
> With the new interface the operation happens only once. This is also in
> line with pcp allocator cache which are disabled for the offlining as
> well.
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>

Looks goot to me
Acked-by: Michal Hocko <mhocko@suse.com>

Thanks
David Hildenbrand March 12, 2021, 9 a.m. UTC | #3
On 10.03.21 17:14, Minchan Kim wrote:
> LRU pagevec holds refcount of pages until the pagevec are drained.
> It could prevent migration since the refcount of the page is greater
> than the expection in migration logic. To mitigate the issue,
> callers of migrate_pages drains LRU pagevec via migrate_prep or
> lru_add_drain_all before migrate_pages call.
> 
> However, it's not enough because pages coming into pagevec after the
> draining call still could stay at the pagevec so it could keep
> preventing page migration. Since some callers of migrate_pages have
> retrial logic with LRU draining, the page would migrate at next trail
> but it is still fragile in that it doesn't close the fundamental race
> between upcoming LRU pages into pagvec and migration so the migration
> failure could cause contiguous memory allocation failure in the end.
> 
> To close the race, this patch disables lru caches(i.e, pagevec)
> during ongoing migration until migrate is done.
> 
> Since it's really hard to reproduce, I measured how many times
> migrate_pages retried with force mode(it is about a fallback to a
> sync migration) with below debug code.
> 
> int migrate_pages(struct list_head *from, new_page_t get_new_page,
> 			..
> 			..
> 
> if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
>         printk(KERN_ERR, "pfn 0x%lx reason %d\n", page_to_pfn(page), rc);
>         dump_page(page, "fail to migrate");
> }
> 
> The test was repeating android apps launching with cma allocation
> in background every five seconds. Total cma allocation count was
> about 500 during the testing. With this patch, the dump_page count
> was reduced from 400 to 30.
> 
> The new interface is also useful for memory hotplug which currently
> drains lru pcp caches after each migration failure. This is rather
> suboptimal as it has to disrupt others running during the operation.
> With the new interface the operation happens only once. This is also in
> line with pcp allocator cache which are disabled for the offlining as
> well.
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>   include/linux/swap.h |  3 ++
>   mm/memory_hotplug.c  |  3 +-
>   mm/mempolicy.c       |  4 ++-
>   mm/migrate.c         |  3 +-
>   mm/swap.c            | 79 ++++++++++++++++++++++++++++++++++++--------
>   5 files changed, 75 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 32f665b1ee85..a3e258335a7f 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -339,6 +339,9 @@ extern void lru_note_cost(struct lruvec *lruvec, bool file,
>   extern void lru_note_cost_page(struct page *);
>   extern void lru_cache_add(struct page *);
>   extern void mark_page_accessed(struct page *);
> +extern void lru_cache_disable(void);
> +extern void lru_cache_enable(void);
> +extern bool lru_cache_disabled(void);
>   extern void lru_add_drain(void);
>   extern void lru_add_drain_cpu(int cpu);
>   extern void lru_add_drain_cpu_zone(struct zone *zone);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 5ba51a8bdaeb..959f659ef085 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1611,6 +1611,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>   	 * in a way that pages from isolated pageblock are left on pcplists.
>   	 */
>   	zone_pcp_disable(zone);
> +	lru_cache_disable();

Did you also experiment which effects zone_pcp_disable() might have on 
alloc_contig_range() ?

Feels like both calls could be abstracted somehow and used in both 
(memory offlining/alloc_contig_range) cases. It's essentially disabling 
some kind of caching.


Looks sane to me, but I am not that experienced with migration code to 
give this a real RB.
Chris Goldsworthy March 14, 2021, 5:10 a.m. UTC | #4
On 2021-03-11 14:41, Chris Goldsworthy wrote:
> On 2021-03-10 08:14, Minchan Kim wrote:
>> LRU pagevec holds refcount of pages until the pagevec are drained.
>> It could prevent migration since the refcount of the page is greater
>> than the expection in migration logic. To mitigate the issue,
>> callers of migrate_pages drains LRU pagevec via migrate_prep or
>> lru_add_drain_all before migrate_pages call.
>> 
>> However, it's not enough because pages coming into pagevec after the
>> draining call still could stay at the pagevec so it could keep
>> preventing page migration. Since some callers of migrate_pages have
>> retrial logic with LRU draining, the page would migrate at next trail
>> but it is still fragile in that it doesn't close the fundamental race
>> between upcoming LRU pages into pagvec and migration so the migration
>> failure could cause contiguous memory allocation failure in the end.
>> 
>> To close the race, this patch disables lru caches(i.e, pagevec)
>> during ongoing migration until migrate is done.
>> 
>> Since it's really hard to reproduce, I measured how many times
>> migrate_pages retried with force mode(it is about a fallback to a
>> sync migration) with below debug code.
>> 
>> int migrate_pages(struct list_head *from, new_page_t get_new_page,
>> 			..
>> 			..
>> 
>> if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
>>        printk(KERN_ERR, "pfn 0x%lx reason %d\n", page_to_pfn(page), 
>> rc);
>>        dump_page(page, "fail to migrate");
>> }
>> 
>> The test was repeating android apps launching with cma allocation
>> in background every five seconds. Total cma allocation count was
>> about 500 during the testing. With this patch, the dump_page count
>> was reduced from 400 to 30.
>> 
>> The new interface is also useful for memory hotplug which currently
>> drains lru pcp caches after each migration failure. This is rather
>> suboptimal as it has to disrupt others running during the operation.
>> With the new interface the operation happens only once. This is also 
>> in
>> line with pcp allocator cache which are disabled for the offlining as
>> well.
>> 
>> Signed-off-by: Minchan Kim <minchan@kernel.org>
>> ---
> 
> Hi Minchan,
> 
> This all looks good to me - feel free to add a Reviewed-by from me.
> 
> Thanks,
> 
> Chris.
Should have added:

Reviewed-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
Andrew Morton March 18, 2021, 12:13 a.m. UTC | #5
On Wed, 10 Mar 2021 08:14:28 -0800 Minchan Kim <minchan@kernel.org> wrote:

> LRU pagevec holds refcount of pages until the pagevec are drained.
> It could prevent migration since the refcount of the page is greater
> than the expection in migration logic. To mitigate the issue,
> callers of migrate_pages drains LRU pagevec via migrate_prep or
> lru_add_drain_all before migrate_pages call.
> 
> However, it's not enough because pages coming into pagevec after the
> draining call still could stay at the pagevec so it could keep
> preventing page migration. Since some callers of migrate_pages have
> retrial logic with LRU draining, the page would migrate at next trail
> but it is still fragile in that it doesn't close the fundamental race
> between upcoming LRU pages into pagvec and migration so the migration
> failure could cause contiguous memory allocation failure in the end.
> 
> To close the race, this patch disables lru caches(i.e, pagevec)
> during ongoing migration until migrate is done.
> 
> Since it's really hard to reproduce, I measured how many times
> migrate_pages retried with force mode(it is about a fallback to a
> sync migration) with below debug code.
> 
> int migrate_pages(struct list_head *from, new_page_t get_new_page,
> 			..
> 			..
> 
> if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
>        printk(KERN_ERR, "pfn 0x%lx reason %d\n", page_to_pfn(page), rc);
>        dump_page(page, "fail to migrate");
> }
> 
> The test was repeating android apps launching with cma allocation
> in background every five seconds. Total cma allocation count was
> about 500 during the testing. With this patch, the dump_page count
> was reduced from 400 to 30.
> 
> The new interface is also useful for memory hotplug which currently
> drains lru pcp caches after each migration failure. This is rather
> suboptimal as it has to disrupt others running during the operation.
> With the new interface the operation happens only once. This is also in
> line with pcp allocator cache which are disabled for the offlining as
> well.
> 

This is really a rather ugly thing, particularly from a maintainability
point of view.  Are you sure you found all the sites which need the
enable/disable?  How do we prevent new ones from creeping in which need
the same treatment?  Is there some way of adding a runtime check which
will trip if a conversion was missed?

> ...
>
> +bool lru_cache_disabled(void)
> +{
> +	return atomic_read(&lru_disable_count);
> +}
> +
> +void lru_cache_enable(void)
> +{
> +	atomic_dec(&lru_disable_count);
> +}
> +
> +/*
> + * lru_cache_disable() needs to be called before we start compiling
> + * a list of pages to be migrated using isolate_lru_page().
> + * It drains pages on LRU cache and then disable on all cpus until
> + * lru_cache_enable is called.
> + *
> + * Must be paired with a call to lru_cache_enable().
> + */
> +void lru_cache_disable(void)
> +{
> +	atomic_inc(&lru_disable_count);
> +#ifdef CONFIG_SMP
> +	/*
> +	 * lru_add_drain_all in the force mode will schedule draining on
> +	 * all online CPUs so any calls of lru_cache_disabled wrapped by
> +	 * local_lock or preemption disabled would be ordered by that.
> +	 * The atomic operation doesn't need to have stronger ordering
> +	 * requirements because that is enforeced by the scheduling
> +	 * guarantees.
> +	 */
> +	__lru_add_drain_all(true);
> +#else
> +	lru_add_drain();
> +#endif
> +}

I guess at least the first two of these functions should be inlined.
Minchan Kim March 18, 2021, 1:13 a.m. UTC | #6
On Wed, Mar 17, 2021 at 05:13:16PM -0700, Andrew Morton wrote:
> On Wed, 10 Mar 2021 08:14:28 -0800 Minchan Kim <minchan@kernel.org> wrote:
> 
> > LRU pagevec holds refcount of pages until the pagevec are drained.
> > It could prevent migration since the refcount of the page is greater
> > than the expection in migration logic. To mitigate the issue,
> > callers of migrate_pages drains LRU pagevec via migrate_prep or
> > lru_add_drain_all before migrate_pages call.
> > 
> > However, it's not enough because pages coming into pagevec after the
> > draining call still could stay at the pagevec so it could keep
> > preventing page migration. Since some callers of migrate_pages have
> > retrial logic with LRU draining, the page would migrate at next trail
> > but it is still fragile in that it doesn't close the fundamental race
> > between upcoming LRU pages into pagvec and migration so the migration
> > failure could cause contiguous memory allocation failure in the end.
> > 
> > To close the race, this patch disables lru caches(i.e, pagevec)
> > during ongoing migration until migrate is done.
> > 
> > Since it's really hard to reproduce, I measured how many times
> > migrate_pages retried with force mode(it is about a fallback to a
> > sync migration) with below debug code.
> > 
> > int migrate_pages(struct list_head *from, new_page_t get_new_page,
> > 			..
> > 			..
> > 
> > if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
> >        printk(KERN_ERR, "pfn 0x%lx reason %d\n", page_to_pfn(page), rc);
> >        dump_page(page, "fail to migrate");
> > }
> > 
> > The test was repeating android apps launching with cma allocation
> > in background every five seconds. Total cma allocation count was
> > about 500 during the testing. With this patch, the dump_page count
> > was reduced from 400 to 30.
> > 
> > The new interface is also useful for memory hotplug which currently
> > drains lru pcp caches after each migration failure. This is rather
> > suboptimal as it has to disrupt others running during the operation.
> > With the new interface the operation happens only once. This is also in
> > line with pcp allocator cache which are disabled for the offlining as
> > well.
> > 
> 
> This is really a rather ugly thing, particularly from a maintainability
> point of view.  Are you sure you found all the sites which need the

If you meant maintainability concern as "need pair but might miss",
we have lots of examples on such API(zone_pcp_disable, inc_tlb_flush,
kmap_atomic and so on) so I don't think you meant it.

If you meant how user could decide whether they should use 
lru_add_drain_all or lru_cache_disable/enable pair, we had already
carried the concept by migrate_prep. IOW, if someone want to increase
migration success ratio at the cost of drainning overhead,
they could use the lru_cache_disable instead of lru_add_drain_all.

Personally, I prefered migrate_prep/finish since it could include
other stuffs(e.g., zone_pcp_disable) as well as lru_cache_disable
but reviewerd didn't like to wrap it.

I realized by your comment. During the trasition from v2 to v3,
I missed a site which was most important site for me. :(

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f1f0ee08628f..39775c8f8c90 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8470,7 +8470,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
                .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
        };

-       lru_add_drain_all();
+       lru_cache_disable();

        while (pfn < end || !list_empty(&cc->migratepages)) {
                if (fatal_signal_pending(current)) {
@@ -8498,6 +8498,9 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
                ret = migrate_pages(&cc->migratepages, alloc_migration_target,
                                NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE);
        }
+
+       lru_cache_enable();
+
        if (ret < 0) {
                putback_movable_pages(&cc->migratepages);
                return ret;

However, it was just my mistake during patch stacking and didn't
comes from semantic PoV.

Do you see still any concern? Otherwise, I will submit the fix, again.

> enable/disable?  How do we prevent new ones from creeping in which need

> the same treatment?  Is there some way of adding a runtime check which
> will trip if a conversion was missed?

Are you concerning losing the pair? or places we should use
lru_cache_disable, not lru_cache_drain_all?
As I mentioned, I just replaced all of migrate_prep places with
lru_cache_disable except the mistake above.

> 
> > ...
> >
> > +bool lru_cache_disabled(void)
> > +{
> > +	return atomic_read(&lru_disable_count);
> > +}
> > +
> > +void lru_cache_enable(void)
> > +{
> > +	atomic_dec(&lru_disable_count);
> > +}
> > +
> > +/*
> > + * lru_cache_disable() needs to be called before we start compiling
> > + * a list of pages to be migrated using isolate_lru_page().
> > + * It drains pages on LRU cache and then disable on all cpus until
> > + * lru_cache_enable is called.
> > + *
> > + * Must be paired with a call to lru_cache_enable().
> > + */
> > +void lru_cache_disable(void)
> > +{
> > +	atomic_inc(&lru_disable_count);
> > +#ifdef CONFIG_SMP
> > +	/*
> > +	 * lru_add_drain_all in the force mode will schedule draining on
> > +	 * all online CPUs so any calls of lru_cache_disabled wrapped by
> > +	 * local_lock or preemption disabled would be ordered by that.
> > +	 * The atomic operation doesn't need to have stronger ordering
> > +	 * requirements because that is enforeced by the scheduling
> > +	 * guarantees.
> > +	 */
> > +	__lru_add_drain_all(true);
> > +#else
> > +	lru_add_drain();
> > +#endif
> > +}
> 
> I guess at least the first two of these functions should be inlined.

Sure. Let me respin with fixing missing piece above once we get some
direction.
Michal Hocko March 18, 2021, 8:09 a.m. UTC | #7
On Wed 17-03-21 17:13:16, Andrew Morton wrote:
> On Wed, 10 Mar 2021 08:14:28 -0800 Minchan Kim <minchan@kernel.org> wrote:
> 
> > LRU pagevec holds refcount of pages until the pagevec are drained.
> > It could prevent migration since the refcount of the page is greater
> > than the expection in migration logic. To mitigate the issue,
> > callers of migrate_pages drains LRU pagevec via migrate_prep or
> > lru_add_drain_all before migrate_pages call.
> > 
> > However, it's not enough because pages coming into pagevec after the
> > draining call still could stay at the pagevec so it could keep
> > preventing page migration. Since some callers of migrate_pages have
> > retrial logic with LRU draining, the page would migrate at next trail
> > but it is still fragile in that it doesn't close the fundamental race
> > between upcoming LRU pages into pagvec and migration so the migration
> > failure could cause contiguous memory allocation failure in the end.
> > 
> > To close the race, this patch disables lru caches(i.e, pagevec)
> > during ongoing migration until migrate is done.
> > 
> > Since it's really hard to reproduce, I measured how many times
> > migrate_pages retried with force mode(it is about a fallback to a
> > sync migration) with below debug code.
> > 
> > int migrate_pages(struct list_head *from, new_page_t get_new_page,
> > 			..
> > 			..
> > 
> > if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
> >        printk(KERN_ERR, "pfn 0x%lx reason %d\n", page_to_pfn(page), rc);
> >        dump_page(page, "fail to migrate");
> > }
> > 
> > The test was repeating android apps launching with cma allocation
> > in background every five seconds. Total cma allocation count was
> > about 500 during the testing. With this patch, the dump_page count
> > was reduced from 400 to 30.
> > 
> > The new interface is also useful for memory hotplug which currently
> > drains lru pcp caches after each migration failure. This is rather
> > suboptimal as it has to disrupt others running during the operation.
> > With the new interface the operation happens only once. This is also in
> > line with pcp allocator cache which are disabled for the offlining as
> > well.
> > 
> 
> This is really a rather ugly thing, particularly from a maintainability
> point of view.  Are you sure you found all the sites which need the
> enable/disable?  How do we prevent new ones from creeping in which need
> the same treatment?  Is there some way of adding a runtime check which
> will trip if a conversion was missed?

I am not sure I am following. What is your concern here? This is a
lock-like interface to disable a certain optimization because it stands
in the way. Not using the interface is not a correctness problem.

If you refer to disable/enable interface and one potentially missing
enable for some reason then again this will not become a correctness
problem. It will result in a suboptimal behavior. So in the end this is
much less of a probel than leaving a lock behind.

The functionality is not exported to modules and I would agree that this
is not something for out of core/MM code to be used. We can hide it into
an internal mm header of you want?
diff mbox series

Patch

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 32f665b1ee85..a3e258335a7f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -339,6 +339,9 @@  extern void lru_note_cost(struct lruvec *lruvec, bool file,
 extern void lru_note_cost_page(struct page *);
 extern void lru_cache_add(struct page *);
 extern void mark_page_accessed(struct page *);
+extern void lru_cache_disable(void);
+extern void lru_cache_enable(void);
+extern bool lru_cache_disabled(void);
 extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_cpu_zone(struct zone *zone);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 5ba51a8bdaeb..959f659ef085 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1611,6 +1611,7 @@  int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 	 * in a way that pages from isolated pageblock are left on pcplists.
 	 */
 	zone_pcp_disable(zone);
+	lru_cache_disable();
 
 	/* set above range as isolated */
 	ret = start_isolate_page_range(start_pfn, end_pfn,
@@ -1642,7 +1643,6 @@  int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 			}
 
 			cond_resched();
-			lru_add_drain_all();
 
 			ret = scan_movable_pages(pfn, end_pfn, &pfn);
 			if (!ret) {
@@ -1687,6 +1687,7 @@  int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 	zone->nr_isolate_pageblock -= nr_pages / pageblock_nr_pages;
 	spin_unlock_irqrestore(&zone->lock, flags);
 
+	lru_cache_enable();
 	zone_pcp_enable(zone);
 
 	/* removal success */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index fc024e97be37..658238e69551 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1323,7 +1323,7 @@  static long do_mbind(unsigned long start, unsigned long len,
 
 	if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
 
-		lru_add_drain_all();
+		lru_cache_disable();
 	}
 	{
 		NODEMASK_SCRATCH(scratch);
@@ -1371,6 +1371,8 @@  static long do_mbind(unsigned long start, unsigned long len,
 	mmap_write_unlock(mm);
 mpol_out:
 	mpol_put(new);
+	if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+		lru_cache_enable();
 	return err;
 }
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 45f925e10f5a..acc9913e4303 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1747,7 +1747,7 @@  static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
 	int start, i;
 	int err = 0, err1;
 
-	lru_add_drain_all();
+	lru_cache_disable();
 
 	for (i = start = 0; i < nr_pages; i++) {
 		const void __user *p;
@@ -1816,6 +1816,7 @@  static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
 	if (err >= 0)
 		err = err1;
 out:
+	lru_cache_enable();
 	return err;
 }
 
diff --git a/mm/swap.c b/mm/swap.c
index 441d1ae1f285..fbdf6ac05aec 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -235,6 +235,18 @@  static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
 	}
 }
 
+/* return true if pagevec needs to drain */
+static bool pagevec_add_and_need_flush(struct pagevec *pvec, struct page *page)
+{
+	bool ret = false;
+
+	if (!pagevec_add(pvec, page) || PageCompound(page) ||
+			lru_cache_disabled())
+		ret = true;
+
+	return ret;
+}
+
 /*
  * Writeback is about to end against a page which has been marked for immediate
  * reclaim.  If it still appears to be reclaimable, move it to the tail of the
@@ -252,7 +264,7 @@  void rotate_reclaimable_page(struct page *page)
 		get_page(page);
 		local_lock_irqsave(&lru_rotate.lock, flags);
 		pvec = this_cpu_ptr(&lru_rotate.pvec);
-		if (!pagevec_add(pvec, page) || PageCompound(page))
+		if (pagevec_add_and_need_flush(pvec, page))
 			pagevec_lru_move_fn(pvec, pagevec_move_tail_fn);
 		local_unlock_irqrestore(&lru_rotate.lock, flags);
 	}
@@ -343,7 +355,7 @@  static void activate_page(struct page *page)
 		local_lock(&lru_pvecs.lock);
 		pvec = this_cpu_ptr(&lru_pvecs.activate_page);
 		get_page(page);
-		if (!pagevec_add(pvec, page) || PageCompound(page))
+		if (pagevec_add_and_need_flush(pvec, page))
 			pagevec_lru_move_fn(pvec, __activate_page);
 		local_unlock(&lru_pvecs.lock);
 	}
@@ -458,7 +470,7 @@  void lru_cache_add(struct page *page)
 	get_page(page);
 	local_lock(&lru_pvecs.lock);
 	pvec = this_cpu_ptr(&lru_pvecs.lru_add);
-	if (!pagevec_add(pvec, page) || PageCompound(page))
+	if (pagevec_add_and_need_flush(pvec, page))
 		__pagevec_lru_add(pvec);
 	local_unlock(&lru_pvecs.lock);
 }
@@ -654,7 +666,7 @@  void deactivate_file_page(struct page *page)
 		local_lock(&lru_pvecs.lock);
 		pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate_file);
 
-		if (!pagevec_add(pvec, page) || PageCompound(page))
+		if (pagevec_add_and_need_flush(pvec, page))
 			pagevec_lru_move_fn(pvec, lru_deactivate_file_fn);
 		local_unlock(&lru_pvecs.lock);
 	}
@@ -676,7 +688,7 @@  void deactivate_page(struct page *page)
 		local_lock(&lru_pvecs.lock);
 		pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate);
 		get_page(page);
-		if (!pagevec_add(pvec, page) || PageCompound(page))
+		if (pagevec_add_and_need_flush(pvec, page))
 			pagevec_lru_move_fn(pvec, lru_deactivate_fn);
 		local_unlock(&lru_pvecs.lock);
 	}
@@ -698,7 +710,7 @@  void mark_page_lazyfree(struct page *page)
 		local_lock(&lru_pvecs.lock);
 		pvec = this_cpu_ptr(&lru_pvecs.lru_lazyfree);
 		get_page(page);
-		if (!pagevec_add(pvec, page) || PageCompound(page))
+		if (pagevec_add_and_need_flush(pvec, page))
 			pagevec_lru_move_fn(pvec, lru_lazyfree_fn);
 		local_unlock(&lru_pvecs.lock);
 	}
@@ -729,18 +741,13 @@  static void lru_add_drain_per_cpu(struct work_struct *dummy)
 }
 
 /*
- * lru_add_drain_all() usually needs to be called before we start compiling
- * a list of pages to be migrated using isolate_lru_page(). Note that pages
- * may be moved off the LRU after we have drained them. Those pages will
- * fail to migrate like other pages that may be busy.
- *
  * Doesn't need any cpu hotplug locking because we do rely on per-cpu
  * kworkers being shut down before our page_alloc_cpu_dead callback is
  * executed on the offlined cpu.
  * Calling this function with cpu hotplug locks held can actually lead
  * to obscure indirect dependencies via WQ context.
  */
-void lru_add_drain_all(void)
+static void __lru_add_drain_all(bool force_all_cpus)
 {
 	/*
 	 * lru_drain_gen - Global pages generation number
@@ -785,7 +792,7 @@  void lru_add_drain_all(void)
 	 * (C) Exit the draining operation if a newer generation, from another
 	 * lru_add_drain_all(), was already scheduled for draining. Check (A).
 	 */
-	if (unlikely(this_gen != lru_drain_gen))
+	if (unlikely(this_gen != lru_drain_gen && !force_all_cpus))
 		goto done;
 
 	/*
@@ -815,7 +822,8 @@  void lru_add_drain_all(void)
 	for_each_online_cpu(cpu) {
 		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
 
-		if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
+		if (force_all_cpus ||
+		    pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
 		    data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
 		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
 		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
@@ -833,6 +841,11 @@  void lru_add_drain_all(void)
 done:
 	mutex_unlock(&lock);
 }
+
+void lru_add_drain_all(void)
+{
+	__lru_add_drain_all(false);
+}
 #else
 void lru_add_drain_all(void)
 {
@@ -840,6 +853,44 @@  void lru_add_drain_all(void)
 }
 #endif /* CONFIG_SMP */
 
+static atomic_t lru_disable_count = ATOMIC_INIT(0);
+
+bool lru_cache_disabled(void)
+{
+	return atomic_read(&lru_disable_count);
+}
+
+void lru_cache_enable(void)
+{
+	atomic_dec(&lru_disable_count);
+}
+
+/*
+ * lru_cache_disable() needs to be called before we start compiling
+ * a list of pages to be migrated using isolate_lru_page().
+ * It drains pages on LRU cache and then disable on all cpus until
+ * lru_cache_enable is called.
+ *
+ * Must be paired with a call to lru_cache_enable().
+ */
+void lru_cache_disable(void)
+{
+	atomic_inc(&lru_disable_count);
+#ifdef CONFIG_SMP
+	/*
+	 * lru_add_drain_all in the force mode will schedule draining on
+	 * all online CPUs so any calls of lru_cache_disabled wrapped by
+	 * local_lock or preemption disabled would be ordered by that.
+	 * The atomic operation doesn't need to have stronger ordering
+	 * requirements because that is enforeced by the scheduling
+	 * guarantees.
+	 */
+	__lru_add_drain_all(true);
+#else
+	lru_add_drain();
+#endif
+}
+
 /**
  * release_pages - batched put_page()
  * @pages: array of pages to release