diff mbox series

[RFC,01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.

Message ID 20190215220856.29749-2-zi.yan@sent.com (mailing list archive)
State New, archived
Headers show
Series Generating physically contiguous memory after page allocation | expand

Commit Message

Zi Yan Feb. 15, 2019, 10:08 p.m. UTC
From: Zi Yan <ziy@nvidia.com>

In stead of using two migrate_pages(), a single exchange_pages() would
be sufficient and without allocating new pages.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/ksm.h |   5 +
 mm/Makefile         |   1 +
 mm/exchange.c       | 846 ++++++++++++++++++++++++++++++++++++++++++++
 mm/internal.h       |   6 +
 mm/ksm.c            |  35 ++
 mm/migrate.c        |   4 +-
 6 files changed, 895 insertions(+), 2 deletions(-)
 create mode 100644 mm/exchange.c

Comments

Matthew Wilcox Feb. 17, 2019, 11:29 a.m. UTC | #1
On Fri, Feb 15, 2019 at 02:08:26PM -0800, Zi Yan wrote:
> +struct page_flags {
> +	unsigned int page_error :1;
> +	unsigned int page_referenced:1;
> +	unsigned int page_uptodate:1;
> +	unsigned int page_active:1;
> +	unsigned int page_unevictable:1;
> +	unsigned int page_checked:1;
> +	unsigned int page_mappedtodisk:1;
> +	unsigned int page_dirty:1;
> +	unsigned int page_is_young:1;
> +	unsigned int page_is_idle:1;
> +	unsigned int page_swapcache:1;
> +	unsigned int page_writeback:1;
> +	unsigned int page_private:1;
> +	unsigned int __pad:3;
> +};

I'm not sure how to feel about this.  It's a bit fragile versus somebody adding
new page flags.  I don't know whether it's needed or whether you can just
copy page->flags directly because you're holding PageLock.

> +static void exchange_page(char *to, char *from)
> +{
> +	u64 tmp;
> +	int i;
> +
> +	for (i = 0; i < PAGE_SIZE; i += sizeof(tmp)) {
> +		tmp = *((u64 *)(from + i));
> +		*((u64 *)(from + i)) = *((u64 *)(to + i));
> +		*((u64 *)(to + i)) = tmp;
> +	}
> +}

I have a suspicion you'd be better off allocating a temporary page and
using copy_page().  Some architectures have put a lot of effort into
making copy_page() run faster.

> +		xa_lock_irq(&to_mapping->i_pages);
> +
> +		to_pslot = radix_tree_lookup_slot(&to_mapping->i_pages,
> +			page_index(to_page));

This needs to be converted to the XArray.  radix_tree_lookup_slot() is
going away soon.  You probably need:

	XA_STATE(to_xas, &to_mapping->i_pages, page_index(to_page));

This is a lot of code and I'm still trying to get my head aroud it all.
Thanks for putting in this work; it's good to see this approach being
explored.
Zi Yan Feb. 18, 2019, 5:31 p.m. UTC | #2
On 17 Feb 2019, at 3:29, Matthew Wilcox wrote:

> On Fri, Feb 15, 2019 at 02:08:26PM -0800, Zi Yan wrote:
>> +struct page_flags {
>> +	unsigned int page_error :1;
>> +	unsigned int page_referenced:1;
>> +	unsigned int page_uptodate:1;
>> +	unsigned int page_active:1;
>> +	unsigned int page_unevictable:1;
>> +	unsigned int page_checked:1;
>> +	unsigned int page_mappedtodisk:1;
>> +	unsigned int page_dirty:1;
>> +	unsigned int page_is_young:1;
>> +	unsigned int page_is_idle:1;
>> +	unsigned int page_swapcache:1;
>> +	unsigned int page_writeback:1;
>> +	unsigned int page_private:1;
>> +	unsigned int __pad:3;
>> +};
>
> I'm not sure how to feel about this.  It's a bit fragile versus 
> somebody adding
> new page flags.  I don't know whether it's needed or whether you can 
> just
> copy page->flags directly because you're holding PageLock.

I agree with you that current way of copying page flags individually 
could miss
new page flags. I will try to come up with something better. Copying 
page->flags as a whole
might not simply work, since the upper part of page->flags has the page 
node information,
which should not be changed. I think I need to add a helper function to 
just copy/exchange
all page flags, like calling migrate_page_stats() twice.

>> +static void exchange_page(char *to, char *from)
>> +{
>> +	u64 tmp;
>> +	int i;
>> +
>> +	for (i = 0; i < PAGE_SIZE; i += sizeof(tmp)) {
>> +		tmp = *((u64 *)(from + i));
>> +		*((u64 *)(from + i)) = *((u64 *)(to + i));
>> +		*((u64 *)(to + i)) = tmp;
>> +	}
>> +}
>
> I have a suspicion you'd be better off allocating a temporary page and
> using copy_page().  Some architectures have put a lot of effort into
> making copy_page() run faster.

When I am doing exchange_pages() between two NUMA nodes on a x86_64 
machine,
I actually can saturate the QPI bandwidth with this operation. I think 
cache
prefetching was doing its job.

The purpose of proposing exchange_pages() is to avoid allocating any new 
page,
so that we would not trigger any potential page reclaim or memory 
compaction.
Allocating a temporary page defeats the purpose.

>
>> +		xa_lock_irq(&to_mapping->i_pages);
>> +
>> +		to_pslot = radix_tree_lookup_slot(&to_mapping->i_pages,
>> +			page_index(to_page));
>
> This needs to be converted to the XArray.  radix_tree_lookup_slot() is
> going away soon.  You probably need:
>
> 	XA_STATE(to_xas, &to_mapping->i_pages, page_index(to_page));

Thank you for pointing this out. I will do the change.

>
> This is a lot of code and I'm still trying to get my head aroud it 
> all.
> Thanks for putting in this work; it's good to see this approach being
> explored.

Thank you for taking a look at the code.

--
Best Regards,
Yan Zi
Vlastimil Babka Feb. 18, 2019, 5:42 p.m. UTC | #3
On 2/18/19 6:31 PM, Zi Yan wrote:
> The purpose of proposing exchange_pages() is to avoid allocating any new 
> page,
> so that we would not trigger any potential page reclaim or memory 
> compaction.
> Allocating a temporary page defeats the purpose.

Compaction can only happen for order > 0 temporary pages. Even if you used
single order = 0 page to gradually exchange e.g. a THP, it should be better than
u64. Allocating order = 0 should be a non-issue. If it's an issue, then the
system is in a bad state and physically contiguous layout is a secondary concern.
Zi Yan Feb. 18, 2019, 5:51 p.m. UTC | #4
On 18 Feb 2019, at 9:42, Vlastimil Babka wrote:

> On 2/18/19 6:31 PM, Zi Yan wrote:
>> The purpose of proposing exchange_pages() is to avoid allocating any 
>> new
>> page,
>> so that we would not trigger any potential page reclaim or memory
>> compaction.
>> Allocating a temporary page defeats the purpose.
>
> Compaction can only happen for order > 0 temporary pages. Even if you 
> used
> single order = 0 page to gradually exchange e.g. a THP, it should be 
> better than
> u64. Allocating order = 0 should be a non-issue. If it's an issue, 
> then the
> system is in a bad state and physically contiguous layout is a 
> secondary concern.

You are right if we only need to allocate one order-0 page. But this 
also means
we can only exchange two pages at a time. We need to add a lock to make 
sure
the temporary page is used exclusively or we need to keep allocating 
temporary pages
when multiple exchange_pages() are happening at the same time.

--
Best Regards,
Yan Zi
Matthew Wilcox Feb. 18, 2019, 5:52 p.m. UTC | #5
On Mon, Feb 18, 2019 at 09:51:33AM -0800, Zi Yan wrote:
> On 18 Feb 2019, at 9:42, Vlastimil Babka wrote:
> > On 2/18/19 6:31 PM, Zi Yan wrote:
> > > The purpose of proposing exchange_pages() is to avoid allocating any
> > > new
> > > page,
> > > so that we would not trigger any potential page reclaim or memory
> > > compaction.
> > > Allocating a temporary page defeats the purpose.
> > 
> > Compaction can only happen for order > 0 temporary pages. Even if you
> > used
> > single order = 0 page to gradually exchange e.g. a THP, it should be
> > better than
> > u64. Allocating order = 0 should be a non-issue. If it's an issue, then
> > the
> > system is in a bad state and physically contiguous layout is a secondary
> > concern.
> 
> You are right if we only need to allocate one order-0 page. But this also
> means
> we can only exchange two pages at a time. We need to add a lock to make sure
> the temporary page is used exclusively or we need to keep allocating
> temporary pages
> when multiple exchange_pages() are happening at the same time.

You allocate one temporary page per thread that's doing an exchange_page().
Zi Yan Feb. 18, 2019, 5:59 p.m. UTC | #6
On 18 Feb 2019, at 9:52, Matthew Wilcox wrote:

> On Mon, Feb 18, 2019 at 09:51:33AM -0800, Zi Yan wrote:
>> On 18 Feb 2019, at 9:42, Vlastimil Babka wrote:
>>> On 2/18/19 6:31 PM, Zi Yan wrote:
>>>> The purpose of proposing exchange_pages() is to avoid allocating any
>>>> new
>>>> page,
>>>> so that we would not trigger any potential page reclaim or memory
>>>> compaction.
>>>> Allocating a temporary page defeats the purpose.
>>>
>>> Compaction can only happen for order > 0 temporary pages. Even if you
>>> used
>>> single order = 0 page to gradually exchange e.g. a THP, it should be
>>> better than
>>> u64. Allocating order = 0 should be a non-issue. If it's an issue, then
>>> the
>>> system is in a bad state and physically contiguous layout is a secondary
>>> concern.
>>
>> You are right if we only need to allocate one order-0 page. But this also
>> means
>> we can only exchange two pages at a time. We need to add a lock to make sure
>> the temporary page is used exclusively or we need to keep allocating
>> temporary pages
>> when multiple exchange_pages() are happening at the same time.
>
> You allocate one temporary page per thread that's doing an exchange_page().

Yeah, you are right. I think at most I need NR_CPU order-0 pages. I will try
it. Thanks.

--
Best Regards,
Yan Zi
Anshuman Khandual Feb. 19, 2019, 7:42 a.m. UTC | #7
On 02/18/2019 11:29 PM, Zi Yan wrote:
> On 18 Feb 2019, at 9:52, Matthew Wilcox wrote:
> 
>> On Mon, Feb 18, 2019 at 09:51:33AM -0800, Zi Yan wrote:
>>> On 18 Feb 2019, at 9:42, Vlastimil Babka wrote:
>>>> On 2/18/19 6:31 PM, Zi Yan wrote:
>>>>> The purpose of proposing exchange_pages() is to avoid allocating any
>>>>> new
>>>>> page,
>>>>> so that we would not trigger any potential page reclaim or memory
>>>>> compaction.
>>>>> Allocating a temporary page defeats the purpose.
>>>>
>>>> Compaction can only happen for order > 0 temporary pages. Even if you
>>>> used
>>>> single order = 0 page to gradually exchange e.g. a THP, it should be
>>>> better than
>>>> u64. Allocating order = 0 should be a non-issue. If it's an issue, then
>>>> the
>>>> system is in a bad state and physically contiguous layout is a secondary
>>>> concern.
>>>
>>> You are right if we only need to allocate one order-0 page. But this also
>>> means
>>> we can only exchange two pages at a time. We need to add a lock to make sure
>>> the temporary page is used exclusively or we need to keep allocating
>>> temporary pages
>>> when multiple exchange_pages() are happening at the same time.
>>
>> You allocate one temporary page per thread that's doing an exchange_page().
> 
> Yeah, you are right. I think at most I need NR_CPU order-0 pages. I will try
> it. Thanks.

But the location of this temp page matters as well because you would like to
saturate the inter node interface. It needs to be either of the nodes where
the source or destination page belongs. Any other node would generate two
internode copy process which is not what you intend here I guess.
Matthew Wilcox Feb. 19, 2019, 12:56 p.m. UTC | #8
On Tue, Feb 19, 2019 at 01:12:07PM +0530, Anshuman Khandual wrote:
> But the location of this temp page matters as well because you would like to
> saturate the inter node interface. It needs to be either of the nodes where
> the source or destination page belongs. Any other node would generate two
> internode copy process which is not what you intend here I guess.

That makes no sense.  It should be allocated on the local node of the CPU
performing the copy.  If the CPU is in node A, the destination is in node B
and the source is in node C, then you're doing 4k worth of reads from node C,
4k worth of reads from node B, 4k worth of writes to node C followed by
4k worth of writes to node B.  Eventually the 4k of dirty cachelines on
node A will be written back from cache to the local memory (... or not,
if that page gets reused for some other purpose first).

If you allocate the page on node B or node C, that's an extra 4k of writes
to be sent across the inter-node link.
Anshuman Khandual Feb. 20, 2019, 4:38 a.m. UTC | #9
On 02/19/2019 06:26 PM, Matthew Wilcox wrote:
> On Tue, Feb 19, 2019 at 01:12:07PM +0530, Anshuman Khandual wrote:
>> But the location of this temp page matters as well because you would like to
>> saturate the inter node interface. It needs to be either of the nodes where
>> the source or destination page belongs. Any other node would generate two
>> internode copy process which is not what you intend here I guess.
> That makes no sense.  It should be allocated on the local node of the CPU
> performing the copy.  If the CPU is in node A, the destination is in node B
> and the source is in node C, then you're doing 4k worth of reads from node C,
> 4k worth of reads from node B, 4k worth of writes to node C followed by
> 4k worth of writes to node B.  Eventually the 4k of dirty cachelines on
> node A will be written back from cache to the local memory (... or not,
> if that page gets reused for some other purpose first).
> 
> If you allocate the page on node B or node C, that's an extra 4k of writes
> to be sent across the inter-node link.

Thats right there will be an extra remote write. My assumption was that the CPU
performing the copy belongs to either node B or node C.
Jerome Glisse Feb. 21, 2019, 9:10 p.m. UTC | #10
On Fri, Feb 15, 2019 at 02:08:26PM -0800, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> In stead of using two migrate_pages(), a single exchange_pages() would
> be sufficient and without allocating new pages.

So i believe it would be better to arrange the code differently instead
of having one function that special case combination, define function for
each one ie:
    exchange_anon_to_share()
    exchange_anon_to_anon()
    exchange_share_to_share()

Then you could define function to test if a page is in correct states:
    can_exchange_anon_page() // return true if page can be exchange
    can_exchange_share_page()

In fact both of this function can be factor out as common helpers with the
existing migrate code within migrate.c This way we would have one place
only where we need to handle all the special casing, test and exceptions.

Other than that i could not spot anything obviously wrong but i did not
spent enough time to check everything. Re-architecturing the code like
i propose above would make this a lot easier to review i believe.

Cheers,
Jérôme

> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  include/linux/ksm.h |   5 +
>  mm/Makefile         |   1 +
>  mm/exchange.c       | 846 ++++++++++++++++++++++++++++++++++++++++++++
>  mm/internal.h       |   6 +
>  mm/ksm.c            |  35 ++
>  mm/migrate.c        |   4 +-
>  6 files changed, 895 insertions(+), 2 deletions(-)
>  create mode 100644 mm/exchange.c

[...]

> +	from_page_count = page_count(from_page);
> +	from_map_count = page_mapcount(from_page);
> +	to_page_count = page_count(to_page);
> +	to_map_count = page_mapcount(to_page);
> +	from_flags = from_page->flags;
> +	to_flags = to_page->flags;
> +	from_mapping = from_page->mapping;
> +	to_mapping = to_page->mapping;
> +	from_index = from_page->index;
> +	to_index = to_page->index;

Those are not use anywhere ...
Zi Yan Feb. 21, 2019, 9:25 p.m. UTC | #11
On 21 Feb 2019, at 13:10, Jerome Glisse wrote:

> On Fri, Feb 15, 2019 at 02:08:26PM -0800, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> In stead of using two migrate_pages(), a single exchange_pages() would
>> be sufficient and without allocating new pages.
>
> So i believe it would be better to arrange the code differently instead
> of having one function that special case combination, define function for
> each one ie:
>     exchange_anon_to_share()
>     exchange_anon_to_anon()
>     exchange_share_to_share()
>
> Then you could define function to test if a page is in correct states:
>     can_exchange_anon_page() // return true if page can be exchange
>     can_exchange_share_page()
>
> In fact both of this function can be factor out as common helpers with the
> existing migrate code within migrate.c This way we would have one place
> only where we need to handle all the special casing, test and exceptions.
>
> Other than that i could not spot anything obviously wrong but i did not
> spent enough time to check everything. Re-architecturing the code like
> i propose above would make this a lot easier to review i believe.
>

Thank you for reviewing the patch. Your suggestions are very helpful.
I will restructure the code to help people review it.


>> +	from_page_count = page_count(from_page);
>> +	from_map_count = page_mapcount(from_page);
>> +	to_page_count = page_count(to_page);
>> +	to_map_count = page_mapcount(to_page);
>> +	from_flags = from_page->flags;
>> +	to_flags = to_page->flags;
>> +	from_mapping = from_page->mapping;
>> +	to_mapping = to_page->mapping;
>> +	from_index = from_page->index;
>> +	to_index = to_page->index;
>
> Those are not use anywhere ...

Will remove them. Thanks.

--
Best Regards,
Yan Zi
Zi Yan March 14, 2019, 2:39 a.m. UTC | #12
On 19 Feb 2019, at 20:38, Anshuman Khandual wrote:

> On 02/19/2019 06:26 PM, Matthew Wilcox wrote:
>> On Tue, Feb 19, 2019 at 01:12:07PM +0530, Anshuman Khandual wrote:
>>> But the location of this temp page matters as well because you would 
>>> like to
>>> saturate the inter node interface. It needs to be either of the 
>>> nodes where
>>> the source or destination page belongs. Any other node would 
>>> generate two
>>> internode copy process which is not what you intend here I guess.
>> That makes no sense.  It should be allocated on the local node of the 
>> CPU
>> performing the copy.  If the CPU is in node A, the destination is in 
>> node B
>> and the source is in node C, then you're doing 4k worth of reads from 
>> node C,
>> 4k worth of reads from node B, 4k worth of writes to node C followed 
>> by
>> 4k worth of writes to node B.  Eventually the 4k of dirty cachelines 
>> on
>> node A will be written back from cache to the local memory (... or 
>> not,
>> if that page gets reused for some other purpose first).
>>
>> If you allocate the page on node B or node C, that's an extra 4k of 
>> writes
>> to be sent across the inter-node link.
>
> Thats right there will be an extra remote write. My assumption was 
> that the CPU
> performing the copy belongs to either node B or node C.


I have some interesting throughput results for exchange per u64 and 
exchange per 4KB page.
What I discovered is that using a 4KB page as the temporary storage for 
exchanging
2MB THPs does not improve the throughput. On contrary, when we are 
exchanging more than 2^4=16 THPs,
exchanging per 4KB page has lower throughput than exchanging per u64. 
Please see results below.

The experiments are done on a two socket machine with two Intel Xeon 
E5-2640 v3 CPUs.
All exchanges are done via the QPI link across two sockets.


Results
===

Throughput (GB/s) of exchanging 2 order-N 2MB pages between two NUMA 
nodes

| 2mb_page_order | 0    | 1    | 2    | 3    | 4    | 5    | 6    | 7    
| 8    | 9
|     u64        | 5.31 | 5.58 | 5.89 | 5.69 | 8.97 | 9.51 | 9.21 | 9.50 
| 9.57 | 9.62
|     per_page   | 5.85 | 6.48 | 6.20 | 5.26 | 7.22 | 7.25 | 7.28 | 7.30 
| 7.32 | 7.31

Normalized throughput (to per_page)

  2mb_page_order | 0    | 1    | 2    | 3    | 4    | 5    | 6    | 7    
| 8    | 9
      u64        | 0.90 | 0.86 | 0.94 | 1.08 | 1.24 | 1.31 |1.26  | 1.30 
| 1.30 | 1.31



Exchange page code
===

For exchanging per u64, I use the following function:

static void exchange_page(char *to, char *from)
{
	u64 tmp;
	int i;

	for (i = 0; i < PAGE_SIZE; i += sizeof(tmp)) {
		tmp = *((u64 *)(from + i));
		*((u64 *)(from + i)) = *((u64 *)(to + i));
		*((u64 *)(to + i)) = tmp;
	}
}


For exchange per 4KB, I use the following function:

static void exchange_page2(char *to, char *from)
{
	int cpu = smp_processor_id();

	VM_BUG_ON(!in_atomic());

	if (!page_tmp[cpu]) {
		int nid = cpu_to_node(cpu);
		struct page *page_tmp_page = alloc_pages_node(nid, GFP_KERNEL, 0);
		if (!page_tmp_page) {
			exchange_page(to, from);
			return;
		}
		page_tmp[cpu] = kmap(page_tmp_page);
	}

	copy_page(page_tmp[cpu], to);
	copy_page(to, from);
	copy_page(from, page_tmp[cpu]);
}

where page_tmp is pre-allocated local to each CPU and alloc_pages_node() 
above
is for hot-added CPUs, which is not used in the tests.


The kernel is available at: https://gitlab.com/ziy/linux-contig-mem-rfc
To do a comparison, you can clone this repo: 
https://gitlab.com/ziy/thp-migration-bench,
then make, ./run_test.sh, and ./get_results.sh using the kernel from 
above.

Let me know if I missed anything or did something wrong. Thanks.


--
Best Regards,
Yan Zi
diff mbox series

Patch

diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 161e8164abcf..87c5b943a73c 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -53,6 +53,7 @@  struct page *ksm_might_need_to_copy(struct page *page,
 
 void rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc);
 void ksm_migrate_page(struct page *newpage, struct page *oldpage);
+void ksm_exchange_page(struct page *to_page, struct page *from_page);
 
 #else  /* !CONFIG_KSM */
 
@@ -86,6 +87,10 @@  static inline void rmap_walk_ksm(struct page *page,
 static inline void ksm_migrate_page(struct page *newpage, struct page *oldpage)
 {
 }
+static inline void ksm_exchange_page(struct page *to_page,
+				struct page *from_page)
+{
+}
 #endif /* CONFIG_MMU */
 #endif /* !CONFIG_KSM */
 
diff --git a/mm/Makefile b/mm/Makefile
index d210cc9d6f80..1574ea5743e4 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -43,6 +43,7 @@  obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 
 obj-y += init-mm.o
 obj-y += memblock.o
+obj-y += exchange.o
 
 ifdef CONFIG_MMU
 	obj-$(CONFIG_ADVISE_SYSCALLS)	+= madvise.o
diff --git a/mm/exchange.c b/mm/exchange.c
new file mode 100644
index 000000000000..a607348cc6f4
--- /dev/null
+++ b/mm/exchange.c
@@ -0,0 +1,846 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2016 NVIDIA, Zi Yan <ziy@nvidia.com>
+ *
+ * Exchange two in-use pages. Page flags and page->mapping are exchanged
+ * as well. Only anonymous pages are supported.
+ */
+
+#include <linux/syscalls.h>
+#include <linux/migrate.h>
+#include <linux/security.h>
+#include <linux/cpuset.h>
+#include <linux/hugetlb.h>
+#include <linux/mm_inline.h>
+#include <linux/page_idle.h>
+#include <linux/page-flags.h>
+#include <linux/ksm.h>
+#include <linux/memcontrol.h>
+#include <linux/balloon_compaction.h>
+#include <linux/buffer_head.h>
+#include <linux/fs.h> /* buffer_migrate_page  */
+#include <linux/backing-dev.h>
+
+
+#include "internal.h"
+
+struct exchange_page_info {
+	struct page *from_page;
+	struct page *to_page;
+
+	struct anon_vma *from_anon_vma;
+	struct anon_vma *to_anon_vma;
+
+	struct list_head list;
+};
+
+struct page_flags {
+	unsigned int page_error :1;
+	unsigned int page_referenced:1;
+	unsigned int page_uptodate:1;
+	unsigned int page_active:1;
+	unsigned int page_unevictable:1;
+	unsigned int page_checked:1;
+	unsigned int page_mappedtodisk:1;
+	unsigned int page_dirty:1;
+	unsigned int page_is_young:1;
+	unsigned int page_is_idle:1;
+	unsigned int page_swapcache:1;
+	unsigned int page_writeback:1;
+	unsigned int page_private:1;
+	unsigned int __pad:3;
+};
+
+
+static void exchange_page(char *to, char *from)
+{
+	u64 tmp;
+	int i;
+
+	for (i = 0; i < PAGE_SIZE; i += sizeof(tmp)) {
+		tmp = *((u64 *)(from + i));
+		*((u64 *)(from + i)) = *((u64 *)(to + i));
+		*((u64 *)(to + i)) = tmp;
+	}
+}
+
+static inline void exchange_highpage(struct page *to, struct page *from)
+{
+	char *vfrom, *vto;
+
+	vfrom = kmap_atomic(from);
+	vto = kmap_atomic(to);
+	exchange_page(vto, vfrom);
+	kunmap_atomic(vto);
+	kunmap_atomic(vfrom);
+}
+
+static void __exchange_gigantic_page(struct page *dst, struct page *src,
+				int nr_pages)
+{
+	int i;
+	struct page *dst_base = dst;
+	struct page *src_base = src;
+
+	for (i = 0; i < nr_pages; ) {
+		cond_resched();
+		exchange_highpage(dst, src);
+
+		i++;
+		dst = mem_map_next(dst, dst_base, i);
+		src = mem_map_next(src, src_base, i);
+	}
+}
+
+static void exchange_huge_page(struct page *dst, struct page *src)
+{
+	int i;
+	int nr_pages;
+
+	if (PageHuge(src)) {
+		/* hugetlbfs page */
+		struct hstate *h = page_hstate(src);
+
+		nr_pages = pages_per_huge_page(h);
+
+		if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) {
+			__exchange_gigantic_page(dst, src, nr_pages);
+			return;
+		}
+	} else {
+		/* thp page */
+		VM_BUG_ON(!PageTransHuge(src));
+		nr_pages = hpage_nr_pages(src);
+	}
+
+	for (i = 0; i < nr_pages; i++) {
+		cond_resched();
+		exchange_highpage(dst + i, src + i);
+	}
+}
+
+/*
+ * Copy the page to its new location without polluting cache
+ */
+static void exchange_page_flags(struct page *to_page, struct page *from_page)
+{
+	int from_cpupid, to_cpupid;
+	struct page_flags from_page_flags, to_page_flags;
+	struct mem_cgroup *to_memcg = page_memcg(to_page),
+					  *from_memcg = page_memcg(from_page);
+
+	from_cpupid = page_cpupid_xchg_last(from_page, -1);
+
+	from_page_flags.page_error = TestClearPageError(from_page);
+	from_page_flags.page_referenced = TestClearPageReferenced(from_page);
+	from_page_flags.page_uptodate = PageUptodate(from_page);
+	ClearPageUptodate(from_page);
+	from_page_flags.page_active = TestClearPageActive(from_page);
+	from_page_flags.page_unevictable = TestClearPageUnevictable(from_page);
+	from_page_flags.page_checked = PageChecked(from_page);
+	ClearPageChecked(from_page);
+	from_page_flags.page_mappedtodisk = PageMappedToDisk(from_page);
+	ClearPageMappedToDisk(from_page);
+	from_page_flags.page_dirty = PageDirty(from_page);
+	ClearPageDirty(from_page);
+	from_page_flags.page_is_young = test_and_clear_page_young(from_page);
+	from_page_flags.page_is_idle = page_is_idle(from_page);
+	clear_page_idle(from_page);
+	from_page_flags.page_swapcache = PageSwapCache(from_page);
+	from_page_flags.page_writeback = test_clear_page_writeback(from_page);
+
+
+	to_cpupid = page_cpupid_xchg_last(to_page, -1);
+
+	to_page_flags.page_error = TestClearPageError(to_page);
+	to_page_flags.page_referenced = TestClearPageReferenced(to_page);
+	to_page_flags.page_uptodate = PageUptodate(to_page);
+	ClearPageUptodate(to_page);
+	to_page_flags.page_active = TestClearPageActive(to_page);
+	to_page_flags.page_unevictable = TestClearPageUnevictable(to_page);
+	to_page_flags.page_checked = PageChecked(to_page);
+	ClearPageChecked(to_page);
+	to_page_flags.page_mappedtodisk = PageMappedToDisk(to_page);
+	ClearPageMappedToDisk(to_page);
+	to_page_flags.page_dirty = PageDirty(to_page);
+	ClearPageDirty(to_page);
+	to_page_flags.page_is_young = test_and_clear_page_young(to_page);
+	to_page_flags.page_is_idle = page_is_idle(to_page);
+	clear_page_idle(to_page);
+	to_page_flags.page_swapcache = PageSwapCache(to_page);
+	to_page_flags.page_writeback = test_clear_page_writeback(to_page);
+
+	/* set to_page */
+	if (from_page_flags.page_error)
+		SetPageError(to_page);
+	if (from_page_flags.page_referenced)
+		SetPageReferenced(to_page);
+	if (from_page_flags.page_uptodate)
+		SetPageUptodate(to_page);
+	if (from_page_flags.page_active) {
+		VM_BUG_ON_PAGE(from_page_flags.page_unevictable, from_page);
+		SetPageActive(to_page);
+	} else if (from_page_flags.page_unevictable)
+		SetPageUnevictable(to_page);
+	if (from_page_flags.page_checked)
+		SetPageChecked(to_page);
+	if (from_page_flags.page_mappedtodisk)
+		SetPageMappedToDisk(to_page);
+
+	/* Move dirty on pages not done by migrate_page_move_mapping() */
+	if (from_page_flags.page_dirty)
+		SetPageDirty(to_page);
+
+	if (from_page_flags.page_is_young)
+		set_page_young(to_page);
+	if (from_page_flags.page_is_idle)
+		set_page_idle(to_page);
+
+	/* set from_page */
+	if (to_page_flags.page_error)
+		SetPageError(from_page);
+	if (to_page_flags.page_referenced)
+		SetPageReferenced(from_page);
+	if (to_page_flags.page_uptodate)
+		SetPageUptodate(from_page);
+	if (to_page_flags.page_active) {
+		VM_BUG_ON_PAGE(to_page_flags.page_unevictable, from_page);
+		SetPageActive(from_page);
+	} else if (to_page_flags.page_unevictable)
+		SetPageUnevictable(from_page);
+	if (to_page_flags.page_checked)
+		SetPageChecked(from_page);
+	if (to_page_flags.page_mappedtodisk)
+		SetPageMappedToDisk(from_page);
+
+	/* Move dirty on pages not done by migrate_page_move_mapping() */
+	if (to_page_flags.page_dirty)
+		SetPageDirty(from_page);
+
+	if (to_page_flags.page_is_young)
+		set_page_young(from_page);
+	if (to_page_flags.page_is_idle)
+		set_page_idle(from_page);
+
+	/*
+	 * Copy NUMA information to the new page, to prevent over-eager
+	 * future migrations of this same page.
+	 */
+	page_cpupid_xchg_last(to_page, from_cpupid);
+	page_cpupid_xchg_last(from_page, to_cpupid);
+
+	ksm_exchange_page(to_page, from_page);
+	/*
+	 * Please do not reorder this without considering how mm/ksm.c's
+	 * get_ksm_page() depends upon ksm_migrate_page() and PageSwapCache().
+	 */
+	ClearPageSwapCache(to_page);
+	ClearPageSwapCache(from_page);
+	if (from_page_flags.page_swapcache)
+		SetPageSwapCache(to_page);
+	if (to_page_flags.page_swapcache)
+		SetPageSwapCache(from_page);
+
+
+#ifdef CONFIG_PAGE_OWNER
+	/* exchange page owner  */
+	BUILD_BUG();
+#endif
+	/* exchange mem cgroup  */
+	to_page->mem_cgroup = from_memcg;
+	from_page->mem_cgroup = to_memcg;
+
+}
+
+/*
+ * Replace the page in the mapping.
+ *
+ * The number of remaining references must be:
+ * 1 for anonymous pages without a mapping
+ * 2 for pages with a mapping
+ * 3 for pages with a mapping and PagePrivate/PagePrivate2 set.
+ */
+
+static int exchange_page_move_mapping(struct address_space *to_mapping,
+			struct address_space *from_mapping,
+			struct page *to_page, struct page *from_page,
+			struct buffer_head *to_head,
+			struct buffer_head *from_head,
+			enum migrate_mode mode,
+			int to_extra_count, int from_extra_count)
+{
+	int to_expected_count = 1 + to_extra_count,
+		from_expected_count = 1 + from_extra_count;
+	unsigned long from_page_index = from_page->index;
+	unsigned long to_page_index = to_page->index;
+	int to_swapbacked = PageSwapBacked(to_page),
+		from_swapbacked = PageSwapBacked(from_page);
+	struct address_space *to_mapping_value = to_page->mapping;
+	struct address_space *from_mapping_value = from_page->mapping;
+
+	VM_BUG_ON_PAGE(to_mapping != page_mapping(to_page), to_page);
+	VM_BUG_ON_PAGE(from_mapping != page_mapping(from_page), from_page);
+
+	if (!to_mapping) {
+		/* Anonymous page without mapping */
+		if (page_count(to_page) != to_expected_count)
+			return -EAGAIN;
+	}
+
+	if (!from_mapping) {
+		/* Anonymous page without mapping */
+		if (page_count(from_page) != from_expected_count)
+			return -EAGAIN;
+	}
+
+	/* both are anonymous pages  */
+	if (!from_mapping && !to_mapping) {
+		/* from_page  */
+		from_page->index = to_page_index;
+		from_page->mapping = to_mapping_value;
+
+		ClearPageSwapBacked(from_page);
+		if (to_swapbacked)
+			SetPageSwapBacked(from_page);
+
+
+		/* to_page  */
+		to_page->index = from_page_index;
+		to_page->mapping = from_mapping_value;
+
+		ClearPageSwapBacked(to_page);
+		if (from_swapbacked)
+			SetPageSwapBacked(to_page);
+	} else if (!from_mapping && to_mapping) {
+		/* from is anonymous, to is file-backed  */
+		struct zone *from_zone, *to_zone;
+		void **to_pslot;
+		int dirty;
+
+		from_zone = page_zone(from_page);
+		to_zone = page_zone(to_page);
+
+		xa_lock_irq(&to_mapping->i_pages);
+
+		to_pslot = radix_tree_lookup_slot(&to_mapping->i_pages,
+			page_index(to_page));
+
+		to_expected_count += 1 + page_has_private(to_page);
+		if (page_count(to_page) != to_expected_count ||
+			radix_tree_deref_slot_protected(to_pslot,
+				&to_mapping->i_pages.xa_lock) != to_page) {
+			xa_unlock_irq(&to_mapping->i_pages);
+			return -EAGAIN;
+		}
+
+		if (!page_ref_freeze(to_page, to_expected_count)) {
+			xa_unlock_irq(&to_mapping->i_pages);
+			pr_debug("cannot freeze page count\n");
+			return -EAGAIN;
+		}
+
+		if (mode == MIGRATE_ASYNC && to_head &&
+				!buffer_migrate_lock_buffers(to_head, mode)) {
+			page_ref_unfreeze(to_page, to_expected_count);
+			xa_unlock_irq(&to_mapping->i_pages);
+
+			pr_debug("cannot lock buffer head\n");
+			return -EAGAIN;
+		}
+
+		if (!page_ref_freeze(from_page, from_expected_count)) {
+			page_ref_unfreeze(to_page, to_expected_count);
+			xa_unlock_irq(&to_mapping->i_pages);
+
+			return -EAGAIN;
+		}
+		/*
+		 * Now we know that no one else is looking at the page:
+		 * no turning back from here.
+		 */
+		ClearPageSwapBacked(from_page);
+		ClearPageSwapBacked(to_page);
+
+		/* from_page  */
+		from_page->index = to_page_index;
+		from_page->mapping = to_mapping_value;
+		/* to_page  */
+		to_page->index = from_page_index;
+		to_page->mapping = from_mapping_value;
+
+		if (to_swapbacked)
+			__SetPageSwapBacked(from_page);
+		else
+			VM_BUG_ON_PAGE(PageSwapCache(to_page), to_page);
+
+		if (from_swapbacked)
+			__SetPageSwapBacked(to_page);
+		else
+			VM_BUG_ON_PAGE(PageSwapCache(from_page), from_page);
+
+		dirty = PageDirty(to_page);
+
+		radix_tree_replace_slot(&to_mapping->i_pages,
+				to_pslot, from_page);
+
+		/* move cache reference */
+		page_ref_unfreeze(to_page, to_expected_count - 1);
+		page_ref_unfreeze(from_page, from_expected_count + 1);
+
+		xa_unlock(&to_mapping->i_pages);
+
+		/*
+		 * If moved to a different zone then also account
+		 * the page for that zone. Other VM counters will be
+		 * taken care of when we establish references to the
+		 * new page and drop references to the old page.
+		 *
+		 * Note that anonymous pages are accounted for
+		 * via NR_FILE_PAGES and NR_ANON_MAPPED if they
+		 * are mapped to swap space.
+		 */
+		if (to_zone != from_zone) {
+			__dec_node_state(to_zone->zone_pgdat, NR_FILE_PAGES);
+			__inc_node_state(from_zone->zone_pgdat, NR_FILE_PAGES);
+			if (PageSwapBacked(to_page) && !PageSwapCache(to_page)) {
+				__dec_node_state(to_zone->zone_pgdat, NR_SHMEM);
+				__inc_node_state(from_zone->zone_pgdat, NR_SHMEM);
+			}
+			if (dirty && mapping_cap_account_dirty(to_mapping)) {
+				__dec_node_state(to_zone->zone_pgdat, NR_FILE_DIRTY);
+				__dec_zone_state(to_zone, NR_ZONE_WRITE_PENDING);
+				__inc_node_state(from_zone->zone_pgdat, NR_FILE_DIRTY);
+				__inc_zone_state(from_zone, NR_ZONE_WRITE_PENDING);
+			}
+		}
+		local_irq_enable();
+
+	} else {
+		/* from is file-backed to is anonymous: fold this to the case above */
+		/* both are file-backed  */
+		VM_BUG_ON(1);
+	}
+
+	return MIGRATEPAGE_SUCCESS;
+}
+
+static int exchange_from_to_pages(struct page *to_page, struct page *from_page,
+				enum migrate_mode mode)
+{
+	int rc = -EBUSY;
+	struct address_space *to_page_mapping, *from_page_mapping;
+	struct buffer_head *to_head = NULL, *to_bh = NULL;
+
+	VM_BUG_ON_PAGE(!PageLocked(from_page), from_page);
+	VM_BUG_ON_PAGE(!PageLocked(to_page), to_page);
+
+	/* copy page->mapping not use page_mapping()  */
+	to_page_mapping = page_mapping(to_page);
+	from_page_mapping = page_mapping(from_page);
+
+	/* from_page has to be anonymous page  */
+	VM_BUG_ON(from_page_mapping);
+	VM_BUG_ON(PageWriteback(from_page));
+	/* writeback has to finish */
+	BUG_ON(PageWriteback(to_page));
+
+
+	/* to_page is anonymous  */
+	if (!to_page_mapping) {
+exchange_mappings:
+		/* actual page mapping exchange */
+		rc = exchange_page_move_mapping(to_page_mapping, from_page_mapping,
+					to_page, from_page, NULL, NULL, mode, 0, 0);
+	} else {
+		if (to_page_mapping->a_ops->migratepage == buffer_migrate_page) {
+
+			if (!page_has_buffers(to_page))
+				goto exchange_mappings;
+
+			to_head = page_buffers(to_page);
+
+			rc = exchange_page_move_mapping(to_page_mapping,
+					from_page_mapping, to_page, from_page,
+					to_head, NULL, mode, 0, 0);
+
+			if (rc != MIGRATEPAGE_SUCCESS)
+				return rc;
+
+			/*
+			 * In the async case, migrate_page_move_mapping locked the buffers
+			 * with an IRQ-safe spinlock held. In the sync case, the buffers
+			 * need to be locked now
+			 */
+			if (mode != MIGRATE_ASYNC)
+				VM_BUG_ON(!buffer_migrate_lock_buffers(to_head, mode));
+
+			ClearPagePrivate(to_page);
+			set_page_private(from_page, page_private(to_page));
+			set_page_private(to_page, 0);
+			/* transfer private page count  */
+			put_page(to_page);
+			get_page(from_page);
+
+			to_bh = to_head;
+			do {
+				set_bh_page(to_bh, from_page, bh_offset(to_bh));
+				to_bh = to_bh->b_this_page;
+
+			} while (to_bh != to_head);
+
+			SetPagePrivate(from_page);
+
+			to_bh = to_head;
+		} else if (!to_page_mapping->a_ops->migratepage) {
+			/* fallback_migrate_page  */
+			if (PageDirty(to_page)) {
+				if (mode != MIGRATE_SYNC)
+					return -EBUSY;
+				return writeout(to_page_mapping, to_page);
+			}
+			if (page_has_private(to_page) &&
+				!try_to_release_page(to_page, GFP_KERNEL))
+				return -EAGAIN;
+
+			goto exchange_mappings;
+		}
+	}
+	/* actual page data exchange  */
+	if (rc != MIGRATEPAGE_SUCCESS)
+		return rc;
+
+
+	if (PageHuge(from_page) || PageTransHuge(from_page))
+		exchange_huge_page(to_page, from_page);
+	else
+		exchange_highpage(to_page, from_page);
+	rc = 0;
+
+	/*
+	 * 1. buffer_migrate_page:
+	 *   private flag should be transferred from to_page to from_page
+	 *
+	 * 2. anon<->anon, fallback_migrate_page:
+	 *   both have none private flags or to_page's is cleared.
+	 */
+	VM_BUG_ON(!((page_has_private(from_page) && !page_has_private(to_page)) ||
+				(!page_has_private(from_page) && !page_has_private(to_page))));
+
+	exchange_page_flags(to_page, from_page);
+
+	if (to_bh) {
+		VM_BUG_ON(to_bh != to_head);
+		do {
+			unlock_buffer(to_bh);
+			put_bh(to_bh);
+			to_bh = to_bh->b_this_page;
+
+		} while (to_bh != to_head);
+	}
+
+	return rc;
+}
+
+static int unmap_and_exchange(struct page *from_page,
+		struct page *to_page, enum migrate_mode mode)
+{
+	int rc = -EAGAIN;
+	struct anon_vma *from_anon_vma = NULL;
+	struct anon_vma *to_anon_vma = NULL;
+	int from_page_was_mapped = 0;
+	int to_page_was_mapped = 0;
+	int from_page_count = 0, to_page_count = 0;
+	int from_map_count = 0, to_map_count = 0;
+	unsigned long from_flags, to_flags;
+	pgoff_t from_index, to_index;
+	struct address_space *from_mapping, *to_mapping;
+
+	if (!trylock_page(from_page)) {
+		if (mode == MIGRATE_ASYNC)
+			goto out;
+		lock_page(from_page);
+	}
+
+	if (!trylock_page(to_page)) {
+		if (mode == MIGRATE_ASYNC)
+			goto out_unlock;
+		lock_page(to_page);
+	}
+
+	/* from_page is supposed to be an anonymous page */
+	VM_BUG_ON_PAGE(PageWriteback(from_page), from_page);
+
+	if (PageWriteback(to_page)) {
+		/*
+		 * Only in the case of a full synchronous migration is it
+		 * necessary to wait for PageWriteback. In the async case,
+		 * the retry loop is too short and in the sync-light case,
+		 * the overhead of stalling is too much
+		 */
+		if (mode != MIGRATE_SYNC) {
+			rc = -EBUSY;
+			goto out_unlock_both;
+		}
+		wait_on_page_writeback(to_page);
+	}
+
+	if (PageAnon(from_page) && !PageKsm(from_page))
+		from_anon_vma = page_get_anon_vma(from_page);
+
+	if (PageAnon(to_page) && !PageKsm(to_page))
+		to_anon_vma = page_get_anon_vma(to_page);
+
+	from_page_count = page_count(from_page);
+	from_map_count = page_mapcount(from_page);
+	to_page_count = page_count(to_page);
+	to_map_count = page_mapcount(to_page);
+	from_flags = from_page->flags;
+	to_flags = to_page->flags;
+	from_mapping = from_page->mapping;
+	to_mapping = to_page->mapping;
+	from_index = from_page->index;
+	to_index = to_page->index;
+
+	/*
+	 * Corner case handling:
+	 * 1. When a new swap-cache page is read into, it is added to the LRU
+	 * and treated as swapcache but it has no rmap yet.
+	 * Calling try_to_unmap() against a page->mapping==NULL page will
+	 * trigger a BUG.  So handle it here.
+	 * 2. An orphaned page (see truncate_complete_page) might have
+	 * fs-private metadata. The page can be picked up due to memory
+	 * offlining.  Everywhere else except page reclaim, the page is
+	 * invisible to the vm, so the page can not be migrated.  So try to
+	 * free the metadata, so the page can be freed.
+	 */
+	if (!from_page->mapping) {
+		VM_BUG_ON_PAGE(PageAnon(from_page), from_page);
+		if (page_has_private(from_page)) {
+			try_to_free_buffers(from_page);
+			goto out_unlock_both;
+		}
+	} else if (page_mapped(from_page)) {
+		/* Establish migration ptes */
+		VM_BUG_ON_PAGE(PageAnon(from_page) && !PageKsm(from_page) &&
+					   !from_anon_vma, from_page);
+		try_to_unmap(from_page,
+			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+		from_page_was_mapped = 1;
+	}
+
+	if (!to_page->mapping) {
+		VM_BUG_ON_PAGE(PageAnon(to_page), to_page);
+		if (page_has_private(to_page)) {
+			try_to_free_buffers(to_page);
+			goto out_unlock_both_remove_from_migration_pte;
+		}
+	} else if (page_mapped(to_page)) {
+		/* Establish migration ptes */
+		VM_BUG_ON_PAGE(PageAnon(to_page) && !PageKsm(to_page) &&
+						!to_anon_vma, to_page);
+		try_to_unmap(to_page,
+			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+		to_page_was_mapped = 1;
+	}
+
+	if (!page_mapped(from_page) && !page_mapped(to_page))
+		rc = exchange_from_to_pages(to_page, from_page, mode);
+
+
+	if (to_page_was_mapped) {
+		/* swap back to_page->index to be compatible with
+		 * remove_migration_ptes(), which assumes both from_page and to_page
+		 * below have the same index.
+		 */
+		if (rc == MIGRATEPAGE_SUCCESS)
+			swap(to_page->index, to_index);
+
+		remove_migration_ptes(to_page,
+			rc == MIGRATEPAGE_SUCCESS ? from_page : to_page, false);
+
+		if (rc == MIGRATEPAGE_SUCCESS)
+			swap(to_page->index, to_index);
+	}
+
+out_unlock_both_remove_from_migration_pte:
+	if (from_page_was_mapped) {
+		/* swap back from_page->index to be compatible with
+		 * remove_migration_ptes(), which assumes both from_page and to_page
+		 * below have the same index.
+		 */
+		if (rc == MIGRATEPAGE_SUCCESS)
+			swap(from_page->index, from_index);
+
+		remove_migration_ptes(from_page,
+			rc == MIGRATEPAGE_SUCCESS ? to_page : from_page, false);
+
+		if (rc == MIGRATEPAGE_SUCCESS)
+			swap(from_page->index, from_index);
+	}
+
+out_unlock_both:
+	if (to_anon_vma)
+		put_anon_vma(to_anon_vma);
+	unlock_page(to_page);
+out_unlock:
+	/* Drop an anon_vma reference if we took one */
+	if (from_anon_vma)
+		put_anon_vma(from_anon_vma);
+	unlock_page(from_page);
+out:
+	return rc;
+}
+
+/*
+ * Exchange pages in the exchange_list
+ *
+ * Caller should release the exchange_list resource.
+ *
+ */
+static int exchange_pages(struct list_head *exchange_list,
+			enum migrate_mode mode,
+			int reason)
+{
+	struct exchange_page_info *one_pair, *one_pair2;
+	int failed = 0;
+
+	list_for_each_entry_safe(one_pair, one_pair2, exchange_list, list) {
+		struct page *from_page = one_pair->from_page;
+		struct page *to_page = one_pair->to_page;
+		int rc;
+		int retry = 0;
+
+again:
+		if (page_count(from_page) == 1) {
+			/* page was freed from under us. So we are done  */
+			ClearPageActive(from_page);
+			ClearPageUnevictable(from_page);
+
+			mod_node_page_state(page_pgdat(from_page), NR_ISOLATED_ANON +
+					page_is_file_cache(from_page),
+					-hpage_nr_pages(from_page));
+			put_page(from_page);
+
+			if (page_count(to_page) == 1) {
+				ClearPageActive(to_page);
+				ClearPageUnevictable(to_page);
+				put_page(to_page);
+				mod_node_page_state(page_pgdat(to_page), NR_ISOLATED_ANON +
+						page_is_file_cache(to_page),
+						-hpage_nr_pages(to_page));
+			} else
+				goto putback_to_page;
+
+			continue;
+		}
+
+		if (page_count(to_page) == 1) {
+			/* page was freed from under us. So we are done  */
+			ClearPageActive(to_page);
+			ClearPageUnevictable(to_page);
+
+			mod_node_page_state(page_pgdat(to_page), NR_ISOLATED_ANON +
+					page_is_file_cache(to_page),
+					-hpage_nr_pages(to_page));
+			put_page(to_page);
+
+			mod_node_page_state(page_pgdat(from_page), NR_ISOLATED_ANON +
+					page_is_file_cache(from_page),
+					-hpage_nr_pages(from_page));
+			putback_lru_page(from_page);
+			continue;
+		}
+
+		/* TODO: compound page not supported */
+		/* to_page can be file-backed page  */
+		if (PageCompound(from_page) ||
+			page_mapping(from_page)
+			) {
+			++failed;
+			goto putback;
+		}
+
+		rc = unmap_and_exchange(from_page, to_page, mode);
+
+		if (rc == -EAGAIN && retry < 3) {
+			++retry;
+			goto again;
+		}
+
+		if (rc != MIGRATEPAGE_SUCCESS)
+			++failed;
+
+putback:
+		mod_node_page_state(page_pgdat(from_page), NR_ISOLATED_ANON +
+				page_is_file_cache(from_page),
+				-hpage_nr_pages(from_page));
+
+		putback_lru_page(from_page);
+putback_to_page:
+		mod_node_page_state(page_pgdat(to_page), NR_ISOLATED_ANON +
+				page_is_file_cache(to_page),
+				-hpage_nr_pages(to_page));
+
+		putback_lru_page(to_page);
+	}
+	return failed;
+}
+
+int exchange_two_pages(struct page *page1, struct page *page2)
+{
+	struct exchange_page_info page_info;
+	LIST_HEAD(exchange_list);
+	int err = -EFAULT;
+	int pagevec_flushed = 0;
+
+	VM_BUG_ON_PAGE(PageTail(page1), page1);
+	VM_BUG_ON_PAGE(PageTail(page2), page2);
+
+	if (!(PageLRU(page1) && PageLRU(page2)))
+		return -EBUSY;
+
+retry_isolate1:
+	if (!get_page_unless_zero(page1))
+		return -EBUSY;
+	err = isolate_lru_page(page1);
+	put_page(page1);
+	if (err) {
+		if (!pagevec_flushed) {
+			migrate_prep();
+			pagevec_flushed = 1;
+			goto retry_isolate1;
+		}
+		return err;
+	}
+	mod_node_page_state(page_pgdat(page1),
+			NR_ISOLATED_ANON + page_is_file_cache(page1),
+			hpage_nr_pages(page1));
+
+retry_isolate2:
+	if (!get_page_unless_zero(page2)) {
+		putback_lru_page(page1);
+		return -EBUSY;
+	}
+	err = isolate_lru_page(page2);
+	put_page(page2);
+	if (err) {
+		if (!pagevec_flushed) {
+			migrate_prep();
+			pagevec_flushed = 1;
+			goto retry_isolate2;
+		}
+		return err;
+	}
+	mod_node_page_state(page_pgdat(page2),
+			NR_ISOLATED_ANON + page_is_file_cache(page2),
+			hpage_nr_pages(page2));
+
+	page_info.from_page = page1;
+	page_info.to_page = page2;
+	INIT_LIST_HEAD(&page_info.list);
+	list_add(&page_info.list, &exchange_list);
+
+
+	return exchange_pages(&exchange_list, MIGRATE_SYNC, 0);
+
+}
diff --git a/mm/internal.h b/mm/internal.h
index f4a7bb02decf..77e205c423ce 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -543,4 +543,10 @@  static inline bool is_migrate_highatomic_page(struct page *page)
 
 void setup_zone_pageset(struct zone *zone);
 extern struct page *alloc_new_node_page(struct page *page, unsigned long node);
+
+bool buffer_migrate_lock_buffers(struct buffer_head *head,
+							enum migrate_mode mode);
+int writeout(struct address_space *mapping, struct page *page);
+extern int exchange_two_pages(struct page *page1, struct page *page2);
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/ksm.c b/mm/ksm.c
index 6c48ad13b4c9..dc1ec06b71a0 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -2665,6 +2665,41 @@  void ksm_migrate_page(struct page *newpage, struct page *oldpage)
 		set_page_stable_node(oldpage, NULL);
 	}
 }
+
+void ksm_exchange_page(struct page *to_page, struct page *from_page)
+{
+	struct stable_node *to_stable_node, *from_stable_node;
+
+	VM_BUG_ON_PAGE(!PageLocked(to_page), to_page);
+	VM_BUG_ON_PAGE(!PageLocked(from_page), from_page);
+
+	to_stable_node = page_stable_node(to_page);
+	from_stable_node = page_stable_node(from_page);
+	if (to_stable_node) {
+		VM_BUG_ON_PAGE(to_stable_node->kpfn != page_to_pfn(from_page),
+					from_page);
+		to_stable_node->kpfn = page_to_pfn(to_page);
+		/*
+		 * newpage->mapping was set in advance; now we need smp_wmb()
+		 * to make sure that the new stable_node->kpfn is visible
+		 * to get_ksm_page() before it can see that oldpage->mapping
+		 * has gone stale (or that PageSwapCache has been cleared).
+		 */
+		smp_wmb();
+	}
+	if (from_stable_node) {
+		VM_BUG_ON_PAGE(from_stable_node->kpfn != page_to_pfn(to_page),
+					to_page);
+		from_stable_node->kpfn = page_to_pfn(from_page);
+		/*
+		 * newpage->mapping was set in advance; now we need smp_wmb()
+		 * to make sure that the new stable_node->kpfn is visible
+		 * to get_ksm_page() before it can see that oldpage->mapping
+		 * has gone stale (or that PageSwapCache has been cleared).
+		 */
+		smp_wmb();
+	}
+}
 #endif /* CONFIG_MIGRATION */
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
diff --git a/mm/migrate.c b/mm/migrate.c
index d4fd680be3b0..b8c79aa62134 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -701,7 +701,7 @@  EXPORT_SYMBOL(migrate_page);
 
 #ifdef CONFIG_BLOCK
 /* Returns true if all buffers are successfully locked */
-static bool buffer_migrate_lock_buffers(struct buffer_head *head,
+bool buffer_migrate_lock_buffers(struct buffer_head *head,
 							enum migrate_mode mode)
 {
 	struct buffer_head *bh = head;
@@ -849,7 +849,7 @@  int buffer_migrate_page_norefs(struct address_space *mapping,
 /*
  * Writeback a page to clean the dirty state
  */
-static int writeout(struct address_space *mapping, struct page *page)
+int writeout(struct address_space *mapping, struct page *page)
 {
 	struct writeback_control wbc = {
 		.sync_mode = WB_SYNC_NONE,