diff mbox series

[RFC,v6,71/92] mm: add support for remote mapping

Message ID 20190809160047.8319-72-alazar@bitdefender.com (mailing list archive)
State New, archived
Headers show
Series VM introspection | expand

Commit Message

Adalbert Lazăr Aug. 9, 2019, 4 p.m. UTC
From: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>

The following two new mm exports are introduced:
 * mm_remote_map(struct mm_struct *req_mm,
                 unsigned long req_hva,
                 unsigned long map_hva)
 * mm_remote_unmap(unsigned long map_hva)
 * mm_remote_reset(void)
 * rmap_walk_remote(struct page *page,
                    struct rmap_walk_control *rwc)

This patch allows one process to map into its address space a page from
another process. The previous page (if it exists) is dropped. There is
no corresponding pair of system calls as this API is meant to be used
by the kernel itself only.

The targeted user is the upcoming KVM VM introspection subsystem (KVMI),
where an introspector running in its own VM will map pages from the
introspected guest in order to eliminate round trips to the host kernel
(read/write guest pages).

The flow is as follows: the introspector identifies a guest physical
address where some information of interest is located. It creates a one
page anonymous mapping with MAP_LOCKED | MAP_POPULATE and calls the
kernel via an IOCTL on /dev/kvmmem giving the map virtual address and
the guest physical address as arguments. The kernel converts the map
va into a physical page (gpa in KVM-speak) and passes it to the host
kernel via a hypercall, along with the introspected guest gpa. The host
kernel converts the two gpa-s into their appropriate hva-s (host virtual
addresses) and makes sure the vma backing up the page belonging to the VM
in which the introspector runs, points to the indicated page into the
introspected guest. I have not included here the use of the mapping
token described in the KVMI documentation.

Signed-off-by: Mircea Cîrjaliu <mcirjaliu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 include/linux/page-flags.h          |    9 +-
 include/linux/remote_mapping.h      |  167 +++
 include/uapi/linux/remote_mapping.h |   18 +
 mm/Kconfig                          |    8 +
 mm/Makefile                         |    1 +
 mm/memory-failure.c                 |   69 +-
 mm/migrate.c                        |    9 +-
 mm/remote_mapping.c                 | 1834 +++++++++++++++++++++++++++
 mm/rmap.c                           |   13 +-
 mm/vmscan.c                         |    3 +-
 10 files changed, 2108 insertions(+), 23 deletions(-)
 create mode 100644 include/linux/remote_mapping.h
 create mode 100644 include/uapi/linux/remote_mapping.h
 create mode 100644 mm/remote_mapping.c

Comments

Matthew Wilcox Aug. 9, 2019, 4:24 p.m. UTC | #1
On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote:
> +++ b/include/linux/page-flags.h
> @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY)
>   */
>  #define PAGE_MAPPING_ANON	0x1
>  #define PAGE_MAPPING_MOVABLE	0x2
> +#define PAGE_MAPPING_REMOTE	0x4

Uh.  How do you know page->mapping would otherwise have bit 2 clear?
Who's guaranteeing that?

This is an awfully big patch to the memory management code, buried in
the middle of a gigantic series which almost guarantees nobody would
look at it.  I call shenanigans.

> @@ -1021,7 +1022,7 @@ void page_move_anon_rmap(struct page *page, struct vm_area_struct *vma)
>   * __page_set_anon_rmap - set up new anonymous rmap
>   * @page:	Page or Hugepage to add to rmap
>   * @vma:	VM area to add page to.
> - * @address:	User virtual address of the mapping	
> + * @address:	User virtual address of the mapping

And mixing in fluff changes like this is a real no-no.  Try again.
Paolo Bonzini Aug. 13, 2019, 9:29 a.m. UTC | #2
On 09/08/19 18:24, Matthew Wilcox wrote:
> On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote:
>> +++ b/include/linux/page-flags.h
>> @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY)
>>   */
>>  #define PAGE_MAPPING_ANON	0x1
>>  #define PAGE_MAPPING_MOVABLE	0x2
>> +#define PAGE_MAPPING_REMOTE	0x4
> Uh.  How do you know page->mapping would otherwise have bit 2 clear?
> Who's guaranteeing that?
> 
> This is an awfully big patch to the memory management code, buried in
> the middle of a gigantic series which almost guarantees nobody would
> look at it.  I call shenanigans.

Are you calling shenanigans on the patch submitter (which is gratuitous)
or on the KVM maintainers/reviewers?

It's not true that nobody would look at it.  Of course no one from
linux-mm is going to look at it, but the maintainer that looks at the
gigantic series is very much expected to look at it and explain to the
submitter that this patch is unacceptable as is.

In fact I shouldn't have to to explain this to you; you know better than
believing that I would try to sneak it past the mm folks.  I am puzzled.

Paolo
Matthew Wilcox Aug. 13, 2019, 11:24 a.m. UTC | #3
On Tue, Aug 13, 2019 at 11:29:07AM +0200, Paolo Bonzini wrote:
> On 09/08/19 18:24, Matthew Wilcox wrote:
> > On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote:
> >> +++ b/include/linux/page-flags.h
> >> @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY)
> >>   */
> >>  #define PAGE_MAPPING_ANON	0x1
> >>  #define PAGE_MAPPING_MOVABLE	0x2
> >> +#define PAGE_MAPPING_REMOTE	0x4
> > Uh.  How do you know page->mapping would otherwise have bit 2 clear?
> > Who's guaranteeing that?
> > 
> > This is an awfully big patch to the memory management code, buried in
> > the middle of a gigantic series which almost guarantees nobody would
> > look at it.  I call shenanigans.
> 
> Are you calling shenanigans on the patch submitter (which is gratuitous)
> or on the KVM maintainers/reviewers?

On the patch submitter, of course.  How can I possibly be criticising you
for something you didn't do?
Paolo Bonzini Aug. 13, 2019, 12:02 p.m. UTC | #4
On 13/08/19 13:24, Matthew Wilcox wrote:
>>>
>>> This is an awfully big patch to the memory management code, buried in
>>> the middle of a gigantic series which almost guarantees nobody would
>>> look at it.  I call shenanigans.
>> Are you calling shenanigans on the patch submitter (which is gratuitous)
>> or on the KVM maintainers/reviewers?
>
> On the patch submitter, of course.  How can I possibly be criticising you
> for something you didn't do?

No idea.  "Nobody would look at it" definitely includes me though.

In any case, water under the bridge.  The submitter did duly mark the
series as RFC, I don't see anything wrong in what he did apart from not
having testcases. :)

Paolo
Jerome Glisse Aug. 15, 2019, 7:19 p.m. UTC | #5
On Tue, Aug 13, 2019 at 02:01:35PM +0300, Adalbert Lazăr wrote:
> On Fri, 9 Aug 2019 09:24:44 -0700, Matthew Wilcox <willy@infradead.org> wrote:
> > On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote:
> > > +++ b/include/linux/page-flags.h
> > > @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY)
> > >   */
> > >  #define PAGE_MAPPING_ANON	0x1
> > >  #define PAGE_MAPPING_MOVABLE	0x2
> > > +#define PAGE_MAPPING_REMOTE	0x4
> > 
> > Uh.  How do you know page->mapping would otherwise have bit 2 clear?
> > Who's guaranteeing that?
> > 
> > This is an awfully big patch to the memory management code, buried in
> > the middle of a gigantic series which almost guarantees nobody would
> > look at it.  I call shenanigans.
> > 
> > > @@ -1021,7 +1022,7 @@ void page_move_anon_rmap(struct page *page, struct vm_area_struct *vma)
> > >   * __page_set_anon_rmap - set up new anonymous rmap
> > >   * @page:	Page or Hugepage to add to rmap
> > >   * @vma:	VM area to add page to.
> > > - * @address:	User virtual address of the mapping	
> > > + * @address:	User virtual address of the mapping
> > 
> > And mixing in fluff changes like this is a real no-no.  Try again.
> > 
> 
> No bad intentions, just overzealous.
> I didn't want to hide anything from our patches.
> Once we advance with the introspection patches related to KVM we'll be
> back with the remote mapping patch, split and cleaned.

They are not bit left in struct page ! Looking at the patch it seems
you want to have your own pin count just for KVM. This is bad, we are
already trying to solve the GUP thing (see all various patchset about
GUP posted recently).

You need to rethink how you want to achieve this. Why not simply a
remote read()/write() into the process memory ie KVMI would call
an ioctl that allow to read or write into a remote process memory
like ptrace() but on steroid ...

Adding this whole big complex infrastructure without justification
of why we need to avoid round trip is just too much really.

Cheers,
Jérôme
Jerome Glisse Aug. 15, 2019, 8:16 p.m. UTC | #6
On Thu, Aug 15, 2019 at 03:19:29PM -0400, Jerome Glisse wrote:
> On Tue, Aug 13, 2019 at 02:01:35PM +0300, Adalbert Lazăr wrote:
> > On Fri, 9 Aug 2019 09:24:44 -0700, Matthew Wilcox <willy@infradead.org> wrote:
> > > On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote:
> > > > +++ b/include/linux/page-flags.h
> > > > @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY)
> > > >   */
> > > >  #define PAGE_MAPPING_ANON	0x1
> > > >  #define PAGE_MAPPING_MOVABLE	0x2
> > > > +#define PAGE_MAPPING_REMOTE	0x4
> > > 
> > > Uh.  How do you know page->mapping would otherwise have bit 2 clear?
> > > Who's guaranteeing that?
> > > 
> > > This is an awfully big patch to the memory management code, buried in
> > > the middle of a gigantic series which almost guarantees nobody would
> > > look at it.  I call shenanigans.
> > > 
> > > > @@ -1021,7 +1022,7 @@ void page_move_anon_rmap(struct page *page, struct vm_area_struct *vma)
> > > >   * __page_set_anon_rmap - set up new anonymous rmap
> > > >   * @page:	Page or Hugepage to add to rmap
> > > >   * @vma:	VM area to add page to.
> > > > - * @address:	User virtual address of the mapping	
> > > > + * @address:	User virtual address of the mapping
> > > 
> > > And mixing in fluff changes like this is a real no-no.  Try again.
> > > 
> > 
> > No bad intentions, just overzealous.
> > I didn't want to hide anything from our patches.
> > Once we advance with the introspection patches related to KVM we'll be
> > back with the remote mapping patch, split and cleaned.
> 
> They are not bit left in struct page ! Looking at the patch it seems
> you want to have your own pin count just for KVM. This is bad, we are
> already trying to solve the GUP thing (see all various patchset about
> GUP posted recently).
> 
> You need to rethink how you want to achieve this. Why not simply a
> remote read()/write() into the process memory ie KVMI would call
> an ioctl that allow to read or write into a remote process memory
> like ptrace() but on steroid ...
> 
> Adding this whole big complex infrastructure without justification
> of why we need to avoid round trip is just too much really.

Thinking a bit more about this, you can achieve the same thing without
adding a single line to any mm code. Instead of having mmap with
PROT_NONE | MAP_LOCKED you have userspace mmap some kvm device file
(i am assuming this is something you already have and can control
the mmap callback).

So now kernel side you have a vma with a vm_operations_struct under
your control this means that everything you want to block mm wise
from within the inspector process can be block through those call-
backs (find_special_page() specificaly for which you have to return
NULL all the time).

To mirror target process memory you can use hmm_mirror, when you
populate the inspector process page table you use insert_pfn()
(mmap of the kvm device file must mark this vma as PFNMAP).

By following the hmm_mirror API, anytime the target process has
a change in its page table (ie virtual address -> page) you will
get a callback and all you have to do is clear the page table
within the inspector process and flush tlb (use zap_page_range).

On page fault within the inspector process the fault callback of
vm_ops will get call and from there you call hmm_mirror following
its API.

Oh also mark the vma with VM_WIPEONFORK to avoid any issue if the
inspector process use fork() (you could support fork but then you
would need to mark the vma as SHARED and use unmap_mapping_pages
instead of zap_page_range).


There everything you want to do with already upstream mm code.

Cheers,
Jérôme
Jason Gunthorpe Aug. 16, 2019, 5:45 p.m. UTC | #7
On Thu, Aug 15, 2019 at 04:16:30PM -0400, Jerome Glisse wrote:
> On Thu, Aug 15, 2019 at 03:19:29PM -0400, Jerome Glisse wrote:
> > On Tue, Aug 13, 2019 at 02:01:35PM +0300, Adalbert Lazăr wrote:
> > > On Fri, 9 Aug 2019 09:24:44 -0700, Matthew Wilcox <willy@infradead.org> wrote:
> > > > On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote:
> > > > > +++ b/include/linux/page-flags.h
> > > > > @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY)
> > > > >   */
> > > > >  #define PAGE_MAPPING_ANON	0x1
> > > > >  #define PAGE_MAPPING_MOVABLE	0x2
> > > > > +#define PAGE_MAPPING_REMOTE	0x4
> > > > 
> > > > Uh.  How do you know page->mapping would otherwise have bit 2 clear?
> > > > Who's guaranteeing that?
> > > > 
> > > > This is an awfully big patch to the memory management code, buried in
> > > > the middle of a gigantic series which almost guarantees nobody would
> > > > look at it.  I call shenanigans.
> > > > 
> > > > > @@ -1021,7 +1022,7 @@ void page_move_anon_rmap(struct page *page, struct vm_area_struct *vma)
> > > > >   * __page_set_anon_rmap - set up new anonymous rmap
> > > > >   * @page:	Page or Hugepage to add to rmap
> > > > >   * @vma:	VM area to add page to.
> > > > > - * @address:	User virtual address of the mapping	
> > > > > + * @address:	User virtual address of the mapping
> > > > 
> > > > And mixing in fluff changes like this is a real no-no.  Try again.
> > > > 
> > > 
> > > No bad intentions, just overzealous.
> > > I didn't want to hide anything from our patches.
> > > Once we advance with the introspection patches related to KVM we'll be
> > > back with the remote mapping patch, split and cleaned.
> > 
> > They are not bit left in struct page ! Looking at the patch it seems
> > you want to have your own pin count just for KVM. This is bad, we are
> > already trying to solve the GUP thing (see all various patchset about
> > GUP posted recently).
> > 
> > You need to rethink how you want to achieve this. Why not simply a
> > remote read()/write() into the process memory ie KVMI would call
> > an ioctl that allow to read or write into a remote process memory
> > like ptrace() but on steroid ...
> > 
> > Adding this whole big complex infrastructure without justification
> > of why we need to avoid round trip is just too much really.
> 
> Thinking a bit more about this, you can achieve the same thing without
> adding a single line to any mm code. Instead of having mmap with
> PROT_NONE | MAP_LOCKED you have userspace mmap some kvm device file
> (i am assuming this is something you already have and can control
> the mmap callback).
> 
> So now kernel side you have a vma with a vm_operations_struct under
> your control this means that everything you want to block mm wise
> from within the inspector process can be block through those call-
> backs (find_special_page() specificaly for which you have to return
> NULL all the time).

I'm actually aware of a couple of use cases that would like to
mirror the VA of one process into another. One big one in the HPC
world is the out of tree 'xpmem' still in wide use today. xpmem is
basically what Jerome described above.

If you do an approach like Jerome describes it would be nice if was a
general facility and not buried in kvm.

I know past xpmem adventures ran into trouble with locking/etc - ie
getting the mm_struct of the victim seemed a bit hard for some reason,
but maybe that could be done with a FD pass 'ioctl(I AM THE VICITM)' ?

Jason
Mircea CIRJALIU - MELIU Aug. 23, 2019, 12:39 p.m. UTC | #8
> On Thu, Aug 15, 2019 at 03:19:29PM -0400, Jerome Glisse wrote:
> > On Tue, Aug 13, 2019 at 02:01:35PM +0300, Adalbert Lazăr wrote:
> > > On Fri, 9 Aug 2019 09:24:44 -0700, Matthew Wilcox <willy@infradead.org>
> wrote:
> > > > On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote:
> > > > > +++ b/include/linux/page-flags.h
> > > > > @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY)
> > > > >   */
> > > > >  #define PAGE_MAPPING_ANON	0x1
> > > > >  #define PAGE_MAPPING_MOVABLE	0x2
> > > > > +#define PAGE_MAPPING_REMOTE	0x4
> > > >
> > > > Uh.  How do you know page->mapping would otherwise have bit 2
> clear?
> > > > Who's guaranteeing that?
> > > >
> > > > This is an awfully big patch to the memory management code, buried
> > > > in the middle of a gigantic series which almost guarantees nobody
> > > > would look at it.  I call shenanigans.
> > > >
> > > > > @@ -1021,7 +1022,7 @@ void page_move_anon_rmap(struct page
> *page, struct vm_area_struct *vma)
> > > > >   * __page_set_anon_rmap - set up new anonymous rmap
> > > > >   * @page:	Page or Hugepage to add to rmap
> > > > >   * @vma:	VM area to add page to.
> > > > > - * @address:	User virtual address of the mapping
> > > > > + * @address:	User virtual address of the mapping
> > > >
> > > > And mixing in fluff changes like this is a real no-no.  Try again.
> > > >
> > >
> > > No bad intentions, just overzealous.
> > > I didn't want to hide anything from our patches.
> > > Once we advance with the introspection patches related to KVM we'll
> > > be back with the remote mapping patch, split and cleaned.
> >
> > They are not bit left in struct page ! Looking at the patch it seems
> > you want to have your own pin count just for KVM. This is bad, we are
> > already trying to solve the GUP thing (see all various patchset about
> > GUP posted recently).
> >
> > You need to rethink how you want to achieve this. Why not simply a
> > remote read()/write() into the process memory ie KVMI would call an
> > ioctl that allow to read or write into a remote process memory like
> > ptrace() but on steroid ...
> >
> > Adding this whole big complex infrastructure without justification of
> > why we need to avoid round trip is just too much really.
> 
> Thinking a bit more about this, you can achieve the same thing without
> adding a single line to any mm code. Instead of having mmap with
> PROT_NONE | MAP_LOCKED you have userspace mmap some kvm device
> file (i am assuming this is something you already have and can control the
> mmap callback).
> 
> So now kernel side you have a vma with a vm_operations_struct under your
> control this means that everything you want to block mm wise from within
> the inspector process can be block through those call- backs
> (find_special_page() specificaly for which you have to return NULL all the
> time).
> 
> To mirror target process memory you can use hmm_mirror, when you
> populate the inspector process page table you use insert_pfn() (mmap of
> the kvm device file must mark this vma as PFNMAP).
> 
> By following the hmm_mirror API, anytime the target process has a change in
> its page table (ie virtual address -> page) you will get a callback and all you
> have to do is clear the page table within the inspector process and flush tlb
> (use zap_page_range).
> 
> On page fault within the inspector process the fault callback of vm_ops will
> get call and from there you call hmm_mirror following its API.
> 
> Oh also mark the vma with VM_WIPEONFORK to avoid any issue if the
> inspector process use fork() (you could support fork but then you would
> need to mark the vma as SHARED and use unmap_mapping_pages instead of
> zap_page_range).
> 
> 
> There everything you want to do with already upstream mm code.

I'm the author of remote mapping, so I owe everybody some explanations.
My requirement was to map pages from one QEMU process to another QEMU 
process (our inspector process works in a virtual machine of its own). So I had 
to implement a KSM-like page sharing between processes, where an anon page
from the target QEMU's working memory is promoted to a remote page and 
mapped in the inspector QEMU's working memory (both anon VMAs). 
The extra page flag is for differentiating the page for rmap walking.

The mapping requests come at PAGE_SIZE granularity for random addresses 
within the target/inspector QEMUs, so I couldn't do any linear mapping that
would keep things simpler. 

I have an extra patch that does remote mapping by mirroring an entire VMA
from the target process by way of a device file. This thing creates a separate 
mirror VMA in my inspector process (at the moment a QEMU), but then I 
bumped into the KVM hva->gpa mapping, which makes it hard to override 
mappings with addresses outside memslot associated VMAs.

Mircea
Jerome Glisse Sept. 5, 2019, 6:09 p.m. UTC | #9
On Fri, Aug 23, 2019 at 12:39:21PM +0000, Mircea CIRJALIU - MELIU wrote:
> > On Thu, Aug 15, 2019 at 03:19:29PM -0400, Jerome Glisse wrote:
> > > On Tue, Aug 13, 2019 at 02:01:35PM +0300, Adalbert Lazăr wrote:
> > > > On Fri, 9 Aug 2019 09:24:44 -0700, Matthew Wilcox <willy@infradead.org>
> > wrote:
> > > > > On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote:
> > > > > > +++ b/include/linux/page-flags.h
> > > > > > @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY)
> > > > > >   */
> > > > > >  #define PAGE_MAPPING_ANON	0x1
> > > > > >  #define PAGE_MAPPING_MOVABLE	0x2
> > > > > > +#define PAGE_MAPPING_REMOTE	0x4
> > > > >
> > > > > Uh.  How do you know page->mapping would otherwise have bit 2
> > clear?
> > > > > Who's guaranteeing that?
> > > > >
> > > > > This is an awfully big patch to the memory management code, buried
> > > > > in the middle of a gigantic series which almost guarantees nobody
> > > > > would look at it.  I call shenanigans.
> > > > >
> > > > > > @@ -1021,7 +1022,7 @@ void page_move_anon_rmap(struct page
> > *page, struct vm_area_struct *vma)
> > > > > >   * __page_set_anon_rmap - set up new anonymous rmap
> > > > > >   * @page:	Page or Hugepage to add to rmap
> > > > > >   * @vma:	VM area to add page to.
> > > > > > - * @address:	User virtual address of the mapping
> > > > > > + * @address:	User virtual address of the mapping
> > > > >
> > > > > And mixing in fluff changes like this is a real no-no.  Try again.
> > > > >
> > > >
> > > > No bad intentions, just overzealous.
> > > > I didn't want to hide anything from our patches.
> > > > Once we advance with the introspection patches related to KVM we'll
> > > > be back with the remote mapping patch, split and cleaned.
> > >
> > > They are not bit left in struct page ! Looking at the patch it seems
> > > you want to have your own pin count just for KVM. This is bad, we are
> > > already trying to solve the GUP thing (see all various patchset about
> > > GUP posted recently).
> > >
> > > You need to rethink how you want to achieve this. Why not simply a
> > > remote read()/write() into the process memory ie KVMI would call an
> > > ioctl that allow to read or write into a remote process memory like
> > > ptrace() but on steroid ...
> > >
> > > Adding this whole big complex infrastructure without justification of
> > > why we need to avoid round trip is just too much really.
> > 
> > Thinking a bit more about this, you can achieve the same thing without
> > adding a single line to any mm code. Instead of having mmap with
> > PROT_NONE | MAP_LOCKED you have userspace mmap some kvm device
> > file (i am assuming this is something you already have and can control the
> > mmap callback).
> > 
> > So now kernel side you have a vma with a vm_operations_struct under your
> > control this means that everything you want to block mm wise from within
> > the inspector process can be block through those call- backs
> > (find_special_page() specificaly for which you have to return NULL all the
> > time).
> > 
> > To mirror target process memory you can use hmm_mirror, when you
> > populate the inspector process page table you use insert_pfn() (mmap of
> > the kvm device file must mark this vma as PFNMAP).
> > 
> > By following the hmm_mirror API, anytime the target process has a change in
> > its page table (ie virtual address -> page) you will get a callback and all you
> > have to do is clear the page table within the inspector process and flush tlb
> > (use zap_page_range).
> > 
> > On page fault within the inspector process the fault callback of vm_ops will
> > get call and from there you call hmm_mirror following its API.
> > 
> > Oh also mark the vma with VM_WIPEONFORK to avoid any issue if the
> > inspector process use fork() (you could support fork but then you would
> > need to mark the vma as SHARED and use unmap_mapping_pages instead of
> > zap_page_range).
> > 
> > 
> > There everything you want to do with already upstream mm code.
> 
> I'm the author of remote mapping, so I owe everybody some explanations.
> My requirement was to map pages from one QEMU process to another QEMU 
> process (our inspector process works in a virtual machine of its own). So I had 
> to implement a KSM-like page sharing between processes, where an anon page
> from the target QEMU's working memory is promoted to a remote page and 
> mapped in the inspector QEMU's working memory (both anon VMAs). 
> The extra page flag is for differentiating the page for rmap walking.
> 
> The mapping requests come at PAGE_SIZE granularity for random addresses 
> within the target/inspector QEMUs, so I couldn't do any linear mapping that
> would keep things simpler. 
> 
> I have an extra patch that does remote mapping by mirroring an entire VMA
> from the target process by way of a device file. This thing creates a separate 
> mirror VMA in my inspector process (at the moment a QEMU), but then I 
> bumped into the KVM hva->gpa mapping, which makes it hard to override 
> mappings with addresses outside memslot associated VMAs.

Not sure i understand, you are saying that the solution i outline above
does not work ? If so then i think you are wrong, in the above solution
the importing process mmap a device file and the resulting vma is then
populated using insert_pfn() and constantly keep synchronize with the
target process through mirroring which means that you never have to look
at the struct page ... you can mirror any kind of memory from the remote
process.

Am i miss-understanding something here ?

Cheers,
Jérôme
Paolo Bonzini Sept. 9, 2019, 5 p.m. UTC | #10
On 05/09/19 20:09, Jerome Glisse wrote:
> Not sure i understand, you are saying that the solution i outline above
> does not work ? If so then i think you are wrong, in the above solution
> the importing process mmap a device file and the resulting vma is then
> populated using insert_pfn() and constantly keep synchronize with the
> target process through mirroring which means that you never have to look
> at the struct page ... you can mirror any kind of memory from the remote
> process.

If insert_pfn in turn calls MMU notifiers for the target VMA (which
would be the KVM MMU notifier), then that would work.  Though I guess it
would be possible to call MMU notifier update callbacks around the call
to insert_pfn.

Paolo
Mircea CIRJALIU - MELIU Sept. 10, 2019, 7:49 a.m. UTC | #11
> On 05/09/19 20:09, Jerome Glisse wrote:
> > Not sure i understand, you are saying that the solution i outline
> > above does not work ? If so then i think you are wrong, in the above
> > solution the importing process mmap a device file and the resulting
> > vma is then populated using insert_pfn() and constantly keep
> > synchronize with the target process through mirroring which means that
> > you never have to look at the struct page ... you can mirror any kind
> > of memory from the remote process.
> 
> If insert_pfn in turn calls MMU notifiers for the target VMA (which would be
> the KVM MMU notifier), then that would work.  Though I guess it would be
> possible to call MMU notifier update callbacks around the call to insert_pfn.

Can't do that.
First, insert_pfn() uses set_pte_at() which won't trigger the MMU notifier on
the target VMA. It's also static, so I'll have to access it thru vmf_insert_pfn()
or vmf_insert_mixed().

Our model (the importing process is encapsulated in another VM) forces us
to mirror certain pages from the anon VMA backing one VM's system RAM to 
the other VM's anon VMA. 

Using the functions above means setting VM_PFNMAP|VM_MIXEDMAP on 
the target anon VMA, but I guess this breaks the VMA. Is this recommended?

Then, mapping anon pages from one VMA to another without fixing the 
refcount and the mapcount breaks the daemons that think they're working 
on a pure anon VMA (kcompactd, khugepaged).

Mircea
Paolo Bonzini Oct. 2, 2019, 1:46 p.m. UTC | #12
On 02/10/19 21:27, Jerome Glisse wrote:
> On Tue, Sep 10, 2019 at 07:49:51AM +0000, Mircea CIRJALIU - MELIU wrote:
>>> On 05/09/19 20:09, Jerome Glisse wrote:
>>>> Not sure i understand, you are saying that the solution i outline
>>>> above does not work ? If so then i think you are wrong, in the above
>>>> solution the importing process mmap a device file and the resulting
>>>> vma is then populated using insert_pfn() and constantly keep
>>>> synchronize with the target process through mirroring which means that
>>>> you never have to look at the struct page ... you can mirror any kind
>>>> of memory from the remote process.
>>>
>>> If insert_pfn in turn calls MMU notifiers for the target VMA (which would be
>>> the KVM MMU notifier), then that would work.  Though I guess it would be
>>> possible to call MMU notifier update callbacks around the call to insert_pfn.
>>
>> Can't do that.
>> First, insert_pfn() uses set_pte_at() which won't trigger the MMU notifier on
>> the target VMA. It's also static, so I'll have to access it thru vmf_insert_pfn()
>> or vmf_insert_mixed().
> 
> Why would you need to target mmu notifier on target vma ?

If the mapping of the source VMA changes, mirroring can update the
target VMA via insert_pfn.  But what ensures that KVM's MMU notifier
dismantles its own existing page tables (so that they can be recreated
with the new mapping from the source VMA)?

Thanks,

Paolo

> You do not need
> that. The workflow is:
> 
>     userspace:
>         ptr = mmap(/dev/kvm-mirroring-device, virtual_addresse_of_target)
> 
> Then when the mirroring process access ptr it triggers page fault that
> endup in the vm_operation_struct->fault() which is just doing:
> 
>     kernel-kvm-mirroring-function:
>         kvm_mirror_page_fault(struct vm_fault *vmf) {
>             struct kvm_mirror_struct *kvmms;
> 
>             kvmms = kvm_mirror_struct_from_file(vmf->vma->vm_file);
>             ...
>         again:
>             hmm_range_register(&range);
>             hmm_range_snapshot(&range);
>             take_lock(kvmms->update);
>             if (!hmm_range_valid(&range)) {
>                 vm_insert_pfn();
>                 drop_lock(kvmms->update);
>                 hmm_range_unregister(&range);
>                 return VM_FAULT_NOPAGE;
>             }
>             drop_lock(kvmms->update);
>             goto again;
>         }
> 
> The notifier callback:
>         kvmms_notifier_start() {
>             take_lock(kvmms->update);
>             clear_pte(start, end);
>             drop_lock(kvmms->update);
>         }
> 
>>
>> Our model (the importing process is encapsulated in another VM) forces us
>> to mirror certain pages from the anon VMA backing one VM's system RAM to 
>> the other VM's anon VMA. 
> 
> The mirror does not have to be an anon vma it can very well be a
> device vma ie mmap of a device file. I do not see any reasons why
> the mirror need to be an anon vma. Please explain why.
> 
>>
>> Using the functions above means setting VM_PFNMAP|VM_MIXEDMAP on 
>> the target anon VMA, but I guess this breaks the VMA. Is this recommended?
> 
> The mirror vma should not be an anon vma.
> 
>>
>> Then, mapping anon pages from one VMA to another without fixing the 
>> refcount and the mapcount breaks the daemons that think they're working 
>> on a pure anon VMA (kcompactd, khugepaged).
> 
> Note here the target vma ie the mirroring one is a mmap of device file
> and thus is skip by all of the above (kcompactd, khugepaged, ...) it is
> fully ignore by core mm.
> 
> Thus you do not need to fix the refcount in any way. If any of the core
> mm try to reclaim memory from the original vma then you will get mmu
> notifier callbacks and all you have to do is clear the page table of your
> device vma.
> 
> I did exactly that as a tools in the past and it works just fine with
> no change to core mm whatsoever.
> 
> Cheers,
> Jérôme
>
Jerome Glisse Oct. 2, 2019, 2:15 p.m. UTC | #13
On Wed, Oct 02, 2019 at 03:46:30PM +0200, Paolo Bonzini wrote:
> On 02/10/19 21:27, Jerome Glisse wrote:
> > On Tue, Sep 10, 2019 at 07:49:51AM +0000, Mircea CIRJALIU - MELIU wrote:
> >>> On 05/09/19 20:09, Jerome Glisse wrote:
> >>>> Not sure i understand, you are saying that the solution i outline
> >>>> above does not work ? If so then i think you are wrong, in the above
> >>>> solution the importing process mmap a device file and the resulting
> >>>> vma is then populated using insert_pfn() and constantly keep
> >>>> synchronize with the target process through mirroring which means that
> >>>> you never have to look at the struct page ... you can mirror any kind
> >>>> of memory from the remote process.
> >>>
> >>> If insert_pfn in turn calls MMU notifiers for the target VMA (which would be
> >>> the KVM MMU notifier), then that would work.  Though I guess it would be
> >>> possible to call MMU notifier update callbacks around the call to insert_pfn.
> >>
> >> Can't do that.
> >> First, insert_pfn() uses set_pte_at() which won't trigger the MMU notifier on
> >> the target VMA. It's also static, so I'll have to access it thru vmf_insert_pfn()
> >> or vmf_insert_mixed().
> > 
> > Why would you need to target mmu notifier on target vma ?
> 
> If the mapping of the source VMA changes, mirroring can update the
> target VMA via insert_pfn.  But what ensures that KVM's MMU notifier
> dismantles its own existing page tables (so that they can be recreated
> with the new mapping from the source VMA)?
> 

So just to make sure i follow we have:
      - qemu process on host with anonymous vma
            -> host cpu page table
      - kvm which maps host anonymous vma to guest
            -> kvm guest page table
      - kvm inspector process which mirror vma from qemu process
            -> inspector process page table

AFAIK the KVM notifier's will clear the kvm guest page table whenever
necessary (through kvm_mmu_notifier_invalidate_range_start). This is
what ensure that KVM's dismatles its own mapping, it abides to mmu-
notifier callbacks. If you did not you would have bugs (at least i
expect so). Am i wrong here ?

The mirroring kernel driver would also register the notifier against
the quemu process and would also abide to notifier callbacks.

What you want to maintain at all times is that none of the actors
above ever look at different page for the same virtual address (ie
one looking at older page while another look at new page).

This is where you have helper like HMM that make sure that you can
not populate the mirroring vma while a notifier is on going. Which
means that everything is serialize on the notifier.

Cheers,
Jérôme
Paolo Bonzini Oct. 2, 2019, 4:18 p.m. UTC | #14
On 02/10/19 16:15, Jerome Glisse wrote:
>>> Why would you need to target mmu notifier on target vma ?
>> If the mapping of the source VMA changes, mirroring can update the
>> target VMA via insert_pfn.  But what ensures that KVM's MMU notifier
>> dismantles its own existing page tables (so that they can be recreated
>> with the new mapping from the source VMA)?
>>
> So just to make sure i follow we have:
>       - qemu process on host with anonymous vma
>             -> host cpu page table
>       - kvm which maps host anonymous vma to guest
>             -> kvm guest page table
>       - kvm inspector process which mirror vma from qemu process
>             -> inspector process page table
> 
> AFAIK the KVM notifier's will clear the kvm guest page table whenever
> necessary (through kvm_mmu_notifier_invalidate_range_start). This is
> what ensure that KVM's dismatles its own mapping, it abides to mmu-
> notifier callbacks. If you did not you would have bugs (at least i
> expect so). Am i wrong here ?

The KVM inspector process is also (or can be) a QEMU that will have to
create its own KVM guest page table.

So if a page in the source VMA is unmapped we want:

- the source KVM to invalidate its guest page table (done by the KVM MMU
notifier)

- the target VMA to be invalidated (easy using mirroring)

- the target KVM to invalidate its guest page table, as a result of
invalidation of the target VMA

Paolo
Jerome Glisse Oct. 2, 2019, 5:04 p.m. UTC | #15
On Wed, Oct 02, 2019 at 06:18:06PM +0200, Paolo Bonzini wrote:
> On 02/10/19 16:15, Jerome Glisse wrote:
> >>> Why would you need to target mmu notifier on target vma ?
> >> If the mapping of the source VMA changes, mirroring can update the
> >> target VMA via insert_pfn.  But what ensures that KVM's MMU notifier
> >> dismantles its own existing page tables (so that they can be recreated
> >> with the new mapping from the source VMA)?
> >>
> > So just to make sure i follow we have:
> >       - qemu process on host with anonymous vma
> >             -> host cpu page table
> >       - kvm which maps host anonymous vma to guest
> >             -> kvm guest page table
> >       - kvm inspector process which mirror vma from qemu process
> >             -> inspector process page table
> > 
> > AFAIK the KVM notifier's will clear the kvm guest page table whenever
> > necessary (through kvm_mmu_notifier_invalidate_range_start). This is
> > what ensure that KVM's dismatles its own mapping, it abides to mmu-
> > notifier callbacks. If you did not you would have bugs (at least i
> > expect so). Am i wrong here ?
> 
> The KVM inspector process is also (or can be) a QEMU that will have to
> create its own KVM guest page table.

Ok missed that part, thank you for explaining

> 
> So if a page in the source VMA is unmapped we want:
> 
> - the source KVM to invalidate its guest page table (done by the KVM MMU
> notifier)
> 
> - the target VMA to be invalidated (easy using mirroring)
> 
> - the target KVM to invalidate its guest page table, as a result of
> invalidation of the target VMA

You can do the target KVM invalidation inside the mirroring invalidation
code.

Cheers,
Jérôme
Jerome Glisse Oct. 2, 2019, 7:27 p.m. UTC | #16
On Tue, Sep 10, 2019 at 07:49:51AM +0000, Mircea CIRJALIU - MELIU wrote:
> > On 05/09/19 20:09, Jerome Glisse wrote:
> > > Not sure i understand, you are saying that the solution i outline
> > > above does not work ? If so then i think you are wrong, in the above
> > > solution the importing process mmap a device file and the resulting
> > > vma is then populated using insert_pfn() and constantly keep
> > > synchronize with the target process through mirroring which means that
> > > you never have to look at the struct page ... you can mirror any kind
> > > of memory from the remote process.
> > 
> > If insert_pfn in turn calls MMU notifiers for the target VMA (which would be
> > the KVM MMU notifier), then that would work.  Though I guess it would be
> > possible to call MMU notifier update callbacks around the call to insert_pfn.
> 
> Can't do that.
> First, insert_pfn() uses set_pte_at() which won't trigger the MMU notifier on
> the target VMA. It's also static, so I'll have to access it thru vmf_insert_pfn()
> or vmf_insert_mixed().

Why would you need to target mmu notifier on target vma ? You do not need
that. The workflow is:

    userspace:
        ptr = mmap(/dev/kvm-mirroring-device, virtual_addresse_of_target)

Then when the mirroring process access ptr it triggers page fault that
endup in the vm_operation_struct->fault() which is just doing:

    kernel-kvm-mirroring-function:
        kvm_mirror_page_fault(struct vm_fault *vmf) {
            struct kvm_mirror_struct *kvmms;

            kvmms = kvm_mirror_struct_from_file(vmf->vma->vm_file);
            ...
        again:
            hmm_range_register(&range);
            hmm_range_snapshot(&range);
            take_lock(kvmms->update);
            if (!hmm_range_valid(&range)) {
                vm_insert_pfn();
                drop_lock(kvmms->update);
                hmm_range_unregister(&range);
                return VM_FAULT_NOPAGE;
            }
            drop_lock(kvmms->update);
            goto again;
        }

The notifier callback:
        kvmms_notifier_start() {
            take_lock(kvmms->update);
            clear_pte(start, end);
            drop_lock(kvmms->update);
        }

> 
> Our model (the importing process is encapsulated in another VM) forces us
> to mirror certain pages from the anon VMA backing one VM's system RAM to 
> the other VM's anon VMA. 

The mirror does not have to be an anon vma it can very well be a
device vma ie mmap of a device file. I do not see any reasons why
the mirror need to be an anon vma. Please explain why.

> 
> Using the functions above means setting VM_PFNMAP|VM_MIXEDMAP on 
> the target anon VMA, but I guess this breaks the VMA. Is this recommended?

The mirror vma should not be an anon vma.

> 
> Then, mapping anon pages from one VMA to another without fixing the 
> refcount and the mapcount breaks the daemons that think they're working 
> on a pure anon VMA (kcompactd, khugepaged).

Note here the target vma ie the mirroring one is a mmap of device file
and thus is skip by all of the above (kcompactd, khugepaged, ...) it is
fully ignore by core mm.

Thus you do not need to fix the refcount in any way. If any of the core
mm try to reclaim memory from the original vma then you will get mmu
notifier callbacks and all you have to do is clear the page table of your
device vma.

I did exactly that as a tools in the past and it works just fine with
no change to core mm whatsoever.

Cheers,
Jérôme
Paolo Bonzini Oct. 2, 2019, 8:10 p.m. UTC | #17
On 02/10/19 19:04, Jerome Glisse wrote:
> On Wed, Oct 02, 2019 at 06:18:06PM +0200, Paolo Bonzini wrote:
>>>> If the mapping of the source VMA changes, mirroring can update the
>>>> target VMA via insert_pfn.  But what ensures that KVM's MMU notifier
>>>> dismantles its own existing page tables (so that they can be recreated
>>>> with the new mapping from the source VMA)?
>>
>> The KVM inspector process is also (or can be) a QEMU that will have to
>> create its own KVM guest page table.  So if a page in the source VMA is
>> unmapped we want:
>>
>> - the source KVM to invalidate its guest page table (done by the KVM MMU
>> notifier)
>>
>> - the target VMA to be invalidated (easy using mirroring)
>>
>> - the target KVM to invalidate its guest page table, as a result of
>> invalidation of the target VMA
> 
> You can do the target KVM invalidation inside the mirroring invalidation
> code.

Why should the source and target KVMs behave differently?  If the source
invalidates its guest page table via MMU notifiers, so should the target.

The KVM MMU notifier exists so that nothing (including mirroring) needs
to know that there is KVM on the other side.  Any interaction between
KVM page tables and VMAs must be mediated by MMU notifiers, anything
else is unacceptable.

If it is possible to invoke the MMU notifiers around the calls to
insert_pfn, that of course would be perfect.

Thanks,

Paolo
Jerome Glisse Oct. 3, 2019, 3:42 p.m. UTC | #18
On Wed, Oct 02, 2019 at 10:10:18PM +0200, Paolo Bonzini wrote:
> On 02/10/19 19:04, Jerome Glisse wrote:
> > On Wed, Oct 02, 2019 at 06:18:06PM +0200, Paolo Bonzini wrote:
> >>>> If the mapping of the source VMA changes, mirroring can update the
> >>>> target VMA via insert_pfn.  But what ensures that KVM's MMU notifier
> >>>> dismantles its own existing page tables (so that they can be recreated
> >>>> with the new mapping from the source VMA)?
> >>
> >> The KVM inspector process is also (or can be) a QEMU that will have to
> >> create its own KVM guest page table.  So if a page in the source VMA is
> >> unmapped we want:
> >>
> >> - the source KVM to invalidate its guest page table (done by the KVM MMU
> >> notifier)
> >>
> >> - the target VMA to be invalidated (easy using mirroring)
> >>
> >> - the target KVM to invalidate its guest page table, as a result of
> >> invalidation of the target VMA
> > 
> > You can do the target KVM invalidation inside the mirroring invalidation
> > code.
> 
> Why should the source and target KVMs behave differently?  If the source
> invalidates its guest page table via MMU notifiers, so should the target.
> 
> The KVM MMU notifier exists so that nothing (including mirroring) needs
> to know that there is KVM on the other side.  Any interaction between
> KVM page tables and VMAs must be mediated by MMU notifiers, anything
> else is unacceptable.
> 
> If it is possible to invoke the MMU notifiers around the calls to
> insert_pfn, that of course would be perfect.

Ok and yes you can do that exactly ie inside the mmu notifier callback
from the target. For instance it is as easy as:
    target_mirror_notifier_start_callback(start, end) {
        struct kvm_mirror_struct *kvmms = from_mmun(...);
        unsigned long target_foff, size;

        size = end - start;
        target_foff = kvmms_convert_mirror_address(start);
        take_lock(kvmms->mirror_fault_exclusion_lock);
        unmap_mapping_range(kvmms->address_space, target_foff, size, 1);
        drop_lock(kvmms->mirror_fault_exclusion_lock);
    }

All that is needed is to make sure that vm_normal_page() will see those
pte (inside the process that is mirroring the other process) as special
which is the case either because insert_pfn() mark the pte as special or
the kvm device driver which control the vm_operation struct set a
find_special_page() callback that always return NULL, or the vma has
either VM_PFNMAP or VM_MIXEDMAP set (which is the case with insert_pfn).

So you can keep the existing kvm code unmodified.

Cheers,
Jérôme
Paolo Bonzini Oct. 3, 2019, 3:50 p.m. UTC | #19
On 03/10/19 17:42, Jerome Glisse wrote:
> All that is needed is to make sure that vm_normal_page() will see those
> pte (inside the process that is mirroring the other process) as special
> which is the case either because insert_pfn() mark the pte as special or
> the kvm device driver which control the vm_operation struct set a
> find_special_page() callback that always return NULL, or the vma has
> either VM_PFNMAP or VM_MIXEDMAP set (which is the case with insert_pfn).
> 
> So you can keep the existing kvm code unmodified.

Great, thanks.  And KVM is already able to handle VM_PFNMAP/VM_MIXEDMAP,
so that should work.

Paolo
Mircea CIRJALIU - MELIU Oct. 3, 2019, 4:36 p.m. UTC | #20
> The KVM MMU notifier exists so that nothing (including mirroring) needs to
> know that there is KVM on the other side.  Any interaction between KVM
> page tables and VMAs must be mediated by MMU notifiers, anything else is
> unacceptable.
> 
> If it is possible to invoke the MMU notifiers around the calls to insert_pfn,
> that of course would be perfect.

Looks to me like a work-around.
Any reason why insert_pfn() can't do set_pte_at_notify() so it triggers the KVM MMU notifier instead?
Mircea CIRJALIU - MELIU Oct. 3, 2019, 4:42 p.m. UTC | #21
> On 03/10/19 17:42, Jerome Glisse wrote:
> > All that is needed is to make sure that vm_normal_page() will see
> > those pte (inside the process that is mirroring the other process) as
> > special which is the case either because insert_pfn() mark the pte as
> > special or the kvm device driver which control the vm_operation struct
> > set a
> > find_special_page() callback that always return NULL, or the vma has
> > either VM_PFNMAP or VM_MIXEDMAP set (which is the case with
> insert_pfn).
> >
> > So you can keep the existing kvm code unmodified.
> 
> Great, thanks.  And KVM is already able to handle
> VM_PFNMAP/VM_MIXEDMAP, so that should work.

This means setting VM_PFNMAP/VM_MIXEDMAP on the anon VMA that acts as the VM's system RAM.
Will it have any side effects?
Jerome Glisse Oct. 3, 2019, 6:31 p.m. UTC | #22
On Thu, Oct 03, 2019 at 04:42:20PM +0000, Mircea CIRJALIU - MELIU wrote:
> > On 03/10/19 17:42, Jerome Glisse wrote:
> > > All that is needed is to make sure that vm_normal_page() will see
> > > those pte (inside the process that is mirroring the other process) as
> > > special which is the case either because insert_pfn() mark the pte as
> > > special or the kvm device driver which control the vm_operation struct
> > > set a
> > > find_special_page() callback that always return NULL, or the vma has
> > > either VM_PFNMAP or VM_MIXEDMAP set (which is the case with
> > insert_pfn).
> > >
> > > So you can keep the existing kvm code unmodified.
> > 
> > Great, thanks.  And KVM is already able to handle
> > VM_PFNMAP/VM_MIXEDMAP, so that should work.
> 
> This means setting VM_PFNMAP/VM_MIXEDMAP on the anon VMA that acts as the VM's system RAM.
> Will it have any side effects?

You do not set it up on the anonymous vma but on the mmap of the
kvm device file, the resulting vma is under the control of the
kvm device file and is not an anonymous vma but a "device" special
vma.

So in summary, the source qemu process has anonymous vma (regular
libc malloc for instance). The introspector qemu process which
mirror the the source qemu use mmap on /dev/kvm (assuming you can
reuse the kvm device file for this otherwise you can introduce a
new kvm device file). The resulting mmap inside the introspector
qemu process is a vma which has vma->vm_file pointing to the kvm
device file and has VM_PFNMAP or VM_MIXEDMAP (i think you want the
former). On architecture with ARCH_SPECIAL_PTE the pte will be
mark as special when using insert_pfn() on other architecture you
can either rely on VM_PFNMAP/VM_MIXEDMAP flag or set a specific
find_special_page() callbacks in vm_ops.


I am at a conference right now but i will put an example of what
i mean next week.

Cheers,
Jérôme
Paolo Bonzini Oct. 3, 2019, 7:38 p.m. UTC | #23
On 03/10/19 20:31, Jerome Glisse wrote:
> So in summary, the source qemu process has anonymous vma (regular
> libc malloc for instance). The introspector qemu process which
> mirror the the source qemu use mmap on /dev/kvm (assuming you can
> reuse the kvm device file for this otherwise you can introduce a
> new kvm device file). 

It should be a new device, something like /dev/kvmmem.  BitDefender's
RFC patches already have the right userspace API, that was not an issue.

Paolo
Mircea CIRJALIU - MELIU Oct. 4, 2019, 9:41 a.m. UTC | #24
> On 03/10/19 20:31, Jerome Glisse wrote:
> > So in summary, the source qemu process has anonymous vma (regular libc
> > malloc for instance). The introspector qemu process which mirror the
> > the source qemu use mmap on /dev/kvm (assuming you can reuse the
> kvm
> > device file for this otherwise you can introduce a new kvm device
> > file).
> 
> It should be a new device, something like /dev/kvmmem.  BitDefender's RFC
> patches already have the right userspace API, that was not an issue.

I get it so far. I have a patch that does mirroring in a separate VMA.
We create an extra VMA with VM_PFNMAP/VM_MIXEDMAP that mirrors the 
source VMA in the other QEMU and is refreshed by the device MMU notifier.

This is a simple choice for an introspector process that runs on the same host 
as the source QEMU. But how do I make the new VMA accessible as memory 
to the guest VM inside the introspector QEMU? I was thinking of 2 solutions:

Create a new memslot based on the mirror VMA, hotplug it into the guest as
new memory device (is this possible?) and have a guest-side driver allocate 
pages from that area.

or

Redirect (some) GFN->HVA translations into the new VMA based on a table 
of addresses required by the introspector process.

Mircea
Paolo Bonzini Oct. 4, 2019, 11:46 a.m. UTC | #25
On 04/10/19 11:41, Mircea CIRJALIU - MELIU wrote:
> I get it so far. I have a patch that does mirroring in a separate VMA.
> We create an extra VMA with VM_PFNMAP/VM_MIXEDMAP that mirrors the 
> source VMA in the other QEMU and is refreshed by the device MMU notifier.

So for example on the host you'd have a new ioctl on the kvm file
descriptor.  You pass a size and you get back a file descriptor for that
guest's physical memory, which is mmap-able up to the size you specified
in the ioctl.

In turn, the file descriptor would have ioctls to map/unmap ranges of
the guest memory into its mmap-able range.  Accessing an unmapped range
produces a SIGSEGV.

When asked via the QEMU monitor, QEMU will create the file descriptor
and pass it back via SCM_RIGHTS.  The management application can then
use it to hotplug memory into the destination...

> Create a new memslot based on the mirror VMA, hotplug it into the guest as
> new memory device (is this possible?) and have a guest-side driver allocate 
> pages from that area.

... using the existing ivshmem device, whose BAR can be accessed and
mmap-ed from the guest via sysfs.  In other words, the hotplugging will
use the file descriptor returned by QEMU when creating the ivshmem device.

We then need an additional mechanism to invoke the map/unmap ioctls from
the guest.  Without writing a guest-side driver it is possible to:

- pass a socket into the "create guest physical memory view" ioctl
above.  KVM will then associate that KVMI socket with the newly created
file descriptor.

- use KVMI messages to that socket to map/unmap sections of memory

> Redirect (some) GFN->HVA translations into the new VMA based on a table 
> of addresses required by the introspector process.

That would be tricky because there are multiple paths (gfn_to_page,
gfn_to_pfn, etc.).

There is some complication in this because the new device has to be
plumbed at multiple levels (KVM, QEMU, libvirt).  But it seems like a
very easily separated piece of code (except for the KVMI socket part,
which can be added later), so I suggest that you contribute the KVM
parts first.

Paolo
diff mbox series

Patch

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 39b4494e29f1..3f65b2833562 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -417,8 +417,10 @@  PAGEFLAG(Idle, idle, PF_ANY)
  */
 #define PAGE_MAPPING_ANON	0x1
 #define PAGE_MAPPING_MOVABLE	0x2
+#define PAGE_MAPPING_REMOTE	0x4
 #define PAGE_MAPPING_KSM	(PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
-#define PAGE_MAPPING_FLAGS	(PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
+#define PAGE_MAPPING_FLAGS	(PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE | \
+				 PAGE_MAPPING_REMOTE)
 
 static __always_inline int PageMappingFlags(struct page *page)
 {
@@ -431,6 +433,11 @@  static __always_inline int PageAnon(struct page *page)
 	return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
 }
 
+static __always_inline int PageRemote(struct page *page)
+{
+	return ((unsigned long)page->mapping & PAGE_MAPPING_REMOTE) != 0;
+}
+
 static __always_inline int __PageMovable(struct page *page)
 {
 	return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) ==
diff --git a/include/linux/remote_mapping.h b/include/linux/remote_mapping.h
new file mode 100644
index 000000000000..d30d0d10e51d
--- /dev/null
+++ b/include/linux/remote_mapping.h
@@ -0,0 +1,167 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _REMOTE_MAPPING_H
+#define _REMOTE_MAPPING_H
+
+#include <linux/mm.h>
+#include <linux/rmap.h>
+#include <linux/rbtree.h>
+#include <linux/mmu_notifier.h>
+#include <linux/mmdebug.h>
+
+struct page_db {
+	struct mm_struct *target;		/* target for this mapping */
+	unsigned long req_hva;			/* HVA in target */
+	unsigned long map_hva;			/* HVA in client */
+
+	refcount_t refcnt;			/* client-side sharing */
+	int flags;
+
+	/* target links - serialized by target_db->lock */
+	struct list_head target_link;		/* target-side link */
+
+	/* client links - serialized by client_db->lock */
+	struct rb_node file_link;		/* uses map_hva as key */
+
+	/* rmap components - serialized by page lock */
+	struct anon_vma *req_anon_vma;
+	struct anon_vma *map_anon_vma;
+};
+
+struct target_db {
+	struct mm_struct *mm;		/* mm of this struct */
+	struct hlist_node db_link;	/* database link */
+
+	struct mmu_notifier mn;		/* for notifications from mm */
+	struct rcu_head	rcu;		/* for delayed freeing */
+	refcount_t refcnt;
+
+	spinlock_t lock;		/* lock for the following */
+	struct mm_struct *client;	/* client for this target */
+	struct list_head pages_list;	/* mapped HVAs for this target */
+};
+
+struct file_db;
+struct client_db {
+	struct mm_struct *mm;		/* mm of this struct */
+	struct hlist_node db_link;	/* database link */
+
+	struct mmu_notifier mn;		/* for notifications from mm */
+	struct rcu_head	rcu;		/* for delayed freeing */
+	refcount_t refcnt;
+
+	struct file_db *pseudo;		/* kernel interface */
+};
+
+struct file_db {
+	struct client_db *cdb;
+
+	spinlock_t lock;		/* lock for the following */
+	struct rb_root rb_root;		/* mappings indexed by map_hva */
+};
+
+static inline void *PageMapping(struct page_db *pdb)
+{
+	return (void *)pdb + (PAGE_MAPPING_ANON | PAGE_MAPPING_REMOTE);
+}
+
+static inline struct page_db *RemoteMapping(struct page *page)
+{
+	VM_BUG_ON_PAGE(!PageRemote(page), page);
+	return (void *)((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS);
+}
+
+/*
+ * Template for keyed RB tree.
+ *
+ * RBCTYPE	type of container structure
+ * _rb_root	name of rb_root element
+ * RBNTYPE	type of node structure
+ * _rb_node	name of rb_node element
+ * _key		name of key element
+ */
+
+#define KEYED_RB_TREE(RBPREFIX, RBCTYPE, _rb_root, RBNTYPE, _rb_node, _key)\
+									\
+static bool RBPREFIX ## _insert(RBCTYPE * _container, RBNTYPE * _node)	\
+{									\
+	struct rb_root *root = &_container->_rb_root;			\
+	struct rb_node **new = &root->rb_node;				\
+	struct rb_node *parent = NULL;					\
+									\
+	/* Figure out where to put new node */				\
+	while (*new) {							\
+		RBNTYPE *this = rb_entry(*new, RBNTYPE, _rb_node);	\
+									\
+		parent = *new;						\
+		if (_node->_key < this->_key)				\
+			new = &((*new)->rb_left);			\
+		else if (_node->_key > this->_key)			\
+			new = &((*new)->rb_right);			\
+		else							\
+			return false;					\
+	}								\
+									\
+	/* Add new node and rebalance tree. */				\
+	rb_link_node(&_node->_rb_node, parent, new);			\
+	rb_insert_color(&_node->_rb_node, root);			\
+									\
+	return true;							\
+}									\
+									\
+static RBNTYPE *							\
+RBPREFIX ## _search(RBCTYPE * _container, unsigned long _key)		\
+{									\
+	struct rb_root *root = &_container->_rb_root;			\
+	struct rb_node *node = root->rb_node;				\
+									\
+	while (node) {							\
+		RBNTYPE *_node = rb_entry(node, RBNTYPE, _rb_node);	\
+									\
+		if (_key < _node->_key)					\
+			node = node->rb_left;				\
+		else if (_key > _node->_key)				\
+			node = node->rb_right;				\
+		else							\
+			return _node;					\
+	}								\
+									\
+	return NULL;							\
+}									\
+									\
+static void RBPREFIX ## _remove(RBCTYPE *_container, RBNTYPE *_node)	\
+{									\
+	rb_erase(&_node->_rb_node, &_container->_rb_root);		\
+	RB_CLEAR_NODE(&_node->_rb_node);				\
+}									\
+									\
+static bool RBPREFIX ## _empty(const RBCTYPE *_container)		\
+{									\
+	return RB_EMPTY_ROOT(&_container->_rb_root);			\
+}									\
+
+#ifdef CONFIG_REMOTE_MAPPING
+extern int mm_remote_map(struct mm_struct *req_mm,
+			 unsigned long req_hva, unsigned long map_hva);
+extern int mm_remote_unmap(unsigned long map_hva);
+extern void mm_remote_reset(void);
+extern void rmap_walk_remote(struct page *page, struct rmap_walk_control *rwc);
+#else /* CONFIG_REMOTE_MAPPING */
+static inline int mm_remote_map(struct mm_struct *req_mm,
+				unsigned long req_hva, unsigned long map_hva)
+{
+	return -EINVAL;
+}
+static inline int mm_remote_unmap(unsigned long map_hva)
+{
+	return -EINVAL;
+}
+static inline void mm_remote_reset(void)
+{
+}
+static inline void rmap_walk_remote(struct page *page,
+				    struct rmap_walk_control *rwc)
+{
+}
+#endif /* CONFIG_REMOTE_MAPPING */
+
+#endif /* _REMOTE_MAPPING_H */
diff --git a/include/uapi/linux/remote_mapping.h b/include/uapi/linux/remote_mapping.h
new file mode 100644
index 000000000000..d8b544dd5add
--- /dev/null
+++ b/include/uapi/linux/remote_mapping.h
@@ -0,0 +1,18 @@ 
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+
+#ifndef __UAPI_REMOTE_MAPPING_H__
+#define __UAPI_REMOTE_MAPPING_H__
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+struct remote_map_request {
+	__u32 req_pid;
+	__u64 req_hva;
+	__u64 map_hva;
+};
+
+#define REMOTE_MAP       _IOW('r', 0x01, struct remote_map_request)
+#define REMOTE_UNMAP     _IOW('r', 0x02, unsigned long)
+
+#endif /* __UAPI_REMOTE_MAPPING_H__ */
diff --git a/mm/Kconfig b/mm/Kconfig
index 25c71eb8a7db..8451dafd3c91 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -758,4 +758,12 @@  config GUP_BENCHMARK
 config ARCH_HAS_PTE_SPECIAL
 	bool
 
+config REMOTE_MAPPING
+	bool "Remote memory mapping"
+	depends on MMU && !KSM
+	default n
+	help
+	  Allows a given application to map pages of another application in its own
+	  address space.
+
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index d210cc9d6f80..e69a3b15627a 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -99,3 +99,4 @@  obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
 obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
 obj-$(CONFIG_HMM) += hmm.o
 obj-$(CONFIG_MEMFD_CREATE) += memfd.o
+obj-$(CONFIG_REMOTE_MAPPING) += remote_mapping.o
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 831be5ff5f4d..40066271c411 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -9,16 +9,16 @@ 
  * High level machine check handler. Handles pages reported by the
  * hardware as being corrupted usually due to a multi-bit ECC memory or cache
  * failure.
- * 
+ *
  * In addition there is a "soft offline" entry point that allows stop using
  * not-yet-corrupted-by-suspicious pages without killing anything.
  *
  * Handles page cache pages in various states.	The tricky part
- * here is that we can access any page asynchronously in respect to 
- * other VM users, because memory failures could happen anytime and 
- * anywhere. This could violate some of their assumptions. This is why 
- * this code has to be extremely careful. Generally it tries to use 
- * normal locking rules, as in get the standard locks, even if that means 
+ * here is that we can access any page asynchronously in respect to
+ * other VM users, because memory failures could happen anytime and
+ * anywhere. This could violate some of their assumptions. This is why
+ * this code has to be extremely careful. Generally it tries to use
+ * normal locking rules, as in get the standard locks, even if that means
  * the error handling takes potentially a long time.
  *
  * It can be very tempting to add handling for obscure cases here.
@@ -28,12 +28,12 @@ 
  *   https://git.kernel.org/cgit/utils/cpu/mce/mce-test.git/
  * - The case actually shows up as a frequent (top 10) page state in
  *   tools/vm/page-types when running a real workload.
- * 
+ *
  * There are several operations here with exponential complexity because
- * of unsuitable VM data structures. For example the operation to map back 
- * from RMAP chains to processes has to walk the complete process list and 
+ * of unsuitable VM data structures. For example the operation to map back
+ * from RMAP chains to processes has to walk the complete process list and
  * has non linear complexity with the number. But since memory corruptions
- * are rare we hope to get away with this. This avoids impacting the core 
+ * are rare we hope to get away with this. This avoids impacting the core
  * VM.
  */
 #include <linux/kernel.h>
@@ -59,6 +59,7 @@ 
 #include <linux/kfifo.h>
 #include <linux/ratelimit.h>
 #include <linux/page-isolation.h>
+#include <linux/remote_mapping.h>
 #include "internal.h"
 #include "ras/ras_event.h"
 
@@ -467,6 +468,45 @@  static void collect_procs_anon(struct page *page, struct list_head *to_kill,
 	page_unlock_anon_vma_read(av);
 }
 
+/*
+ * Collect processes when the error hit a remote mapped page.
+ */
+static void collect_procs_remote(struct page *page, struct list_head *to_kill,
+				struct to_kill **tkc, int force_early)
+{
+	struct page_db *pdb;
+	struct vm_area_struct *vma;
+	struct task_struct *tsk;
+	struct anon_vma *av;
+	pgoff_t pgoff;
+
+	pdb = RemoteMapping(page);
+	av = pdb->req_anon_vma;
+	if (av == NULL)			/* Target has left */
+		return;
+
+	pgoff = page_to_pgoff(page);	/* Offset in target */
+	anon_vma_lock_read(av);
+	read_lock(&tasklist_lock);
+	for_each_process(tsk) {
+		struct anon_vma_chain *vmac;
+		struct task_struct *t = task_early_kill(tsk, force_early);
+
+		if (!t)
+			continue;
+		anon_vma_interval_tree_foreach(vmac, &av->rb_root,
+			pgoff, pgoff) {
+			vma = vmac->vma;
+			if (!page_mapped_in_vma(page, vma))
+				continue;
+			if (vma->vm_mm == t->mm)
+				add_to_kill(t, page, vma, to_kill, tkc);
+		}
+	}
+	read_unlock(&tasklist_lock);
+	anon_vma_unlock_read(av);
+}
+
 /*
  * Collect processes when the error hit a file mapped page.
  */
@@ -519,9 +559,12 @@  static void collect_procs(struct page *page, struct list_head *tokill,
 	tk = kmalloc(sizeof(struct to_kill), GFP_NOIO);
 	if (!tk)
 		return;
-	if (PageAnon(page))
-		collect_procs_anon(page, tokill, &tk, force_early);
-	else
+	if (PageAnon(page)) {
+		if (PageRemote(page))
+			collect_procs_remote(page, tokill, &tk, force_early);
+		else
+			collect_procs_anon(page, tokill, &tk, force_early);
+	} else
 		collect_procs_file(page, tokill, &tk, force_early);
 	kfree(tk);
 }
diff --git a/mm/migrate.c b/mm/migrate.c
index d4fd680be3b0..4d18a8115ffc 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -47,6 +47,7 @@ 
 #include <linux/page_owner.h>
 #include <linux/sched/mm.h>
 #include <linux/ptrace.h>
+#include <linux/remote_mapping.h>
 
 #include <asm/tlbflush.h>
 
@@ -215,7 +216,7 @@  static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 
 	VM_BUG_ON_PAGE(PageTail(page), page);
 	while (page_vma_mapped_walk(&pvmw)) {
-		if (PageKsm(page))
+		if (PageKsm(page) || PageRemote(page))
 			new = page;
 		else
 			new = page - pvmw.page->index +
@@ -1065,7 +1066,7 @@  static int __unmap_and_move(struct page *page, struct page *newpage,
 	 * because that implies that the anon page is no longer mapped
 	 * (and cannot be remapped so long as we hold the page lock).
 	 */
-	if (PageAnon(page) && !PageKsm(page))
+	if (PageAnon(page) && !PageKsm(page) && !PageRemote(page))
 		anon_vma = page_get_anon_vma(page);
 
 	/*
@@ -1104,8 +1105,8 @@  static int __unmap_and_move(struct page *page, struct page *newpage,
 		}
 	} else if (page_mapped(page)) {
 		/* Establish migration ptes */
-		VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma,
-				page);
+		VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) &&
+				!PageRemote(page) && !anon_vma, page);
 		try_to_unmap(page,
 			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
 		page_was_mapped = 1;
diff --git a/mm/remote_mapping.c b/mm/remote_mapping.c
new file mode 100644
index 000000000000..14b0db89c425
--- /dev/null
+++ b/mm/remote_mapping.c
@@ -0,0 +1,1834 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Remote memory mapping.
+ *
+ * Copyright (C) 2017-2019 Bitdefender S.R.L.
+ *
+ * Author:
+ *   Mircea Cirjaliu <mcirjaliu@bitdefender.com>
+ */
+#define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
+
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/rmap.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/rbtree.h>
+#include <linux/pagemap.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/spinlock.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/printk.h>
+#include <linux/mm.h>
+#include <linux/pid.h>
+#include <linux/oom.h>
+#include <linux/huge_mm.h>
+#include <linux/mmu_notifier.h>
+#include <linux/sched/mm.h>
+#include <linux/interval_tree_generic.h>
+#include <linux/hashtable.h>
+#include <linux/refcount.h>
+#include <linux/debugfs.h>
+#include <linux/miscdevice.h>
+#include <linux/remote_mapping.h>
+#include <uapi/linux/remote_mapping.h>
+
+#include "internal.h"
+
+#define ASSERT(exp) BUG_ON(!(exp))
+#define BUSY_BIT 0
+#define MAPPED_BIT 1
+
+#define TDB_HASH_BITS 4
+#define CDB_HASH_BITS 2
+
+static int mm_remote_do_unmap(struct mm_struct *map_mm, unsigned long map_hva);
+static int mm_remote_do_unmap_target(struct page_db *pdb);
+static int mm_remote_make_stale(struct page_db *pdb);
+
+static void mm_remote_db_target_release(struct target_db *tdb);
+static void mm_remote_db_client_release(struct client_db *cdb);
+
+static void tdb_release(struct mmu_notifier *mn, struct mm_struct *mm);
+static void cdb_release(struct mmu_notifier *mn, struct mm_struct *mm);
+
+static const struct mmu_notifier_ops tdb_notifier_ops = {
+	.release = tdb_release,
+};
+
+static const struct mmu_notifier_ops cdb_notifier_ops = {
+	.release = cdb_release,
+};
+
+static DEFINE_HASHTABLE(tdb_hash, TDB_HASH_BITS);
+static DEFINE_SPINLOCK(tdb_lock);
+
+static DEFINE_HASHTABLE(cdb_hash, CDB_HASH_BITS);
+static DEFINE_SPINLOCK(cdb_lock);
+
+static struct kmem_cache *pdb_cache;
+static atomic_t pdb_count = ATOMIC_INIT(0);
+static atomic_t map_count = ATOMIC_INIT(0);
+static atomic_t rpg_count = ATOMIC_INIT(0);
+
+static atomic_t stat_empty_pte = ATOMIC_INIT(0);
+static atomic_t stat_mapped_pte = ATOMIC_INIT(0);
+static atomic_t stat_swap_pte = ATOMIC_INIT(0);
+static atomic_t stat_refault = ATOMIC_INIT(0);
+
+static struct dentry *mm_remote_debugfs_dir;
+
+static void target_db_init(struct target_db *tdb)
+{
+	tdb->mn.ops = &tdb_notifier_ops;
+	refcount_set(&tdb->refcnt, 0);
+
+	tdb->client = NULL;
+	INIT_LIST_HEAD(&tdb->pages_list);
+	spin_lock_init(&tdb->lock);
+}
+
+static struct target_db *target_db_alloc(void)
+{
+	struct target_db *tdb;
+
+	tdb = kzalloc(sizeof(*tdb), GFP_KERNEL);
+	if (tdb != NULL)
+		target_db_init(tdb);
+
+	return tdb;
+}
+
+static void target_db_free(struct target_db *tdb)
+{
+	ASSERT(refcount_read(&tdb->refcnt) == 0);
+	ASSERT(list_empty(&tdb->pages_list));
+
+	kfree(tdb);
+}
+
+static void target_db_insert(struct target_db *tdb, struct page_db *pdb)
+{
+	list_add(&pdb->target_link, &tdb->pages_list);
+}
+
+static bool target_db_empty(const struct target_db *tdb)
+{
+	return list_empty(&tdb->pages_list);
+}
+
+static void target_db_remove(struct target_db *tdb, struct page_db *pdb)
+{
+	list_del(&pdb->target_link);
+}
+
+static void target_db_free_delayed(struct rcu_head *rcu)
+{
+	struct target_db *tdb = container_of(rcu, struct target_db, rcu);
+
+	pr_debug("%s: for mm %016lx\n", __func__, (unsigned long)tdb->mm);
+
+	target_db_free(tdb);
+}
+
+static void target_db_put(struct target_db *tdb)
+{
+	if (refcount_dec_and_test(&tdb->refcnt)) {
+		pr_debug("%s: mm %016lx\n", __func__, (unsigned long)tdb->mm);
+
+		spin_lock(&tdb_lock);
+		hash_del(&tdb->db_link);
+		spin_unlock(&tdb_lock);
+
+		mm_remote_db_target_release(tdb);
+
+		ASSERT(target_db_empty(tdb));
+
+		mmu_notifier_call_srcu(&tdb->rcu, target_db_free_delayed);
+	}
+}
+
+static struct target_db *target_db_lookup(const struct mm_struct *mm)
+{
+	struct target_db *tdb;
+
+	spin_lock(&tdb_lock);
+
+	hash_for_each_possible(tdb_hash, tdb, db_link, (unsigned long)mm)
+		if (tdb->mm == mm && refcount_inc_not_zero(&tdb->refcnt))
+			break;
+
+	spin_unlock(&tdb_lock);
+
+	return tdb;
+}
+
+static struct target_db *target_db_lookup_or_add(struct mm_struct *mm)
+{
+	struct target_db *tdb, *allocated;
+	bool found = false;
+	int result;
+
+	allocated = target_db_alloc();	/* may be NULL */
+
+	spin_lock(&tdb_lock);
+
+	hash_for_each_possible(tdb_hash, tdb, db_link, (unsigned long)mm)
+		if (tdb->mm == mm && refcount_inc_not_zero(&tdb->refcnt)) {
+			found = true;
+			break;
+		}
+
+	if (!found && allocated != NULL) {
+		tdb = allocated;
+		allocated = NULL;
+
+		tdb->mm = mm;
+		hash_add(tdb_hash, &tdb->db_link, (unsigned long)mm);
+		refcount_set(&tdb->refcnt, 1);
+	}
+
+	spin_unlock(&tdb_lock);
+
+	if (allocated != NULL)
+		target_db_free(allocated);
+
+	if (found || tdb == NULL)
+		return tdb;
+
+	/*
+	 * register a mmu notifier when adding this entry to the list - at this
+	 * point other threads may already have hold of this tdb
+	 */
+	result = mmu_notifier_register(&tdb->mn, mm);
+	if (IS_ERR_VALUE((long) result)) {
+		pr_err("mmu_notifier_register() failed: %d\n", result);
+
+		target_db_put(tdb);
+		return ERR_PTR((long) result);
+	}
+
+	pr_debug("%s: new entry for mm %016lx\n",
+		__func__, (unsigned long)tdb->mm);
+
+	refcount_inc(&tdb->refcnt);
+	return tdb;
+}
+
+static void client_db_init(struct client_db *cdb)
+{
+	cdb->mm = NULL;
+	INIT_HLIST_NODE(&cdb->db_link);
+
+	cdb->mn.ops = &cdb_notifier_ops;
+	refcount_set(&cdb->refcnt, 0);
+
+	cdb->pseudo = NULL;
+}
+
+static struct client_db *client_db_alloc(void)
+{
+	struct client_db *cdb;
+
+	cdb = kzalloc(sizeof(*cdb), GFP_KERNEL);
+	if (cdb != NULL)
+		client_db_init(cdb);
+
+	return cdb;
+}
+
+static void client_db_free(struct client_db *cdb)
+{
+	ASSERT(refcount_read(&cdb->refcnt) == 0);
+
+	kfree(cdb);
+}
+
+static void client_db_free_delayed(struct rcu_head *rcu)
+{
+	struct client_db *cdb = container_of(rcu, struct client_db, rcu);
+
+	pr_debug("%s: mm %016lx\n", __func__, (unsigned long)cdb->mm);
+
+	client_db_free(cdb);
+}
+
+static void client_db_put(struct client_db *cdb)
+{
+	if (refcount_dec_and_test(&cdb->refcnt)) {
+		pr_debug("%s: mm %016lx\n", __func__, (unsigned long)cdb->mm);
+
+		spin_lock(&cdb_lock);
+		hash_del(&cdb->db_link);
+		spin_unlock(&cdb_lock);
+
+		mm_remote_db_client_release(cdb);
+
+		mmu_notifier_call_srcu(&cdb->rcu, client_db_free_delayed);
+	}
+}
+
+static struct client_db *client_db_lookup(const struct mm_struct *mm)
+{
+	struct client_db *cdb;
+
+	spin_lock(&cdb_lock);
+
+	hash_for_each_possible(cdb_hash, cdb, db_link, (unsigned long)mm)
+		if (cdb->mm == mm && refcount_inc_not_zero(&cdb->refcnt))
+			break;
+
+	spin_unlock(&cdb_lock);
+
+	return cdb;
+}
+
+// TODO: each mapping request by direct kernel interface calls this function
+// to find its mm association. Temporary allocating a struct client_db for each
+// mapping attempt may pose a performance problem.
+static struct client_db *client_db_lookup_or_add(struct mm_struct *mm)
+{
+	struct client_db *cdb, *allocated;
+	bool found = false;
+	int result;
+
+	allocated = client_db_alloc();	/* may be NULL */
+
+	spin_lock(&cdb_lock);
+
+	hash_for_each_possible(cdb_hash, cdb, db_link, (unsigned long)mm)
+		if (cdb->mm == mm && refcount_inc_not_zero(&cdb->refcnt)) {
+			found = true;
+			break;
+		}
+
+	if (!found && allocated != NULL) {
+		cdb = allocated;
+		allocated = NULL;
+
+		cdb->mm = mm;
+		hash_add(cdb_hash, &cdb->db_link, (unsigned long)mm);
+		refcount_set(&cdb->refcnt, 1);
+	}
+
+	spin_unlock(&cdb_lock);
+
+	if (allocated != NULL)
+		client_db_free(allocated);
+
+	if (found || cdb == NULL)
+		return cdb;
+
+	/*
+	 * register a mmu notifier when adding this entry to the list - at this
+	 * point other threads may already have hold of this cdb
+	 */
+	result = mmu_notifier_register(&cdb->mn, mm);
+	if (IS_ERR_VALUE((long)result)) {
+		pr_err("mmu_notifier_register() failed: %d\n", result);
+
+		client_db_put(cdb);
+		return ERR_PTR((long)result);
+	}
+
+	pr_debug("%s: new entry for mm %016lx\n",
+		__func__, (unsigned long)cdb->mm);
+
+	refcount_inc(&cdb->refcnt);
+	return cdb;
+}
+
+KEYED_RB_TREE(client_db_hva, struct file_db, rb_root,
+	struct page_db, file_link, map_hva)
+
+static void file_db_init(struct file_db *fdb)
+{
+	fdb->cdb = NULL;
+
+	spin_lock_init(&fdb->lock);
+	fdb->rb_root = RB_ROOT;
+}
+
+static struct file_db *file_db_alloc(void)
+{
+	struct file_db *fdb;
+
+	fdb = kmalloc(sizeof(*fdb), GFP_KERNEL);
+	if (fdb != NULL)
+		file_db_init(fdb);
+
+	return fdb;
+}
+
+static void file_db_free(struct file_db *fdb)
+{
+	ASSERT(client_db_hva_empty(fdb));
+
+	kfree(fdb);
+}
+
+static struct file_db *client_db_pseudo_file(struct client_db *cdb)
+{
+	struct file_db *allocated;
+
+	if (cdb->pseudo == NULL) {
+		allocated = file_db_alloc();
+		if (cmpxchg(&cdb->pseudo, NULL, allocated))
+			file_db_free(allocated);
+	}
+
+	return cdb->pseudo;
+}
+
+static struct page_db *page_db_alloc(void)
+{
+	struct page_db *result;
+
+	result = kmem_cache_alloc(pdb_cache, GFP_KERNEL);
+	if (result == NULL)
+		return NULL;
+
+	memset(result, 0, sizeof(*result));
+
+	atomic_inc(&pdb_count);
+
+	return result;
+}
+
+static void page_db_free(struct page_db *pdb)
+{
+	kmem_cache_free(pdb_cache, pdb);
+
+	BUG_ON(atomic_add_negative(-1, &pdb_count));
+}
+
+static void page_db_put(struct page_db *pdb)
+{
+	if (refcount_dec_and_test(&pdb->refcnt)) {
+
+		/* this case is possible if both target and client are
+		 * OOM-killed in quick succession and the release functions
+		 * can't get to the remote mapped page
+		 */
+		if (pdb->map_anon_vma)
+			put_anon_vma(pdb->map_anon_vma);
+		if (pdb->req_anon_vma)
+			put_anon_vma(pdb->req_anon_vma);
+
+		page_db_free(pdb);
+	}
+}
+
+static void page_db_release(struct page_db *pdb)
+{
+	clear_bit(BUSY_BIT, (unsigned long *)&pdb->flags);
+	/* see comments of wake_up_bit(), set_bit() is atomic */
+	smp_mb__after_atomic();
+	wake_up_bit(&pdb->flags, BUSY_BIT);
+}
+
+/* Reserve a mapping entry indexed by map_hva in the file database. */
+static struct page_db *
+page_db_reserve(struct file_db *fdb, struct mm_struct *req_mm,
+	unsigned long req_hva, unsigned long map_hva)
+{
+	struct page_db *pdb;
+
+	pdb = page_db_alloc();
+	if (unlikely(pdb == NULL))
+		return ERR_PTR(-ENOMEM);
+
+	/* fill pdb */
+	pdb->target = req_mm;
+	pdb->req_hva = req_hva;
+	pdb->map_hva = map_hva;
+	refcount_set(&pdb->refcnt, 1);
+	__set_bit(BUSY_BIT, (unsigned long *)&pdb->flags);
+
+	/* insert mapping entry into the client if not already there */
+	spin_lock(&fdb->lock);
+
+	if (likely(client_db_hva_insert(fdb, pdb)))
+		refcount_inc(&pdb->refcnt);
+	else {
+		page_db_free(pdb);
+		pdb = ERR_PTR(-EALREADY);
+	}
+
+	spin_unlock(&fdb->lock);
+
+	return pdb;
+}
+
+/* Reverse of page_db_reserve(), to be called in case of error. */
+static void
+page_db_unreserve(struct file_db *fdb, struct page_db *pdb)
+{
+	spin_lock(&fdb->lock);
+
+	client_db_hva_remove(fdb, pdb);
+	page_db_put(pdb);
+
+	spin_unlock(&fdb->lock);
+
+	page_db_release(pdb);
+	page_db_put(pdb);
+}
+
+/* Marks as mapped & drops reference. */
+static void
+page_db_got_mapped(struct page_db *pdb)
+{
+	__set_bit(MAPPED_BIT, (unsigned long *)&pdb->flags);
+
+	page_db_release(pdb);
+	page_db_put(pdb);
+}
+
+/* Gets exclusive access for unmapping. */
+static struct page_db *
+page_db_begin_unmap(struct file_db *fdb, unsigned long map_hva)
+{
+	struct page_db *pdb;
+	int result;
+
+	spin_lock(&fdb->lock);
+
+	pdb = client_db_hva_search(fdb, map_hva);
+	if (likely(pdb != NULL))
+		refcount_inc(&pdb->refcnt);
+
+	spin_unlock(&fdb->lock);
+
+	if (pdb == NULL)
+		return NULL;
+
+retry:
+	result = wait_on_bit((unsigned long *)&pdb->flags, BUSY_BIT,
+			     TASK_KILLABLE);
+	/* non-zero if interrupted by a signal */
+	if (unlikely(result != 0))
+		return ERR_PTR(-EINTR);
+
+	/* try set bit & spin if failed */
+	if (test_and_set_bit(BUSY_BIT, (unsigned long *)&pdb->flags))
+		goto retry;
+
+	return pdb;
+}
+
+/* Marks as unmapped, removes from tree & drops reference. */
+static void
+page_db_end_unmap(struct file_db *fdb, struct page_db *pdb)
+{
+	__clear_bit(MAPPED_BIT, (unsigned long *)&pdb->flags);
+
+	spin_lock(&fdb->lock);
+
+	client_db_hva_remove(fdb, pdb);
+	page_db_put(pdb);
+
+	spin_unlock(&fdb->lock);
+
+	page_db_release(pdb);
+	page_db_put(pdb);
+}
+
+static int
+page_db_add_target(struct page_db *pdb, struct mm_struct *target,
+		   struct mm_struct *client)
+{
+	struct target_db *tdb;
+	int result = 0;
+
+	/*
+	 * returns a valid pointer or an error value, never NULL
+	 * also gets reference to entry
+	 */
+	tdb = target_db_lookup_or_add(target);
+	if (IS_ERR_VALUE(tdb))
+		return PTR_ERR(tdb);
+
+	/* target-side locking */
+	spin_lock(&tdb->lock);
+
+	/* check that target is not introspected by someone else */
+	if (tdb->client != NULL && tdb->client != client)
+		result = -EINVAL;
+	else {
+		tdb->client = client;
+		target_db_insert(tdb, pdb);
+	}
+
+	spin_unlock(&tdb->lock);
+
+	target_db_put(tdb);
+
+	return result;
+}
+
+static int
+page_db_remove_target(struct page_db *pdb)
+{
+	struct target_db *tdb;
+	int result = 0;
+
+	/* find target entry in the database */
+	tdb = target_db_lookup(pdb->target);
+	if (tdb == NULL)
+		return -ENOENT;
+
+	/* target-side locking */
+	spin_lock(&tdb->lock);
+
+	/* remove mapping from target */
+	target_db_remove(tdb, pdb);
+
+	/* clear the client if no more mappings */
+	if (target_db_empty(tdb)) {
+		tdb->client = NULL;
+		pr_debug("%s: all mappings gone for target mm %016lx\n",
+			__func__, (unsigned long)pdb->target);
+	}
+
+	spin_unlock(&tdb->lock);
+
+	target_db_put(tdb);
+
+	return result;
+}
+
+/* Last resort call if memory of client got unmapped before ioctl(REM_UNMAP) */
+static bool not_mapped_in_client(struct page_db *pdb)
+{
+	int numpages;
+	struct page *req_page;
+	bool result = false;
+
+	numpages = __get_user_pages_fast(pdb->map_hva, 1, 0, &req_page);
+	if (numpages == 0)
+		return true;
+
+	/* page was munmapped & replaced by a normal page */
+	if (!PageRemote(req_page))
+		result = true;
+
+	put_page(req_page);
+	return result;
+}
+
+/*
+ * Clear all the links to a target at once.
+ */
+static void mm_remote_db_cleanup_target(struct client_db *cdb,
+					struct target_db *tdb)
+{
+	struct page_db *pdb, *npdb;
+
+	/* if we ended up here the target must be introspected */
+	ASSERT(tdb->client != NULL);
+	tdb->client = NULL;
+
+	/*
+	 * walk the tree & clear links to target - this function is serialized
+	 * with respect to the main loop in mm_remote_db_client_release() so
+	 * there will be no race on pdb->target
+	 */
+	list_for_each_entry_safe(pdb, npdb, &tdb->pages_list, target_link) {
+		if (mm_is_oom_victim(cdb->mm) || not_mapped_in_client(pdb))
+			mm_remote_do_unmap_target(pdb);
+
+		list_del(&pdb->target_link);
+		pdb->target = NULL;
+	}
+}
+
+/*
+ * A client file is closing. No race with operations of file is possible.
+ */
+static void mm_remote_db_file_release(struct file_db *fdb)
+{
+	struct client_db *cdb = fdb->cdb;
+	struct page_db *pdb, *npdb;
+	struct target_db *tdb;
+
+	if (!client_db_hva_empty(fdb))
+		pr_debug("%s: client file %016lx has mappings\n",
+			__func__, (unsigned long)fdb);
+
+	/* iterate the tree of mappings */
+	rbtree_postorder_for_each_entry_safe(pdb, npdb, &fdb->rb_root, file_link) {
+		/* pdb->target is cleared in the func above, store in var */
+		struct mm_struct *req_mm = pdb->target;
+
+		/* see comments in function above */
+		if (req_mm == NULL)
+			goto just_free;
+
+		/* pin target to avoid race with mm_remote_db_target_release() */
+		if (mmget_not_zero(req_mm)) {
+
+			/* pin entry for target - maybe it has been released */
+			tdb = target_db_lookup(req_mm);
+			if (tdb != NULL) {
+				/* see comments of this function */
+				mm_remote_db_cleanup_target(cdb, tdb);
+
+				/* unpin entry for target */
+				target_db_put(tdb);
+			}
+
+			mmput(req_mm);
+		}
+
+	just_free:
+		/* invalidate links to client */
+		RB_CLEAR_NODE(&pdb->file_link);
+
+		if (!mm_is_oom_victim(cdb->mm))
+			mm_remote_do_unmap(cdb->mm, pdb->map_hva);
+
+		page_db_put(pdb);
+	}
+
+	/* clear root of tree */
+	fdb->rb_root = RB_ROOT;
+}
+
+/*
+ * The client is closing. This means the normal mapping/unmapping logic
+ * does not work anymore. No more locking needed.
+ */
+static void mm_remote_db_client_release(struct client_db *cdb)
+{
+	struct file_db *fdb = cdb->pseudo;
+
+	if (fdb == NULL)
+		return;
+
+	pr_debug("%s: client %016lx has special file\n",
+		__func__, (unsigned long) cdb);
+
+	mm_remote_db_file_release(fdb);
+	file_db_free(fdb);
+}
+
+/*
+ * Called when a target exits and the page must be marked as stale and the
+ * target-side anon-vma released.
+ * This function will not race with mm_remote_remap(), since a reference to the
+ * target MM is taken before mapping being done.
+ * This function may race with mm_remote_do_unmap(), so a check must be
+ * done under page lock to make sure the page is still remote mapped.
+ * After this is run, the pages are still remote mapped pages, but the rmap
+ * only points to the client.
+ */
+static int mm_remote_make_stale(struct page_db *pdb)
+{
+	struct mm_struct *req_mm = pdb->target;
+	struct vm_area_struct *req_vma;
+	struct page *req_page;
+	int result = 0;
+
+	/* this allows faulting to happen */
+	down_read(&req_mm->mmap_sem);
+
+	/* find VMA containing address */
+	req_vma = find_vma(req_mm, pdb->req_hva);
+	if (unlikely(req_vma == NULL)) {
+		result = -ENOENT;
+		pr_err("no remote VMA found for stalling\n");
+		goto out_unlock;
+	}
+
+	/* should be available & unevictable */
+	req_page = follow_page(req_vma, pdb->req_hva, FOLL_MIGRATION | FOLL_GET);
+	if (IS_ERR_VALUE(req_page)) {
+		result = PTR_ERR(req_page);
+		pr_err("follow_page() failed: %d\n", result);
+		goto out_unlock;
+	} else if (unlikely(req_page == NULL)) {
+		result = -ENOENT;
+		pr_err("follow_page() returned no page\n");
+		goto out_unlock;
+	}
+
+	/* access to RMAP components of PDB can only be done under page lock */
+	lock_page(req_page);
+
+	if (likely(PageRemote(req_page))) {
+		ASSERT(pdb->req_anon_vma == req_vma->anon_vma);
+		/* just release target anon_vma - the page will be temporarily
+		 * left with increased mapcount & refcount, which will be
+		 * decremented when the page is unmapped from the target mm
+		 */
+		put_anon_vma(pdb->req_anon_vma);
+		pdb->req_anon_vma = NULL;
+	}
+
+	unlock_page(req_page);
+
+	put_page(req_page);	/* follow_page(... FOLL_GET) */
+
+out_unlock:
+	up_read(&req_mm->mmap_sem);
+
+	return result;
+}
+
+static int mm_remote_make_stale_client(struct mm_struct *map_mm,
+				       struct page_db *pdb)
+{
+	struct vm_area_struct *map_vma;
+	struct page *req_page;
+
+	int result = 0;
+
+	down_read(&map_mm->mmap_sem);
+
+	map_vma = find_vma(map_mm, pdb->map_hva);
+	if (unlikely(map_vma == NULL)) {
+		result = -ENOENT;
+		pr_err("no client VMA found for stalling\n");
+		goto out_unlock;
+	}
+
+	/* should be available & unevictable */
+	req_page = follow_page(map_vma, pdb->map_hva, FOLL_MIGRATION | FOLL_GET);
+	if (IS_ERR_VALUE(req_page)) {
+		result = PTR_ERR(req_page);
+		pr_err("follow_page() failed: %d\n", result);
+		goto out_unlock;
+	} else if (unlikely(req_page == NULL)) {
+		result = -ENOENT;
+		pr_err("follow_page() returned no page\n");
+		goto out_unlock;
+	}
+
+	/* access to RMAP components of PDB can only be done under page lock */
+	lock_page(req_page);
+
+	if (likely(PageRemote(req_page))) {
+		/* just release target anon_vma - the page will be temporarily
+		 * left with increased mapcount & refcount, which will be
+		 * decremented when the page is unmapped from the target mm
+		 */
+		put_anon_vma(pdb->req_anon_vma);
+		pdb->req_anon_vma = NULL;
+	}
+
+	unlock_page(req_page);
+
+	put_page(req_page);	/* follow_page(... FOLL_GET) */
+
+out_unlock:
+	up_read(&map_mm->mmap_sem);
+
+	return result;
+}
+
+/*
+ * The target MM is closing. This means the pages are unmapped by the default
+ * kernel logic on the target side, but we must mark the entries as stale.
+ * This function won't race with the mapping function since we get here
+ * on target MM teardown and the mapping function won't be able to get a
+ * reference to the target MM.
+ * This function may race with the unmapping function, but
+ * access will be done only on the target-side components.
+ */
+static void mm_remote_db_target_release(struct target_db *tdb)
+{
+	struct mm_struct *map_mm;
+	struct page_db *pdb, *npdb;
+
+	/* no client, nothing to do */
+	if (tdb->client == NULL) {
+		ASSERT(target_db_empty(tdb));
+		return;
+	}
+
+	map_mm = tdb->client;
+	tdb->client = NULL;
+
+	/* if the target is killed by OOM, try to pin the client */
+	if (mm_is_oom_victim(tdb->mm) && !mmget_not_zero(map_mm)) {
+		/* out of luck, just unlink from the list */
+		list_for_each_entry_safe(pdb, npdb, &tdb->pages_list, target_link) {
+			list_del(&pdb->target_link);
+			pdb->target = NULL;
+		}
+
+		return;
+	}
+
+	/*
+	 * all the entries in this tree must be made stale,
+	 * but not removed from the client tree
+	 */
+	list_for_each_entry_safe(pdb, npdb, &tdb->pages_list, target_link) {
+		if (!mm_is_oom_victim(tdb->mm))
+			mm_remote_make_stale(pdb);
+		else
+			mm_remote_make_stale_client(map_mm, pdb);
+
+		list_del(&pdb->target_link);
+		pdb->target = NULL;
+	}
+
+	/* client has been pinned before */
+	if (mm_is_oom_victim(tdb->mm))
+		mmput(map_mm);
+}
+
+static void tdb_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct target_db *tdb = container_of(mn, struct target_db, mn);
+
+	pr_debug("%s: mm %016lx\n", __func__, (unsigned long)mm);
+
+	/* at this point other threads may already have hold of this tdb */
+	target_db_put(tdb);
+}
+
+static void cdb_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct client_db *cdb = container_of(mn, struct client_db, mn);
+
+	pr_debug("%s: mm %016lx\n", __func__, (unsigned long)mm);
+
+	/* at this point other threads may already have hold of this cdb */
+	client_db_put(cdb);
+}
+
+static void mm_remote_page_unevictable(struct page *page)
+{
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+	if (!isolate_lru_page(page))
+		putback_lru_page(page);
+}
+
+static void mm_remote_page_evictable(struct page *page)
+{
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+	if (!isolate_lru_page(page))
+		putback_lru_page(page);
+	else {
+		if (PageUnevictable(page))
+			count_vm_event(UNEVICTABLE_PGSTRANDED);
+	}
+}
+
+void rmap_walk_remote(struct page *page, struct rmap_walk_control *rwc)
+{
+	struct page_db *pdb;
+	struct anon_vma *anon_vma;
+	struct anon_vma_chain *avc;
+	struct vm_area_struct *vma;
+	pgoff_t pgoff_start, pgoff_end;
+	unsigned long address;
+
+	VM_BUG_ON_PAGE(!PageRemote(page), page);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+	pdb = (void *)((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS);
+
+	/* iterate on original anon_vma */
+	anon_vma = pdb->req_anon_vma;
+	if (anon_vma != NULL) {
+		anon_vma_lock_read(anon_vma);
+		pgoff_start = page_to_pgoff(page);
+		pgoff_end = pgoff_start + hpage_nr_pages(page) - 1;
+		anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
+					       pgoff_start, pgoff_end) {
+			vma = avc->vma;
+			address = vma_address(page, vma);
+
+			cond_resched();
+
+			if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
+				continue;
+
+			if (!rwc->rmap_one(page, vma, address, rwc->arg))
+				break;
+
+			if (rwc->done && rwc->done(page))
+				break;
+		}
+		anon_vma_unlock_read(anon_vma);
+	}
+
+	/* iterare on client anon_vma */
+	anon_vma = pdb->map_anon_vma;
+	if (anon_vma != NULL) {
+		anon_vma_lock_read(anon_vma);
+		anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
+					       0, ULONG_MAX) {
+			vma = avc->vma;
+			address = pdb->map_hva;
+
+			cond_resched();
+
+			if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
+				continue;
+
+			if (!rwc->rmap_one(page, vma, address, rwc->arg))
+				break;
+
+			if (rwc->done && rwc->done(page))
+				break;
+		}
+		anon_vma_unlock_read(anon_vma);
+	}
+}
+
+static int mm_remote_invalidate_pte(struct vm_area_struct *map_vma,
+	unsigned long map_hva, pmd_t *map_pmd, struct page *map_page)
+{
+	struct mm_struct *map_mm = map_vma->vm_mm;
+	struct mmu_notifier_range range;
+	unsigned long mmun_start;
+	unsigned long mmun_end;
+	pte_t *ptep;
+	spinlock_t *ptl;
+	swp_entry_t entry;
+	int result = 0;
+
+	mmun_start = map_hva;
+	mmun_end = map_hva + PAGE_SIZE;
+	mmu_notifier_range_init(&range, map_mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(&range);
+
+	ptep = pte_offset_map_lock(map_mm, map_pmd, map_hva, &ptl);
+
+	/* remove reverse mapping - the caller needs to hold the pte lock */
+	if (likely(map_page != NULL)) {
+		page_remove_rmap(map_page, false);
+
+		/* the zero_page is not anonymous */
+		if (!is_zero_pfn(pte_pfn(*ptep)))
+			dec_mm_counter(map_mm, MM_ANONPAGES);
+
+		/* clear old PTE entry */
+		flush_cache_page(map_vma, map_hva, pte_pfn(*ptep));
+		ptep_clear_flush_notify(map_vma, map_hva, ptep);
+
+		atomic_inc(&stat_mapped_pte);
+	} else {
+		/* fresh PTE or has been cleared before */
+		if (likely(pte_none(*ptep))) {
+			atomic_inc(&stat_empty_pte);
+			goto out_unlock;
+		}
+
+		/* a page was faulted in after follow_page() returned NULL */
+		if (unlikely(pte_present(*ptep))) {
+			atomic_inc(&stat_refault);
+			result = -EAGAIN;
+			goto out_unlock;
+		}
+
+		/* must be swap entry */
+		entry = pte_to_swp_entry(*ptep);
+		/* follow_page(... | FOLL_MIGRATION | ...) */
+		ASSERT(!is_migration_entry(entry));
+		free_swap_and_cache(entry);
+		ptep_clear_flush(map_vma, map_hva, ptep);
+
+		atomic_inc(&stat_swap_pte);
+	}
+
+out_unlock:
+	pte_unmap_unlock(ptep, ptl);
+
+	mmu_notifier_invalidate_range_end(&range);
+
+	return result;
+}
+
+static int mm_remote_install_pte(struct vm_area_struct *map_vma,
+	unsigned long map_hva, pmd_t *map_pmd, struct page *req_page)
+{
+	struct mm_struct *map_mm = map_vma->vm_mm;
+	pte_t pte, *ptep;
+	spinlock_t *ptl;
+	int result = 0;
+
+	ptep = pte_offset_map_lock(map_mm, map_pmd, map_hva, &ptl);
+
+	/* a page was faulted in */
+	if (unlikely(pte_present(*ptep))) {
+		atomic_inc(&stat_refault);
+		result = -EAGAIN;
+		goto out_unlock;
+	}
+
+	/* create new PTE based on requested page */
+	pte = mk_pte(req_page, map_vma->vm_page_prot);
+	if (map_vma->vm_flags & VM_WRITE)
+		pte = pte_mkwrite(pte_mkdirty(pte));
+	set_pte_at_notify(map_mm, map_hva, ptep, pte);
+
+	inc_mm_counter(map_mm, MM_ANONPAGES);
+
+out_unlock:
+	pte_unmap_unlock(ptep, ptl);
+
+	return result;
+}
+
+static void mm_remote_put_req(struct page *req_page,
+			      struct anon_vma *req_anon_vma)
+{
+	if (req_anon_vma)
+		put_anon_vma(req_anon_vma);
+
+	if (req_page)
+		put_page(req_page);
+}
+
+static int mm_remote_get_req(struct mm_struct *req_mm, unsigned long req_hva,
+			     struct page **preq_page,
+			     struct anon_vma **preq_anon_vma)
+{
+	struct page *req_page = NULL;
+	struct anon_vma *req_anon_vma = NULL;
+	long nrpages;
+	int result = 0;
+
+	/* for now we-re using both pointers */
+	ASSERT(preq_page != NULL);
+	ASSERT(preq_anon_vma != NULL);
+
+	if (check_stable_address_space(req_mm)) {
+		pr_err("address space of target not stable");
+		return -EINVAL;
+	}
+
+	down_read(&req_mm->mmap_sem);
+
+	/* get host page corresponding to requested address */
+	nrpages = get_user_pages_remote(NULL, req_mm, req_hva, 1,
+		FOLL_WRITE | FOLL_FORCE | FOLL_SPLIT | FOLL_MIGRATION,
+		&req_page, NULL, NULL);
+	if (unlikely(nrpages == 0)) {
+		pr_err("no page for req_hva %016lx\n", req_hva);
+		result = -ENOENT;
+		goto out;
+	} else if (IS_ERR_VALUE(nrpages)) {
+		result = nrpages;
+		if (result == -EBUSY)
+			pr_debug("get_user_pages_remote() failed: %d\n", result);
+		else
+			pr_err("get_user_pages_remote() failed: %d\n", result);
+		goto out;
+	}
+
+	/* limit introspection to anon memory (this also excludes zero-page) */
+	if (!PageAnon(req_page)) {
+		result = -EINVAL;
+		pr_err("page at req_hva %016lx not anon\n", req_hva);
+		goto out;
+	}
+
+	/* make sure the application doesn't want remote-double-mapping */
+	if (PageRemote(req_page)) {
+		result = -EALREADY;
+		pr_err("page at req_hva %016lx already mapped\n", req_hva);
+		goto out;
+	}
+
+	/* take & lock this anon vma */
+	req_anon_vma = page_get_anon_vma(req_page);
+	if (unlikely(req_anon_vma == NULL)) {
+		result = -EINVAL;
+		pr_err("no anon vma for req_hva %016lx\n", req_hva);
+		goto out;
+	}
+
+	/* output these values only if successful */
+	*preq_page = req_page;
+	*preq_anon_vma = req_anon_vma;
+
+out:
+	up_read(&req_mm->mmap_sem);
+
+	if (result)
+		mm_remote_put_req(req_page, req_anon_vma);
+
+	return result;
+}
+
+static int mm_remote_remap(struct mm_struct *map_mm, unsigned long map_hva,
+			   struct page *req_page, struct anon_vma *req_anon_vma,
+			   struct page_db *pdb)
+{
+	struct vm_area_struct *map_vma;
+	pmd_t *map_pmd;
+	struct page *map_page = NULL;
+	int result = 0;
+
+	/* this allows faulting to happen */
+	down_read(&map_mm->mmap_sem);
+
+	/* find VMA containing address */
+	map_vma = find_vma(map_mm, map_hva);
+	if (unlikely(map_vma == NULL)) {
+		result = -ENOENT;
+		pr_err("no local VMA found for remapping\n");
+		goto out_unlock;
+	}
+
+	if (unlikely(!vma_is_anonymous(map_vma))) {
+		result = -EINVAL;
+		pr_err("local VMA is not anonymous\n");
+		goto out_unlock;
+	}
+	ASSERT(map_vma->anon_vma != NULL);
+
+retry:
+	/*
+	 * get reference to local page corresponding to target address;
+	 * the result may be NULL in case of swap entry or mapping not present
+	 */
+	map_page = follow_page(map_vma, map_hva,
+			       FOLL_SPLIT | FOLL_MIGRATION | FOLL_GET);
+	if (IS_ERR_VALUE(map_page)) {
+		result = PTR_ERR(map_page);
+		pr_debug("%s: follow_page() failed: %d\n", __func__, result);
+		goto out_unlock;
+	}
+
+	/* in case of THP, the huge page must be split before the PMD exists */
+	map_pmd = mm_find_pmd(map_mm, map_hva);
+	if (unlikely(!map_pmd)) {
+		/* follow_page(... | FOLL_GET) */
+		if (map_page != NULL)
+			put_page(map_page);
+		result = -EFAULT;
+		pr_err("local PMD not found");
+		goto out_unlock;
+	}
+
+	/* unmap map_page from current page tables */
+	if (map_page != NULL)
+		lock_page(map_page);
+
+	/* the only possible error is -EAGAIN when map_page == NULL */
+	result = mm_remote_invalidate_pte(map_vma, map_hva, map_pmd, map_page);
+	if (IS_ERR_VALUE((long)result))
+		goto retry;
+
+	if (map_page != NULL)
+		unlock_page(map_page);
+
+	/* we're done with this page */
+	if (map_page != NULL) {
+		/* reference acquired in follow_page(... | FOLL_GET) */
+		put_page(map_page);
+		free_page_and_swap_cache(map_page);
+	}
+
+	/* map req_page at the same address - page is already PageRemote() */
+	lock_page(req_page);
+
+	/* the only possible error is -EAGAIN when PTE != pte_none() */
+	result = mm_remote_install_pte(map_vma, map_hva, map_pmd, req_page);
+	if (IS_ERR_VALUE((long)result)) {
+		unlock_page(req_page);
+		goto retry;
+	}
+
+	/* increment its reference to outlive OOM */
+	get_anon_vma(map_vma->anon_vma);
+	pdb->map_anon_vma = map_vma->anon_vma;
+
+	/* will only increment the mapcount of this page */
+	page_add_anon_rmap(req_page, map_vma, map_hva, false);
+
+	unlock_page(req_page);
+
+	/* local accounting */
+	atomic_inc(&map_count);
+
+out_unlock:
+	up_read(&map_mm->mmap_sem);
+
+	return result;
+}
+
+static int mm_remote_promote_page(struct page *req_page,
+				  struct anon_vma *req_anon_vma,
+				  struct page_db *pdb)
+{
+	int result = 0;
+
+	lock_page(req_page);
+
+	/*
+	 * maybe some other thread mapping the same page in another file
+	 * reached here before us
+	 */
+	if (PageRemote(req_page)) {
+		result = -EALREADY;
+		goto out_unlock;
+	}
+
+	/* make this page remote, mapped only under the target */
+	pdb->req_anon_vma = req_anon_vma;
+	req_page->mapping = PageMapping(pdb);
+
+	mm_remote_page_unevictable(req_page);
+	atomic_inc(&rpg_count);
+
+out_unlock:
+	unlock_page(req_page);
+
+	return result;
+}
+
+static void mm_remote_revert_promote(struct page *req_page)
+{
+	struct page_db *pdb;
+
+	/* the page must have been made remote by this thread */
+	ASSERT(PageRemote(req_page));
+
+	lock_page(req_page);
+
+	pdb = RemoteMapping(req_page);
+
+	/* revert the mapping back to anon page mapped under target */
+	req_page->mapping = (void *)pdb->req_anon_vma + PAGE_MAPPING_ANON;
+	pdb->req_anon_vma = NULL;
+
+	mm_remote_page_evictable(req_page);
+	BUG_ON(atomic_add_negative(-1, &rpg_count));
+
+	unlock_page(req_page);
+}
+
+static int mm_remote_do_map(struct mm_struct *req_mm, unsigned long req_hva,
+			    struct mm_struct *map_mm, unsigned long map_hva,
+			    struct page_db *pdb)
+{
+	struct page *req_page;
+	struct anon_vma *req_anon_vma;
+	int result;
+
+	result = mm_remote_get_req(req_mm, req_hva, &req_page, &req_anon_vma);
+	if (IS_ERR_VALUE((long)result))
+		return result;
+
+	result = mm_remote_promote_page(req_page, req_anon_vma, pdb);
+	if (IS_ERR_VALUE((long)result))
+		goto out_put;
+
+	result = mm_remote_remap(map_mm, map_hva, req_page, req_anon_vma, pdb);
+	if (IS_ERR_VALUE((long)result))
+		goto out_revert;
+
+	return 0;
+
+out_revert:
+	mm_remote_revert_promote(req_page);
+out_put:
+	mm_remote_put_req(req_page, req_anon_vma);
+
+	return result;
+}
+
+static int mm_remote_map_file(struct file_db *fdb, struct mm_struct *req_mm,
+			      unsigned long req_hva, unsigned long map_hva)
+{
+	struct mm_struct *map_mm = current->mm;
+	struct page_db *pdb;
+	int result = 0;
+
+	/* tries to add the entry in the tree */
+	pdb = page_db_reserve(fdb, req_mm, req_hva, map_hva);
+	if (IS_ERR_VALUE(pdb))
+		return PTR_ERR(pdb);
+
+	/* do the actual memory mapping */
+	result = mm_remote_do_map(req_mm, req_hva, map_mm, map_hva, pdb);
+	if (IS_ERR_VALUE((long)result))
+		goto out_pdb;
+
+	/* add mapping to target database */
+	result = page_db_add_target(pdb, req_mm, map_mm);
+	if (IS_ERR_VALUE((long)result)) {
+		mm_remote_do_unmap(map_mm, map_hva);
+		goto out_pdb;
+	}
+
+	/* marks as mapped & drops reference */
+	page_db_got_mapped(pdb);
+
+	return 0;
+
+out_pdb:
+	/* removes the entry from the tree & drops reference */
+	page_db_unreserve(fdb, pdb);
+
+	return result;
+}
+
+int mm_remote_map(struct mm_struct *req_mm,
+		  unsigned long req_hva, unsigned long map_hva)
+{
+	struct mm_struct *map_mm = current->mm;
+	struct client_db *cdb;
+	struct file_db *fdb;
+	int result = 0;
+
+	pr_debug("%s: req_mm %016lx, req_hva %016lx, map_hva %016lx\n",
+		__func__, (unsigned long)req_mm, req_hva, map_hva);
+
+	cdb = client_db_lookup_or_add(map_mm);
+	if (IS_ERR_OR_NULL(cdb))
+		return (cdb == NULL) ? -ENOMEM : PTR_ERR(cdb);
+
+	fdb = client_db_pseudo_file(cdb);
+	if (fdb == NULL) {
+		result = -ENOMEM;
+		goto out_cdb;
+	}
+
+	/* try to pin the target MM so it won't go away */
+	if (!mmget_not_zero(req_mm)) {
+		result = -EINVAL;
+		goto out_cdb;
+	}
+
+	result = mm_remote_map_file(fdb, req_mm, req_hva, map_hva);
+	mmput(req_mm);
+
+out_cdb:
+	client_db_put(cdb);
+
+	return result;
+}
+EXPORT_SYMBOL_GPL(mm_remote_map);
+
+static int mm_remote_do_unmap(struct mm_struct *map_mm, unsigned long map_hva)
+{
+	struct vm_area_struct *map_vma;
+	pmd_t *map_pmd;
+	struct page *req_page = NULL;
+	struct anon_vma *req_anon_vma = NULL;
+	struct page_db *pdb;
+	int result = 0;
+
+	/* this allows faulting to happen */
+	down_read(&map_mm->mmap_sem);
+
+	/* find destination VMA for mapping */
+	map_vma = find_vma(map_mm, map_hva);
+	if (unlikely(map_vma == NULL)) {
+		result = -ENOENT;
+		pr_err("no local VMA found for unmapping\n");
+		goto out;
+	}
+
+	map_pmd = mm_find_pmd(map_mm, map_hva);
+	if (unlikely(!map_pmd)) {
+		result = -EFAULT;
+		pr_err("local PMD not found");
+		goto out;
+	}
+
+	/* get page mapped to destination address - we know it is there */
+	req_page = follow_page(map_vma, map_hva, FOLL_GET | FOLL_MIGRATION);
+	if (IS_ERR_VALUE(req_page)) {
+		result = PTR_ERR(req_page);
+		req_page = NULL;
+		pr_err("follow_page() failed: %d\n", result);
+		goto out;
+	} else if (unlikely(req_page == NULL)) {
+		result = -ENOENT;
+		pr_err("follow_page() returned no page\n");
+		goto out;
+	}
+
+	ASSERT(PageRemote(req_page));
+	pdb = RemoteMapping(req_page);
+
+	lock_page(req_page);
+
+	/* also calls page_remove_rmap() */
+	mm_remote_invalidate_pte(map_vma, map_hva, map_pmd, req_page);
+
+	req_anon_vma = pdb->req_anon_vma;
+	pdb->req_anon_vma = NULL;
+
+	/* restore original rmap */
+	req_page->mapping = (void *)req_anon_vma + PAGE_MAPPING_ANON;
+	mm_remote_page_evictable(req_page);
+	BUG_ON(atomic_add_negative(-1, &rpg_count));
+
+	/* refcount was increased in mm_remote_remap() */
+	put_anon_vma(pdb->map_anon_vma);
+	pdb->map_anon_vma = NULL;
+
+	unlock_page(req_page);
+
+	/* follow_page(..., FOLL_GET...) */
+	put_page(req_page);
+
+	BUG_ON(atomic_add_negative(-1, &map_count));
+
+	/* reference count was inc during mm_remote_get_req() */
+	mm_remote_put_req(req_page, req_anon_vma);
+
+out:
+	up_read(&map_mm->mmap_sem);
+
+	return result;
+}
+
+/*
+ * In case the client's memory is reaped by the OOM killer, the remote pages'
+ * reference count + mapcount is dropped and they belong just to the target.
+ */
+static int mm_remote_do_unmap_target(struct page_db *pdb)
+{
+	struct mm_struct *req_mm = pdb->target;
+	struct vm_area_struct *req_vma;
+	struct page *req_page = NULL;
+	struct anon_vma *req_anon_vma = NULL;
+	int result = 0;
+
+	down_read(&req_mm->mmap_sem);
+
+	req_vma = find_vma(req_mm, pdb->req_hva);
+	if (unlikely(req_vma == NULL)) {
+		result = -ENOENT;
+		pr_err("no source VMA found for unmapping\n");
+		goto out;
+	}
+
+	/* page is unevictable - should be mapped */
+	req_page = follow_page(req_vma, pdb->req_hva, FOLL_GET | FOLL_MIGRATION);
+	if (IS_ERR_VALUE(req_page)) {
+		result = PTR_ERR(req_page);
+		req_page = NULL;
+		pr_err("follow_page() failed: %d\n", result);
+		goto out;
+	} else if (unlikely(req_page == NULL)) {
+		result = -ENOENT;
+		pr_err("follow_page() returned no page\n");
+		goto out;
+	}
+
+	ASSERT(PageRemote(req_page));
+	ASSERT(pdb == RemoteMapping(req_page));
+
+	/*
+	 * page_remove_rmap() must have been called when the page was unmapped
+	 * from the client, now we must have a higher refcount from
+	 * follow_page(...FOLL_GET...)
+	 */
+
+	lock_page(req_page);
+
+	req_anon_vma = pdb->req_anon_vma;
+	pdb->req_anon_vma = NULL;
+
+	/* restore original rmap */
+	req_page->mapping = (void *)req_anon_vma + PAGE_MAPPING_ANON;
+	mm_remote_page_evictable(req_page);
+	BUG_ON(atomic_add_negative(-1, &rpg_count));
+
+	/* refcount was increased in mm_remote_remap() */
+	put_anon_vma(pdb->map_anon_vma);
+	pdb->map_anon_vma = NULL;
+
+	unlock_page(req_page);
+
+	BUG_ON(atomic_add_negative(-1, &map_count));
+
+	/* client doesn't map this page anymore, a single refcount to drop */
+	mm_remote_put_req(req_page, req_anon_vma);
+
+out:
+	up_read(&req_mm->mmap_sem);
+
+	return result;
+}
+
+static int mm_remote_unmap_file(struct file_db *fdb, unsigned long map_hva)
+{
+	struct mm_struct *map_mm = current->mm;
+	struct mm_struct *req_mm;
+	struct page_db *pdb;
+	int result;
+
+	/* take exclusive access to this pdb */
+	pdb = page_db_begin_unmap(fdb, map_hva);
+	if (IS_ERR_OR_NULL(pdb))
+		return (pdb == NULL) ? -ENOENT : PTR_ERR(pdb);
+
+	/* test if other thread unmapped this address before us */
+	if (!test_bit(MAPPED_BIT, (unsigned long *)&pdb->flags)) {
+		result = -EALREADY;
+		goto just_release;
+	}
+
+	/* also disconnect from target - can fail if target exited */
+	result = page_db_remove_target(pdb);
+	if (IS_ERR_VALUE((long)result))
+		pr_debug("%s: page_db_remove_target() failed: %d\n",
+			__func__, result);
+
+	/* the unmapping is done on local mm only */
+	result = mm_remote_do_unmap(map_mm, map_hva);
+	if (IS_ERR_VALUE((long)result)) {
+		pr_debug("%s: mm_remote_do_unmap() failed: %d, trying target\n",
+			__func__, result);
+
+		req_mm = pdb->target;
+		if (mmget_not_zero(req_mm)) {
+			result = mm_remote_do_unmap_target(pdb);
+
+			mmput(req_mm);
+		}
+	}
+
+just_release:
+	/* marks as unmapped & drops reference */
+	page_db_end_unmap(fdb, pdb);
+
+	return result;
+}
+
+int mm_remote_unmap(unsigned long map_hva)
+{
+	struct mm_struct *map_mm = current->mm;
+	struct client_db *cdb;
+	struct file_db *fdb;
+	int result;
+
+	pr_debug("%s: map_hva %016lx\n", __func__, map_hva);
+
+	cdb = client_db_lookup_or_add(map_mm);
+	if (IS_ERR_OR_NULL(cdb))
+		return (cdb == NULL) ? -ENOMEM : PTR_ERR(cdb);
+
+	fdb = client_db_pseudo_file(cdb);
+	if (fdb == NULL) {
+		result = -ENOMEM;
+		goto out_cdb;
+	}
+
+	result = mm_remote_unmap_file(fdb, map_hva);
+
+out_cdb:
+	client_db_put(cdb);
+
+	return result;
+}
+EXPORT_SYMBOL_GPL(mm_remote_unmap);
+
+/* called on behalf of the client */
+void mm_remote_reset(void)
+{
+	struct mm_struct *map_mm = current->mm;
+	struct client_db *cdb;
+
+	pr_debug("%s\n", __func__);
+
+	/* also gets reference to entry */
+	cdb = client_db_lookup(map_mm);
+	if (cdb == NULL)
+		return;
+
+	/* no locking here, we have exclusive access */
+	mm_remote_db_client_release(cdb);
+
+	client_db_put(cdb);
+}
+EXPORT_SYMBOL_GPL(mm_remote_reset);
+
+static int remmap_dev_open(struct inode *inodep, struct file *filp)
+{
+	struct file_db *fdb;
+	struct client_db *cdb;
+	int result = 0;
+
+	fdb = file_db_alloc();
+	if (fdb == NULL)
+		return -ENOMEM;
+
+	/* we need the mm to exist at file closing time */
+	mmget(current->mm);
+
+	cdb = client_db_lookup_or_add(current->mm);
+	if (IS_ERR_OR_NULL(cdb)) {
+		result = (cdb == NULL) ? -ENOMEM : PTR_ERR(cdb);
+		goto out_err;
+	}
+
+	fdb->cdb = cdb;
+	filp->private_data = fdb;
+
+	/* by pinning the mm we also make sure the cdb does not get released */
+	client_db_put(cdb);
+
+	return 0;
+
+out_err:
+	mmput(current->mm);
+	file_db_free(fdb);
+
+	return result;
+}
+
+static long remmap_dev_ioctl(struct file *filp, unsigned int ioctl,
+			     unsigned long arg)
+{
+	void __user *argp = (void __user *) arg;
+	struct file_db *fdb = filp->private_data;
+	struct client_db *cdb = fdb->cdb;
+	long result = 0;
+
+	if (current->mm != cdb->mm) {
+		pr_err("ioctl request by different process\n");
+		return -EINVAL;
+	}
+
+	switch (ioctl) {
+	case REMOTE_MAP: {
+		struct remote_map_request req;
+		struct task_struct *req_task;
+		struct mm_struct *req_mm;
+
+		result = -EFAULT;
+		if (copy_from_user(&req, argp, sizeof(req)))
+			break;
+
+		result = -EINVAL;
+		if (!access_ok(req.map_hva, PAGE_SIZE))
+			break;
+		if (req.req_hva & ~PAGE_MASK)
+			break;
+		if (req.map_hva & ~PAGE_MASK)
+			break;
+
+		result = -ESRCH;
+		req_task = find_get_task_by_vpid(req.req_pid);
+		if (req_task == NULL)
+			break;
+
+		result = -EINVAL;
+		req_mm = get_task_mm(req_task);
+		put_task_struct(req_task);
+		if (req_mm == NULL)
+			break;
+
+		result = mm_remote_map_file(fdb, req_mm, req.req_hva, req.map_hva);
+		mmput(req_mm);
+
+		break;
+	}
+
+	case REMOTE_UNMAP: {
+		unsigned long map_hva = (unsigned long) arg;
+
+		result = -EINVAL;
+		if (!access_ok(map_hva, PAGE_SIZE))
+			break;
+		if (map_hva & ~PAGE_MASK)
+			break;
+
+		result = mm_remote_unmap_file(fdb, map_hva);
+
+		break;
+	}
+
+	default:
+		pr_err("ioctl %d not implemented\n", ioctl);
+		result = -ENOTTY;
+	}
+
+	return result;
+}
+
+static int remmap_dev_release(struct inode *inodep, struct file *filp)
+{
+	struct file_db *fdb = filp->private_data;
+	struct client_db *cdb = fdb->cdb;
+	struct mm_struct *mm = cdb->mm;
+
+	mm_remote_db_file_release(fdb);
+	file_db_free(fdb);
+
+	/*
+	 * we may have reached here by killing the client process,
+	 * current->mm is not accessible anymore
+	 */
+	mmput(mm);
+
+	return 0;
+}
+
+static const struct file_operations remmap_ops = {
+	.open = remmap_dev_open,
+	.unlocked_ioctl = remmap_dev_ioctl,
+	.compat_ioctl = remmap_dev_ioctl,
+	.release = remmap_dev_release,
+};
+
+static struct miscdevice remmap_dev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "remote-map",
+	.fops = &remmap_ops,
+};
+
+builtin_misc_device(remmap_dev);
+
+#ifdef CONFIG_DEBUG_FS
+static void __init mm_remote_debugfs_init(void)
+{
+	mm_remote_debugfs_dir = debugfs_create_dir("remote_mapping", NULL);
+	if (mm_remote_debugfs_dir == NULL)
+		return;
+
+	debugfs_create_atomic_t("map_count", 0444, mm_remote_debugfs_dir,
+				&map_count);
+	debugfs_create_atomic_t("pdb_count", 0444, mm_remote_debugfs_dir,
+				&pdb_count);
+	debugfs_create_atomic_t("rpg_count", 0444, mm_remote_debugfs_dir,
+				&rpg_count);
+
+	debugfs_create_atomic_t("stat_empty_pte", 0444, mm_remote_debugfs_dir,
+				&stat_empty_pte);
+	debugfs_create_atomic_t("stat_mapped_pte", 0444, mm_remote_debugfs_dir,
+				&stat_mapped_pte);
+	debugfs_create_atomic_t("stat_swap_pte", 0444, mm_remote_debugfs_dir,
+				&stat_swap_pte);
+	debugfs_create_atomic_t("stat_refault", 0444, mm_remote_debugfs_dir,
+				&stat_refault);
+}
+#else /* CONFIG_DEBUG_FS */
+static void __init mm_remote_debugfs_init(void)
+{
+}
+#endif /* CONFIG_DEBUG_FS */
+
+static int __init mm_remote_init(void)
+{
+	pdb_cache = KMEM_CACHE(page_db, SLAB_PANIC | SLAB_ACCOUNT);
+	if (!pdb_cache)
+		return -ENOMEM;
+
+	mm_remote_debugfs_init();
+
+	return 0;
+}
+device_initcall(mm_remote_init);
diff --git a/mm/rmap.c b/mm/rmap.c
index 0454ecc29537..352570d9ad22 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -65,6 +65,7 @@ 
 #include <linux/page_idle.h>
 #include <linux/memremap.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/remote_mapping.h>
 
 #include <asm/tlbflush.h>
 
@@ -856,7 +857,7 @@  int page_referenced(struct page *page,
 	if (!page_rmapping(page))
 		return 0;
 
-	if (!is_locked && (!PageAnon(page) || PageKsm(page))) {
+	if (!is_locked && (!PageAnon(page) || PageKsm(page) || PageRemote(page))) {
 		we_locked = trylock_page(page);
 		if (!we_locked)
 			return 1;
@@ -1021,7 +1022,7 @@  void page_move_anon_rmap(struct page *page, struct vm_area_struct *vma)
  * __page_set_anon_rmap - set up new anonymous rmap
  * @page:	Page or Hugepage to add to rmap
  * @vma:	VM area to add page to.
- * @address:	User virtual address of the mapping	
+ * @address:	User virtual address of the mapping
  * @exclusive:	the page is exclusively owned by the current process
  */
 static void __page_set_anon_rmap(struct page *page,
@@ -1125,7 +1126,8 @@  void do_page_add_anon_rmap(struct page *page,
 			__inc_node_page_state(page, NR_ANON_THPS);
 		__mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, nr);
 	}
-	if (unlikely(PageKsm(page)))
+
+	if (unlikely(PageKsm(page) || PageRemote(page)))
 		return;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
@@ -1897,6 +1899,8 @@  void rmap_walk(struct page *page, struct rmap_walk_control *rwc)
 {
 	if (unlikely(PageKsm(page)))
 		rmap_walk_ksm(page, rwc);
+	else if (unlikely(PageRemote(page)))
+		rmap_walk_remote(page, rwc);
 	else if (PageAnon(page))
 		rmap_walk_anon(page, rwc, false);
 	else
@@ -1906,8 +1910,9 @@  void rmap_walk(struct page *page, struct rmap_walk_control *rwc)
 /* Like rmap_walk, but caller holds relevant rmap lock */
 void rmap_walk_locked(struct page *page, struct rmap_walk_control *rwc)
 {
-	/* no ksm support for now */
+	/* no ksm/remote support for now */
 	VM_BUG_ON_PAGE(PageKsm(page), page);
+	VM_BUG_ON_PAGE(PageRemote(page), page);
 	if (PageAnon(page))
 		rmap_walk_anon(page, rwc, true);
 	else
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e979705bbf32..63e4dfb477de 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4277,7 +4277,8 @@  int page_evictable(struct page *page)
 
 	/* Prevent address_space of inode and swap cache from being freed */
 	rcu_read_lock();
-	ret = !mapping_unevictable(page_mapping(page)) && !PageMlocked(page);
+	ret = !mapping_unevictable(page_mapping(page)) &&
+		!PageMlocked(page) && !PageRemote(page);
 	rcu_read_unlock();
 	return ret;
 }