mbox series

[RFC,0/1] memfd: Support mapping to zero page on reading

Message ID 20211222123400.1659635-1-liangpeng10@huawei.com (mailing list archive)
Headers show
Series memfd: Support mapping to zero page on reading | expand

Message

Peng Liang Dec. 22, 2021, 12:33 p.m. UTC
Hi all,

Recently we are working on implementing CRIU [1] for QEMU based on
Steven's work [2].  It will use memfd to allocate guest memory in order
to restore (inherit) it in the new QEMU process.  However, memfd will
allocate a new page for reading while anonymous memory will map to zero
page for reading.  For QEMU, memfd may cause that all memory are
allocated during the migration because QEMU will read all pages in
migration.  It may lead to OOM if over-committed memory is enabled,
which is usually enabled in public cloud.

In this patch I try to add support mapping to zero pages on reading
memfd.  On reading, memfd will map to zero page instead of allocating a
new page.  Then COW it when a write occurs.

For now it's just a demo for discussion.  There are lots of work to do,
e.g.:
1. don't support THP;
2. don't support shared reading and writing, only for inherit.  For
   example:
     task1                        | task2
       1) read from addr          |
                                  |   2) write to addr
       3) read from addr again    |
   then 3) will read 0 instead of the data task2 writed in 2).

Would something similar be welcome in the Linux?

Thanks,
Peng

[1] https://criu.org/Checkpoint/Restore
[2] https://patchwork.kernel.org/project/qemu-devel/cover/1628286241-217457-1-git-send-email-steven.sistare@oracle.com/

Peng Liang (1):
  memfd: Support mapping to zero page on reading memfd

 include/linux/fs.h         |  2 ++
 include/uapi/linux/memfd.h |  1 +
 mm/memfd.c                 |  8 ++++++--
 mm/memory.c                | 37 ++++++++++++++++++++++++++++++++++---
 mm/shmem.c                 | 10 ++++++++--
 5 files changed, 51 insertions(+), 7 deletions(-)

Comments

Hugh Dickins Jan. 12, 2022, 2:30 a.m. UTC | #1
On Wed, 22 Dec 2021, Peng Liang wrote:

> Hi all,
> 
> Recently we are working on implementing CRIU [1] for QEMU based on
> Steven's work [2].  It will use memfd to allocate guest memory in order
> to restore (inherit) it in the new QEMU process.  However, memfd will
> allocate a new page for reading while anonymous memory will map to zero
> page for reading.  For QEMU, memfd may cause that all memory are
> allocated during the migration because QEMU will read all pages in
> migration.  It may lead to OOM if over-committed memory is enabled,
> which is usually enabled in public cloud.
> 
> In this patch I try to add support mapping to zero pages on reading
> memfd.  On reading, memfd will map to zero page instead of allocating a
> new page.  Then COW it when a write occurs.
> 
> For now it's just a demo for discussion.  There are lots of work to do,
> e.g.:
> 1. don't support THP;
> 2. don't support shared reading and writing, only for inherit.  For
>    example:
>      task1                        | task2
>        1) read from addr          |
>                                   |   2) write to addr
>        3) read from addr again    |
>    then 3) will read 0 instead of the data task2 writed in 2).
> 
> Would something similar be welcome in the Linux?

David has made good suggestions on better avoiding the need for
such a change, for the use case you have in mind.

And I don't care for the particular RFC patch that you posted.

But I have to say that use of ZERO_PAGE for shmem/memfd/tmpfs read-fault
might (potentially) be very welcome.  Not as some MFD_ZEROPAGE special
case, but as how it would always work.  Deleting the shmem_recalc_inode()
cruft, which is there to correct accounting for the unmodified read-only
pages, after page reclaim has got around to freeing them later.

It does require more work than you gave it in 1/1: mainly, as you call
out above, there's a need to note in the mapping's XArray when ZERO_PAGE
has been used at an offset, and do an rmap walk to unmap those ptes when
a writable page is substituted - see __xip_unmap() in Linux 3.19's
mm/filemap_xip.c for such an rmap walk.

Though when this came up before (in the "no-fault mmap" MAP_NOSIGBUS
thread last year - which then got forgotten), Linus was wary of that
unmapping, and it was dropped for a simple MAP_PRIVATE implementation.

And I've never scoped out what is needed to protect the page from
writing in all circumstances: in principle, it ought to be easy by
giving shmem_vm_ops a page_mkwrite; but that's likely to come with
a performance penalty, which may not be justified for this case.

I didn't check what you did for write protection: maybe what you
did was enough, but one has to be very careful about that.

Making this change to ZERO_PAGE has never quite justified the effort
so far: temporarily allocated pages have worked well enough in most
circumstances.

Hugh

> 
> Thanks,
> Peng
> 
> [1] https://criu.org/Checkpoint/Restore
> [2] https://patchwork.kernel.org/project/qemu-devel/cover/1628286241-217457-1-git-send-email-steven.sistare@oracle.com/
> 
> Peng Liang (1):
>   memfd: Support mapping to zero page on reading memfd
> 
>  include/linux/fs.h         |  2 ++
>  include/uapi/linux/memfd.h |  1 +
>  mm/memfd.c                 |  8 ++++++--
>  mm/memory.c                | 37 ++++++++++++++++++++++++++++++++++---
>  mm/shmem.c                 | 10 ++++++++--
>  5 files changed, 51 insertions(+), 7 deletions(-)
> 
> -- 
> 2.33.1
Yang Shi Jan. 12, 2022, 3:33 a.m. UTC | #2
On Tue, Jan 11, 2022 at 6:30 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Wed, 22 Dec 2021, Peng Liang wrote:
>
> > Hi all,
> >
> > Recently we are working on implementing CRIU [1] for QEMU based on
> > Steven's work [2].  It will use memfd to allocate guest memory in order
> > to restore (inherit) it in the new QEMU process.  However, memfd will
> > allocate a new page for reading while anonymous memory will map to zero
> > page for reading.  For QEMU, memfd may cause that all memory are
> > allocated during the migration because QEMU will read all pages in
> > migration.  It may lead to OOM if over-committed memory is enabled,
> > which is usually enabled in public cloud.
> >
> > In this patch I try to add support mapping to zero pages on reading
> > memfd.  On reading, memfd will map to zero page instead of allocating a
> > new page.  Then COW it when a write occurs.
> >
> > For now it's just a demo for discussion.  There are lots of work to do,
> > e.g.:
> > 1. don't support THP;
> > 2. don't support shared reading and writing, only for inherit.  For
> >    example:
> >      task1                        | task2
> >        1) read from addr          |
> >                                   |   2) write to addr
> >        3) read from addr again    |
> >    then 3) will read 0 instead of the data task2 writed in 2).
> >
> > Would something similar be welcome in the Linux?
>
> David has made good suggestions on better avoiding the need for
> such a change, for the use case you have in mind.
>
> And I don't care for the particular RFC patch that you posted.
>
> But I have to say that use of ZERO_PAGE for shmem/memfd/tmpfs read-fault
> might (potentially) be very welcome.  Not as some MFD_ZEROPAGE special
> case, but as how it would always work.  Deleting the shmem_recalc_inode()
> cruft, which is there to correct accounting for the unmodified read-only
> pages, after page reclaim has got around to freeing them later.

I'm wondering if we could use ZERO_PAGE for shmem_getpage() too so
that we have simpler return value? Currently shmem_getpage() returns:
#1. errno and NULL *pagep
#2. 0 and valid *pagep
#3. 0 and NULL *pagep if SGP_READ

Using ZERO_PAGE should be able to consolidate #2 and #3 so that we
know there must be valid *pagep if 0 is returned. This should make
read-fault use ZERO_PAGE automatically.

>
> It does require more work than you gave it in 1/1: mainly, as you call
> out above, there's a need to note in the mapping's XArray when ZERO_PAGE
> has been used at an offset, and do an rmap walk to unmap those ptes when
> a writable page is substituted - see __xip_unmap() in Linux 3.19's
> mm/filemap_xip.c for such an rmap walk.
>
> Though when this came up before (in the "no-fault mmap" MAP_NOSIGBUS
> thread last year - which then got forgotten), Linus was wary of that
> unmapping, and it was dropped for a simple MAP_PRIVATE implementation.
>
> And I've never scoped out what is needed to protect the page from
> writing in all circumstances: in principle, it ought to be easy by
> giving shmem_vm_ops a page_mkwrite; but that's likely to come with
> a performance penalty, which may not be justified for this case.
>
> I didn't check what you did for write protection: maybe what you
> did was enough, but one has to be very careful about that.
>
> Making this change to ZERO_PAGE has never quite justified the effort
> so far: temporarily allocated pages have worked well enough in most
> circumstances.
>
> Hugh
>
> >
> > Thanks,
> > Peng
> >
> > [1] https://criu.org/Checkpoint/Restore
> > [2] https://patchwork.kernel.org/project/qemu-devel/cover/1628286241-217457-1-git-send-email-steven.sistare@oracle.com/
> >
> > Peng Liang (1):
> >   memfd: Support mapping to zero page on reading memfd
> >
> >  include/linux/fs.h         |  2 ++
> >  include/uapi/linux/memfd.h |  1 +
> >  mm/memfd.c                 |  8 ++++++--
> >  mm/memory.c                | 37 ++++++++++++++++++++++++++++++++++---
> >  mm/shmem.c                 | 10 ++++++++--
> >  5 files changed, 51 insertions(+), 7 deletions(-)
> >
> > --
> > 2.33.1
>
Matthew Wilcox Jan. 12, 2022, 4:32 a.m. UTC | #3
On Tue, Jan 11, 2022 at 06:30:31PM -0800, Hugh Dickins wrote:
> But I have to say that use of ZERO_PAGE for shmem/memfd/tmpfs read-fault
> might (potentially) be very welcome.  Not as some MFD_ZEROPAGE special
> case, but as how it would always work.  Deleting the shmem_recalc_inode()
> cruft, which is there to correct accounting for the unmodified read-only
> pages, after page reclaim has got around to freeing them later.
> 
> It does require more work than you gave it in 1/1: mainly, as you call
> out above, there's a need to note in the mapping's XArray when ZERO_PAGE
> has been used at an offset, and do an rmap walk to unmap those ptes when
> a writable page is substituted - see __xip_unmap() in Linux 3.19's
> mm/filemap_xip.c for such an rmap walk.

I think putting a pointer to the zero page in the XArray would introduce
some unwelcome complexity, but the XArray has a special XA_ZERO_ENTRY
which might be usable for such a thing.  It would need some careful
analysis and testing, of course, but it might also let us remove
the special cases in the DAX code for DAX_ZERO_PAGE.

I agree with you that temporarily allocating pages has worked "well
enough", but maybe some workloads would benefit; even for files on block
device filesystems, reading a hole and never writing to it may be common
enough that this is an optimisation we've been missing for many years.
Hugh Dickins Jan. 12, 2022, 5:02 a.m. UTC | #4
On Tue, 11 Jan 2022, Yang Shi wrote:
> On Tue, Jan 11, 2022 at 6:30 PM Hugh Dickins <hughd@google.com> wrote:
> >
> > But I have to say that use of ZERO_PAGE for shmem/memfd/tmpfs read-fault
> > might (potentially) be very welcome.  Not as some MFD_ZEROPAGE special
> > case, but as how it would always work.  Deleting the shmem_recalc_inode()
> > cruft, which is there to correct accounting for the unmodified read-only
> > pages, after page reclaim has got around to freeing them later.
> 
> I'm wondering if we could use ZERO_PAGE for shmem_getpage() too so
> that we have simpler return value? Currently shmem_getpage() returns:
> #1. errno and NULL *pagep
> #2. 0 and valid *pagep
> #3. 0 and NULL *pagep if SGP_READ
> 
> Using ZERO_PAGE should be able to consolidate #2 and #3 so that we
> know there must be valid *pagep if 0 is returned.

At an earlier stage of mm/shmem.c's life, shmem_getpage() did return
ZERO_PAGE rather than NULL for that case; but I found it works out
better the way it is now (despite I'm not a fan of *pagep generally).

So I've no zest for messing with that now - though it's possible that
if we did extend the use of ZERO_PAGE, I'd look again and decide that
your change is then the best.

One reason for NULL rather than ZERO_PAGE was actually to help avoid
the cache-dirtying get_page/put_page on the thing; but that appears
to have gone missing at some point.  I have a patch in my tree which
fixes that, but held back from sending it in because it also used
iov_iter_zero() instead of copy_page_to_iter().  Obviously a good
improvement... except that whenever I timed it I found the opposite.
So, set aside on a shelf, to look into some other time.

> This should make read-fault use ZERO_PAGE automatically.

But I don't want to make read-fault use ZERO_PAGE automatically:
I want to be rather careful about that change!

Hugh