diff mbox

[XEN,v8,24/29] tools/libs/call: linux: touch newly allocated pages after madvise lockdown

Message ID 1452864188-2417-25-git-send-email-ian.campbell@citrix.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ian Campbell Jan. 15, 2016, 1:23 p.m. UTC
This avoids a potential issue with a fork after allocation but before
madvise.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
v7: New, replacing "tools/libs/call: linux: avoid forking between mmap
    and madvise".
---
 tools/libs/call/linux.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

Comments

Wei Liu Jan. 19, 2016, 1:24 p.m. UTC | #1
On Fri, Jan 15, 2016 at 01:23:03PM +0000, Ian Campbell wrote:
> This avoids a potential issue with a fork after allocation but before
> madvise.
> 
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> ---
> v7: New, replacing "tools/libs/call: linux: avoid forking between mmap
>     and madvise".
> ---
>  tools/libs/call/linux.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/libs/call/linux.c b/tools/libs/call/linux.c
> index 3641e41..651f380 100644
> --- a/tools/libs/call/linux.c
> +++ b/tools/libs/call/linux.c

I didn't notice you only handled this for Linux until now.

I think FreeBSD and NetBSD need similar treatment, too? But then current
BSD* code doesn't even support DONTFORK in madvise.

Adding Roger for more input.

The changes in this patch look fine to me.

Wei.

> @@ -88,7 +88,7 @@ void *osdep_alloc_pages(xencall_handle *xcall, size_t npages)
>  {
>      size_t size = npages * PAGE_SIZE;
>      void *p;
> -    int rc, saved_errno;
> +    int rc, i, saved_errno;
>  
>      /* Address returned by mmap is page aligned. */
>      p = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_LOCKED, -1, 0);
> @@ -107,6 +107,18 @@ void *osdep_alloc_pages(xencall_handle *xcall, size_t npages)
>          goto out;
>      }
>  
> +    /*
> +     * Touch each page in turn to force them to be un-CoWed, in case a
> +     * fork happened in another thread at an inopportune moment
> +     * above. The madvise() will prevent any subsequent fork calls from
> +     * causing the same problem.
> +     */
> +    for ( i = 0; i < npages ; i++ )
> +    {
> +        char *c = (char *)p + (i*PAGE_SIZE);
> +        *c = 0;
> +    }
> +
>      return p;
>  
>  out:
> -- 
> 2.1.4
>
Ian Campbell Jan. 19, 2016, 1:40 p.m. UTC | #2
On Tue, 2016-01-19 at 13:24 +0000, Wei Liu wrote:
> On Fri, Jan 15, 2016 at 01:23:03PM +0000, Ian Campbell wrote:
> > This avoids a potential issue with a fork after allocation but before
> > madvise.
> > 
> > Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> > ---
> > v7: New, replacing "tools/libs/call: linux: avoid forking between mmap
> >     and madvise".
> > ---
> >  tools/libs/call/linux.c | 14 +++++++++++++-
> >  1 file changed, 13 insertions(+), 1 deletion(-)
> > 
> > diff --git a/tools/libs/call/linux.c b/tools/libs/call/linux.c
> > index 3641e41..651f380 100644
> > --- a/tools/libs/call/linux.c
> > +++ b/tools/libs/call/linux.c
> 
> I didn't notice you only handled this for Linux until now.
> 
> I think FreeBSD and NetBSD need similar treatment, too? But then current
> BSD* code doesn't even support DONTFORK in madvise.

I think any updates to the *BSD side should come as separate improvements.

If they are necessary at all, the Linux stuff comes from Linux making use
of particular behaviours on mlock()'s memory (specifically, IIRC, doing
things such as THP and NUMA balancing which can cause page faults on the
mlock'd pages despite them being locked) which are (possibly) allowed by
POSIX, but *BSD may not take the same interpretation.

> Adding Roger for more input.
> 
> The changes in this patch look fine to me.

May I take that as an Ack?
Wei Liu Jan. 19, 2016, 2:26 p.m. UTC | #3
On Tue, Jan 19, 2016 at 01:40:55PM +0000, Ian Campbell wrote:
> On Tue, 2016-01-19 at 13:24 +0000, Wei Liu wrote:
> > On Fri, Jan 15, 2016 at 01:23:03PM +0000, Ian Campbell wrote:
> > > This avoids a potential issue with a fork after allocation but before
> > > madvise.
> > > 
> > > Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> > > ---
> > > v7: New, replacing "tools/libs/call: linux: avoid forking between mmap
> > >     and madvise".
> > > ---
> > >  tools/libs/call/linux.c | 14 +++++++++++++-
> > >  1 file changed, 13 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/tools/libs/call/linux.c b/tools/libs/call/linux.c
> > > index 3641e41..651f380 100644
> > > --- a/tools/libs/call/linux.c
> > > +++ b/tools/libs/call/linux.c
> > 
> > I didn't notice you only handled this for Linux until now.
> > 
> > I think FreeBSD and NetBSD need similar treatment, too? But then current
> > BSD* code doesn't even support DONTFORK in madvise.
> 
> I think any updates to the *BSD side should come as separate improvements.
> 
> If they are necessary at all, the Linux stuff comes from Linux making use
> of particular behaviours on mlock()'s memory (specifically, IIRC, doing
> things such as THP and NUMA balancing which can cause page faults on the
> mlock'd pages despite them being locked) which are (possibly) allowed by
> POSIX, but *BSD may not take the same interpretation.
> 
> > Adding Roger for more input.
> > 
> > The changes in this patch look fine to me.
> 
> May I take that as an Ack?
> 

Yes.
Roger Pau Monné Jan. 19, 2016, 2:54 p.m. UTC | #4
El 19/01/16 a les 14.24, Wei Liu ha escrit:
> On Fri, Jan 15, 2016 at 01:23:03PM +0000, Ian Campbell wrote:
>> This avoids a potential issue with a fork after allocation but before
>> madvise.
>>
>> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
>> ---
>> v7: New, replacing "tools/libs/call: linux: avoid forking between mmap
>>     and madvise".
>> ---
>>  tools/libs/call/linux.c | 14 +++++++++++++-
>>  1 file changed, 13 insertions(+), 1 deletion(-)
>>
>> diff --git a/tools/libs/call/linux.c b/tools/libs/call/linux.c
>> index 3641e41..651f380 100644
>> --- a/tools/libs/call/linux.c
>> +++ b/tools/libs/call/linux.c
> 
> I didn't notice you only handled this for Linux until now.
> 
> I think FreeBSD and NetBSD need similar treatment, too? But then current
> BSD* code doesn't even support DONTFORK in madvise.
> 
> Adding Roger for more input.

Hm, right, thanks for noticing this. I don't think FreeBSD needs a
similar treatment (pre-faulting), because mlock will remove any CoW when
making the pages wired.

Also, AFAICT we don't need to call madvise or minherit(2) because
mlock(2) already takes care of preventing the memory region from being
copied to the child on fork:

"Locked mappings are not inherited by the child process after a
fork(2)." [0]

So I think we are safe on the FreeBSD side.

Roger.

[0] https://www.freebsd.org/cgi/man.cgi?query=mlock
Wei Liu Jan. 19, 2016, 2:58 p.m. UTC | #5
On Tue, Jan 19, 2016 at 03:54:54PM +0100, Roger Pau Monné wrote:
> El 19/01/16 a les 14.24, Wei Liu ha escrit:
> > On Fri, Jan 15, 2016 at 01:23:03PM +0000, Ian Campbell wrote:
> >> This avoids a potential issue with a fork after allocation but before
> >> madvise.
> >>
> >> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> >> ---
> >> v7: New, replacing "tools/libs/call: linux: avoid forking between mmap
> >>     and madvise".
> >> ---
> >>  tools/libs/call/linux.c | 14 +++++++++++++-
> >>  1 file changed, 13 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/tools/libs/call/linux.c b/tools/libs/call/linux.c
> >> index 3641e41..651f380 100644
> >> --- a/tools/libs/call/linux.c
> >> +++ b/tools/libs/call/linux.c
> > 
> > I didn't notice you only handled this for Linux until now.
> > 
> > I think FreeBSD and NetBSD need similar treatment, too? But then current
> > BSD* code doesn't even support DONTFORK in madvise.
> > 
> > Adding Roger for more input.
> 
> Hm, right, thanks for noticing this. I don't think FreeBSD needs a
> similar treatment (pre-faulting), because mlock will remove any CoW when
> making the pages wired.
> 
> Also, AFAICT we don't need to call madvise or minherit(2) because
> mlock(2) already takes care of preventing the memory region from being
> copied to the child on fork:
> 
> "Locked mappings are not inherited by the child process after a
> fork(2)." [0]
> 
> So I think we are safe on the FreeBSD side.
> 

But what if the process forks between mmap and mlock? I think that
warrants touching the area like we do for Linux here.

Wei.

> Roger.
> 
> [0] https://www.freebsd.org/cgi/man.cgi?query=mlock
>
Ian Campbell Jan. 19, 2016, 3:03 p.m. UTC | #6
On Tue, 2016-01-19 at 14:58 +0000, Wei Liu wrote:
> On Tue, Jan 19, 2016 at 03:54:54PM +0100, Roger Pau Monné wrote:
> > El 19/01/16 a les 14.24, Wei Liu ha escrit:
> > > On Fri, Jan 15, 2016 at 01:23:03PM +0000, Ian Campbell wrote:
> > > > This avoids a potential issue with a fork after allocation but
> > > > before
> > > > madvise.
> > > > 
> > > > Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> > > > ---
> > > > v7: New, replacing "tools/libs/call: linux: avoid forking between
> > > > mmap
> > > >     and madvise".
> > > > ---
> > > >  tools/libs/call/linux.c | 14 +++++++++++++-
> > > >  1 file changed, 13 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/tools/libs/call/linux.c b/tools/libs/call/linux.c
> > > > index 3641e41..651f380 100644
> > > > --- a/tools/libs/call/linux.c
> > > > +++ b/tools/libs/call/linux.c
> > > 
> > > I didn't notice you only handled this for Linux until now.
> > > 
> > > I think FreeBSD and NetBSD need similar treatment, too? But then
> > > current
> > > BSD* code doesn't even support DONTFORK in madvise.
> > > 
> > > Adding Roger for more input.
> > 
> > Hm, right, thanks for noticing this. I don't think FreeBSD needs a
> > similar treatment (pre-faulting), because mlock will remove any CoW
> > when
> > making the pages wired.
> > 
> > Also, AFAICT we don't need to call madvise or minherit(2) because
> > mlock(2) already takes care of preventing the memory region from being
> > copied to the child on fork:
> > 
> > "Locked mappings are not inherited by the child process after a
> > fork(2)." [0]
> > 
> > So I think we are safe on the FreeBSD side.
> > 
> 
> But what if the process forks between mmap and mlock? I think that
> warrants touching the area like we do for Linux here.

mlock guarantees the memory is populated, I think, which is equivalent to
touching it.

On Linux we use madvise not mlock, which doesn't make the same claims.

> 
Ian.
Wei Liu Jan. 19, 2016, 3:49 p.m. UTC | #7
On Tue, Jan 19, 2016 at 03:03:31PM +0000, Ian Campbell wrote:
> On Tue, 2016-01-19 at 14:58 +0000, Wei Liu wrote:
> > On Tue, Jan 19, 2016 at 03:54:54PM +0100, Roger Pau Monné wrote:
> > > El 19/01/16 a les 14.24, Wei Liu ha escrit:
> > > > On Fri, Jan 15, 2016 at 01:23:03PM +0000, Ian Campbell wrote:
> > > > > This avoids a potential issue with a fork after allocation but
> > > > > before
> > > > > madvise.
> > > > > 
> > > > > Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> > > > > ---
> > > > > v7: New, replacing "tools/libs/call: linux: avoid forking between
> > > > > mmap
> > > > >     and madvise".
> > > > > ---
> > > > >  tools/libs/call/linux.c | 14 +++++++++++++-
> > > > >  1 file changed, 13 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/tools/libs/call/linux.c b/tools/libs/call/linux.c
> > > > > index 3641e41..651f380 100644
> > > > > --- a/tools/libs/call/linux.c
> > > > > +++ b/tools/libs/call/linux.c
> > > > 
> > > > I didn't notice you only handled this for Linux until now.
> > > > 
> > > > I think FreeBSD and NetBSD need similar treatment, too? But then
> > > > current
> > > > BSD* code doesn't even support DONTFORK in madvise.
> > > > 
> > > > Adding Roger for more input.
> > > 
> > > Hm, right, thanks for noticing this. I don't think FreeBSD needs a
> > > similar treatment (pre-faulting), because mlock will remove any CoW
> > > when
> > > making the pages wired.
> > > 
> > > Also, AFAICT we don't need to call madvise or minherit(2) because
> > > mlock(2) already takes care of preventing the memory region from being
> > > copied to the child on fork:
> > > 
> > > "Locked mappings are not inherited by the child process after a
> > > fork(2)." [0]
> > > 
> > > So I think we are safe on the FreeBSD side.
> > > 
> > 
> > But what if the process forks between mmap and mlock? I think that
> > warrants touching the area like we do for Linux here.
> 
> mlock guarantees the memory is populated, I think, which is equivalent to
> touching it.
> 
> On Linux we use madvise not mlock, which doesn't make the same claims.
> 

I see. I wonder why we didn't use mlock(2) in Linux too in the first
place.

Wei.

> > 
> Ian.
Ian Campbell Jan. 19, 2016, 3:59 p.m. UTC | #8
On Tue, 2016-01-19 at 15:49 +0000, Wei Liu wrote:

> I see. I wonder why we didn't use mlock(2) in Linux too in the first
> place.

We did, but mlock(2) on Linux doesn't guarantee that there will be no page
faults, it only guarantees that there will be no page faults which require
I/O to satisfy (~= it won't swap the page out).

Some text calls faults which require I/O "major" vs "minor" page faults
which don't need I/O to satisfy.

But Linux does things which can result in a supposedly locked PTE being not
present or not-writable, which results minor page faults. The main cause of
that is CoW due to fork (which affects both the parent and child). Other
ones could include being in the middle of moving pages between NUMA nodes
(current/source page is R/O) and THP (likewise content moving around).

BSD says (https://www.freebsd.org/cgi/man.cgi?query=mlock):

     After an mlock() system call, the indicated pages will cause neither a
     non-resident page nor address-translation fault until they	are unlocked.

which I think rules out both major and minor faults (i.e. faults of any
kind, which is what we need to pass the memory through to Xen).

Using madvise() instead of mlock() on Linux is really just a hack, in that
it gets us closer to the semantics we need, but isn't really the same.
Really we ought to provide a /dev/xen/foo from which memory can be
allocated with the right properties (much like /dev/xen/gntalloc does for
grantable memory).

Ian.
diff mbox

Patch

diff --git a/tools/libs/call/linux.c b/tools/libs/call/linux.c
index 3641e41..651f380 100644
--- a/tools/libs/call/linux.c
+++ b/tools/libs/call/linux.c
@@ -88,7 +88,7 @@  void *osdep_alloc_pages(xencall_handle *xcall, size_t npages)
 {
     size_t size = npages * PAGE_SIZE;
     void *p;
-    int rc, saved_errno;
+    int rc, i, saved_errno;
 
     /* Address returned by mmap is page aligned. */
     p = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_LOCKED, -1, 0);
@@ -107,6 +107,18 @@  void *osdep_alloc_pages(xencall_handle *xcall, size_t npages)
         goto out;
     }
 
+    /*
+     * Touch each page in turn to force them to be un-CoWed, in case a
+     * fork happened in another thread at an inopportune moment
+     * above. The madvise() will prevent any subsequent fork calls from
+     * causing the same problem.
+     */
+    for ( i = 0; i < npages ; i++ )
+    {
+        char *c = (char *)p + (i*PAGE_SIZE);
+        *c = 0;
+    }
+
     return p;
 
 out: