Message ID | 20130321061838.GA28319@redhat.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
On Wed, Mar 20, 2013 at 11:18 PM, Michael S. Tsirkin <mst@redhat.com> wrote: > core/umem.c seems to get the arguments to get_user_pages > in the reverse order: it sets writeable flag and > breaks COW for MAP_SHARED if and only if hardware needs to > write the page. > > This breaks memory overcommit for users such as KVM: > each time we try to register a page to send it to remote, this > breaks COW. It seems that for applications that only have > REMOTE_READ permission, there is no reason to break COW at all. I proposed a similar (but not exactly the same, see below) patch a while ago: https://lkml.org/lkml/2012/1/26/7 but read the thread, especially https://lkml.org/lkml/2012/2/6/265 I think this change will break the case where userspace tries to register an MR with read-only permission, but intends locally through the CPU to write to the memory. If the memory registration is done while the memory is mapped read-only but has VM_MAYWRITE, then userspace gets into trouble when COW happens. In the case you're describing (although I'm not sure where in KVM we're talking about using RDMA), what happens if you register memory with only REMOTE_READ and then COW is triggered because of a local write? (I'm assuming you don't want remote access to continue to get the old contents of the page) I have to confess that I still haven't had a chance to implement the proposed FOLL_FOLLOW solution to all of this. > If the page that is COW has lots of copies, this makes the user process > quickly exceed the cgroups memory limit. This makes RDMA mostly useless > for virtualization, thus the stable tag. The actual problem description here is a bit too terse for me to understand. How do we end up with lots of copies of a COW page? Why is RDMA registering the memory any more special than having everyone who maps this page actually writing to it and triggering COW? > ret = get_user_pages(current, current->mm, cur_base, > min_t(unsigned long, npages, > PAGE_SIZE / sizeof (struct page *)), > - 1, !umem->writable, page_list, vma_list); > + !umem->writable, 1, page_list, vma_list); The first two parameters in this line being changed are "write" and "force". I think if we do change this, then we need to pass umem->writable (as opposed to !umem->writable) for the "write" parameter. Not sure whether "force" makes sense or not. - R. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 20, 2013 at 11:55:54PM -0700, Roland Dreier wrote: > On Wed, Mar 20, 2013 at 11:18 PM, Michael S. Tsirkin <mst@redhat.com> wrote: > > core/umem.c seems to get the arguments to get_user_pages > > in the reverse order: it sets writeable flag and > > breaks COW for MAP_SHARED if and only if hardware needs to > > write the page. > > > > This breaks memory overcommit for users such as KVM: > > each time we try to register a page to send it to remote, this > > breaks COW. It seems that for applications that only have > > REMOTE_READ permission, there is no reason to break COW at all. > > I proposed a similar (but not exactly the same, see below) patch a > while ago: https://lkml.org/lkml/2012/1/26/7 but read the thread, > especially https://lkml.org/lkml/2012/2/6/265 > > I think this change will break the case where userspace tries to > register an MR with read-only permission, but intends locally through > the CPU to write to the memory. Shouldn't it set LOCAL_WRITE then? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> I think this change will break the case where userspace tries to >> register an MR with read-only permission, but intends locally through >> the CPU to write to the memory. > Shouldn't it set LOCAL_WRITE then? We're talking about the permissions for the register MR operation, right? (That's what the kernel RDMA driver code that does get_user_pages() sees) In that case, no, I don't see any reason for LOCAL_WRITE, since the only RDMA operations that will access this memory are remote reads. The writing (that triggers COW) is coming from normal process access triggering a page fault, etc. This is a pretty standard way of using RDMA... For example, I allocate some memory and register it for RDMA read (and pass the R_Key to the remote system) with only REMOTE_READ permission. Then I fill in the memory with the results of some computation and the remote system does an RDMA read to get those results. - R. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 21, 2013 at 12:15:33AM -0700, Roland Dreier wrote: > >> I think this change will break the case where userspace tries to > >> register an MR with read-only permission, but intends locally through > >> the CPU to write to the memory. > > > Shouldn't it set LOCAL_WRITE then? > > We're talking about the permissions for the register MR operation, > right? (That's what the kernel RDMA driver code that does > get_user_pages() sees) > > In that case, no, I don't see any reason for LOCAL_WRITE, since the > only RDMA operations that will access this memory are remote reads. What is the meaning of LOCAL_WRITE then? There are no local RDMA writes as far as I can see. > The writing (that triggers COW) is coming from normal process access > triggering a page fault, etc. This is a pretty standard way of using > RDMA... For example, I allocate some memory and register it for RDMA > read (and pass the R_Key to the remote system) with only REMOTE_READ > permission. Then I fill in the memory with the results of some > computation and the remote system does an RDMA read to get those > results. > > - R. OK then what we need is a new flag saying "I really do not intend to write into this memory please do not break COW or do anything else just in case I do".
On Thu, Mar 21, 2013 at 1:51 AM, Michael S. Tsirkin <mst@redhat.com> wrote: >> In that case, no, I don't see any reason for LOCAL_WRITE, since the >> only RDMA operations that will access this memory are remote reads. > > What is the meaning of LOCAL_WRITE then? There are no local > RDMA writes as far as I can see. Umm, it means you're giving the local adapter permission to write to that memory. So you can use it as a receive buffer or as the target for remote data from an RDMA read operation. > OK then what we need is a new flag saying "I really do not > intend to write into this memory please do not break > COW or do anything else just in case I do". Isn't that a shared read-only mapping? - R. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 21, 2013 at 02:13:38AM -0700, Roland Dreier wrote: > On Thu, Mar 21, 2013 at 1:51 AM, Michael S. Tsirkin <mst@redhat.com> wrote: > >> In that case, no, I don't see any reason for LOCAL_WRITE, since the > >> only RDMA operations that will access this memory are remote reads. > > > > What is the meaning of LOCAL_WRITE then? There are no local > > RDMA writes as far as I can see. > > Umm, it means you're giving the local adapter permission to write to > that memory. So you can use it as a receive buffer or as the target > for remote data from an RDMA read operation. Well RDMA read has it's own flag, IB_ACCESS_REMOTE_READ. I don't see why do you need to give adapter permission to use memory as a receive buffer given that you must pass the address in the post receive verb, but maybe that's what the IB spec says so that's what the applications assumed? > > OK then what we need is a new flag saying "I really do not > > intend to write into this memory please do not break > > COW or do anything else just in case I do". > > Isn't that a shared read-only mapping? > > - R. Nothing to do with how the page is mapped. We can and do write the page before registration. BTW umem.c passes in force so it breaks COW even for read-only mappings, right?
Yes, I'd be happy to try the patch. Got meetings all day...... but will dive in soon. On 03/21/2013 02:18 AM, Michael S. Tsirkin wrote: > core/umem.c seems to get the arguments to get_user_pages > in the reverse order: it sets writeable flag and > breaks COW for MAP_SHARED if and only if hardware needs to > write the page. > > This breaks memory overcommit for users such as KVM: > each time we try to register a page to send it to remote, this > breaks COW. It seems that for applications that only have > REMOTE_READ permission, there is no reason to break COW at all. > > If the page that is COW has lots of copies, this makes the user process > quickly exceed the cgroups memory limit. This makes RDMA mostly useless > for virtualization, thus the stable tag. > > Reported-by: "Michael R. Hines" <mrhines@linux.vnet.ibm.com> > Cc: stable@vger.kernel.org > Signed-off-by: Michael S. Tsirkin <mst@redhat.com> > --- > > Note: compile-tested only, I don't have RDMA hardware at the moment. > Michael, could you please try this patch (also fixing your > usespace code not to request write access) and report? > > Note2: grep for get_user_pages in infiniband drivers turns up > lots of users who set write to 1 unconditionally. > These might be bugs too, should be checked. > > drivers/infiniband/core/umem.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c > index a841123..5929598 100644 > --- a/drivers/infiniband/core/umem.c > +++ b/drivers/infiniband/core/umem.c > @@ -152,7 +152,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, > ret = get_user_pages(current, current->mm, cur_base, > min_t(unsigned long, npages, > PAGE_SIZE / sizeof (struct page *)), > - 1, !umem->writable, page_list, vma_list); > + !umem->writable, 1, page_list, vma_list); > > if (ret < 0) > goto out; -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 21, 2013 at 08:23:48AM -0400, Michael R. Hines wrote: > Yes, I'd be happy to try the patch. > > Got meetings all day...... but will dive in soon. The patch is unlikely to be the final version. In particular you need to change !umem->writable to umem->writable. > On 03/21/2013 02:18 AM, Michael S. Tsirkin wrote: > >core/umem.c seems to get the arguments to get_user_pages > >in the reverse order: it sets writeable flag and > >breaks COW for MAP_SHARED if and only if hardware needs to > >write the page. > > > >This breaks memory overcommit for users such as KVM: > >each time we try to register a page to send it to remote, this > >breaks COW. It seems that for applications that only have > >REMOTE_READ permission, there is no reason to break COW at all. > > > >If the page that is COW has lots of copies, this makes the user process > >quickly exceed the cgroups memory limit. This makes RDMA mostly useless > >for virtualization, thus the stable tag. > > > >Reported-by: "Michael R. Hines" <mrhines@linux.vnet.ibm.com> > >Cc: stable@vger.kernel.org > >Signed-off-by: Michael S. Tsirkin <mst@redhat.com> > >--- > > > >Note: compile-tested only, I don't have RDMA hardware at the moment. > >Michael, could you please try this patch (also fixing your > >usespace code not to request write access) and report? > > > >Note2: grep for get_user_pages in infiniband drivers turns up > >lots of users who set write to 1 unconditionally. > >These might be bugs too, should be checked. > > > > drivers/infiniband/core/umem.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > >diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c > >index a841123..5929598 100644 > >--- a/drivers/infiniband/core/umem.c > >+++ b/drivers/infiniband/core/umem.c > >@@ -152,7 +152,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, > > ret = get_user_pages(current, current->mm, cur_base, > > min_t(unsigned long, npages, > > PAGE_SIZE / sizeof (struct page *)), > >- 1, !umem->writable, page_list, vma_list); > >+ !umem->writable, 1, page_list, vma_list); > > > > if (ret < 0) > > goto out; -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 21, 2013 at 11:39:47AM +0200, Michael S. Tsirkin wrote: > On Thu, Mar 21, 2013 at 02:13:38AM -0700, Roland Dreier wrote: > > On Thu, Mar 21, 2013 at 1:51 AM, Michael S. Tsirkin <mst@redhat.com> wrote: > > >> In that case, no, I don't see any reason for LOCAL_WRITE, since the > > >> only RDMA operations that will access this memory are remote reads. > > > > > > What is the meaning of LOCAL_WRITE then? There are no local > > > RDMA writes as far as I can see. > > > > Umm, it means you're giving the local adapter permission to write to > > that memory. So you can use it as a receive buffer or as the target > > for remote data from an RDMA read operation. > > Well RDMA read has it's own flag, IB_ACCESS_REMOTE_READ. > I don't see why do you need to give adapter permission The access flags have to do with what independent access remote nodes get. There are four major cases: access = IBV_ACCESS_REMOTE_READ says the adaptor will let remote nodes read the memory. access = 0 (ie IBV_ACCESS_LOCAL_READ) says that only the adaptor, under the direct control of the application, can read this memory. Remote nodes are barred. access = IBV_ACCESS_REMOTE_WRITE|IBV_ACCESS_LOCAL_WRITE says the adaptor will let remote nodes write the memory access = IBV_ACCESS_LOCAL_WRITE bars remote nodes from writing to that memory. Only the adaptor, under the direct control of the application, can write the memory. The fact LOCAL_READ/REMOTE_READ exists makes it possible to do what you want - it guarentees the adaptor will never write to this memory under any circumstances, so you can leave the page COW'd. If LOCAL_WRITE was implied then you'd have to COW everything.. Would it be better to drive the COW break decision off the region's MM flags? Ie if the memory is mapped read only into the process then you can keep the COW at the RDMA layer, otherwise you should break it. That seems more natural than a new flag? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 21, 2013 at 11:11:15AM -0600, Jason Gunthorpe wrote: > On Thu, Mar 21, 2013 at 11:39:47AM +0200, Michael S. Tsirkin wrote: > > On Thu, Mar 21, 2013 at 02:13:38AM -0700, Roland Dreier wrote: > > > On Thu, Mar 21, 2013 at 1:51 AM, Michael S. Tsirkin <mst@redhat.com> wrote: > > > >> In that case, no, I don't see any reason for LOCAL_WRITE, since the > > > >> only RDMA operations that will access this memory are remote reads. > > > > > > > > What is the meaning of LOCAL_WRITE then? There are no local > > > > RDMA writes as far as I can see. > > > > > > Umm, it means you're giving the local adapter permission to write to > > > that memory. So you can use it as a receive buffer or as the target > > > for remote data from an RDMA read operation. > > > > Well RDMA read has it's own flag, IB_ACCESS_REMOTE_READ. > > I don't see why do you need to give adapter permission > > The access flags have to do with what independent access remote nodes > get. There are four major cases: > > access = IBV_ACCESS_REMOTE_READ says the adaptor will let remote nodes > read the memory. > > access = 0 (ie IBV_ACCESS_LOCAL_READ) says that only the adaptor, under > the direct control of the application, can read this memory. Remote > nodes are barred. > > access = IBV_ACCESS_REMOTE_WRITE|IBV_ACCESS_LOCAL_WRITE says the adaptor > will let remote nodes write the memory > > access = IBV_ACCESS_LOCAL_WRITE bars remote nodes from writing to that > memory. Only the adaptor, under the direct control of the application, > can write the memory. > > The fact LOCAL_READ/REMOTE_READ exists makes it possible to do what > you want - it guarentees the adaptor will never write to this memory > under any circumstances, so you can leave the page COW'd. If > LOCAL_WRITE was implied then you'd have to COW everything.. > > Would it be better to drive the COW break decision off the region's MM > flags? Ie if the memory is mapped read only into the process then you > can keep the COW at the RDMA layer, otherwise you should break > it. That seems more natural than a new flag? > > Jason No because application does this: init page ... after a lot of time .. register send unregister so it can not be read only.
On Thu, Mar 21, 2013 at 07:15:25PM +0200, Michael S. Tsirkin wrote: > No because application does this: > init page > > ... > > after a lot of time > > .. > > register > send > unregister > > so it can not be read only. mprotect(READONLY) register send unregister mprotect(WRITABLE) ? With something like GIFT the app already has to give up writing to the pages while they are moving, so changing the protection seems in line with that? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 21, 2013 at 11:21:50AM -0600, Jason Gunthorpe wrote: > On Thu, Mar 21, 2013 at 07:15:25PM +0200, Michael S. Tsirkin wrote: > > > No because application does this: > > init page > > > > ... > > > > after a lot of time > > > > .. > > > > register > > send > > unregister > > > > so it can not be read only. > > mprotect(READONLY) > register > send > unregister > mprotect(WRITABLE) > > ? > With something like GIFT the app already has to give up writing to the > pages while they are moving, so changing the protection seems in line > with that? > > Jason It doesn't actually, and our app would sometimes write to these pages. It simply does not care which version does the remote get in this case since we track writes and resend later. Also this is per-page, MRs have byte granularity so easier to use.
On Thu, Mar 21, 2013 at 07:42:37PM +0200, Michael S. Tsirkin wrote: > It doesn't actually, and our app would sometimes write to these pages. > It simply does not care which version does the remote get in this case > since we track writes and resend later. Heh, somehow I thought you might say that :) A new flag seems like the only way then - maybe: IBV_ACCESS_NON_COHERENT - The adaptor and the CPU do not share a coherent view of registered memory. Memory writes from the CPU after ibv_reg_mr completes may not be reflected in the memory viewed by the adaptor. Can only be combined with read only access permissions. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 21, 2013 at 11:57:32AM -0600, Jason Gunthorpe wrote: > On Thu, Mar 21, 2013 at 07:42:37PM +0200, Michael S. Tsirkin wrote: > > > It doesn't actually, and our app would sometimes write to these pages. > > It simply does not care which version does the remote get in this case > > since we track writes and resend later. > > Heh, somehow I thought you might say that :) > > A new flag seems like the only way then - maybe: > IBV_ACCESS_NON_COHERENT - The adaptor and the CPU do not share > a coherent view of registered memory. Memory writes from the CPU > after ibv_reg_mr completes may not be reflected in the memory > viewed by the adaptor. > > Can only be combined with read only access permissions. > > Jason I kind of like _GIFT for name, gifts are nice :) But yes that's exactly the semantics we need.
On Thu, Mar 21, 2013 at 11:11:15AM -0600, Jason Gunthorpe wrote: > On Thu, Mar 21, 2013 at 11:39:47AM +0200, Michael S. Tsirkin wrote: > > On Thu, Mar 21, 2013 at 02:13:38AM -0700, Roland Dreier wrote: > > > On Thu, Mar 21, 2013 at 1:51 AM, Michael S. Tsirkin <mst@redhat.com> wrote: > > > >> In that case, no, I don't see any reason for LOCAL_WRITE, since the > > > >> only RDMA operations that will access this memory are remote reads. > > > > > > > > What is the meaning of LOCAL_WRITE then? There are no local > > > > RDMA writes as far as I can see. > > > > > > Umm, it means you're giving the local adapter permission to write to > > > that memory. So you can use it as a receive buffer or as the target > > > for remote data from an RDMA read operation. > > > > Well RDMA read has it's own flag, IB_ACCESS_REMOTE_READ. > > I don't see why do you need to give adapter permission > > The access flags have to do with what independent access remote nodes > get. There are four major cases: > > access = IBV_ACCESS_REMOTE_READ says the adaptor will let remote nodes > read the memory. > > access = 0 (ie IBV_ACCESS_LOCAL_READ) says that only the adaptor, under > the direct control of the application, can read this memory. Remote > nodes are barred. > > access = IBV_ACCESS_REMOTE_WRITE|IBV_ACCESS_LOCAL_WRITE says the adaptor > will let remote nodes write the memory > > access = IBV_ACCESS_LOCAL_WRITE bars remote nodes from writing to that > memory. Only the adaptor, under the direct control of the application, > can write the memory. This is the one I find redundant. Since the write will be done by the adaptor under direct control by the application, why does it make sense to declare this beforehand? If you don't want to allow local write access to memory, just do not post any receive WRs with this address. If you posted and regret it, reset the QP to cancel. But IB spec specified LOCAL_WRITE in this redundant way so I guess applications expect it to have the semantics defined there, I just didn't remember what they are. No way around it then, need another flag. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 21, 2013 at 08:16:33PM +0200, Michael S. Tsirkin wrote: > This is the one I find redundant. Since the write will be done by > the adaptor under direct control by the application, why does it > make sense to declare this beforehand? If you don't want to allow > local write access to memory, just do not post any receive WRs with > this address. If you posted and regret it, reset the QP to cancel. This is to support your COW scenario - the app declares before hand to the kernel that it will write to the memory and the kernel ensures pages are dedicated to the app at registration time. Or the app says it will only read and the kernel could leave them shared. The adaptor enforces the access control to prevent a naughty app from writing to shared memory - think about mmap'ing libc.so and then using RDMA to write to the shared pages. It is necessary to ensure that is impossible. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 21, 2013 at 12:41:35PM -0600, Jason Gunthorpe wrote: > On Thu, Mar 21, 2013 at 08:16:33PM +0200, Michael S. Tsirkin wrote: > > > This is the one I find redundant. Since the write will be done by > > the adaptor under direct control by the application, why does it > > make sense to declare this beforehand? If you don't want to allow > > local write access to memory, just do not post any receive WRs with > > this address. If you posted and regret it, reset the QP to cancel. > > This is to support your COW scenario - the app declares before hand to > the kernel that it will write to the memory and the kernel ensures > pages are dedicated to the app at registration time. Or the app says > it will only read and the kernel could leave them shared. Someone here is confused. LOCAL_WRITE/absence of it does not address COW, it breaks COW anyway. Are you now saying we should change rdma so without LOCAL_WRITE it will not break COW? > The adaptor enforces the access control to prevent a naughty app from > writing to shared memory - think about mmap'ing libc.so and then using > RDMA to write to the shared pages. It is necessary to ensure that is > impossible. > > Jason That's why it's redundant: we can't trust an application to tell us 'this page is writeable', we must get this info from kernel. And so there's apparently no need for application to tell adaptor about LOCAL_WRITE.
On Thu, Mar 21, 2013 at 09:15:41PM +0200, Michael S. Tsirkin wrote: > On Thu, Mar 21, 2013 at 12:41:35PM -0600, Jason Gunthorpe wrote: > > On Thu, Mar 21, 2013 at 08:16:33PM +0200, Michael S. Tsirkin wrote: > > > > > This is the one I find redundant. Since the write will be done by > > > the adaptor under direct control by the application, why does it > > > make sense to declare this beforehand? If you don't want to allow > > > local write access to memory, just do not post any receive WRs with > > > this address. If you posted and regret it, reset the QP to cancel. > > > > This is to support your COW scenario - the app declares before hand to > > the kernel that it will write to the memory and the kernel ensures > > pages are dedicated to the app at registration time. Or the app says > > it will only read and the kernel could leave them shared. > > Someone here is confused. LOCAL_WRITE/absence of it does not address > COW, it breaks COW anyway. Are you now saying we should change rdma so > without LOCAL_WRITE it will not break COW? I am talking about 'from a spec' perspective - not what Linux does today. The absence of LOCAL_WRITE is part of the specification to support shared pages. Pages can only be kept shared if all the ACCESS WRITE bits are clear - today Linux always breaks the COW, but if you patch in the ability to keep things shared then it must only happen when *all* the ACCESS WRITE bits are clear. > > The adaptor enforces the access control to prevent a naughty app from > > writing to shared memory - think about mmap'ing libc.so and then using > > RDMA to write to the shared pages. It is necessary to ensure that is > > impossible. > That's why it's redundant: we can't trust an application to tell us > 'this page is writeable', we must get this info from kernel. And so > there's apparently no need for application to tell adaptor about > LOCAL_WRITE. The API design gives user space maximum flexibility, if it wants to create an enforced no-write MR in otherwise writable pages by skipping LOCAL_WRITE then it can do so. The kernel's role in this should be to deny ibv_reg_mr with WRITE bits set if the pages are not writable by the app - I don't know if it does this today, it isn't critically important as long as the pages are unshared. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index a841123..5929598 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -152,7 +152,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, ret = get_user_pages(current, current->mm, cur_base, min_t(unsigned long, npages, PAGE_SIZE / sizeof (struct page *)), - 1, !umem->writable, page_list, vma_list); + !umem->writable, 1, page_list, vma_list); if (ret < 0) goto out;
core/umem.c seems to get the arguments to get_user_pages in the reverse order: it sets writeable flag and breaks COW for MAP_SHARED if and only if hardware needs to write the page. This breaks memory overcommit for users such as KVM: each time we try to register a page to send it to remote, this breaks COW. It seems that for applications that only have REMOTE_READ permission, there is no reason to break COW at all. If the page that is COW has lots of copies, this makes the user process quickly exceed the cgroups memory limit. This makes RDMA mostly useless for virtualization, thus the stable tag. Reported-by: "Michael R. Hines" <mrhines@linux.vnet.ibm.com> Cc: stable@vger.kernel.org Signed-off-by: Michael S. Tsirkin <mst@redhat.com> --- Note: compile-tested only, I don't have RDMA hardware at the moment. Michael, could you please try this patch (also fixing your usespace code not to request write access) and report? Note2: grep for get_user_pages in infiniband drivers turns up lots of users who set write to 1 unconditionally. These might be bugs too, should be checked. drivers/infiniband/core/umem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)