diff mbox

CVE-2014-8159 kernel: infiniband: uverbs: unprotected physical memory access

Message ID AM3PR05MB0935AABF569F15EA846B8E72DC000@AM3PR05MB0935.eurprd05.prod.outlook.com (mailing list archive)
State Rejected
Headers show

Commit Message

Shachar Raindel March 18, 2015, 5:39 p.m. UTC
Hi,

It was found that the Linux kernel's InfiniBand/RDMA subsystem did not properly sanitize input parameters while registering memory regions from user space via the (u)verbs API. A local user with access to a /dev/infiniband/uverbsX device could use this flaw to crash the system or, potentially, escalate their privileges on the system.

The issue has been assigned CVE-2014-8159.

The issue exists in the InfiniBand/RDMA/iWARP drivers since Linux Kernel version 2.6.13.

Mellanox OFED 2.4-1.0.4 fixes the issue. Available from:
http://www.mellanox.com/page/products_dyn?product_family=26&mtag=linux_sw_drivers 

RedHat errata: https://access.redhat.com/security/cve/CVE-2014-8159
Canonical errata: http://people.canonical.com/~ubuntu-security/cve/2014/CVE-2014-8159.html
Novell (Suse) bug tracking: https://bugzilla.novell.com/show_bug.cgi?id=914742


The following patch fixes the issue:

--------------- 8< ------------------------------

From d4d68430d4a12c569e28b4f4468284ea22111186 Mon Sep 17 00:00:00 2001
From: Shachar Raindel <raindel@mellanox.com>
Date: Sun, 04 Jan 2015 18:30:32 +0200
Subject: [PATCH] IB/core: Prevent integer overflow in ib_umem_get address arithmetic

Properly verify that the resulting page aligned end address is larger
than both the start address and the length of the memory area
requested.

Both the start and length arguments for ib_umem_get are controlled by
the user. A misbehaving user can provide values which will cause an
integer overflow when calculating the page aligned end address.

This overflow can cause also miscalculation of the number of pages
mapped, and additional logic issues.

Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Roland Dreier April 1, 2015, 5:28 p.m. UTC | #1
On Wed, Mar 18, 2015 at 10:39 AM, Shachar Raindel <raindel@mellanox.com> wrote:
> Date: Sun, 04 Jan 2015 18:30:32 +0200
> Subject: [PATCH] IB/core: Prevent integer overflow in ib_umem_get address arithmetic

Just so we're clear, this bug has been known since January 4, and it's
getting sent upstream now?

I assume we want it in 4.0 and -stable?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shachar Raindel April 2, 2015, 7:52 a.m. UTC | #2
DQoNCj4gLS0tLS1PcmlnaW5hbCBNZXNzYWdlLS0tLS0NCj4gRnJvbTogUm9sYW5kIERyZWllciBb
bWFpbHRvOnJvbGFuZEBwdXJlc3RvcmFnZS5jb21dDQo+IFNlbnQ6IFdlZG5lc2RheSwgQXByaWwg
MDEsIDIwMTUgODoyOCBQTQ0KPiBUbzogU2hhY2hhciBSYWluZGVsDQo+IENjOiBvc3Mtc2VjdXJp
dHlAbGlzdHMub3BlbndhbGwuY29tOyA8bGludXgtcmRtYUB2Z2VyLmtlcm5lbC5vcmc+DQo+IChs
aW51eC1yZG1hQHZnZXIua2VybmVsLm9yZyk7IHN0YWJsZUB2Z2VyLmtlcm5lbC5vcmcNCj4gU3Vi
amVjdDogUmU6IENWRS0yMDE0LTgxNTkga2VybmVsOiBpbmZpbmliYW5kOiB1dmVyYnM6IHVucHJv
dGVjdGVkDQo+IHBoeXNpY2FsIG1lbW9yeSBhY2Nlc3MNCj4gDQo+IE9uIFdlZCwgTWFyIDE4LCAy
MDE1IGF0IDEwOjM5IEFNLCBTaGFjaGFyIFJhaW5kZWwgPHJhaW5kZWxAbWVsbGFub3guY29tPg0K
PiB3cm90ZToNCj4gPiBEYXRlOiBTdW4sIDA0IEphbiAyMDE1IDE4OjMwOjMyICswMjAwDQo+ID4g
U3ViamVjdDogW1BBVENIXSBJQi9jb3JlOiBQcmV2ZW50IGludGVnZXIgb3ZlcmZsb3cgaW4gaWJf
dW1lbV9nZXQNCj4gYWRkcmVzcyBhcml0aG1ldGljDQo+IA0KPiBKdXN0IHNvIHdlJ3JlIGNsZWFy
LCB0aGlzIGJ1ZyBoYXMgYmVlbiBrbm93biBzaW5jZSBKYW51YXJ5IDQsIGFuZCBpdCdzDQo+IGdl
dHRpbmcgc2VudCB1cHN0cmVhbSBub3c/DQo+IA0KDQpUaGlzIGJ1ZyBoYXMgYmVlbiBrbm93biBz
aW5jZSBKYW51YXJ5LCB5ZXMuIFRvIHJlZHVjZSB0aGUgd2luZG93IG9mDQp2dWxuZXJhYmlsaXR5
LCB3ZSBwcml2YXRlbHkgcmVwb3J0ZWQgaXQgdG8gbWFqb3IgZGlzdHJpYnV0aW9ucyBmaXJzdCwN
CmFsaWduaW5nIGV2ZXJ5b25lIG9uIHRoZSBzYW1lIHJlbGVhc2UgZGF0ZS4gU3VjaCBzeW5jaHJv
bml6YXRpb24gdGFrZXMNCnRpbWUuIEdpdmVuIHRoYXQgdGhlIGJ1ZyB3YXMgbm90LCB0byB0aGUg
YmVzdCBvZiBvdXIga25vd2xlZGdlLA0KZXhwbG9pdGVkIGluIHRoZSB3aWxkLCB3ZSBiZWxpZXZl
ZCB0aGF0IDMgbW9udGhzIG9mIGVtYmFyZ28gaXMgbm90DQpleHViZXJhdGVkLiBUaGlzIGlzIGEg
Y29tbW9uIHByYWN0aWNlIGluIHRoZSBzZWN1cml0eSBpbmR1c3RyeSwgY2FsbGVkDQoicmVzcG9u
c2libGUgZGlzY2xvc3VyZS4iDQoNCkZvbGxvd2luZyB0aGUga2VybmVsICBzZWN1cml0eSBidWdz
IHBvbGljeSBbMV0sIHdlIHJlcG9ydGVkIGl0IHRvIA0KdGhlIGtlcm5lbCBzZWN1cml0eSBjb250
YWN0cyBmZXcgZGF5cyBiZWZvcmUgbWFraW5nIHRoZSBpc3N1ZSBwdWJsaWMuDQpGZXcgZGF5cyBh
ZnRlciBpc3N1ZSBiZWNhbWUgcHVibGljLCB3ZSBwdWJsaXNoZWQgYSBjbGVhciByZXBvcnQgdG8g
YWxsDQpvZiB0aGUgcmVsZXZhbnQgbWFpbGluZyBsaXN0cy4NCg0KPiBJIGFzc3VtZSB3ZSB3YW50
IGl0IGluIDQuMCBhbmQgLXN0YWJsZT8NCg0KWWVzLiBTaG91bGQgYmUgYXBwbGllZCBBU0FQIHRv
IGFsbCBrZXJuZWxzIHRoYXQgY29udGFpbiB1dmVyYnMuDQpXZSBwdXQgc29tZSBlZmZvcnQgdG8g
bWFrZSBzdXJlIHRoZSBwYXRjaCB3aWxsIGFwcGx5IGNsZWFubHkgb24gYWxsDQpoaXN0b3JpY2Fs
IHZlcnNpb25zIG9mIHRoZSBrZXJuZWwsIHNvIGJhY2twb3J0aW5nIHNob3VsZCBiZSBzaW1wbGUu
DQoNCg0KVGhhbmtzLA0KLS1TaGFjaGFyDQoNClsxXSBodHRwczovL3d3dy5rZXJuZWwub3JnL2Rv
Yy9Eb2N1bWVudGF0aW9uL1NlY3VyaXR5QnVncw0K
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yann Droneaud April 2, 2015, 10:04 a.m. UTC | #3
Hi,

Le mercredi 18 mars 2015 à 17:39 +0000, Shachar Raindel a écrit :
> Hi,
> 
> It was found that the Linux kernel's InfiniBand/RDMA subsystem did not
> properly sanitize input parameters while registering memory regions
> from user space via the (u)verbs API. A local user with access to
> a /dev/infiniband/uverbsX device could use this flaw to crash the
> system or, potentially, escalate their privileges on the system.
> 
> The issue has been assigned CVE-2014-8159.
> 
> The issue exists in the InfiniBand/RDMA/iWARP drivers since Linux
> Kernel version 2.6.13.
> 
> Mellanox OFED 2.4-1.0.4 fixes the issue. Available from:
> http://www.mellanox.com/page/products_dyn?product_family=26&mtag=linux_sw_drivers 
> 
> RedHat errata: https://access.redhat.com/security/cve/CVE-2014-8159
> Canonical errata: http://people.canonical.com/~ubuntu-security/cve/2014/CVE-2014-8159.html
> Novell (Suse) bug tracking: https://bugzilla.novell.com/show_bug.cgi?id=914742
> 
> 
> The following patch fixes the issue:
> 
> --------------- 8< ------------------------------
> 
> From d4d68430d4a12c569e28b4f4468284ea22111186 Mon Sep 17 00:00:00 2001
> From: Shachar Raindel <raindel@mellanox.com>
> Date: Sun, 04 Jan 2015 18:30:32 +0200
> Subject: [PATCH] IB/core: Prevent integer overflow in ib_umem_get address arithmetic
> 
> Properly verify that the resulting page aligned end address is larger
> than both the start address and the length of the memory area
> requested.
> 
> Both the start and length arguments for ib_umem_get are controlled by
> the user. A misbehaving user can provide values which will cause an
> integer overflow when calculating the page aligned end address.
> 
> This overflow can cause also miscalculation of the number of pages
> mapped, and additional logic issues.
> 
> Signed-off-by: Shachar Raindel <raindel@mellanox.com>
> Signed-off-by: Jack Morgenstein <jackm@mellanox.com>
> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
> ---
> 
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index aec7a6a..8c014b5 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -99,6 +99,14 @@
>  	if (dmasync)
>  		dma_set_attr(DMA_ATTR_WRITE_BARRIER, &attrs);
>  
> +	/*
> +	 * If the combination of the addr and size requested for this memory
> +	 * region causes an integer overflow, return error.
> +	 */
> +	if ((PAGE_ALIGN(addr + size) <= size) ||
> +	    (PAGE_ALIGN(addr + size) <= addr))
> +		return ERR_PTR(-EINVAL);
> +

Can access_ok() be used here ?

         if (!access_ok(writable ? VERIFY_WRITE : VERIFY_READ,
                        addr, size))
                  return ERR_PTR(-EINVAL);


>  	if (!can_do_mlock())
>  		return ERR_PTR(-EPERM);
> 

Regards.
Shachar Raindel April 2, 2015, 10:52 a.m. UTC | #4
Hi,

> -----Original Message-----

> From: Yann Droneaud [mailto:ydroneaud@opteya.com]

> Sent: Thursday, April 02, 2015 1:05 PM

> To: Shachar Raindel

> Cc: oss-security@lists.openwall.com; <linux-rdma@vger.kernel.org>

> (linux-rdma@vger.kernel.org); linux-kernel@vger.kernel.org;

> stable@vger.kernel.org

> Subject: Re: CVE-2014-8159 kernel: infiniband: uverbs: unprotected

> physical memory access

> 

> Hi,

> 

> Le mercredi 18 mars 2015 à 17:39 +0000, Shachar Raindel a écrit :

> > Hi,

> >


<snipped long e-mail>
 
> > +	/*

> > +	 * If the combination of the addr and size requested for this

> memory

> > +	 * region causes an integer overflow, return error.

> > +	 */

> > +	if ((PAGE_ALIGN(addr + size) <= size) ||

> > +	    (PAGE_ALIGN(addr + size) <= addr))

> > +		return ERR_PTR(-EINVAL);

> > +

> 

> Can access_ok() be used here ?

> 

>          if (!access_ok(writable ? VERIFY_WRITE : VERIFY_READ,

>                         addr, size))

>                   return ERR_PTR(-EINVAL);

> 


No, this will break the current ODP semantics.

ODP allows the user to register memory that is not accessible yet.
This is a critical design feature, as it allows avoiding holding
a registration cache. Adding this check will break the behavior,
forcing memory to be all accessible when registering an ODP MR.

Thanks,
--Shachar
Yann Droneaud April 2, 2015, 1:30 p.m. UTC | #5
Hi,

Le jeudi 02 avril 2015 à 10:52 +0000, Shachar Raindel a écrit :
> > -----Original Message-----
> > From: Yann Droneaud [mailto:ydroneaud@opteya.com]
> > Sent: Thursday, April 02, 2015 1:05 PM
> > Le mercredi 18 mars 2015 à 17:39 +0000, Shachar Raindel a écrit :

> > > +	/*
> > > +	 * If the combination of the addr and size requested for this
> > memory
> > > +	 * region causes an integer overflow, return error.
> > > +	 */
> > > +	if ((PAGE_ALIGN(addr + size) <= size) ||
> > > +	    (PAGE_ALIGN(addr + size) <= addr))
> > > +		return ERR_PTR(-EINVAL);
> > > +
> > 
> > Can access_ok() be used here ?
> > 
> >          if (!access_ok(writable ? VERIFY_WRITE : VERIFY_READ,
> >                         addr, size))
> >                   return ERR_PTR(-EINVAL);
> > 
> 
> No, this will break the current ODP semantics.
> 
> ODP allows the user to register memory that is not accessible yet.
> This is a critical design feature, as it allows avoiding holding
> a registration cache. Adding this check will break the behavior,
> forcing memory to be all accessible when registering an ODP MR.
> 

Where's the check for the range being in userspace memory space,
especially for the ODP case ?

For non ODP case (eg. plain old behavior), does get_user_pages()
ensure the requested pages fit in userspace region on all 
architectures ? I think so.

In ODP case, I'm not sure such check is ever done ?
(Aside, does it take special mesure to protect shared mapping from
being read and/or *written* ?)

Regards.
Haggai Eran April 2, 2015, 3:18 p.m. UTC | #6
On 02/04/2015 16:30, Yann Droneaud wrote:
> Hi,
> 
> Le jeudi 02 avril 2015 à 10:52 +0000, Shachar Raindel a écrit :
>>> -----Original Message-----
>>> From: Yann Droneaud [mailto:ydroneaud@opteya.com]
>>> Sent: Thursday, April 02, 2015 1:05 PM
>>> Le mercredi 18 mars 2015 à 17:39 +0000, Shachar Raindel a écrit :
> 
>>>> +	/*
>>>> +	 * If the combination of the addr and size requested for this
>>> memory
>>>> +	 * region causes an integer overflow, return error.
>>>> +	 */
>>>> +	if ((PAGE_ALIGN(addr + size) <= size) ||
>>>> +	    (PAGE_ALIGN(addr + size) <= addr))
>>>> +		return ERR_PTR(-EINVAL);
>>>> +
>>>
>>> Can access_ok() be used here ?
>>>
>>>          if (!access_ok(writable ? VERIFY_WRITE : VERIFY_READ,
>>>                         addr, size))
>>>                   return ERR_PTR(-EINVAL);
>>>
>>
>> No, this will break the current ODP semantics.
>>
>> ODP allows the user to register memory that is not accessible yet.
>> This is a critical design feature, as it allows avoiding holding
>> a registration cache. Adding this check will break the behavior,
>> forcing memory to be all accessible when registering an ODP MR.
>>
> 
> Where's the check for the range being in userspace memory space,
> especially for the ODP case ?
> 
> For non ODP case (eg. plain old behavior), does get_user_pages()
> ensure the requested pages fit in userspace region on all 
> architectures ? I think so.

Yes, get_user_pages will return a smaller amount of pages than requested
if it encounters an unmapped region (or a region without write
permissions for write requests). If this happens, the loop in
ib_umem_get calls get_user_pages again with the next set of pages, and
this time if it the first page still cannot be mapped an error is returned.

> 
> In ODP case, I'm not sure such check is ever done ?

In ODP, we also call get_user_pages, but only when a page fault occurs
(see ib_umem_odp_map_dma_pages()). This allows the user to pre-register
a memory region that contains unmapped virtual space, and then mmap
different files into that area without needing to re-register.

> (Aside, does it take special mesure to protect shared mapping from
> being read and/or *written* ?)

I'm not sure I understand the question. Shared mappings that the process
is allowed to read or write are also allowed for the HCA (specifically,
to local and remote operations the same process performs using the HCA),
provided the application has registered their virtual address space as a
memory region.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Roland Dreier April 2, 2015, 4:32 p.m. UTC | #7
On Thu, Apr 2, 2015 at 12:52 AM, Shachar Raindel <raindel@mellanox.com> wrote:
> This is a common practice in the security industry, called
> "responsible disclosure."
>
> Following the kernel  security bugs policy [1], we reported it to
> the kernel security contacts few days before making the issue public.
> Few days after issue became public, we published a clear report to all
> of the relevant mailing lists.

Isn't the point of responsible disclosure to delay disclosure until a
fix is in place?  What's the point of sending a notification to the
kernel security team if you're going to disclose publicly before the
upstream kernel is fixed?

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yann Droneaud April 2, 2015, 4:35 p.m. UTC | #8
Hi Haggai,

Le jeudi 02 avril 2015 à 18:18 +0300, Haggai Eran a écrit :
> On 02/04/2015 16:30, Yann Droneaud wrote:
> > Hi,
> > 
> > Le jeudi 02 avril 2015 à 10:52 +0000, Shachar Raindel a écrit :
> >>> -----Original Message-----
> >>> From: Yann Droneaud [mailto:ydroneaud@opteya.com]
> >>> Sent: Thursday, April 02, 2015 1:05 PM
> >>> Le mercredi 18 mars 2015 à 17:39 +0000, Shachar Raindel a écrit :
> > 
> >>>> +	/*
> >>>> +	 * If the combination of the addr and size requested for this
> >>> memory
> >>>> +	 * region causes an integer overflow, return error.
> >>>> +	 */
> >>>> +	if ((PAGE_ALIGN(addr + size) <= size) ||
> >>>> +	    (PAGE_ALIGN(addr + size) <= addr))
> >>>> +		return ERR_PTR(-EINVAL);
> >>>> +
> >>>
> >>> Can access_ok() be used here ?
> >>>
> >>>          if (!access_ok(writable ? VERIFY_WRITE : VERIFY_READ,
> >>>                         addr, size))
> >>>                   return ERR_PTR(-EINVAL);
> >>>
> >>
> >> No, this will break the current ODP semantics.
> >>
> >> ODP allows the user to register memory that is not accessible yet.
> >> This is a critical design feature, as it allows avoiding holding
> >> a registration cache. Adding this check will break the behavior,
> >> forcing memory to be all accessible when registering an ODP MR.
> >>
> > 
> > Where's the check for the range being in userspace memory space,
> > especially for the ODP case ?
> > 
> > For non ODP case (eg. plain old behavior), does get_user_pages()
> > ensure the requested pages fit in userspace region on all 
> > architectures ? I think so.
> 
> Yes, get_user_pages will return a smaller amount of pages than requested
> if it encounters an unmapped region (or a region without write
> permissions for write requests). If this happens, the loop in
> ib_umem_get calls get_user_pages again with the next set of pages, and
> this time if it the first page still cannot be mapped an error is returned.
> 
> > 
> > In ODP case, I'm not sure such check is ever done ?
> 
> In ODP, we also call get_user_pages, but only when a page fault occurs
> (see ib_umem_odp_map_dma_pages()). This allows the user to pre-register
> a memory region that contains unmapped virtual space, and then mmap
> different files into that area without needing to re-register.
> 

OK, thanks for the description.

> > (Aside, does it take special mesure to protect shared mapping from
> > being read and/or *written* ?)
> 
> I'm not sure I understand the question. Shared mappings that the process
> is allowed to read or write are also allowed for the HCA (specifically,
> to local and remote operations the same process performs using the HCA),
> provided the application has registered their virtual address space as a
> memory region.
> 

I was refering to description of get_user_pages():

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/mm/gup.c?id=v4.0-rc6#n765

 * @force:	whether to force access even when user mapping is currently
 *		protected (but never forces write access to shared mapping).

But since ib_umem_odp_map_dma_pages() use get_user_pages() with force
argument set to 0, it's OK.

Another related question: as the large memory range could be registered 
by user space with ibv_reg_mr(pd, base, size, IB_ACCESS_ON_DEMAND), 
what's prevent the kernel to map a file as the result of mmap(0, ...)
in this  region, making it available remotely through IBV_WR_RDMA_READ /
IBV_WR_RDMA_WRITE ?

Again, thanks for the information I was missing to understand how ODP is
checking the memory ranges.

Regards.
Shachar Raindel April 2, 2015, 4:39 p.m. UTC | #9
> -----Original Message-----

> From: Roland Dreier [mailto:roland@purestorage.com]

> Sent: Thursday, April 02, 2015 7:33 PM

> To: Shachar Raindel

> Cc: oss-security@lists.openwall.com; <linux-rdma@vger.kernel.org>

> (linux-rdma@vger.kernel.org); stable@vger.kernel.org

> Subject: Re: CVE-2014-8159 kernel: infiniband: uverbs: unprotected

> physical memory access

> 

> On Thu, Apr 2, 2015 at 12:52 AM, Shachar Raindel <raindel@mellanox.com>

> wrote:

> > This is a common practice in the security industry, called

> > "responsible disclosure."

> >

> > Following the kernel  security bugs policy [1], we reported it to

> > the kernel security contacts few days before making the issue public.

> > Few days after issue became public, we published a clear report to all

> > of the relevant mailing lists.

> 

> Isn't the point of responsible disclosure to delay disclosure until a

> fix is in place?  What's the point of sending a notification to the

> kernel security team if you're going to disclose publicly before the

> upstream kernel is fixed?

> 


We delayed the disclosure until most major Linux vendors released a fix for
the issue, give or take in synchronization.

The Linux security contact list only guarantee secrecy for 7 days. We
therefore contacted them only close to the date at which fixes were going to
be released, to follow their expectations for period of time between contact
and public disclosure.

Thanks,
--Shachar
Shachar Raindel April 2, 2015, 4:44 p.m. UTC | #10
> -----Original Message-----

> From: Yann Droneaud [mailto:ydroneaud@opteya.com]

> Sent: Thursday, April 02, 2015 7:35 PM

> To: Haggai Eran

> Cc: Shachar Raindel; Sagi Grimberg; oss-security@lists.openwall.com;

> <linux-rdma@vger.kernel.org> (linux-rdma@vger.kernel.org); linux-

> kernel@vger.kernel.org; stable@vger.kernel.org

> Subject: Re: CVE-2014-8159 kernel: infiniband: uverbs: unprotected

> physical memory access

> 

> Hi Haggai,

> 

> Le jeudi 02 avril 2015 à 18:18 +0300, Haggai Eran a écrit :

> > On 02/04/2015 16:30, Yann Droneaud wrote:

> > > Hi,

> > >

> > > Le jeudi 02 avril 2015 à 10:52 +0000, Shachar Raindel a écrit :

> > >>> -----Original Message-----

> > >>> From: Yann Droneaud [mailto:ydroneaud@opteya.com]

> > >>> Sent: Thursday, April 02, 2015 1:05 PM

> > >>> Le mercredi 18 mars 2015 à 17:39 +0000, Shachar Raindel a écrit :

> > >

> > >>>> +	/*

> > >>>> +	 * If the combination of the addr and size requested for this

> > >>> memory

> > >>>> +	 * region causes an integer overflow, return error.

> > >>>> +	 */

> > >>>> +	if ((PAGE_ALIGN(addr + size) <= size) ||

> > >>>> +	    (PAGE_ALIGN(addr + size) <= addr))

> > >>>> +		return ERR_PTR(-EINVAL);

> > >>>> +

> > >>>

> > >>> Can access_ok() be used here ?

> > >>>

> > >>>          if (!access_ok(writable ? VERIFY_WRITE : VERIFY_READ,

> > >>>                         addr, size))

> > >>>                   return ERR_PTR(-EINVAL);

> > >>>

> > >>

> > >> No, this will break the current ODP semantics.

> > >>

> > >> ODP allows the user to register memory that is not accessible yet.

> > >> This is a critical design feature, as it allows avoiding holding

> > >> a registration cache. Adding this check will break the behavior,

> > >> forcing memory to be all accessible when registering an ODP MR.

> > >>

> > >

> > > Where's the check for the range being in userspace memory space,

> > > especially for the ODP case ?

> > >

> > > For non ODP case (eg. plain old behavior), does get_user_pages()

> > > ensure the requested pages fit in userspace region on all

> > > architectures ? I think so.

> >

> > Yes, get_user_pages will return a smaller amount of pages than

> requested

> > if it encounters an unmapped region (or a region without write

> > permissions for write requests). If this happens, the loop in

> > ib_umem_get calls get_user_pages again with the next set of pages, and

> > this time if it the first page still cannot be mapped an error is

> returned.

> >

> > >

> > > In ODP case, I'm not sure such check is ever done ?

> >

> > In ODP, we also call get_user_pages, but only when a page fault occurs

> > (see ib_umem_odp_map_dma_pages()). This allows the user to pre-

> register

> > a memory region that contains unmapped virtual space, and then mmap

> > different files into that area without needing to re-register.

> >

> 

> OK, thanks for the description.

> 

> > > (Aside, does it take special mesure to protect shared mapping from

> > > being read and/or *written* ?)

> >

> > I'm not sure I understand the question. Shared mappings that the

> process

> > is allowed to read or write are also allowed for the HCA

> (specifically,

> > to local and remote operations the same process performs using the

> HCA),

> > provided the application has registered their virtual address space as

> a

> > memory region.

> >

> 

> I was refering to description of get_user_pages():

> 

> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/mm/g

> up.c?id=v4.0-rc6#n765

> 

>  * @force:	whether to force access even when user mapping is currently

>  *		protected (but never forces write access to shared mapping).

> 

> But since ib_umem_odp_map_dma_pages() use get_user_pages() with force

> argument set to 0, it's OK.

> 

> Another related question: as the large memory range could be registered

> by user space with ibv_reg_mr(pd, base, size, IB_ACCESS_ON_DEMAND),

> what's prevent the kernel to map a file as the result of mmap(0, ...)

> in this  region, making it available remotely through IBV_WR_RDMA_READ /

> IBV_WR_RDMA_WRITE ?

> 


This is not a bug. This is a feature.

Exposing a file through RDMA, using ODP, can be done exactly like this.
Given that the application explicitly requested this behavior, I don't
see why it is a problem. Actually, some of our tests use such flows.
The mmu notifiers mechanism allow us to do this safely. When the page is
written back to disk, it is removed from the ODP mapping. When it is
accessed by the HCA, it is brought back to RAM.


Thanks,
--Shachar
Haggai Eran April 2, 2015, 6:12 p.m. UTC | #11
On Thursday, April 2, 2015 7:44 PM, Shachar Raindel wrote:
>> -----Original Message-----
>> From: Yann Droneaud [mailto:ydroneaud@opteya.com]
>> Sent: Thursday, April 02, 2015 7:35 PM
>> To: Haggai Eran
>> Cc: Shachar Raindel; Sagi Grimberg; oss-security@lists.openwall.com;
>> <linux-rdma@vger.kernel.org> (linux-rdma@vger.kernel.org); linux-
>> kernel@vger.kernel.org; stable@vger.kernel.org
>> Subject: Re: CVE-2014-8159 kernel: infiniband: uverbs: unprotected
>> physical memory access
>>
>> Hi Haggai,
>>
>> Le jeudi 02 avril 2015 à 18:18 +0300, Haggai Eran a écrit :
>> > On 02/04/2015 16:30, Yann Droneaud wrote:
>> >> Hi,
>> >>
>> >> Le jeudi 02 avril 2015 à 10:52 +0000, Shachar Raindel a écrit :
>> >>>> -----Original Message-----
>> >>>> From: Yann Droneaud [mailto:ydroneaud@opteya.com]
>> >>>> Sent: Thursday, April 02, 2015 1:05 PM
>> >>>> Le mercredi 18 mars 2015 à 17:39 +0000, Shachar Raindel a écrit :
>> >>
>> >>>>> +      /*
>> >>>>> +       * If the combination of the addr and size requested for this
>> >>>> memory
>> >>>>> +       * region causes an integer overflow, return error.
>> >>>>> +       */
>> >>>>> +      if ((PAGE_ALIGN(addr + size) <= size) ||
>> >>>>> +          (PAGE_ALIGN(addr + size) <= addr))
>> >>>>> +              return ERR_PTR(-EINVAL);
>> >>>>> +
>> >>>>
>> >>>> Can access_ok() be used here ?
>> >>>>
>> >>>>          if (!access_ok(writable ? VERIFY_WRITE : VERIFY_READ,
>> >>>>                         addr, size))
>> >>>>                   return ERR_PTR(-EINVAL);
>> >>>>
>> >>>
>> >>> No, this will break the current ODP semantics.
>> >>>
>> >>> ODP allows the user to register memory that is not accessible yet.
>> >>> This is a critical design feature, as it allows avoiding holding
>> >>> a registration cache. Adding this check will break the behavior,
>> >>> forcing memory to be all accessible when registering an ODP MR.
>> >>>
>> >>
>> >> Where's the check for the range being in userspace memory space,
>> >> especially for the ODP case ?
>> >>
>> >> For non ODP case (eg. plain old behavior), does get_user_pages()
>> >> ensure the requested pages fit in userspace region on all
>> >> architectures ? I think so.
>> >
>> > Yes, get_user_pages will return a smaller amount of pages than
>> requested
>> > if it encounters an unmapped region (or a region without write
>> > permissions for write requests). If this happens, the loop in
>> > ib_umem_get calls get_user_pages again with the next set of pages, and
>> > this time if it the first page still cannot be mapped an error is
>> returned.
>> >
>> >>
>> >> In ODP case, I'm not sure such check is ever done ?
>> >
>> > In ODP, we also call get_user_pages, but only when a page fault occurs
>> > (see ib_umem_odp_map_dma_pages()). This allows the user to pre-
>> register
>> > a memory region that contains unmapped virtual space, and then mmap
>> > different files into that area without needing to re-register.
>> >
>>
>> OK, thanks for the description.
>>
>> ...
>>
>> Another related question: as the large memory range could be registered
>> by user space with ibv_reg_mr(pd, base, size, IB_ACCESS_ON_DEMAND),
>> what's prevent the kernel to map a file as the result of mmap(0, ...)
>> in this  region, making it available remotely through IBV_WR_RDMA_READ /
>> IBV_WR_RDMA_WRITE ?
>>
> 
> This is not a bug. This is a feature.
> 
> Exposing a file through RDMA, using ODP, can be done exactly like this.
> Given that the application explicitly requested this behavior, I don't
> see why it is a problem. Actually, some of our tests use such flows.
> The mmu notifiers mechanism allow us to do this safely. When the page is
> written back to disk, it is removed from the ODP mapping. When it is
> accessed by the HCA, it is brought back to RAM.
> 

I want to add that we would like to see users registering a very large memory region (perhaps the entire process address space) for local access, and then enabling remote access only to specific regions using memory windows. However, this isn't supported yet by our driver. Still, there are valid cases where you would still want the results of an mmap(0,...) call to be remotely accessible, in cases where there is enough trust between the local process and the remote process. It may help a middleware communication library register a large portion of the address space in advance, and still work with random pointers given to it by another application module.

Regards,
Haggai--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yann Droneaud April 2, 2015, 8:40 p.m. UTC | #12
Hi,

Le jeudi 02 avril 2015 à 16:44 +0000, Shachar Raindel a écrit :
> > -----Original Message-----
> > From: Yann Droneaud [mailto:ydroneaud@opteya.com]
> > Sent: Thursday, April 02, 2015 7:35 PM

> > Another related question: as the large memory range could be registered
> > by user space with ibv_reg_mr(pd, base, size, IB_ACCESS_ON_DEMAND),
> > what's prevent the kernel to map a file as the result of mmap(0, ...)
> > in this  region, making it available remotely through IBV_WR_RDMA_READ /
> > IBV_WR_RDMA_WRITE ?
> > 
> 
> This is not a bug. This is a feature.
> 
> Exposing a file through RDMA, using ODP, can be done exactly like this.
> Given that the application explicitly requested this behavior, I don't
> see why it is a problem. 

If the application cannot choose what will end up in the region it has
registered, it's an issue !

What might happen if one library in a program call mmap(0, size, ...) to
load a file storing a secret (a private key), and that file ends up 
being mapped in an registered but otherwise free region (afaict, the 
kernel is allowed to do it) ?
What might happen if one library in a program call call mmap(0, 
size, ..., MAP_ANONYMOUS,...) to allocate memory, call mlock(), then
write in this location a secret (a passphrase), and that area ends up
in the memory region registered for on demand paging ?

The application haven't choose to disclose these confidential piece of 
information, but they are available for reading/writing by remote
through RDMA given it knows the rkey of the memory region (which is a 
32bits value).

I hope I'm missing something, because I'm not feeling confident such
behavor is a feature.


> Actually, some of our tests use such flows.
> The mmu notifiers mechanism allow us to do this safely. When the page is
> written back to disk, it is removed from the ODP mapping. When it is
> accessed by the HCA, it is brought back to RAM.
> 

I'm not discussing about the benefit of On Demand Paging and why it's a
very good feature to expose files through RDMA.

I'm trying to understand how the application can choose what is exposed
through RDMA if it registers a very large memory region for later use 
(but do not actually explicitly map something there yet): what's the
consequences ?

   void *start = sbrk(0);
   size_t size = ULONG_MAX - (unsigned long)start;

   ibv_reg_mr(pd, start, size, IB_ACCESS_ON_DEMAND)


Regards.
Haggai Eran April 3, 2015, 8:39 a.m. UTC | #13
On Thursday, April 2, 2015 11:40 PM, Yann Droneaud <ydroneaud@opteya.com> wrote:
> Le jeudi 02 avril 2015 à 16:44 +0000, Shachar Raindel a écrit :
>> > -----Original Message-----
>> > From: Yann Droneaud [mailto:ydroneaud@opteya.com]
>> > Sent: Thursday, April 02, 2015 7:35 PM
> 
>> > Another related question: as the large memory range could be registered
>> > by user space with ibv_reg_mr(pd, base, size, IB_ACCESS_ON_DEMAND),
>> > what's prevent the kernel to map a file as the result of mmap(0, ...)
>> > in this  region, making it available remotely through IBV_WR_RDMA_READ /
>> > IBV_WR_RDMA_WRITE ?
>> >
>>
>> This is not a bug. This is a feature.
>>
>> Exposing a file through RDMA, using ODP, can be done exactly like this.
>> Given that the application explicitly requested this behavior, I don't
>> see why it is a problem.
> 
> If the application cannot choose what will end up in the region it has
> registered, it's an issue !
> 
> What might happen if one library in a program call mmap(0, size, ...) to
> load a file storing a secret (a private key), and that file ends up
> being mapped in an registered but otherwise free region (afaict, the
> kernel is allowed to do it) ?
> What might happen if one library in a program call call mmap(0,
> size, ..., MAP_ANONYMOUS,...) to allocate memory, call mlock(), then
> write in this location a secret (a passphrase), and that area ends up
> in the memory region registered for on demand paging ?
> 
> The application haven't choose to disclose these confidential piece of
> information, but they are available for reading/writing by remote
> through RDMA given it knows the rkey of the memory region (which is a
> 32bits value).
> 
> I hope I'm missing something, because I'm not feeling confident such
> behavor is a feature.

What we are aiming for is the possibility to register the entire process' address 
space for RDMA operations (if the process chooses to use this feature).
This is similar to multiple threads accessing the same address space. I'm sure 
you wouldn't be complaining about the ability of one thread to access the secret 
passphrase mmapped by another thread in your example.

> I'm trying to understand how the application can choose what is exposed
> through RDMA if it registers a very large memory region for later use
> (but do not actually explicitly map something there yet): what's the
> consequences ?
> 
>    void *start = sbrk(0);
>    size_t size = ULONG_MAX - (unsigned long)start;
> 
>    ibv_reg_mr(pd, start, size, IB_ACCESS_ON_DEMAND)

The consequences are exactly as you wrote. Just as giving a non-ODP rkey 
to a remote node allows the node to access the registered memory behind that 
rkey, giving an ODP rkey to a remote node allows that node to access the 
virtual address space behind that rkey.

Regards,
Haggai--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shachar Raindel April 3, 2015, 11:49 a.m. UTC | #14
Hi Dominique,

> -----Original Message-----

> From: Dominique Martinet [mailto:dominique.martinet@cea.fr]

> Sent: Thursday, April 02, 2015 8:44 PM

> To: Shachar Raindel

> Subject: Re: [oss-security] RE: CVE-2014-8159 kernel: infiniband:

> uverbs: unprotected physical memory access

> 

> Hi,

> 

> Shachar Raindel wrote on Thu, Apr 02, 2015:

> > From: Yann Droneaud [mailto:ydroneaud@opteya.com]

> > > Another related question: as the large memory range could be

> registered

> > > by user space with ibv_reg_mr(pd, base, size, IB_ACCESS_ON_DEMAND),

> > > what's prevent the kernel to map a file as the result of mmap(0,

> ...)

> > > in this  region, making it available remotely through

> IBV_WR_RDMA_READ /

> > > IBV_WR_RDMA_WRITE ?

> > >

> >

> > This is not a bug. This is a feature.

> >

> > Exposing a file through RDMA, using ODP, can be done exactly like

> this.

> > Given that the application explicitly requested this behavior, I don't

> > see why it is a problem. Actually, some of our tests use such flows.

> > The mmu notifiers mechanism allow us to do this safely. When the page

> is

> > written back to disk, it is removed from the ODP mapping. When it is

> > accessed by the HCA, it is brought back to RAM.

> 

> Forgive the private reply, but I've actually tried that a while ago and


No problem, better than e-mailing the entire world on RDMA specific topics.
As this discussion is relevant to the Linux RDMA users, I'm adding back
the linux-rdma mailing list.

> couldn't get it to work - ibv_reg_mr would return EINVAL on an address

> obtained by mmap.


Were you mmaping a normal disk file, or was the mmap targeting an MMIO of
another hardware device? mmap of a normal disk file should work also with
normal memory registration, assuming you are providing the proper length.

mmap of the MMIO area of another hardware device (i.e. interfacing an FPGA,
NVRAM, or similar things) requires some code changes on both sides. The
current kernel code in the infiniband side is using get_user_pages, which
does not support MMIO pages. The proposed PeerDirect patches [1] allows peer
device to declare ownership of virtual address ranges, and enable such
registration. However, these patches are have not yet been merged upstream.

> 

> Conceptually as well I'm not sure how it's supposed to work, mmap should

> only actually issue reads when memory access issues page faults (give or

> take suitable readahead logic), but I don't think direct memory access

> from the IB/RDMA adapter would issue such page faults ?


You are correct. RDMA adapters without ODP support do not issue page faults.
Instead, during memory registration, the ib_umem code calls get_user_pages,
which ensures all relevant pages are in memory, and pins them as needed.

> Likewise on writes, would need the kernel to notice memory has been

> written and pages are dirty/needs flushing.

> 


Similarly, when deregistering a writable memory region, the kernel driver
marks all pages as dirty before unpinning them. You can see the code doing
this in [2].

> So, what am I missing? I'd wager it's that "ODP" you're talking about,

> do you have any documentation I could skim through ?

> 


Liran Liss gave a presentation about ODP at OFA [3]. The technology is
available for ConnectIB devices using the most recent firmware and kernel
versions above 3.19.

Thanks,
--Shachar

[1] http://www.spinics.net/lists/linux-rdma/msg21770.html
[2] http://lxr.free-electrons.com/source/drivers/infiniband/core/umem.c#L62
[3] https://www.openfabrics.org/images/Workshops_2014/DevWorkshop/presos/Tuesday/pdf/09.30_2014_OFA_Workshop_ODP_update_final.pdf and https://www.youtube.com/watch?v=KbrlsXQbHCw
Yann Droneaud April 3, 2015, 11:49 a.m. UTC | #15
Hi,

Le vendredi 03 avril 2015 à 08:39 +0000, Haggai Eran a écrit :
> On Thursday, April 2, 2015 11:40 PM, Yann Droneaud <ydroneaud@opteya.com> wrote:
> > Le jeudi 02 avril 2015 à 16:44 +0000, Shachar Raindel a écrit :
> >> > -----Original Message-----
> >> > From: Yann Droneaud [mailto:ydroneaud@opteya.com]
> >> > Sent: Thursday, April 02, 2015 7:35 PM
> > 
> >> > Another related question: as the large memory range could be registered
> >> > by user space with ibv_reg_mr(pd, base, size, IB_ACCESS_ON_DEMAND),
> >> > what's prevent the kernel to map a file as the result of mmap(0, ...)
> >> > in this  region, making it available remotely through IBV_WR_RDMA_READ /
> >> > IBV_WR_RDMA_WRITE ?
> >> >
> >>
> >> This is not a bug. This is a feature.
> >>
> >> Exposing a file through RDMA, using ODP, can be done exactly like this.
> >> Given that the application explicitly requested this behavior, I don't
> >> see why it is a problem.
> > 
> > If the application cannot choose what will end up in the region it has
> > registered, it's an issue !
> > 
> > What might happen if one library in a program call mmap(0, size, ...) to
> > load a file storing a secret (a private key), and that file ends up
> > being mapped in an registered but otherwise free region (afaict, the
> > kernel is allowed to do it) ?
> > What might happen if one library in a program call call mmap(0,
> > size, ..., MAP_ANONYMOUS,...) to allocate memory, call mlock(), then
> > write in this location a secret (a passphrase), and that area ends up
> > in the memory region registered for on demand paging ?
> > 
> > The application haven't choose to disclose these confidential piece of
> > information, but they are available for reading/writing by remote
> > through RDMA given it knows the rkey of the memory region (which is a
> > 32bits value).
> > 
> > I hope I'm missing something, because I'm not feeling confident such
> > behavor is a feature.
> 
> What we are aiming for is the possibility to register the entire process' address 
> space for RDMA operations (if the process chooses to use this feature).
> This is similar to multiple threads accessing the same address space. I'm sure 
> you wouldn't be complaining about the ability of one thread to access the secret 
> passphrase mmapped by another thread in your example.
> 
> > I'm trying to understand how the application can choose what is exposed
> > through RDMA if it registers a very large memory region for later use
> > (but do not actually explicitly map something there yet): what's the
> > consequences ?
> > 
> >    void *start = sbrk(0);
> >    size_t size = ULONG_MAX - (unsigned long)start;
> > 
> >    ibv_reg_mr(pd, start, size, IB_ACCESS_ON_DEMAND)
> 
> The consequences are exactly as you wrote. Just as giving a non-ODP rkey 
> to a remote node allows the node to access the registered memory behind that 
> rkey, giving an ODP rkey to a remote node allows that node to access the 
> virtual address space behind that rkey.
> 

There's a difference: it's impossible to give a valid non-ODP rkey that
point to a memory region not already mapped (backed by a file for 
example), so the application *choose* the content of the memory to be
made accessible remotely before making it accessible.

As I understand the last explanation regarding ODP, at creation time,
an ODP rkey can point to a free, unused, unallocated memory portion.
At this point the kernel can happily map anything the application
(and its libraries) want to map at a (almost) *random* address that
could be in (or partially in) the ODP memory region.

And I have a problem with such random behavior. Allowing this is seems
dangerous and should be done with care.

I believe the application must kept the control of what's end up in its 
ODP registered memory region.

Especially for multi thread program: imagine one thread creating a large
memory region for its future purposes, then send the rkey to a remote 
peer and wait for some work to be done.
In the mean time another call mmap(0, ...) to map a file at a kernel 
chosen address, and that address happen to be in the memory region 
registered by the other thread:

1) the first thread is amputated from a portion of memory it was 
willing to use;
2) the data used by the second thread is accessible to the remote 
peer(s) while not expected.

Speculatively registering memory seems dangerous for any use case I
could think of.

Regards.
Dominique Martinet April 3, 2015, 12:43 p.m. UTC | #16
Hi,

Shachar Raindel wrote on Fri, Apr 03, 2015 at 11:49:13AM +0000:
> > couldn't get it to work - ibv_reg_mr would return EINVAL on an address
> > obtained by mmap.
> 
> Were you mmaping a normal disk file, or was the mmap targeting an MMIO of
> another hardware device? mmap of a normal disk file should work also with
> normal memory registration, assuming you are providing the proper length.

On a proper file.
It was a year or two ago, I actually tried again now and it does seem
to work alright even on older kernels... I wonder what I did wrong back
then!

> mmap of the MMIO area of another hardware device (i.e. interfacing an FPGA,
> NVRAM, or similar things) requires some code changes on both sides. The
> current kernel code in the infiniband side is using get_user_pages, which
> does not support MMIO pages. The proposed PeerDirect patches [1] allows peer
> device to declare ownership of virtual address ranges, and enable such
> registration. However, these patches are have not yet been merged upstream.

Interesting, I don't need this right now but it's good to know.

> > Conceptually as well I'm not sure how it's supposed to work, mmap should
> > only actually issue reads when memory access issues page faults (give or
> > take suitable readahead logic), but I don't think direct memory access
> > from the IB/RDMA adapter would issue such page faults ?
> 
> You are correct. RDMA adapters without ODP support do not issue page faults.
> Instead, during memory registration, the ib_umem code calls get_user_pages,
> which ensures all relevant pages are in memory, and pins them as needed.
> 
> > Likewise on writes, would need the kernel to notice memory has been
> > written and pages are dirty/needs flushing.
> > 
> 
> Similarly, when deregistering a writable memory region, the kernel driver
> marks all pages as dirty before unpinning them. You can see the code doing
> this in [2].

Ok this makes sense, thanks for clearing it up.

> Liran Liss gave a presentation about ODP at OFA [3]. The technology is
> available for ConnectIB devices using the most recent firmware and kernel
> versions above 3.19.

I don't have any recent kernel around, but I'll give it a shot next
week.
(I'm working on a userspace file server, nfs-ganesha, so being able to
mmap a file and use it directly for I/O is very promising for me!)


> [1] http://www.spinics.net/lists/linux-rdma/msg21770.html
> [2] http://lxr.free-electrons.com/source/drivers/infiniband/core/umem.c#L62
> [3] https://www.openfabrics.org/images/Workshops_2014/DevWorkshop/presos/Tuesday/pdf/09.30_2014_OFA_Workshop_ODP_update_final.pdf and https://www.youtube.com/watch?v=KbrlsXQbHCw


Thanks for the detailed reply, I'll dig a bit further and come back
straight to the list if need to.
Yann Droneaud April 13, 2015, 1:29 p.m. UTC | #17
Hi,

Le jeudi 02 avril 2015 à 18:12 +0000, Haggai Eran a écrit :
> On Thursday, April 2, 2015 7:44 PM, Shachar Raindel wrote:
> >> -----Original Message-----
> >> From: Yann Droneaud [mailto:ydroneaud@opteya.com]
> >> Le jeudi 02 avril 2015 à 18:18 +0300, Haggai Eran a écrit :
> >> > On 02/04/2015 16:30, Yann Droneaud wrote:
> >> >> Le jeudi 02 avril 2015 à 10:52 +0000, Shachar Raindel a écrit :
> >> >>>> -----Original Message-----
> >> >>>> From: Yann Droneaud [mailto:ydroneaud@opteya.com]
> >> >>>> Sent: Thursday, April 02, 2015 1:05 PM
> >> >>>> Le mercredi 18 mars 2015 à 17:39 +0000, Shachar Raindel a écrit :
> >> >>
> >> >>>>> +      /*
> >> >>>>> +       * If the combination of the addr and size requested for this
> >> >>>> memory
> >> >>>>> +       * region causes an integer overflow, return error.
> >> >>>>> +       */
> >> >>>>> +      if ((PAGE_ALIGN(addr + size) <= size) ||
> >> >>>>> +          (PAGE_ALIGN(addr + size) <= addr))
> >> >>>>> +              return ERR_PTR(-EINVAL);
> >> >>>>> +
> >> >>>>
> >> >>>> Can access_ok() be used here ?
> >> >>>>
> >> >>>>          if (!access_ok(writable ? VERIFY_WRITE : VERIFY_READ,
> >> >>>>                         addr, size))
> >> >>>>                   return ERR_PTR(-EINVAL);
> >> >>>>
> >> >>>
> >> >>> No, this will break the current ODP semantics.
> >> >>>
> >> >>> ODP allows the user to register memory that is not accessible yet.
> >> >>> This is a critical design feature, as it allows avoiding holding
> >> >>> a registration cache. Adding this check will break the behavior,
> >> >>> forcing memory to be all accessible when registering an ODP MR.
> >> >>>
> >> >>
> >> >> Where's the check for the range being in userspace memory space,
> >> >> especially for the ODP case ?
> >> >>
> >> >> For non ODP case (eg. plain old behavior), does get_user_pages()
> >> >> ensure the requested pages fit in userspace region on all
> >> >> architectures ? I think so.
> >> >
> >> > Yes, get_user_pages will return a smaller amount of pages than
> >> requested
> >> > if it encounters an unmapped region (or a region without write
> >> > permissions for write requests). If this happens, the loop in
> >> > ib_umem_get calls get_user_pages again with the next set of pages, and
> >> > this time if it the first page still cannot be mapped an error is
> >> returned.
> >> >
> >> >>
> >> >> In ODP case, I'm not sure such check is ever done ?
> >> >
> >> > In ODP, we also call get_user_pages, but only when a page fault occurs
> >> > (see ib_umem_odp_map_dma_pages()). This allows the user to pre-
> >> register
> >> > a memory region that contains unmapped virtual space, and then mmap
> >> > different files into that area without needing to re-register.
> >> >
> >>
> >> OK, thanks for the description.
> >>
> >> ...
> >>
> >> Another related question: as the large memory range could be registered
> >> by user space with ibv_reg_mr(pd, base, size, IB_ACCESS_ON_DEMAND),
> >> what's prevent the kernel to map a file as the result of mmap(0, ...)
> >> in this  region, making it available remotely through IBV_WR_RDMA_READ /
> >> IBV_WR_RDMA_WRITE ?
> >>
> > 
> > This is not a bug. This is a feature.
> > 
> > Exposing a file through RDMA, using ODP, can be done exactly like this.
> > Given that the application explicitly requested this behavior, I don't
> > see why it is a problem. Actually, some of our tests use such flows.
> > The mmu notifiers mechanism allow us to do this safely. When the page is
> > written back to disk, it is removed from the ODP mapping. When it is
> > accessed by the HCA, it is brought back to RAM.
> > 
> 
> I want to add that we would like to see users registering a very large
> memory region (perhaps the entire process address space) for local
> access, and then enabling remote access only to specific regions using
> memory windows. However, this isn't supported yet by our driver.

In such scheme, the registration must still be handled "manually":
one has to create a memory window to get a rkey to be exchanged with
a peer, so why one would want to register such a large memory region
(the whole process space) ?

I guess creating the memory window is faster than registering memory
region. I'd rather say this is not an excuse to register a larger 
memory region (up to the whole process space, current and future) as it 
sounds like a surprising optimisation: let the HCA known too many
pages just to be sure it already knows some when the process want to 
use them. It seems it would become difficult to handle if there's too
many processes.

I would prefer creating memory region becoming costless (through ODP :).

>  Still, there are valid cases where you would still want the results
> of an mmap(0,...) call to be remotely accessible, in cases where there
> is enough trust between the local process and the remote process.

mmap(0, ...., fd) let the kernel choose where to put the file in 
process virtual memory space, so it may, may not, partially, end up in 
an ODP pre registered memory region for a range unallocated/unused yet.

I don't think one want such to happen.

>  It may help a middleware communication library register a large
> portion of the address space in advance, and still work with random
> pointers given to it by another application module.
> 

But as said in the beginnig of your message, the middleware would have
bind a memory window before posting work request / exposing rkey for
the "random pointers".

So I fail to understand how could be used ODP when it comes to 
registering a memory region not yet backed by something.

Regards.
Haggai Eran April 14, 2015, 8:11 a.m. UTC | #18
On 13/04/2015 16:29, Yann Droneaud wrote:
> Le jeudi 02 avril 2015 à 18:12 +0000, Haggai Eran a écrit :
...
>>
>> I want to add that we would like to see users registering a very large
>> memory region (perhaps the entire process address space) for local
>> access, and then enabling remote access only to specific regions using
>> memory windows. However, this isn't supported yet by our driver.
> 
> In such scheme, the registration must still be handled "manually":
> one has to create a memory window to get a rkey to be exchanged with
> a peer, so why one would want to register such a large memory region
> (the whole process space) ?
> 
> I guess creating the memory window is faster than registering memory
> region. 
Right.

It takes time to create and fill the hardware's page tables. Using
memory windows allows you to reuse the work done previously, while still
having a more granular control over the RDMA permissions. The larger MR
can be created with only local permissions, and the memory window can
add specific remote permissions to a smaller range. The memory window
utilizes the page tables created for the memory region.

> I'd rather say this is not an excuse to register a larger
> memory region (up to the whole process space, current and future) as it 
> sounds like a surprising optimisation: let the HCA known too many
> pages just to be sure it already knows some when the process want to 
> use them. It seems it would become difficult to handle if there's too
> many processes.
Are you worried about pinning too many pages? That is an issue we want
to solve with ODP :)

> 
> I would prefer creating memory region becoming costless (through ODP :).
I agree :)

> 
>>  Still, there are valid cases where you would still want the results
>> of an mmap(0,...) call to be remotely accessible, in cases where there
>> is enough trust between the local process and the remote process.
> 
> mmap(0, ...., fd) let the kernel choose where to put the file in 
> process virtual memory space, so it may, may not, partially, end up in 
> an ODP pre registered memory region for a range unallocated/unused yet.
> 
> I don't think one want such to happen.
I think that in some cases the benefit of allowing this outweigh the
risks. This is why it is an opt-in feature.

> 
>>  It may help a middleware communication library register a large
>> portion of the address space in advance, and still work with random
>> pointers given to it by another application module.
>>
> 
> But as said in the beginnig of your message, the middleware would have
> bind a memory window before posting work request / exposing rkey for
> the "random pointers".
> 
> So I fail to understand how could be used ODP when it comes to 
> registering a memory region not yet backed by something.

In this scenario, the middleware would first register the full address
space as an ODP memory region with local permissions only. When it wants
to provide remote access to some buffer, it would bind a memory window
over the ODP MR. This is possible with multiple processes because it
uses the virtual memory system without pinning. It won't cause random
mmap regions to be mapped for RDMA without the specific intent of the
application.

However, we currently don't have support for memory windows over ODP
MRs. Even if we did, there is some performance penalty due to binding
and invalidating memory windows. Some applications will still need full
process address space access for RDMA.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index aec7a6a..8c014b5 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -99,6 +99,14 @@ 
 	if (dmasync)
 		dma_set_attr(DMA_ATTR_WRITE_BARRIER, &attrs);
 
+	/*
+	 * If the combination of the addr and size requested for this memory
+	 * region causes an integer overflow, return error.
+	 */
+	if ((PAGE_ALIGN(addr + size) <= size) ||
+	    (PAGE_ALIGN(addr + size) <= addr))
+		return ERR_PTR(-EINVAL);
+
 	if (!can_do_mlock())
 		return ERR_PTR(-EPERM);