diff mbox

[3/4] IB/uverbs: ex_query_device: check request's comp_mask

Message ID 063956366559d6919693aabb324bab83d676bc28.1421931555.git.ydroneaud@opteya.com (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Yann Droneaud Jan. 22, 2015, 1:28 p.m. UTC
This patch ensures the extended QUERY_DEVICE uverbs request's
comp_mask has only known values. If userspace returns unknown
features, -EINVAL will be returned, allowing to probe/discover
which features are currently supported by the kernel.

Moreover, it also ensure the requested features set in comp_mask
are sequentially set, not skipping intermediate features, eg. the
"highest" requested feature also request all the "lower" ones.
This way each new feature will have to be stacked on top of the
existing ones: this is especially important for the request and
response data structures where fields are added after the
current ones when expanded to support a new feature.

Link: http://mid.gmane.org/cover.1421931555.git.ydroneaud@opteya.com
Cc: Sagi Grimberg <sagig@mellanox.com>
Cc: Shachar Raindel <raindel@mellanox.com>
Cc: Eli Cohen <eli@mellanox.com>
Cc: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
---
 drivers/infiniband/core/uverbs_cmd.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Haggai Eran Jan. 25, 2015, 3:23 p.m. UTC | #1
On 22/01/2015 15:28, Yann Droneaud wrote:
> This patch ensures the extended QUERY_DEVICE uverbs request's
> comp_mask has only known values. If userspace returns unknown
> features, -EINVAL will be returned, allowing to probe/discover
> which features are currently supported by the kernel.

This probing process will be much more cumbersome than it needs to be
because userspace will have to call QUERY_DEVICE repetitively with
different comp_mask fields until it finds out the exact set of supported
bits.

> Moreover, it also ensure the requested features set in comp_mask
> are sequentially set, not skipping intermediate features, eg. the
> "highest" requested feature also request all the "lower" ones.
> This way each new feature will have to be stacked on top of the
> existing ones: this is especially important for the request and
> response data structures where fields are added after the
> current ones when expanded to support a new feature.

I think it is perfectly acceptable that not all drivers will implement
all available features, and so you shouldn't enforce this requirement.
Also, it makes the comp_mask nothing more than a wasteful version number
between 0 and 63.

In the specific case of QUERY_DEVICE you might argue that there isn't
any need for input comp_mask, only for output, and then you may enforce
the input comp_mask will always be zero. However, you will in any case
need to be able to extended the size of the response in the future.

> 
> Link: http://mid.gmane.org/cover.1421931555.git.ydroneaud@opteya.com
> Cc: Sagi Grimberg <sagig@mellanox.com>
> Cc: Shachar Raindel <raindel@mellanox.com>
> Cc: Eli Cohen <eli@mellanox.com>
> Cc: Haggai Eran <haggaie@mellanox.com>
> Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
> ---
>  drivers/infiniband/core/uverbs_cmd.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
> index 8668b328b7e6..80a1c90f1dbf 100644
> --- a/drivers/infiniband/core/uverbs_cmd.c
> +++ b/drivers/infiniband/core/uverbs_cmd.c
> @@ -3313,6 +3313,12 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
>  	if (err)
>  		return err;
>  
> +	if (cmd.comp_mask & (cmd.comp_mask + 1))
> +		return -EINVAL;
> +
> +	if (cmd.comp_mask & ~(__u32)IB_USER_VERBS_EX_QUERY_DEVICE_ODP)
> +		return -EINVAL;
> +
>  	if (cmd.reserved)
>  		return -EINVAL;
>  
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yann Droneaud Jan. 26, 2015, 11:17 a.m. UTC | #2
Hi,

(adding linux-api@ for comments:

 We're introducing a new "uverb" in the InfiniBand subsystem:
 extended QUERY_DEVICE in v3.19:

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/infiniband/core/uverbs_cmd.c?id=v3.19-rc6#n3297
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/uapi/rdma/ib_user_verbs.h?id=v3.19-rc6#n209
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/uapi/rdma/ib_user_verbs.h?id=v3.19-rc6#n224

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=5a77abf9a97a
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=860f10a799c8

 As you may not already know, InfiniBand subsystem use a write() 
 syscall[1] to issue ioctl()-like operations. Many operations (aka   
 verbs) are available [2], for each there's a query data structures
 and for some there's a response data structure [3]. As a result to a 
 write() operation, kernel could allocate resources on the task behalf 
 and/or write data back to userspace provided buffers whose pointers 
 were part of buffer passed to write().

 I've expressed concern on the way the new extended QUERY_DEVICE
 was trying to be itself expandable: by silently ignoring shorter
 buffer, returning truncated data, the interface seems awkward.

"Re: [PATCH v2 06/17] IB/core: Add support for extended query device caps"
http://mid.gmane.org/1418216676.11111.45.camel@opteya.com

"Re: [PATCH v3 06/17] IB/core: Add support for extended query device caps"
http://mid.gmane.org/1418733236.2779.26.camel@opteya.com
http://mid.gmane.org/1418824811.3334.3.camel@dworkin
http://mid.gmane.org/1421844612.13543.40.camel@opteya.com

 I recognize the author's intention to provide a way for userspace
 to retrieve easily the highest supported information as something
 desirable.

 But I believe we must be more strict on the request content and
 fail for any unrecognized, unsupported, incorrect bits to make
 safer to extend the interface latter.

 I've submitted a patchset[4] to address these issues.
 But, while I'm convinced it the way to go, I'm not able to find
 how it could be made to satisfy the original author expectations.
 
 I hope linux-api@ readers could provide some insight regarding
 the way we should implement such interface

[1] http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/infiniband/core/uverbs_main.c?id=v3.19-rc6#n599

[2] http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/infiniband/core/uverbs_main.c?id=v3.19-rc6#n81

[3] http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/uapi/rdma/ib_user_verbs.h?id=v3.19-rc6

[4] http://mid.gmane.org/cover.1421931555.git.ydroneaud@opteya.com

)

Le dimanche 25 janvier 2015 à 17:23 +0200, Haggai Eran a écrit :
> On 22/01/2015 15:28, Yann Droneaud wrote:
> > This patch ensures the extended QUERY_DEVICE uverbs request's
> > comp_mask has only known values. If userspace returns unknown
> > features, -EINVAL will be returned, allowing to probe/discover
> > which features are currently supported by the kernel.
> 
> This probing process will be much more cumbersome than it needs to be
> because userspace will have to call QUERY_DEVICE repetitively with
> different comp_mask fields until it finds out the exact set of supported
> bits.
> 

O(log2(N))

Or you had to add a different interface, dedicated to retrieve the exact
supported feature mask.

> > Moreover, it also ensure the requested features set in comp_mask
> > are sequentially set, not skipping intermediate features, eg. the
> > "highest" requested feature also request all the "lower" ones.
> > This way each new feature will have to be stacked on top of the
> > existing ones: this is especially important for the request and
> > response data structures where fields are added after the
> > current ones when expanded to support a new feature.
> 
> I think it is perfectly acceptable that not all drivers will implement
> all available features, and so you shouldn't enforce this requirement.

With regard to QUERY_DEVICE: the data structure layout depends on the
comp_mask value. So either you propose a way to express multipart data
structure (see CMSG or NETLINK), or we have to ensure the data structure
is ever-growing, with each new chunck stacked over the existing ones:
that's the purpose of :

	if (cmd.comp_mask & (cmd.comp_mask + 1))
		return -EINVAL;

> Also, it makes the comp_mask nothing more than a wasteful version number
> between 0 and 63.

That's what I've already explained earlier in "Re: [PATCH v3 06/17]
IB/core: Add support for extended query device caps":

http://mid.gmane.org/1421844612.13543.40.camel@opteya.com

> 
> In the specific case of QUERY_DEVICE you might argue that there isn't
> any need for input comp_mask, only for output, and then you may enforce
> the input comp_mask will always be zero.

The extended QUERY_DEVICE uverbs as currently merged is using comp_mask
from input to choose to report on-demand-paging related value. So it
seems it's needed.

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/infiniband/core/uverbs_cmd.c?id=v3.19-rc6#n3297

> However, you will in any case need to be able to extended the size of the response in the future.
> 

That's already the case for on demand paging.

> > 
> > Link: http://mid.gmane.org/cover.1421931555.git.ydroneaud@opteya.com
> > Cc: Sagi Grimberg <sagig@mellanox.com>
> > Cc: Shachar Raindel <raindel@mellanox.com>
> > Cc: Eli Cohen <eli@mellanox.com>
> > Cc: Haggai Eran <haggaie@mellanox.com>
> > Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
> > ---
> >  drivers/infiniband/core/uverbs_cmd.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
> > index 8668b328b7e6..80a1c90f1dbf 100644
> > --- a/drivers/infiniband/core/uverbs_cmd.c
> > +++ b/drivers/infiniband/core/uverbs_cmd.c
> > @@ -3313,6 +3313,12 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
> >  	if (err)
> >  		return err;
> >  
> > +	if (cmd.comp_mask & (cmd.comp_mask + 1))
> > +		return -EINVAL;
> > +
> > +	if (cmd.comp_mask & ~(__u32)IB_USER_VERBS_EX_QUERY_DEVICE_ODP)
> > +		return -EINVAL;
> > +

If we keep the checks on output buffer size from patch 1, returning
-ENOSPC in case of size mismatch, and if we are sure that no bit in
input comp_mask will ever trigger a call to a kernel function that can
make the uverb fail, the latter check on known value could be dropped,
allowing the userspace to set the highest mask (-1): userspace
will use -ENOSPC to probe the expected size of the response buffer
to match the highest supported comp_mask. But it's going to hurt
userspace application not ready to receive -ENOSPC on newer kernel
with extended QUERY_DEVICE ABI ... Oops.

So in this end, the safest way to ensure userspace is doing the correct
thing so that we have backward and forward compatibility is to check
for known values in comp_mask, check for response buffer size and ensure
that data structure chunk are stacked.

The tighter are the checks now, the easier the interface could be
extended latter.

> >  	if (cmd.reserved)
> >  		return -EINVAL;
> >  
> > 
> 

Regards.
Haggai Eran Jan. 27, 2015, 6:50 a.m. UTC | #3
On 26/01/2015 13:17, Yann Droneaud wrote:
> ...
> Le dimanche 25 janvier 2015 à 17:23 +0200, Haggai Eran a écrit :
>> On 22/01/2015 15:28, Yann Droneaud wrote:
>>> This patch ensures the extended QUERY_DEVICE uverbs request's
>>> comp_mask has only known values. If userspace returns unknown
>>> features, -EINVAL will be returned, allowing to probe/discover
>>> which features are currently supported by the kernel.
>>
>> This probing process will be much more cumbersome than it needs to be
>> because userspace will have to call QUERY_DEVICE repetitively with
>> different comp_mask fields until it finds out the exact set of supported
>> bits.
>>
> 
> O(log2(N))

I don't think user space developers will be happy having to do trial and 
error to determine what features the kernel driver supports. It might be 
even more then O(log2(N)). If my understanding of comp_mask bits usage is 
correct it would O(N). But it's not the time complexity I'm worried about,
it's the fact that it requires user-space developers to go through hoops in
order to get information that can be much more easily exported.

> Or you had to add a different interface, dedicated to retrieve the exact
> supported feature mask.
> 
>>> Moreover, it also ensure the requested features set in comp_mask
>>> are sequentially set, not skipping intermediate features, eg. the
>>> "highest" requested feature also request all the "lower" ones.
>>> This way each new feature will have to be stacked on top of the
>>> existing ones: this is especially important for the request and
>>> response data structures where fields are added after the
>>> current ones when expanded to support a new feature.
>>
>> I think it is perfectly acceptable that not all drivers will implement
>> all available features, and so you shouldn't enforce this requirement.
> 
> With regard to QUERY_DEVICE: the data structure layout depends on the
> comp_mask value. So either you propose a way to express multipart data
> structure (see CMSG or NETLINK), or we have to ensure the data structure
> is ever-growing, with each new chunck stacked over the existing ones:
> that's the purpose of :
> 
> 	if (cmd.comp_mask & (cmd.comp_mask + 1))
> 		return -EINVAL;
> 
>> Also, it makes the comp_mask nothing more than a wasteful version number
>> between 0 and 63.
> 
> That's what I've already explained earlier in "Re: [PATCH v3 06/17]
> IB/core: Add support for extended query device caps":
> 
> http://mid.gmane.org/1421844612.13543.40.camel@opteya.com

Yes, you wrote there:
> Regarding comp_mask (not for this particular verb):
> 
> It's not clear whether request's comp_mask describe the request or the
> response, as such I'm puzzled.
> 
> How would the kernel and the userspace be able to parse the request and
> the response if they ignore unknown bits ?
> 
> How would they be able to skip the unrecognized chunk of the request and
> response buffer ?
> 
> How one bit in a comp_mask is related to a chunk in the request or
> response ?
> 
> It's likely the kernel or userspace would have to skip the remaining
> comp_mask's bits after encountering an unknown bit as the size of the
> corresponding chunk in request or response would be unknown, making
> impossible to locate the corresponding chunk for the next bit set in
> comp_mask. Having said that, comp_mask is just a complicated way of
> expressing a version, which is turn describe a size (ever growing).

It is my understanding that each comp_mask bit marks a set of fields in 
the command or in the response struct as valid, so the struct format 
remains the same and the kernel and userspace don't need to make 
difficult calculation as to where each field is, but you can still pass 
a high bit set in comp_mask with one of the lower bits cleared.

I couldn't find this explicit detail in the mailing list, but I did found
a presentation that was presented in OFA International Developer 
Workshop 2013 [1], that gave an example of of an verb where each 
comp_mask bit marked a different field as valid.

> 
>>
>> In the specific case of QUERY_DEVICE you might argue that there isn't
>> any need for input comp_mask, only for output, and then you may enforce
>> the input comp_mask will always be zero.
> 
> The extended QUERY_DEVICE uverbs as currently merged is using comp_mask
> from input to choose to report on-demand-paging related value. So it
> seems it's needed.
> 
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/infiniband/core/uverbs_cmd.c?id=v3.19-rc6#n3297
> 
>> However, you will in any case need to be able to extended the size of the response in the future.
>>
> 
> That's already the case for on demand paging.
> 
>>>
>>> Link: http://mid.gmane.org/cover.1421931555.git.ydroneaud@opteya.com
>>> Cc: Sagi Grimberg <sagig@mellanox.com>
>>> Cc: Shachar Raindel <raindel@mellanox.com>
>>> Cc: Eli Cohen <eli@mellanox.com>
>>> Cc: Haggai Eran <haggaie@mellanox.com>
>>> Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
>>> ---
>>>  drivers/infiniband/core/uverbs_cmd.c | 6 ++++++
>>>  1 file changed, 6 insertions(+)
>>>
>>> diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
>>> index 8668b328b7e6..80a1c90f1dbf 100644
>>> --- a/drivers/infiniband/core/uverbs_cmd.c
>>> +++ b/drivers/infiniband/core/uverbs_cmd.c
>>> @@ -3313,6 +3313,12 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
>>>  	if (err)
>>>  		return err;
>>>  
>>> +	if (cmd.comp_mask & (cmd.comp_mask + 1))
>>> +		return -EINVAL;
>>> +
>>> +	if (cmd.comp_mask & ~(__u32)IB_USER_VERBS_EX_QUERY_DEVICE_ODP)
>>> +		return -EINVAL;
>>> +
> 
> If we keep the checks on output buffer size from patch 1, returning
> -ENOSPC in case of size mismatch, and if we are sure that no bit in
> input comp_mask will ever trigger a call to a kernel function that can
> make the uverb fail, the latter check on known value could be dropped,
> allowing the userspace to set the highest mask (-1): userspace
> will use -ENOSPC to probe the expected size of the response buffer
> to match the highest supported comp_mask. But it's going to hurt
> userspace application not ready to receive -ENOSPC on newer kernel
> with extended QUERY_DEVICE ABI ... Oops.
> 
> So in this end, the safest way to ensure userspace is doing the correct
> thing so that we have backward and forward compatibility is to check
> for known values in comp_mask, check for response buffer size and ensure
> that data structure chunk are stacked.
> 
> The tighter are the checks now, the easier the interface could be
> extended latter.

I understand this position, and I generally agree, but I think that 
specifically for a verb like QUERY_DEVICE that only reads information from 
the kernel driver to user-space, there is no harm in the kernel just 
providing all the information it can fit in the response buffer provided 
by user-space.

Let me explain: newer fields are added to the kernel response struct at the 
end, together with a new comp_mask bit.

Older user-space with newer kernels will simply ask only for the buffer 
size they care about. The fact that the struct is truncated doesn't affect
these programs because the truncated struct is a valid struct that was 
presented by the kernel in an older version.

They may or may not receive a comp_mask bit they don't recognize, but that 
only tells them that the kernel supports new features they don't know about.

Newer user-space with older kernel will give a larger buffer then the 
kernel can fill. The kernel only fills in the beginning of the user-space 
buffer, and provides user-space with the comp_mask bits that mark which
fields are valid. So user-space can tell that the end of the buffer isn't 
valid. 

In my implementation I also left the ending uninitialized, but the 
kernel can zero it if you think it is important.

So I hope you agree this scheme is extendible and allows keeping backward 
and forward compatibility. If you can think of another scheme that will be 
more strict with the buffer sizes, but doesn't require user-space to do 
extra work, I'll be happy to hear about it.

Regards,
Haggai

[1] Extending Verbs API, Tzahi Oved
    https://www.openfabrics.org/images/docs/2013_Dev_Workshop/Tues_0423/2013_Workshop_Tues_0830_Tzahi_Oved-verbs_extensions_ofa_2013-tzahio.pdf
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yann Droneaud Jan. 28, 2015, 1:19 p.m. UTC | #4
Hi,

Le mardi 27 janvier 2015 à 08:50 +0200, Haggai Eran a écrit :
> On 26/01/2015 13:17, Yann Droneaud wrote:
> > ...
> > Le dimanche 25 janvier 2015 à 17:23 +0200, Haggai Eran a écrit :
> >> On 22/01/2015 15:28, Yann Droneaud wrote:
> >>> This patch ensures the extended QUERY_DEVICE uverbs request's
> >>> comp_mask has only known values. If userspace returns unknown
> >>> features, -EINVAL will be returned, allowing to probe/discover
> >>> which features are currently supported by the kernel.
> >>
> >> This probing process will be much more cumbersome than it needs to be
> >> because userspace will have to call QUERY_DEVICE repetitively with
> >> different comp_mask fields until it finds out the exact set of supported
> >> bits.
> >>
> > 
> > O(log2(N))
> 
> I don't think user space developers will be happy having to do trial and 
> error to determine what features the kernel driver supports. It might be 
> even more then O(log2(N)). If my understanding of comp_mask bits usage is 
> correct it would O(N). But it's not the time complexity I'm worried about,
> it's the fact that it requires user-space developers to go through hoops in
> order to get information that can be much more easily exported.
> 

But there's currently *NO* such mean that could give a hint to userspace
developer whether one bit of request's comp_mask is currently effective
in extended QUERY_DEVICE uverbs. While we have such mean in CREATE_FLOW
and DESTROY_FLOW: unsupported bits trigger -EINVAL.

In QUERY_DEVICE, userspace developer is allowed to set request's 
comp_mask to -1: it won't hurt it on current kernel, so why not using 
this value or any other random value ? The program will run: it's "Works
For Me".

But the same program (either binary or source code) might fail on
newer kernel where some bits in comp_mask gain a meaning not supported
by the program.

> > Or you had to add a different interface, dedicated to retrieve the exact
> > supported feature mask.
> > 
> >>> Moreover, it also ensure the requested features set in comp_mask
> >>> are sequentially set, not skipping intermediate features, eg. the
> >>> "highest" requested feature also request all the "lower" ones.
> >>> This way each new feature will have to be stacked on top of the
> >>> existing ones: this is especially important for the request and
> >>> response data structures where fields are added after the
> >>> current ones when expanded to support a new feature.
> >>
> >> I think it is perfectly acceptable that not all drivers will implement
> >> all available features, and so you shouldn't enforce this requirement.
> > 
> > With regard to QUERY_DEVICE: the data structure layout depends on the
> > comp_mask value. So either you propose a way to express multipart data
> > structure (see CMSG or NETLINK), or we have to ensure the data structure
> > is ever-growing, with each new chunck stacked over the existing ones:
> > that's the purpose of :
> > 
> > 	if (cmd.comp_mask & (cmd.comp_mask + 1))
> > 		return -EINVAL;
> > 
> >> Also, it makes the comp_mask nothing more than a wasteful version number
> >> between 0 and 63.
> > 
> > That's what I've already explained earlier in "Re: [PATCH v3 06/17]
> > IB/core: Add support for extended query device caps":
> > 
> > http://mid.gmane.org/1421844612.13543.40.camel@opteya.com
> 
> Yes, you wrote there:
> > Regarding comp_mask (not for this particular verb):
> > 
> > It's not clear whether request's comp_mask describe the request or the
> > response, as such I'm puzzled.
> > 
> > How would the kernel and the userspace be able to parse the request and
> > the response if they ignore unknown bits ?
> > 
> > How would they be able to skip the unrecognized chunk of the request and
> > response buffer ?
> > 
> > How one bit in a comp_mask is related to a chunk in the request or
> > response ?
> > 
> > It's likely the kernel or userspace would have to skip the remaining
> > comp_mask's bits after encountering an unknown bit as the size of the
> > corresponding chunk in request or response would be unknown, making
> > impossible to locate the corresponding chunk for the next bit set in
> > comp_mask. Having said that, comp_mask is just a complicated way of
> > expressing a version, which is turn describe a size (ever growing).
> 
> It is my understanding that each comp_mask bit marks a set of fields in 
> the command or in the response struct as valid, so the struct format 
> remains the same and the kernel and userspace don't need to make 
> difficult calculation as to where each field is, but you can still pass 
> a high bit set in comp_mask with one of the lower bits cleared.
> 

How can the struct format remain the same, if some fields are added
to implement newer feature ?

> I couldn't find this explicit detail in the mailing list, but I did found
> a presentation that was presented in OFA International Developer 
> Workshop 2013 [1], that gave an example of of an verb where each 
> comp_mask bit marked a different field as valid.
> 

Thanks for the link to the presentation.

As I read it the presentation:

- in request, comp_mask give hint to the kernel of additional fields in
the request.

- in response, comp_mask give hint to userspace regarding the presence
of additional fields in the response.

But commit 860f10a799c8 ("IB/core: Add flags for on demand paging
support") on top of commit 5a77abf9a97a ("IB/core: Add support for
extended query device caps") is not doing it the expected way
as one bit set in request's comp_mask has direct effect on
the response's fields.

To be conformant with the "Extending Verbs API" presentation,
no check should be done on request's comp_mask, and on-demand paging
information should be returned to userspace in all case, provided 
there's enough room in the response buffer.

Anyway, allowing userspace to set any "hint" in the request's comp_mask
is opening a pandora box: being allowed to set any value, userspace
program will use any random value, as it will work with current kernel.

But the same program used on newer kernel is going to trigger some
behavior unknown to the application or return an error the application
is not ready to handle ... breaking any kind of forward compatibility 
promise.

It's likely the kernel will have to use the size of the request to 
guess the hints in comp_mask are corrects to handle such. In such case, 
relying only on the size of the request might be enough to detect 
extended request, without the need for comp_mask.

Regarding the answer, if the response buffer is smaller than the size
of the extended response, will the kernel keep setting the response's 
comp_mask with all the bits it supports or will it unset the bits for 
the fields it wasn't able to fit in the response buffer ?

It's likely the kernel will have to use the size of the response buffer 
to set the response's comp_mask. 

Instead commit 860f10a799c8 ("IB/core: Add flags for on demand paging
support") on top of commit 5a77abf9a97a ("IB/core: Add support for
extended query device caps") make it possible for the kernel to return
truncated response with full comp_mask. Such is going to break silently 
when one will introduce a data structure with an ABI mismatch between
userspace and kernel (for example x86 vs amd64 ... we have some recent
exemples).

> > 
> >>
> >> In the specific case of QUERY_DEVICE you might argue that there isn't
> >> any need for input comp_mask, only for output, and then you may enforce
> >> the input comp_mask will always be zero.
> > 
> > The extended QUERY_DEVICE uverbs as currently merged is using comp_mask
> > from input to choose to report on-demand-paging related value. So it
> > seems it's needed.
> > 
> > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/infiniband/core/uverbs_cmd.c?id=v3.19-rc6#n3297
> > 
> >> However, you will in any case need to be able to extended the size of the response in the future.
> >>
> > 
> > That's already the case for on demand paging.
> > 
> >>>
> >>> Link: http://mid.gmane.org/cover.1421931555.git.ydroneaud@opteya.com
> >>> Cc: Sagi Grimberg <sagig@mellanox.com>
> >>> Cc: Shachar Raindel <raindel@mellanox.com>
> >>> Cc: Eli Cohen <eli@mellanox.com>
> >>> Cc: Haggai Eran <haggaie@mellanox.com>
> >>> Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
> >>> ---
> >>>  drivers/infiniband/core/uverbs_cmd.c | 6 ++++++
> >>>  1 file changed, 6 insertions(+)
> >>>
> >>> diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
> >>> index 8668b328b7e6..80a1c90f1dbf 100644
> >>> --- a/drivers/infiniband/core/uverbs_cmd.c
> >>> +++ b/drivers/infiniband/core/uverbs_cmd.c
> >>> @@ -3313,6 +3313,12 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
> >>>  	if (err)
> >>>  		return err;
> >>>  
> >>> +	if (cmd.comp_mask & (cmd.comp_mask + 1))
> >>> +		return -EINVAL;
> >>> +
> >>> +	if (cmd.comp_mask & ~(__u32)IB_USER_VERBS_EX_QUERY_DEVICE_ODP)
> >>> +		return -EINVAL;
> >>> +
> > 
> > If we keep the checks on output buffer size from patch 1, returning
> > -ENOSPC in case of size mismatch, and if we are sure that no bit in
> > input comp_mask will ever trigger a call to a kernel function that can
> > make the uverb fail, the latter check on known value could be dropped,
> > allowing the userspace to set the highest mask (-1): userspace
> > will use -ENOSPC to probe the expected size of the response buffer
> > to match the highest supported comp_mask. But it's going to hurt
> > userspace application not ready to receive -ENOSPC on newer kernel
> > with extended QUERY_DEVICE ABI ... Oops.
> > 
> > So in this end, the safest way to ensure userspace is doing the correct
> > thing so that we have backward and forward compatibility is to check
> > for known values in comp_mask, check for response buffer size and ensure
> > that data structure chunk are stacked.
> > 
> > The tighter are the checks now, the easier the interface could be
> > extended latter.
> 
> I understand this position, and I generally agree, but I think that 
> specifically for a verb like QUERY_DEVICE that only reads information from 
> the kernel driver to user-space, there is no harm in the kernel just 
> providing all the information it can fit in the response buffer provided 
> by user-space.
> 

Returning as much as possible information to userspace is OK,
but it's doing it the wrong way.

I'm not specifically discussing QUERY_DEVICE, as I'm concerned with 
every extended uverbs to be added to the kernel.

> Let me explain: newer fields are added to the kernel response struct at the 
> end, together with a new comp_mask bit.
> 
> Older user-space with newer kernels will simply ask only for the buffer 
> size they care about. The fact that the struct is truncated doesn't affect
> these programs because the truncated struct is a valid struct that was 
> presented by the kernel in an older version.
> 

You cannot ensure the userspace being correct when specifying a request.
It's a wrong assumption (perhaps not so wrong in the case of
InfiniBand/RDMA, as most userspace program are using libibverbs,
librdmacm and provider's libraries).

That's why we must not be liberal with the request and check any bit of
it for being something valid, so that erroneous values are catch now,
ensuring userspace is not trying to request things we don't know yet
and it's not aware of it too.

> They may or may not receive a comp_mask bit they don't recognize, but that 
> only tells them that the kernel supports new features they don't know about.
> 
> Newer user-space with older kernel will give a larger buffer then the 
> kernel can fill. The kernel only fills in the beginning of the user-space 
> buffer, and provides user-space with the comp_mask bits that mark which
> fields are valid. So user-space can tell that the end of the buffer isn't 
> valid. 
> 

But older userspace programs with newer kernel might break, as request's
comp_mask was allowed to contains invalid values in older kernel.

> In my implementation I also left the ending uninitialized, but the 
> kernel can zero it if you think it is important.
> 

Why not, but it's a minor point.

> So I hope you agree this scheme is extendible and allows keeping backward 
> and forward compatibility. If you can think of another scheme that will be 
> more strict with the buffer sizes, but doesn't require user-space to do 
> extra work, I'll be happy to hear about it.
> 

I'm sorry about that but I cannot agree with the current scheme.

> [1] Extending Verbs API, Tzahi Oved
>     https://www.openfabrics.org/images/docs/2013_Dev_Workshop/Tues_0423/2013_Workshop_Tues_0830_Tzahi_Oved-verbs_extensions_ofa_2013-tzahio.pdf

Regards.
Haggai Eran Jan. 28, 2015, 3:40 p.m. UTC | #5
On 28/01/2015 15:19, Yann Droneaud wrote:
> Hi,
> 
> Le mardi 27 janvier 2015 à 08:50 +0200, Haggai Eran a écrit :
>> On 26/01/2015 13:17, Yann Droneaud wrote:
>>> ...
>>> Le dimanche 25 janvier 2015 à 17:23 +0200, Haggai Eran a écrit :
>>>> On 22/01/2015 15:28, Yann Droneaud wrote:
>>>>> This patch ensures the extended QUERY_DEVICE uverbs request's
>>>>> comp_mask has only known values. If userspace returns unknown
>>>>> features, -EINVAL will be returned, allowing to probe/discover
>>>>> which features are currently supported by the kernel.
>>>>
>>>> This probing process will be much more cumbersome than it needs to be
>>>> because userspace will have to call QUERY_DEVICE repetitively with
>>>> different comp_mask fields until it finds out the exact set of supported
>>>> bits.
>>>>
>>>
>>> O(log2(N))
>>
>> I don't think user space developers will be happy having to do trial and 
>> error to determine what features the kernel driver supports. It might be 
>> even more then O(log2(N)). If my understanding of comp_mask bits usage is 
>> correct it would O(N). But it's not the time complexity I'm worried about,
>> it's the fact that it requires user-space developers to go through hoops in
>> order to get information that can be much more easily exported.
>>
> 
> But there's currently *NO* such mean that could give a hint to userspace
> developer whether one bit of request's comp_mask is currently effective
> in extended QUERY_DEVICE uverbs. While we have such mean in CREATE_FLOW
> and DESTROY_FLOW: unsupported bits trigger -EINVAL.
> 
> In QUERY_DEVICE, userspace developer is allowed to set request's 
> comp_mask to -1: it won't hurt it on current kernel, so why not using 
> this value or any other random value ? The program will run: it's "Works
> For Me".
> 
> But the same program (either binary or source code) might fail on
> newer kernel where some bits in comp_mask gain a meaning not supported
> by the program.

Why don't we make the command comp_mask in QUERY_DEVICE zero in both
versions. The behavior of the command with comp_mask = 0 will be to
return the maximum amount of data that fits in the response buffer. The
kernel will return -EINVAL if the input command comp_mask is not zero in
the current version.

If in the future we want to alter the behavior or add more input fields
to QUERY_DEVICE, we would use new bits.

>>> Or you had to add a different interface, dedicated to retrieve the exact
>>> supported feature mask.
>>>
>>>>> Moreover, it also ensure the requested features set in comp_mask
>>>>> are sequentially set, not skipping intermediate features, eg. the
>>>>> "highest" requested feature also request all the "lower" ones.
>>>>> This way each new feature will have to be stacked on top of the
>>>>> existing ones: this is especially important for the request and
>>>>> response data structures where fields are added after the
>>>>> current ones when expanded to support a new feature.
>>>>
>>>> I think it is perfectly acceptable that not all drivers will implement
>>>> all available features, and so you shouldn't enforce this requirement.
>>>
>>> With regard to QUERY_DEVICE: the data structure layout depends on the
>>> comp_mask value. So either you propose a way to express multipart data
>>> structure (see CMSG or NETLINK), or we have to ensure the data structure
>>> is ever-growing, with each new chunck stacked over the existing ones:
>>> that's the purpose of :
>>>
>>> 	if (cmd.comp_mask & (cmd.comp_mask + 1))
>>> 		return -EINVAL;
>>>
>>>> Also, it makes the comp_mask nothing more than a wasteful version number
>>>> between 0 and 63.
>>>
>>> That's what I've already explained earlier in "Re: [PATCH v3 06/17]
>>> IB/core: Add support for extended query device caps":
>>>
>>> http://mid.gmane.org/1421844612.13543.40.camel@opteya.com
>>
>> Yes, you wrote there:
>>> Regarding comp_mask (not for this particular verb):
>>>
>>> It's not clear whether request's comp_mask describe the request or the
>>> response, as such I'm puzzled.
>>>
>>> How would the kernel and the userspace be able to parse the request and
>>> the response if they ignore unknown bits ?
>>>
>>> How would they be able to skip the unrecognized chunk of the request and
>>> response buffer ?
>>>
>>> How one bit in a comp_mask is related to a chunk in the request or
>>> response ?
>>>
>>> It's likely the kernel or userspace would have to skip the remaining
>>> comp_mask's bits after encountering an unknown bit as the size of the
>>> corresponding chunk in request or response would be unknown, making
>>> impossible to locate the corresponding chunk for the next bit set in
>>> comp_mask. Having said that, comp_mask is just a complicated way of
>>> expressing a version, which is turn describe a size (ever growing).
>>
>> It is my understanding that each comp_mask bit marks a set of fields in 
>> the command or in the response struct as valid, so the struct format 
>> remains the same and the kernel and userspace don't need to make 
>> difficult calculation as to where each field is, but you can still pass 
>> a high bit set in comp_mask with one of the lower bits cleared.
>>
> 
> How can the struct format remain the same, if some fields are added
> to implement newer feature ?

I meant that the binary format for an older version is the prefix of the
binary format of the newer version.

>> I couldn't find this explicit detail in the mailing list, but I did found
>> a presentation that was presented in OFA International Developer 
>> Workshop 2013 [1], that gave an example of of an verb where each 
>> comp_mask bit marked a different field as valid.
>>
> 
> Thanks for the link to the presentation.
> 
> As I read it the presentation:
> 
> - in request, comp_mask give hint to the kernel of additional fields in
> the request.
> 
> - in response, comp_mask give hint to userspace regarding the presence
> of additional fields in the response.

I'm not sure that in request you can regard the comp_mask as a hint. I
agree that you need to enforce it in general and reject unknown bits
there (although I thought QUERY_DEVICE would be an exception).

> But commit 860f10a799c8 ("IB/core: Add flags for on demand paging
> support") on top of commit 5a77abf9a97a ("IB/core: Add support for
> extended query device caps") is not doing it the expected way
> as one bit set in request's comp_mask has direct effect on
> the response's fields.
> 
> To be conformant with the "Extending Verbs API" presentation,
> no check should be done on request's comp_mask, and on-demand paging
> information should be returned to userspace in all case, provided 
> there's enough room in the response buffer.
> 
> Anyway, allowing userspace to set any "hint" in the request's comp_mask
> is opening a pandora box: being allowed to set any value, userspace
> program will use any random value, as it will work with current kernel.
> 
> But the same program used on newer kernel is going to trigger some
> behavior unknown to the application or return an error the application
> is not ready to handle ... breaking any kind of forward compatibility 
> promise.

I don't think that the application should handle unknown response bits
as an error. As you wrote, I see these as hints about more data in the
response that the application is free to ignore.

> It's likely the kernel will have to use the size of the request to 
> guess the hints in comp_mask are corrects to handle such. In such case, 
> relying only on the size of the request might be enough to detect 
> extended request, without the need for comp_mask.
> 
> Regarding the answer, if the response buffer is smaller than the size
> of the extended response, will the kernel keep setting the response's 
> comp_mask with all the bits it supports or will it unset the bits for 
> the fields it wasn't able to fit in the response buffer ?
> 
> It's likely the kernel will have to use the size of the response buffer 
> to set the response's comp_mask. 
> 
> Instead commit 860f10a799c8 ("IB/core: Add flags for on demand paging
> support") on top of commit 5a77abf9a97a ("IB/core: Add support for
> extended query device caps") make it possible for the kernel to return
> truncated response with full comp_mask. Such is going to break silently 
> when one will introduce a data structure with an ABI mismatch between
> userspace and kernel (for example x86 vs amd64 ... we have some recent
> exemples).

As I said, I think unknown bits in the comp_mask are hints about unknown
fields in the response that can be ignored by the application. However,
I can agree to having the kernel checking the response buffer size as
you wrote, and only setting the valid comp_mask bits.

> 
>>>
>>>>
>>>> In the specific case of QUERY_DEVICE you might argue that there isn't
>>>> any need for input comp_mask, only for output, and then you may enforce
>>>> the input comp_mask will always be zero.
>>>
>>> The extended QUERY_DEVICE uverbs as currently merged is using comp_mask
>>> from input to choose to report on-demand-paging related value. So it
>>> seems it's needed.
>>>
>>> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/infiniband/core/uverbs_cmd.c?id=v3.19-rc6#n3297
>>>
>>>> However, you will in any case need to be able to extended the size of the response in the future.
>>>>
>>>
>>> That's already the case for on demand paging.
>>>
>>>>>
>>>>> Link: http://mid.gmane.org/cover.1421931555.git.ydroneaud@opteya.com
>>>>> Cc: Sagi Grimberg <sagig@mellanox.com>
>>>>> Cc: Shachar Raindel <raindel@mellanox.com>
>>>>> Cc: Eli Cohen <eli@mellanox.com>
>>>>> Cc: Haggai Eran <haggaie@mellanox.com>
>>>>> Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
>>>>> ---
>>>>>  drivers/infiniband/core/uverbs_cmd.c | 6 ++++++
>>>>>  1 file changed, 6 insertions(+)
>>>>>
>>>>> diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
>>>>> index 8668b328b7e6..80a1c90f1dbf 100644
>>>>> --- a/drivers/infiniband/core/uverbs_cmd.c
>>>>> +++ b/drivers/infiniband/core/uverbs_cmd.c
>>>>> @@ -3313,6 +3313,12 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
>>>>>  	if (err)
>>>>>  		return err;
>>>>>  
>>>>> +	if (cmd.comp_mask & (cmd.comp_mask + 1))
>>>>> +		return -EINVAL;
>>>>> +
>>>>> +	if (cmd.comp_mask & ~(__u32)IB_USER_VERBS_EX_QUERY_DEVICE_ODP)
>>>>> +		return -EINVAL;
>>>>> +
>>>
>>> If we keep the checks on output buffer size from patch 1, returning
>>> -ENOSPC in case of size mismatch, and if we are sure that no bit in
>>> input comp_mask will ever trigger a call to a kernel function that can
>>> make the uverb fail, the latter check on known value could be dropped,
>>> allowing the userspace to set the highest mask (-1): userspace
>>> will use -ENOSPC to probe the expected size of the response buffer
>>> to match the highest supported comp_mask. But it's going to hurt
>>> userspace application not ready to receive -ENOSPC on newer kernel
>>> with extended QUERY_DEVICE ABI ... Oops.
>>>
>>> So in this end, the safest way to ensure userspace is doing the correct
>>> thing so that we have backward and forward compatibility is to check
>>> for known values in comp_mask, check for response buffer size and ensure
>>> that data structure chunk are stacked.
>>>
>>> The tighter are the checks now, the easier the interface could be
>>> extended latter.
>>
>> I understand this position, and I generally agree, but I think that 
>> specifically for a verb like QUERY_DEVICE that only reads information from 
>> the kernel driver to user-space, there is no harm in the kernel just 
>> providing all the information it can fit in the response buffer provided 
>> by user-space.
>>
> 
> Returning as much as possible information to userspace is OK,
> but it's doing it the wrong way.
> 
> I'm not specifically discussing QUERY_DEVICE, as I'm concerned with 
> every extended uverbs to be added to the kernel.
> 
>> Let me explain: newer fields are added to the kernel response struct at the 
>> end, together with a new comp_mask bit.
>>
>> Older user-space with newer kernels will simply ask only for the buffer 
>> size they care about. The fact that the struct is truncated doesn't affect
>> these programs because the truncated struct is a valid struct that was 
>> presented by the kernel in an older version.
>>
> 
> You cannot ensure the userspace being correct when specifying a request.
> It's a wrong assumption (perhaps not so wrong in the case of
> InfiniBand/RDMA, as most userspace program are using libibverbs,
> librdmacm and provider's libraries).
> 
> That's why we must not be liberal with the request and check any bit of
> it for being something valid, so that erroneous values are catch now,
> ensuring userspace is not trying to request things we don't know yet
> and it's not aware of it too.

Does the solution I proposed above satisfy this requirement?
- The kernel validates input size == command struct size and
cmd.comp_mask == 0.
- The kernel fills as much information as it can fit in the buffer
provided by userspace.
- The kernel marks which fields it was able to fill in the response's
comp_mask.

Regards,
Haggai


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yann Droneaud Jan. 29, 2015, 6 p.m. UTC | #6
Hi,

Le mercredi 28 janvier 2015 à 17:40 +0200, Haggai Eran a écrit :
> On 28/01/2015 15:19, Yann Droneaud wrote:
> > Le mardi 27 janvier 2015 à 08:50 +0200, Haggai Eran a écrit :
> >> On 26/01/2015 13:17, Yann Droneaud wrote:
> >>> ...
> >>> Le dimanche 25 janvier 2015 à 17:23 +0200, Haggai Eran a écrit :
> >>>> On 22/01/2015 15:28, Yann Droneaud wrote:
> >>>>> This patch ensures the extended QUERY_DEVICE uverbs request's
> >>>>> comp_mask has only known values. If userspace returns unknown
> >>>>> features, -EINVAL will be returned, allowing to probe/discover
> >>>>> which features are currently supported by the kernel.
> >>>>
> >>>> This probing process will be much more cumbersome than it needs to be
> >>>> because userspace will have to call QUERY_DEVICE repetitively with
> >>>> different comp_mask fields until it finds out the exact set of supported
> >>>> bits.
> >>>>
> >>>
> >>> O(log2(N))
> >>
> >> I don't think user space developers will be happy having to do trial and 
> >> error to determine what features the kernel driver supports. It might be 
> >> even more then O(log2(N)). If my understanding of comp_mask bits usage is 
> >> correct it would O(N). But it's not the time complexity I'm worried about,
> >> it's the fact that it requires user-space developers to go through hoops in
> >> order to get information that can be much more easily exported.
> >>
> > 
> > But there's currently *NO* such mean that could give a hint to userspace
> > developer whether one bit of request's comp_mask is currently effective
> > in extended QUERY_DEVICE uverbs. While we have such mean in CREATE_FLOW
> > and DESTROY_FLOW: unsupported bits trigger -EINVAL.
> > 
> > In QUERY_DEVICE, userspace developer is allowed to set request's 
> > comp_mask to -1: it won't hurt it on current kernel, so why not using 
> > this value or any other random value ? The program will run: it's "Works
> > For Me".
> > 
> > But the same program (either binary or source code) might fail on
> > newer kernel where some bits in comp_mask gain a meaning not supported
> > by the program.
> 
> Why don't we make the command comp_mask in QUERY_DEVICE zero in both
> versions. The behavior of the command with comp_mask = 0 will be to
> return the maximum amount of data that fits in the response buffer. The
> kernel will return -EINVAL if the input command comp_mask is not zero in
> the current version.
> 

Yes, that's exactly what I wanted to do.

> If in the future we want to alter the behavior or add more input fields
> to QUERY_DEVICE, we would use new bits.
> 

Yes.

> >>> Or you had to add a different interface, dedicated to retrieve the exact
> >>> supported feature mask.
> >>>
> >>>>> Moreover, it also ensure the requested features set in comp_mask
> >>>>> are sequentially set, not skipping intermediate features, eg. the
> >>>>> "highest" requested feature also request all the "lower" ones.
> >>>>> This way each new feature will have to be stacked on top of the
> >>>>> existing ones: this is especially important for the request and
> >>>>> response data structures where fields are added after the
> >>>>> current ones when expanded to support a new feature.
> >>>>
> >>>> I think it is perfectly acceptable that not all drivers will implement
> >>>> all available features, and so you shouldn't enforce this requirement.
> >>>
> >>> With regard to QUERY_DEVICE: the data structure layout depends on the
> >>> comp_mask value. So either you propose a way to express multipart data
> >>> structure (see CMSG or NETLINK), or we have to ensure the data structure
> >>> is ever-growing, with each new chunck stacked over the existing ones:
> >>> that's the purpose of :
> >>>
> >>> 	if (cmd.comp_mask & (cmd.comp_mask + 1))
> >>> 		return -EINVAL;
> >>>
> >>>> Also, it makes the comp_mask nothing more than a wasteful version number
> >>>> between 0 and 63.
> >>>
> >>> That's what I've already explained earlier in "Re: [PATCH v3 06/17]
> >>> IB/core: Add support for extended query device caps":
> >>>
> >>> http://mid.gmane.org/1421844612.13543.40.camel@opteya.com
> >>
> >> Yes, you wrote there:
> >>> Regarding comp_mask (not for this particular verb):
> >>>
> >>> It's not clear whether request's comp_mask describe the request or the
> >>> response, as such I'm puzzled.
> >>>
> >>> How would the kernel and the userspace be able to parse the request and
> >>> the response if they ignore unknown bits ?
> >>>
> >>> How would they be able to skip the unrecognized chunk of the request and
> >>> response buffer ?
> >>>
> >>> How one bit in a comp_mask is related to a chunk in the request or
> >>> response ?
> >>>
> >>> It's likely the kernel or userspace would have to skip the remaining
> >>> comp_mask's bits after encountering an unknown bit as the size of the
> >>> corresponding chunk in request or response would be unknown, making
> >>> impossible to locate the corresponding chunk for the next bit set in
> >>> comp_mask. Having said that, comp_mask is just a complicated way of
> >>> expressing a version, which is turn describe a size (ever growing).
> >>
> >> It is my understanding that each comp_mask bit marks a set of fields in 
> >> the command or in the response struct as valid, so the struct format 
> >> remains the same and the kernel and userspace don't need to make 
> >> difficult calculation as to where each field is, but you can still pass 
> >> a high bit set in comp_mask with one of the lower bits cleared.
> >>
> > 
> > How can the struct format remain the same, if some fields are added
> > to implement newer feature ?
> 
> I meant that the binary format for an older version is the prefix of the
> binary format of the newer version.
> 

OK.

> >> I couldn't find this explicit detail in the mailing list, but I did found
> >> a presentation that was presented in OFA International Developer 
> >> Workshop 2013 [1], that gave an example of of an verb where each 
> >> comp_mask bit marked a different field as valid.
> >>
> > 
> > Thanks for the link to the presentation.
> > 
> > As I read it the presentation:
> > 
> > - in request, comp_mask give hint to the kernel of additional fields in
> > the request.
> > 
> > - in response, comp_mask give hint to userspace regarding the presence
> > of additional fields in the response.
> 
> I'm not sure that in request you can regard the comp_mask as a hint.

It's a hint because the kernel also have to check the size match:
if userspace say: "hey I've given you more fields" but the request
size is shorter than the kernel expect for such fields, the kernel
must return an error so that userspace is taught to fix its requests.

> I agree that you need to enforce it in general and reject unknown bits
> there (although I thought QUERY_DEVICE would be an exception).
> 

I think it's better to stick to one common behavior so that every
extended uverbs behave the same way.

> > But commit 860f10a799c8 ("IB/core: Add flags for on demand paging
> > support") on top of commit 5a77abf9a97a ("IB/core: Add support for
> > extended query device caps") is not doing it the expected way
> > as one bit set in request's comp_mask has direct effect on
> > the response's fields.
> > 
> > To be conformant with the "Extending Verbs API" presentation,
> > no check should be done on request's comp_mask, and on-demand paging
> > information should be returned to userspace in all case, provided 
> > there's enough room in the response buffer.
> > 
> > Anyway, allowing userspace to set any "hint" in the request's comp_mask
> > is opening a pandora box: being allowed to set any value, userspace
> > program will use any random value, as it will work with current kernel.
> > 
> > But the same program used on newer kernel is going to trigger some
> > behavior unknown to the application or return an error the application
> > is not ready to handle ... breaking any kind of forward compatibility 
> > promise.
> 
> I don't think that the application should handle unknown response bits
> as an error. As you wrote, I see these as hints about more data in the
> response that the application is free to ignore.
> 

I tried to explain the issue regarding older userspace setting random
bits in request's comp_mask: it older kernel allows such, the same
program will likely have issue with newer kernel where the request's
comp_mask bits have a meaning now: the program will either face
unknown behavior: the kernel does something the userspace was not
expecting, or the kernel refuse to do it because there's not enough
information in the request to fullfil it.

That's why we must catch now unknown value to prevent userspace to
use it now. If unknown value are not allowed, userspace program won't
use it and then it could run asis on newer kernel.


> > It's likely the kernel will have to use the size of the request to 
> > guess the hints in comp_mask are corrects to handle such. In such case, 
> > relying only on the size of the request might be enough to detect 
> > extended request, without the need for comp_mask.
> > 
> > Regarding the answer, if the response buffer is smaller than the size
> > of the extended response, will the kernel keep setting the response's 
> > comp_mask with all the bits it supports or will it unset the bits for 
> > the fields it wasn't able to fit in the response buffer ?
> > 
> > It's likely the kernel will have to use the size of the response buffer 
> > to set the response's comp_mask. 
> > 
> > Instead commit 860f10a799c8 ("IB/core: Add flags for on demand paging
> > support") on top of commit 5a77abf9a97a ("IB/core: Add support for
> > extended query device caps") make it possible for the kernel to return
> > truncated response with full comp_mask. Such is going to break silently 
> > when one will introduce a data structure with an ABI mismatch between
> > userspace and kernel (for example x86 vs amd64 ... we have some recent
> > exemples).
> 
> As I said, I think unknown bits in the comp_mask are hints about unknown
> fields in the response that can be ignored by the application. However,
> I can agree to having the kernel checking the response buffer size as
> you wrote, and only setting the valid comp_mask bits.
> 

OK.

> > 
> >>>
> >>>>
> >>>> In the specific case of QUERY_DEVICE you might argue that there isn't
> >>>> any need for input comp_mask, only for output, and then you may enforce
> >>>> the input comp_mask will always be zero.
> >>>
> >>> The extended QUERY_DEVICE uverbs as currently merged is using comp_mask
> >>> from input to choose to report on-demand-paging related value. So it
> >>> seems it's needed.
> >>>
> >>> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/infiniband/core/uverbs_cmd.c?id=v3.19-rc6#n3297
> >>>
> >>>> However, you will in any case need to be able to extended the size of the response in the future.
> >>>>
> >>>
> >>> That's already the case for on demand paging.
> >>>
> >>>>>
> >>>>> Link: http://mid.gmane.org/cover.1421931555.git.ydroneaud@opteya.com
> >>>>> Cc: Sagi Grimberg <sagig@mellanox.com>
> >>>>> Cc: Shachar Raindel <raindel@mellanox.com>
> >>>>> Cc: Eli Cohen <eli@mellanox.com>
> >>>>> Cc: Haggai Eran <haggaie@mellanox.com>
> >>>>> Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
> >>>>> ---
> >>>>>  drivers/infiniband/core/uverbs_cmd.c | 6 ++++++
> >>>>>  1 file changed, 6 insertions(+)
> >>>>>
> >>>>> diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
> >>>>> index 8668b328b7e6..80a1c90f1dbf 100644
> >>>>> --- a/drivers/infiniband/core/uverbs_cmd.c
> >>>>> +++ b/drivers/infiniband/core/uverbs_cmd.c
> >>>>> @@ -3313,6 +3313,12 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
> >>>>>  	if (err)
> >>>>>  		return err;
> >>>>>  
> >>>>> +	if (cmd.comp_mask & (cmd.comp_mask + 1))
> >>>>> +		return -EINVAL;
> >>>>> +
> >>>>> +	if (cmd.comp_mask & ~(__u32)IB_USER_VERBS_EX_QUERY_DEVICE_ODP)
> >>>>> +		return -EINVAL;
> >>>>> +
> >>>
> >>> If we keep the checks on output buffer size from patch 1, returning
> >>> -ENOSPC in case of size mismatch, and if we are sure that no bit in
> >>> input comp_mask will ever trigger a call to a kernel function that can
> >>> make the uverb fail, the latter check on known value could be dropped,
> >>> allowing the userspace to set the highest mask (-1): userspace
> >>> will use -ENOSPC to probe the expected size of the response buffer
> >>> to match the highest supported comp_mask. But it's going to hurt
> >>> userspace application not ready to receive -ENOSPC on newer kernel
> >>> with extended QUERY_DEVICE ABI ... Oops.
> >>>
> >>> So in this end, the safest way to ensure userspace is doing the correct
> >>> thing so that we have backward and forward compatibility is to check
> >>> for known values in comp_mask, check for response buffer size and ensure
> >>> that data structure chunk are stacked.
> >>>
> >>> The tighter are the checks now, the easier the interface could be
> >>> extended latter.
> >>
> >> I understand this position, and I generally agree, but I think that 
> >> specifically for a verb like QUERY_DEVICE that only reads information from 
> >> the kernel driver to user-space, there is no harm in the kernel just 
> >> providing all the information it can fit in the response buffer provided 
> >> by user-space.
> >>
> > 
> > Returning as much as possible information to userspace is OK,
> > but it's doing it the wrong way.
> > 
> > I'm not specifically discussing QUERY_DEVICE, as I'm concerned with 
> > every extended uverbs to be added to the kernel.
> > 
> >> Let me explain: newer fields are added to the kernel response struct at the 
> >> end, together with a new comp_mask bit.
> >>
> >> Older user-space with newer kernels will simply ask only for the buffer 
> >> size they care about. The fact that the struct is truncated doesn't affect
> >> these programs because the truncated struct is a valid struct that was 
> >> presented by the kernel in an older version.
> >>
> > 
> > You cannot ensure the userspace being correct when specifying a request.
> > It's a wrong assumption (perhaps not so wrong in the case of
> > InfiniBand/RDMA, as most userspace program are using libibverbs,
> > librdmacm and provider's libraries).
> > 
> > That's why we must not be liberal with the request and check any bit of
> > it for being something valid, so that erroneous values are catch now,
> > ensuring userspace is not trying to request things we don't know yet
> > and it's not aware of it too.
> 
> Does the solution I proposed above satisfy this requirement?
> - The kernel validates input size == command struct size and
> cmd.comp_mask == 0.
> - The kernel fills as much information as it can fit in the buffer
> provided by userspace.
> - The kernel marks which fields it was able to fill in the response's
> comp_mask.
> 

Yes. This scheme follow the "Extending Verbs API" presention
while ensuring the command can be extended regarding to good
practice one can find in
http://blog.ffwll.ch/2013/11/botching-up-ioctls.html

I'm going to submit the patches to have this behavior.

Regards.
Jason Gunthorpe Jan. 29, 2015, 6:09 p.m. UTC | #7
On Wed, Jan 28, 2015 at 02:19:11PM +0100, Yann Droneaud wrote:

> But the same program (either binary or source code) might fail on
> newer kernel where some bits in comp_mask gain a meaning not supported
> by the program.

To clarify some of this:

The intention of the comp_mask scheme was to define a growing
structure.

The invariant is: A bigger structure allways has all smaller
structures (past versions) as a prefix.

So the size alone is enough for user space to know what is going
on. userspace only processes structure members up to the size returned
by the kernel, and there is only one structure layout.

The purpose of the output comp_mask is to allow drivers to declare
they do not support the new structure members, and comp_mask bits
should only be used with new structure members do not have a natural
'null' value.

Further, we need to distinguish cases where additional data is
mandatory or optional.

query_device is clearly optional, the only purpose the input comp mask
serves is to reduce expensive work in the driver by indicating that
some result bits are not needed. It is perfectly OK for the kernel to
ignore the input comp mask, and OK for userspace to typically request
all bits. (indeed, I think this is a pretty silly optimization myself,
and the original patch that motivated this was restructured so it is
not needed)

Other verbs must be more strict, if one side does not understand the
comp mask then the verb must fail in some way. Presumably the valid
comp mask values are inferred in some way (eg query_device says the
extended function is supported)

Everything should fit in one of those two general cases..
  
> Thanks for the link to the presentation.

If I recall the presentation is old and had some flaws that were
addressed before things made it into libibverbs..

Thank you for looking at this, it is very important that this scheme
is used properly, and it is very easy to make mistakes. I haven't had
time to review any of these new patches myself.

Regards,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yann Droneaud Jan. 29, 2015, 6:35 p.m. UTC | #8
Hi,

Le jeudi 29 janvier 2015 à 11:09 -0700, Jason Gunthorpe a écrit :
> On Wed, Jan 28, 2015 at 02:19:11PM +0100, Yann Droneaud wrote:
> 
> > But the same program (either binary or source code) might fail on
> > newer kernel where some bits in comp_mask gain a meaning not supported
> > by the program.
> 
> To clarify some of this:
> 
> The intention of the comp_mask scheme was to define a growing
> structure.
> 
> The invariant is: A bigger structure allways has all smaller
> structures (past versions) as a prefix.
> 
> So the size alone is enough for user space to know what is going
> on. userspace only processes structure members up to the size returned
> by the kernel, and there is only one structure layout.
> 

Unfortunately, the userspace don't get the size of the returned data:
it's only a single "write()" syscall after all.

So the kernel is left with the comp_mask in the response to express
the returned size.

> The purpose of the output comp_mask is to allow drivers to declare
> they do not support the new structure members, and comp_mask bits
> should only be used with new structure members do not have a natural
> 'null' value.
> 

It's not (yet) about drivers as the output's comp_mask (in the 
response), is set by uverbs layer.

Do you think we have to enforce a 1 to 1 relation between the request 
comp_mask's bits and the response comp_mask's bits ?

That would not fit with my understanding of "Extending Verbs API" 
presentation [1]: request's comp_mask describe the fields present in 
the request and response's comp_mask describe the fields present in the 
response.

> Further, we need to distinguish cases where additional data is
> mandatory or optional.
> 
> query_device is clearly optional, the only purpose the input comp mask
> serves is to reduce expensive work in the driver by indicating that
> some result bits are not needed.

That's not how I've understand the purpose of the request's comp_mask
after reading the presentation: request's comp_mask describe the content
of the request. Then that additional content can trigger the presence 
of additional fields in the response if needed.

>  It is perfectly OK for the kernel to
> ignore the input comp mask, and OK for userspace to typically request
> all bits. (indeed, I think this is a pretty silly optimization myself,
> and the original patch that motivated this was restructured so it is
> not needed)
> 

It's not at all OK to ignore request's comp_mask: it has to be checked
to find if there's a feature requested the kernel cannot fullfil, and 
any unknown bit is a possible feature request. So the kernel has to 
reject the request as a whole as it don't know how to process it.

As we don't know yet how we would extend the extended QUERY_DEVICE
uverbs, the kernel have to refuse any random value.

http://blog.ffwll.ch/2013/11/botching-up-ioctls.html


> Other verbs must be more strict, if one side does not understand the
> comp mask then the verb must fail in some way. Presumably the valid
> comp mask values are inferred in some way (eg query_device says the
> extended function is supported)
> 

> Everything should fit in one of those two general cases..

I believe every interface should return an error for any unknown value
(every unused bits of a data structure), that is, there's only one case.

>   
> > Thanks for the link to the presentation.
> 
> If I recall the presentation is old and had some flaws that were
> addressed before things made it into libibverbs..
> 

I have to have a look to this part of libibverbs: I'm not sure
the extended QUERY_DEVICE is already implemented.

> Thank you for looking at this, it is very important that this scheme
> is used properly, and it is very easy to make mistakes. I haven't had
> time to review any of these new patches myself.
> 

I hope you would find some time to review my latest patchset
so that we don't miss a corner case. It's starting to become urgent.

Regards.
Jason Gunthorpe Jan. 29, 2015, 7:14 p.m. UTC | #9
On Thu, Jan 29, 2015 at 07:35:14PM +0100, Yann Droneaud wrote:

> Unfortunately, the userspace don't get the size of the returned data:
> it's only a single "write()" syscall after all.

A write syscall that behaves nothing like write() actually should, so
I don't see why we can't have

resp_len = write(fd,inout_buf,sizeof(input_len));

Returning 0 from write makes no sense at all.

In the fullness of your patchset it will maintain the invariant that
resp_len <= sizeof(input_len)

Which seems OK to me considering what we have to work with, and a
significantly better choice than 0.

> So the kernel is left with the comp_mask in the response to express
> the returned size.

It was never the intent that the size should be computed from
comp_mask. If the size is necessary it must be explicit.

In this instance if the size is not returned then libibverbs will have
to zero the entire user buffer before passing it to the kernel,
because it has to ensure any tail for the user app is 0'd.

Remember for santity we want comp mask bits for things that can't be 0
and we want 0 for things that are not set.

struct ib_query_device_ex res;
ibv_query_device_ex(..,res,sizeof(res));

printf("%u",res.foo_cap); // 0 if unsupported is OK
if (res.comp_mask & COMP_BAR)
  printf("%u",res.bar_thingy);  // 0 has meaning, must use COMP_BAR

Ideally we want to minimize the number of COMP bits because it is a
huge burden on the end user to work with them.

> > The purpose of the output comp_mask is to allow drivers to declare
> > they do not support the new structure members, and comp_mask bits
> > should only be used with new structure members do not have a natural
> > 'null' value.

> It's not (yet) about drivers as the output's comp_mask (in the 
> response), is set by uverbs layer.
> 
> Do you think we have to enforce a 1 to 1 relation between the request 
> comp_mask's bits and the response comp_mask's bits ?

In the query context I would be happy enough to just treat the in
bound buffer as uninitialized buffer space, but certainly generally
speaking the comp_mask+size should refer to the structure - so
input/output are not directly related.

> > Further, we need to distinguish cases where additional data is
> > mandatory or optional.
> > 
> > query_device is clearly optional, the only purpose the input comp mask
> > serves is to reduce expensive work in the driver by indicating that
> > some result bits are not needed.
> 
> That's not how I've understand the purpose of the request's comp_mask
> after reading the presentation: request's comp_mask describe the content
> of the request. Then that additional content can trigger the presence 
> of additional fields in the response if needed.

Agreed - what I described was an abuse that some early Mellanox
patches for a query_ex included and it was just a shortcut to avoid
defining a dedicated input structure. It seems that same scheme was
copied into these new patches.

> >  It is perfectly OK for the kernel to
> > ignore the input comp mask, and OK for userspace to typically request
> > all bits. (indeed, I think this is a pretty silly optimization myself,
> > and the original patch that motivated this was restructured so it is
> > not needed)
> > 
> 
> It's not at all OK to ignore request's comp_mask: it has to be checked
> to find if there's a feature requested the kernel cannot fullfil, and 
> any unknown bit is a possible feature request. So the kernel has to 
> reject the request as a whole as it don't know how to process it.

In the context of the above scheme the input structure was just this:

struct query_input
{
  u64 requested_output;
};

ie it wasn't actually a comp_mask, it just overlapped the comp_mask
bytes on output.

Such a use was never explicitly documented, and IIRC was never
actually included in libibverbs.

Unless someone has a strong reason we need to do this I am inclined to
think that your interpretation is the better one, and we could always
add a requested_output to the input someday if it became urgent.

In any event, you are right, we can't ingore the input bytes today and
expect to give them meaning tomorrow.

> As we don't know yet how we would extend the extended QUERY_DEVICE
> uverbs, the kernel have to refuse any random value.
> 
> http://blog.ffwll.ch/2013/11/botching-up-ioctls.html

or the kernel treats QUERY_DEVICE as an output only function and never
inspects the in/out buffer at all.

> > > Thanks for the link to the presentation.
> > 
> > If I recall the presentation is old and had some flaws that were
> > addressed before things made it into libibverbs..
> 
> I have to have a look to this part of libibverbs: I'm not sure
> the extended QUERY_DEVICE is already implemented.

The patches turned out to be unnecessary and were dropped, IIRC.
 
> > Thank you for looking at this, it is very important that this scheme
> > is used properly, and it is very easy to make mistakes. I haven't had
> > time to review any of these new patches myself.

> I hope you would find some time to review my latest patchset so that
> we don't miss a corner case. It's starting to become urgent.

I have and will, thank you

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yann Droneaud Jan. 29, 2015, 9:14 p.m. UTC | #10
Le jeudi 29 janvier 2015 à 12:14 -0700, Jason Gunthorpe a écrit :
> On Thu, Jan 29, 2015 at 07:35:14PM +0100, Yann Droneaud wrote:
> 
> > Unfortunately, the userspace don't get the size of the returned data:
> > it's only a single "write()" syscall after all.
> 
> A write syscall that behaves nothing like write() actually should, so
> I don't see why we can't have
> 
> resp_len = write(fd,inout_buf,sizeof(input_len));
> 
> Returning 0 from write makes no sense at all.
> 

0 is not the result of the write() syscall, as for extended uverbs
I've ensure the return value of the write() syscall was the size
it was given. See
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/infiniband/core/uverbs_main.c?id=v3.19-rc6#n599

    705                 err = uverbs_ex_cmd_table[command](file,
    706                                                    &ucore,
    707                                                    &uhw);
    708 
    709                 if (err)
    710                         return err;
    711 
    712                 return written_count;

See commit f21519b23c1 ("IB/core: extended command: an improved
infrastructure for uverbs commands")

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f21519b23c1b6fa25366be4114ccf7fcf1c190f9

> In the fullness of your patchset it will maintain the invariant that
> resp_len <= sizeof(input_len)
> 

I don't catch your point: the response can be bigger than the request.

> Which seems OK to me considering what we have to work with, and a
> significantly better choice than 0.
> 
> > So the kernel is left with the comp_mask in the response to express
> > the returned size.
> 
> It was never the intent that the size should be computed from
> comp_mask. If the size is necessary it must be explicit.
> 
> In this instance if the size is not returned then libibverbs will have
> to zero the entire user buffer before passing it to the kernel,
> because it has to ensure any tail for the user app is 0'd.
> 

The proposed patch ensure the integrity of the response regarding
comp_mask: if a bit is set in response's comp_mask that means the
related fields are presents (and valid).

So before parsing the response fields, userspace have to check
response's comp_mask: fields access must be protected by correct
check on comp_mask ... but it might be useful for the userspace
developer to clear the response buffer just in case he/she decided
to be lazy with the check.

> Remember for santity we want comp mask bits for things that can't be 0

For me, it's better if a bit is set in response's comp_mask by the 
kernel when the kernel have written something in the related fields 
even if the those fields are all 0.

> and we want 0 for things that are not set.
> 
> struct ib_query_device_ex res;
> ibv_query_device_ex(..,res,sizeof(res));
> 
> printf("%u",res.foo_cap); // 0 if unsupported is OK
> if (res.comp_mask & COMP_BAR)
>   printf("%u",res.bar_thingy);  // 0 has meaning, must use COMP_BAR
> 
> Ideally we want to minimize the number of COMP bits because it is a
> huge burden on the end user to work with them.
> 

Sure. So I think comp_mask is just an overly complicated way of
expressing the version and the size of the response.

> > > The purpose of the output comp_mask is to allow drivers to declare
> > > they do not support the new structure members, and comp_mask bits
> > > should only be used with new structure members do not have a natural
> > > 'null' value.
> 
> > It's not (yet) about drivers as the output's comp_mask (in the 
> > response), is set by uverbs layer.
> > 
> > Do you think we have to enforce a 1 to 1 relation between the request 
> > comp_mask's bits and the response comp_mask's bits ?
> 
> In the query context I would be happy enough to just treat the in
> bound buffer as uninitialized buffer space, but certainly generally
> speaking the comp_mask+size should refer to the structure - so
> input/output are not directly related.
> 

OK.

> > > Further, we need to distinguish cases where additional data is
> > > mandatory or optional.
> > > 
> > > query_device is clearly optional, the only purpose the input comp mask
> > > serves is to reduce expensive work in the driver by indicating that
> > > some result bits are not needed.
> > 
> > That's not how I've understand the purpose of the request's comp_mask
> > after reading the presentation: request's comp_mask describe the content
> > of the request. Then that additional content can trigger the presence 
> > of additional fields in the response if needed.
> 
> Agreed - what I described was an abuse that some early Mellanox
> patches for a query_ex included and it was just a shortcut to avoid
> defining a dedicated input structure. It seems that same scheme was
> copied into these new patches.
> 

OK

> > >  It is perfectly OK for the kernel to
> > > ignore the input comp mask, and OK for userspace to typically request
> > > all bits. (indeed, I think this is a pretty silly optimization myself,
> > > and the original patch that motivated this was restructured so it is
> > > not needed)
> > > 
> > 
> > It's not at all OK to ignore request's comp_mask: it has to be checked
> > to find if there's a feature requested the kernel cannot fullfil, and 
> > any unknown bit is a possible feature request. So the kernel has to 
> > reject the request as a whole as it don't know how to process it.
> 
> In the context of the above scheme the input structure was just this:
> 
> struct query_input
> {
>   u64 requested_output;
> };
> 
> ie it wasn't actually a comp_mask, it just overlapped the comp_mask
> bytes on output.
> 
> Such a use was never explicitly documented, and IIRC was never
> actually included in libibverbs.
> 

OK

> Unless someone has a strong reason we need to do this I am inclined to
> think that your interpretation is the better one, and we could always
> add a requested_output to the input someday if it became urgent.
> 

Sure one field can be added later: in such case, one bit of request's
comp_mask will gain the meaning: "request_output field is present in the
request".

> In any event, you are right, we can't ingore the input bytes today and
> expect to give them meaning tomorrow.
> 

Sure.

> > As we don't know yet how we would extend the extended QUERY_DEVICE
> > uverbs, the kernel have to refuse any random value.
> > 
> > http://blog.ffwll.ch/2013/11/botching-up-ioctls.html
> 
> or the kernel treats QUERY_DEVICE as an output only function and never
> inspects the in/out buffer at all.
> 

It's something we could think of but the extended uverbs is already
merge with the comp_mask field in the request and it cost only 4 bytes,
while allowing us to make mistake on the first iteration of the extended
QUERY_DEVICE uverb (provided we manage to add the check for the field
being 0).

> > > > Thanks for the link to the presentation.
> > > 
> > > If I recall the presentation is old and had some flaws that were
> > > addressed before things made it into libibverbs..
> > 
> > I have to have a look to this part of libibverbs: I'm not sure
> > the extended QUERY_DEVICE is already implemented.
> 
> The patches turned out to be unnecessary and were dropped, IIRC.
>  

OK.

> > > Thank you for looking at this, it is very important that this scheme
> > > is used properly, and it is very easy to make mistakes. I haven't had
> > > time to review any of these new patches myself.
> 
> > I hope you would find some time to review my latest patchset so that
> > we don't miss a corner case. It's starting to become urgent.
> 
> I have and will, thank you
> 

:)

Thanks.

Regards.
Jason Gunthorpe Jan. 29, 2015, 9:34 p.m. UTC | #11
On Thu, Jan 29, 2015 at 10:14:29PM +0100, Yann Droneaud wrote:
> > > Unfortunately, the userspace don't get the size of the returned data:
> > > it's only a single "write()" syscall after all.
> > 
> > A write syscall that behaves nothing like write() actually should, so
> > I don't see why we can't have
> > 
> > resp_len = write(fd,inout_buf,sizeof(input_len));
> > 
> > Returning 0 from write makes no sense at all.

Okay, yes, I see, the flow is different for the ex versions from the
non-ex versions.

> > In the fullness of your patchset it will maintain the invariant that
> > resp_len <= sizeof(input_len)

> I don't catch your point: the response can be bigger than the request.

Indeed, I forgot the scheme used that indirection buffer with
pointers. My comments on write() result are all wrong. Sorry

> > It was never the intent that the size should be computed from
> > comp_mask. If the size is necessary it must be explicit.
> > 
> > In this instance if the size is not returned then libibverbs will have
> > to zero the entire user buffer before passing it to the kernel,
> > because it has to ensure any tail for the user app is 0'd.
>
> The proposed patch ensure the integrity of the response regarding
> comp_mask: if a bit is set in response's comp_mask that means the
> related fields are presents (and valid).

The patch it is good, but sitting this into the larger framework where
we have libibverbs consumers compiled against arbitrary versions of
the structure creates a broader complexity that someone has to deal
with.

libibverbs needs to ensure the trailer portion of the structure from
the end user is 0'd.

> > Remember for santity we want comp mask bits for things that can't be 0
> 
> For me, it's better if a bit is set in response's comp_mask by the 
> kernel when the kernel have written something in the related fields
> even if the those fields are all 0.

Yes, but my point is that comp mask bit should only be used to protect
fields where 0 is not an acceptable 'do not support' answer.

Primarily we should rely on memset(0) to provide compatibility, and
only when really necessary introduce a comp mask bit.

This is why the size is important, we can't deduce the size from
comp_mask, and without the size we can't know where the kernel stopped
writing and the whole thing becomes a little more fragile.

Zeroing the response buffer before calling into the kernel is not a
great general pattern.

Perhaps having the kernel zero the trailing portion it doesn't write
too is okay, but it still feels like not telling user space the size
is asking for future trouble.

> > Ideally we want to minimize the number of COMP bits because it is a
> > huge burden on the end user to work with them.
> 
> Sure. So I think comp_mask is just an overly complicated way of
> expressing the version and the size of the response.

Yes. But understand it is selected to try and provide a good end user
experience, exposing structure versions or size directly to the user
is very difficult to work with.

Ideally the end user will not see many comp mask bits (ie I would drop
the ODP one) and things will 'just work' as compiled.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 8668b328b7e6..80a1c90f1dbf 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -3313,6 +3313,12 @@  int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
 	if (err)
 		return err;
 
+	if (cmd.comp_mask & (cmd.comp_mask + 1))
+		return -EINVAL;
+
+	if (cmd.comp_mask & ~(__u32)IB_USER_VERBS_EX_QUERY_DEVICE_ODP)
+		return -EINVAL;
+
 	if (cmd.reserved)
 		return -EINVAL;