Message ID | 063956366559d6919693aabb324bab83d676bc28.1421931555.git.ydroneaud@opteya.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
On 22/01/2015 15:28, Yann Droneaud wrote: > This patch ensures the extended QUERY_DEVICE uverbs request's > comp_mask has only known values. If userspace returns unknown > features, -EINVAL will be returned, allowing to probe/discover > which features are currently supported by the kernel. This probing process will be much more cumbersome than it needs to be because userspace will have to call QUERY_DEVICE repetitively with different comp_mask fields until it finds out the exact set of supported bits. > Moreover, it also ensure the requested features set in comp_mask > are sequentially set, not skipping intermediate features, eg. the > "highest" requested feature also request all the "lower" ones. > This way each new feature will have to be stacked on top of the > existing ones: this is especially important for the request and > response data structures where fields are added after the > current ones when expanded to support a new feature. I think it is perfectly acceptable that not all drivers will implement all available features, and so you shouldn't enforce this requirement. Also, it makes the comp_mask nothing more than a wasteful version number between 0 and 63. In the specific case of QUERY_DEVICE you might argue that there isn't any need for input comp_mask, only for output, and then you may enforce the input comp_mask will always be zero. However, you will in any case need to be able to extended the size of the response in the future. > > Link: http://mid.gmane.org/cover.1421931555.git.ydroneaud@opteya.com > Cc: Sagi Grimberg <sagig@mellanox.com> > Cc: Shachar Raindel <raindel@mellanox.com> > Cc: Eli Cohen <eli@mellanox.com> > Cc: Haggai Eran <haggaie@mellanox.com> > Signed-off-by: Yann Droneaud <ydroneaud@opteya.com> > --- > drivers/infiniband/core/uverbs_cmd.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c > index 8668b328b7e6..80a1c90f1dbf 100644 > --- a/drivers/infiniband/core/uverbs_cmd.c > +++ b/drivers/infiniband/core/uverbs_cmd.c > @@ -3313,6 +3313,12 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, > if (err) > return err; > > + if (cmd.comp_mask & (cmd.comp_mask + 1)) > + return -EINVAL; > + > + if (cmd.comp_mask & ~(__u32)IB_USER_VERBS_EX_QUERY_DEVICE_ODP) > + return -EINVAL; > + > if (cmd.reserved) > return -EINVAL; > > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, (adding linux-api@ for comments: We're introducing a new "uverb" in the InfiniBand subsystem: extended QUERY_DEVICE in v3.19: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/infiniband/core/uverbs_cmd.c?id=v3.19-rc6#n3297 http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/uapi/rdma/ib_user_verbs.h?id=v3.19-rc6#n209 http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/uapi/rdma/ib_user_verbs.h?id=v3.19-rc6#n224 http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=5a77abf9a97a http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=860f10a799c8 As you may not already know, InfiniBand subsystem use a write() syscall[1] to issue ioctl()-like operations. Many operations (aka verbs) are available [2], for each there's a query data structures and for some there's a response data structure [3]. As a result to a write() operation, kernel could allocate resources on the task behalf and/or write data back to userspace provided buffers whose pointers were part of buffer passed to write(). I've expressed concern on the way the new extended QUERY_DEVICE was trying to be itself expandable: by silently ignoring shorter buffer, returning truncated data, the interface seems awkward. "Re: [PATCH v2 06/17] IB/core: Add support for extended query device caps" http://mid.gmane.org/1418216676.11111.45.camel@opteya.com "Re: [PATCH v3 06/17] IB/core: Add support for extended query device caps" http://mid.gmane.org/1418733236.2779.26.camel@opteya.com http://mid.gmane.org/1418824811.3334.3.camel@dworkin http://mid.gmane.org/1421844612.13543.40.camel@opteya.com I recognize the author's intention to provide a way for userspace to retrieve easily the highest supported information as something desirable. But I believe we must be more strict on the request content and fail for any unrecognized, unsupported, incorrect bits to make safer to extend the interface latter. I've submitted a patchset[4] to address these issues. But, while I'm convinced it the way to go, I'm not able to find how it could be made to satisfy the original author expectations. I hope linux-api@ readers could provide some insight regarding the way we should implement such interface [1] http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/infiniband/core/uverbs_main.c?id=v3.19-rc6#n599 [2] http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/infiniband/core/uverbs_main.c?id=v3.19-rc6#n81 [3] http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/uapi/rdma/ib_user_verbs.h?id=v3.19-rc6 [4] http://mid.gmane.org/cover.1421931555.git.ydroneaud@opteya.com ) Le dimanche 25 janvier 2015 à 17:23 +0200, Haggai Eran a écrit : > On 22/01/2015 15:28, Yann Droneaud wrote: > > This patch ensures the extended QUERY_DEVICE uverbs request's > > comp_mask has only known values. If userspace returns unknown > > features, -EINVAL will be returned, allowing to probe/discover > > which features are currently supported by the kernel. > > This probing process will be much more cumbersome than it needs to be > because userspace will have to call QUERY_DEVICE repetitively with > different comp_mask fields until it finds out the exact set of supported > bits. > O(log2(N)) Or you had to add a different interface, dedicated to retrieve the exact supported feature mask. > > Moreover, it also ensure the requested features set in comp_mask > > are sequentially set, not skipping intermediate features, eg. the > > "highest" requested feature also request all the "lower" ones. > > This way each new feature will have to be stacked on top of the > > existing ones: this is especially important for the request and > > response data structures where fields are added after the > > current ones when expanded to support a new feature. > > I think it is perfectly acceptable that not all drivers will implement > all available features, and so you shouldn't enforce this requirement. With regard to QUERY_DEVICE: the data structure layout depends on the comp_mask value. So either you propose a way to express multipart data structure (see CMSG or NETLINK), or we have to ensure the data structure is ever-growing, with each new chunck stacked over the existing ones: that's the purpose of : if (cmd.comp_mask & (cmd.comp_mask + 1)) return -EINVAL; > Also, it makes the comp_mask nothing more than a wasteful version number > between 0 and 63. That's what I've already explained earlier in "Re: [PATCH v3 06/17] IB/core: Add support for extended query device caps": http://mid.gmane.org/1421844612.13543.40.camel@opteya.com > > In the specific case of QUERY_DEVICE you might argue that there isn't > any need for input comp_mask, only for output, and then you may enforce > the input comp_mask will always be zero. The extended QUERY_DEVICE uverbs as currently merged is using comp_mask from input to choose to report on-demand-paging related value. So it seems it's needed. http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/infiniband/core/uverbs_cmd.c?id=v3.19-rc6#n3297 > However, you will in any case need to be able to extended the size of the response in the future. > That's already the case for on demand paging. > > > > Link: http://mid.gmane.org/cover.1421931555.git.ydroneaud@opteya.com > > Cc: Sagi Grimberg <sagig@mellanox.com> > > Cc: Shachar Raindel <raindel@mellanox.com> > > Cc: Eli Cohen <eli@mellanox.com> > > Cc: Haggai Eran <haggaie@mellanox.com> > > Signed-off-by: Yann Droneaud <ydroneaud@opteya.com> > > --- > > drivers/infiniband/core/uverbs_cmd.c | 6 ++++++ > > 1 file changed, 6 insertions(+) > > > > diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c > > index 8668b328b7e6..80a1c90f1dbf 100644 > > --- a/drivers/infiniband/core/uverbs_cmd.c > > +++ b/drivers/infiniband/core/uverbs_cmd.c > > @@ -3313,6 +3313,12 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, > > if (err) > > return err; > > > > + if (cmd.comp_mask & (cmd.comp_mask + 1)) > > + return -EINVAL; > > + > > + if (cmd.comp_mask & ~(__u32)IB_USER_VERBS_EX_QUERY_DEVICE_ODP) > > + return -EINVAL; > > + If we keep the checks on output buffer size from patch 1, returning -ENOSPC in case of size mismatch, and if we are sure that no bit in input comp_mask will ever trigger a call to a kernel function that can make the uverb fail, the latter check on known value could be dropped, allowing the userspace to set the highest mask (-1): userspace will use -ENOSPC to probe the expected size of the response buffer to match the highest supported comp_mask. But it's going to hurt userspace application not ready to receive -ENOSPC on newer kernel with extended QUERY_DEVICE ABI ... Oops. So in this end, the safest way to ensure userspace is doing the correct thing so that we have backward and forward compatibility is to check for known values in comp_mask, check for response buffer size and ensure that data structure chunk are stacked. The tighter are the checks now, the easier the interface could be extended latter. > > if (cmd.reserved) > > return -EINVAL; > > > > > Regards.
On 26/01/2015 13:17, Yann Droneaud wrote: > ... > Le dimanche 25 janvier 2015 à 17:23 +0200, Haggai Eran a écrit : >> On 22/01/2015 15:28, Yann Droneaud wrote: >>> This patch ensures the extended QUERY_DEVICE uverbs request's >>> comp_mask has only known values. If userspace returns unknown >>> features, -EINVAL will be returned, allowing to probe/discover >>> which features are currently supported by the kernel. >> >> This probing process will be much more cumbersome than it needs to be >> because userspace will have to call QUERY_DEVICE repetitively with >> different comp_mask fields until it finds out the exact set of supported >> bits. >> > > O(log2(N)) I don't think user space developers will be happy having to do trial and error to determine what features the kernel driver supports. It might be even more then O(log2(N)). If my understanding of comp_mask bits usage is correct it would O(N). But it's not the time complexity I'm worried about, it's the fact that it requires user-space developers to go through hoops in order to get information that can be much more easily exported. > Or you had to add a different interface, dedicated to retrieve the exact > supported feature mask. > >>> Moreover, it also ensure the requested features set in comp_mask >>> are sequentially set, not skipping intermediate features, eg. the >>> "highest" requested feature also request all the "lower" ones. >>> This way each new feature will have to be stacked on top of the >>> existing ones: this is especially important for the request and >>> response data structures where fields are added after the >>> current ones when expanded to support a new feature. >> >> I think it is perfectly acceptable that not all drivers will implement >> all available features, and so you shouldn't enforce this requirement. > > With regard to QUERY_DEVICE: the data structure layout depends on the > comp_mask value. So either you propose a way to express multipart data > structure (see CMSG or NETLINK), or we have to ensure the data structure > is ever-growing, with each new chunck stacked over the existing ones: > that's the purpose of : > > if (cmd.comp_mask & (cmd.comp_mask + 1)) > return -EINVAL; > >> Also, it makes the comp_mask nothing more than a wasteful version number >> between 0 and 63. > > That's what I've already explained earlier in "Re: [PATCH v3 06/17] > IB/core: Add support for extended query device caps": > > http://mid.gmane.org/1421844612.13543.40.camel@opteya.com Yes, you wrote there: > Regarding comp_mask (not for this particular verb): > > It's not clear whether request's comp_mask describe the request or the > response, as such I'm puzzled. > > How would the kernel and the userspace be able to parse the request and > the response if they ignore unknown bits ? > > How would they be able to skip the unrecognized chunk of the request and > response buffer ? > > How one bit in a comp_mask is related to a chunk in the request or > response ? > > It's likely the kernel or userspace would have to skip the remaining > comp_mask's bits after encountering an unknown bit as the size of the > corresponding chunk in request or response would be unknown, making > impossible to locate the corresponding chunk for the next bit set in > comp_mask. Having said that, comp_mask is just a complicated way of > expressing a version, which is turn describe a size (ever growing). It is my understanding that each comp_mask bit marks a set of fields in the command or in the response struct as valid, so the struct format remains the same and the kernel and userspace don't need to make difficult calculation as to where each field is, but you can still pass a high bit set in comp_mask with one of the lower bits cleared. I couldn't find this explicit detail in the mailing list, but I did found a presentation that was presented in OFA International Developer Workshop 2013 [1], that gave an example of of an verb where each comp_mask bit marked a different field as valid. > >> >> In the specific case of QUERY_DEVICE you might argue that there isn't >> any need for input comp_mask, only for output, and then you may enforce >> the input comp_mask will always be zero. > > The extended QUERY_DEVICE uverbs as currently merged is using comp_mask > from input to choose to report on-demand-paging related value. So it > seems it's needed. > > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/infiniband/core/uverbs_cmd.c?id=v3.19-rc6#n3297 > >> However, you will in any case need to be able to extended the size of the response in the future. >> > > That's already the case for on demand paging. > >>> >>> Link: http://mid.gmane.org/cover.1421931555.git.ydroneaud@opteya.com >>> Cc: Sagi Grimberg <sagig@mellanox.com> >>> Cc: Shachar Raindel <raindel@mellanox.com> >>> Cc: Eli Cohen <eli@mellanox.com> >>> Cc: Haggai Eran <haggaie@mellanox.com> >>> Signed-off-by: Yann Droneaud <ydroneaud@opteya.com> >>> --- >>> drivers/infiniband/core/uverbs_cmd.c | 6 ++++++ >>> 1 file changed, 6 insertions(+) >>> >>> diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c >>> index 8668b328b7e6..80a1c90f1dbf 100644 >>> --- a/drivers/infiniband/core/uverbs_cmd.c >>> +++ b/drivers/infiniband/core/uverbs_cmd.c >>> @@ -3313,6 +3313,12 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, >>> if (err) >>> return err; >>> >>> + if (cmd.comp_mask & (cmd.comp_mask + 1)) >>> + return -EINVAL; >>> + >>> + if (cmd.comp_mask & ~(__u32)IB_USER_VERBS_EX_QUERY_DEVICE_ODP) >>> + return -EINVAL; >>> + > > If we keep the checks on output buffer size from patch 1, returning > -ENOSPC in case of size mismatch, and if we are sure that no bit in > input comp_mask will ever trigger a call to a kernel function that can > make the uverb fail, the latter check on known value could be dropped, > allowing the userspace to set the highest mask (-1): userspace > will use -ENOSPC to probe the expected size of the response buffer > to match the highest supported comp_mask. But it's going to hurt > userspace application not ready to receive -ENOSPC on newer kernel > with extended QUERY_DEVICE ABI ... Oops. > > So in this end, the safest way to ensure userspace is doing the correct > thing so that we have backward and forward compatibility is to check > for known values in comp_mask, check for response buffer size and ensure > that data structure chunk are stacked. > > The tighter are the checks now, the easier the interface could be > extended latter. I understand this position, and I generally agree, but I think that specifically for a verb like QUERY_DEVICE that only reads information from the kernel driver to user-space, there is no harm in the kernel just providing all the information it can fit in the response buffer provided by user-space. Let me explain: newer fields are added to the kernel response struct at the end, together with a new comp_mask bit. Older user-space with newer kernels will simply ask only for the buffer size they care about. The fact that the struct is truncated doesn't affect these programs because the truncated struct is a valid struct that was presented by the kernel in an older version. They may or may not receive a comp_mask bit they don't recognize, but that only tells them that the kernel supports new features they don't know about. Newer user-space with older kernel will give a larger buffer then the kernel can fill. The kernel only fills in the beginning of the user-space buffer, and provides user-space with the comp_mask bits that mark which fields are valid. So user-space can tell that the end of the buffer isn't valid. In my implementation I also left the ending uninitialized, but the kernel can zero it if you think it is important. So I hope you agree this scheme is extendible and allows keeping backward and forward compatibility. If you can think of another scheme that will be more strict with the buffer sizes, but doesn't require user-space to do extra work, I'll be happy to hear about it. Regards, Haggai [1] Extending Verbs API, Tzahi Oved https://www.openfabrics.org/images/docs/2013_Dev_Workshop/Tues_0423/2013_Workshop_Tues_0830_Tzahi_Oved-verbs_extensions_ofa_2013-tzahio.pdf -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, Le mardi 27 janvier 2015 à 08:50 +0200, Haggai Eran a écrit : > On 26/01/2015 13:17, Yann Droneaud wrote: > > ... > > Le dimanche 25 janvier 2015 à 17:23 +0200, Haggai Eran a écrit : > >> On 22/01/2015 15:28, Yann Droneaud wrote: > >>> This patch ensures the extended QUERY_DEVICE uverbs request's > >>> comp_mask has only known values. If userspace returns unknown > >>> features, -EINVAL will be returned, allowing to probe/discover > >>> which features are currently supported by the kernel. > >> > >> This probing process will be much more cumbersome than it needs to be > >> because userspace will have to call QUERY_DEVICE repetitively with > >> different comp_mask fields until it finds out the exact set of supported > >> bits. > >> > > > > O(log2(N)) > > I don't think user space developers will be happy having to do trial and > error to determine what features the kernel driver supports. It might be > even more then O(log2(N)). If my understanding of comp_mask bits usage is > correct it would O(N). But it's not the time complexity I'm worried about, > it's the fact that it requires user-space developers to go through hoops in > order to get information that can be much more easily exported. > But there's currently *NO* such mean that could give a hint to userspace developer whether one bit of request's comp_mask is currently effective in extended QUERY_DEVICE uverbs. While we have such mean in CREATE_FLOW and DESTROY_FLOW: unsupported bits trigger -EINVAL. In QUERY_DEVICE, userspace developer is allowed to set request's comp_mask to -1: it won't hurt it on current kernel, so why not using this value or any other random value ? The program will run: it's "Works For Me". But the same program (either binary or source code) might fail on newer kernel where some bits in comp_mask gain a meaning not supported by the program. > > Or you had to add a different interface, dedicated to retrieve the exact > > supported feature mask. > > > >>> Moreover, it also ensure the requested features set in comp_mask > >>> are sequentially set, not skipping intermediate features, eg. the > >>> "highest" requested feature also request all the "lower" ones. > >>> This way each new feature will have to be stacked on top of the > >>> existing ones: this is especially important for the request and > >>> response data structures where fields are added after the > >>> current ones when expanded to support a new feature. > >> > >> I think it is perfectly acceptable that not all drivers will implement > >> all available features, and so you shouldn't enforce this requirement. > > > > With regard to QUERY_DEVICE: the data structure layout depends on the > > comp_mask value. So either you propose a way to express multipart data > > structure (see CMSG or NETLINK), or we have to ensure the data structure > > is ever-growing, with each new chunck stacked over the existing ones: > > that's the purpose of : > > > > if (cmd.comp_mask & (cmd.comp_mask + 1)) > > return -EINVAL; > > > >> Also, it makes the comp_mask nothing more than a wasteful version number > >> between 0 and 63. > > > > That's what I've already explained earlier in "Re: [PATCH v3 06/17] > > IB/core: Add support for extended query device caps": > > > > http://mid.gmane.org/1421844612.13543.40.camel@opteya.com > > Yes, you wrote there: > > Regarding comp_mask (not for this particular verb): > > > > It's not clear whether request's comp_mask describe the request or the > > response, as such I'm puzzled. > > > > How would the kernel and the userspace be able to parse the request and > > the response if they ignore unknown bits ? > > > > How would they be able to skip the unrecognized chunk of the request and > > response buffer ? > > > > How one bit in a comp_mask is related to a chunk in the request or > > response ? > > > > It's likely the kernel or userspace would have to skip the remaining > > comp_mask's bits after encountering an unknown bit as the size of the > > corresponding chunk in request or response would be unknown, making > > impossible to locate the corresponding chunk for the next bit set in > > comp_mask. Having said that, comp_mask is just a complicated way of > > expressing a version, which is turn describe a size (ever growing). > > It is my understanding that each comp_mask bit marks a set of fields in > the command or in the response struct as valid, so the struct format > remains the same and the kernel and userspace don't need to make > difficult calculation as to where each field is, but you can still pass > a high bit set in comp_mask with one of the lower bits cleared. > How can the struct format remain the same, if some fields are added to implement newer feature ? > I couldn't find this explicit detail in the mailing list, but I did found > a presentation that was presented in OFA International Developer > Workshop 2013 [1], that gave an example of of an verb where each > comp_mask bit marked a different field as valid. > Thanks for the link to the presentation. As I read it the presentation: - in request, comp_mask give hint to the kernel of additional fields in the request. - in response, comp_mask give hint to userspace regarding the presence of additional fields in the response. But commit 860f10a799c8 ("IB/core: Add flags for on demand paging support") on top of commit 5a77abf9a97a ("IB/core: Add support for extended query device caps") is not doing it the expected way as one bit set in request's comp_mask has direct effect on the response's fields. To be conformant with the "Extending Verbs API" presentation, no check should be done on request's comp_mask, and on-demand paging information should be returned to userspace in all case, provided there's enough room in the response buffer. Anyway, allowing userspace to set any "hint" in the request's comp_mask is opening a pandora box: being allowed to set any value, userspace program will use any random value, as it will work with current kernel. But the same program used on newer kernel is going to trigger some behavior unknown to the application or return an error the application is not ready to handle ... breaking any kind of forward compatibility promise. It's likely the kernel will have to use the size of the request to guess the hints in comp_mask are corrects to handle such. In such case, relying only on the size of the request might be enough to detect extended request, without the need for comp_mask. Regarding the answer, if the response buffer is smaller than the size of the extended response, will the kernel keep setting the response's comp_mask with all the bits it supports or will it unset the bits for the fields it wasn't able to fit in the response buffer ? It's likely the kernel will have to use the size of the response buffer to set the response's comp_mask. Instead commit 860f10a799c8 ("IB/core: Add flags for on demand paging support") on top of commit 5a77abf9a97a ("IB/core: Add support for extended query device caps") make it possible for the kernel to return truncated response with full comp_mask. Such is going to break silently when one will introduce a data structure with an ABI mismatch between userspace and kernel (for example x86 vs amd64 ... we have some recent exemples). > > > >> > >> In the specific case of QUERY_DEVICE you might argue that there isn't > >> any need for input comp_mask, only for output, and then you may enforce > >> the input comp_mask will always be zero. > > > > The extended QUERY_DEVICE uverbs as currently merged is using comp_mask > > from input to choose to report on-demand-paging related value. So it > > seems it's needed. > > > > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/infiniband/core/uverbs_cmd.c?id=v3.19-rc6#n3297 > > > >> However, you will in any case need to be able to extended the size of the response in the future. > >> > > > > That's already the case for on demand paging. > > > >>> > >>> Link: http://mid.gmane.org/cover.1421931555.git.ydroneaud@opteya.com > >>> Cc: Sagi Grimberg <sagig@mellanox.com> > >>> Cc: Shachar Raindel <raindel@mellanox.com> > >>> Cc: Eli Cohen <eli@mellanox.com> > >>> Cc: Haggai Eran <haggaie@mellanox.com> > >>> Signed-off-by: Yann Droneaud <ydroneaud@opteya.com> > >>> --- > >>> drivers/infiniband/core/uverbs_cmd.c | 6 ++++++ > >>> 1 file changed, 6 insertions(+) > >>> > >>> diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c > >>> index 8668b328b7e6..80a1c90f1dbf 100644 > >>> --- a/drivers/infiniband/core/uverbs_cmd.c > >>> +++ b/drivers/infiniband/core/uverbs_cmd.c > >>> @@ -3313,6 +3313,12 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, > >>> if (err) > >>> return err; > >>> > >>> + if (cmd.comp_mask & (cmd.comp_mask + 1)) > >>> + return -EINVAL; > >>> + > >>> + if (cmd.comp_mask & ~(__u32)IB_USER_VERBS_EX_QUERY_DEVICE_ODP) > >>> + return -EINVAL; > >>> + > > > > If we keep the checks on output buffer size from patch 1, returning > > -ENOSPC in case of size mismatch, and if we are sure that no bit in > > input comp_mask will ever trigger a call to a kernel function that can > > make the uverb fail, the latter check on known value could be dropped, > > allowing the userspace to set the highest mask (-1): userspace > > will use -ENOSPC to probe the expected size of the response buffer > > to match the highest supported comp_mask. But it's going to hurt > > userspace application not ready to receive -ENOSPC on newer kernel > > with extended QUERY_DEVICE ABI ... Oops. > > > > So in this end, the safest way to ensure userspace is doing the correct > > thing so that we have backward and forward compatibility is to check > > for known values in comp_mask, check for response buffer size and ensure > > that data structure chunk are stacked. > > > > The tighter are the checks now, the easier the interface could be > > extended latter. > > I understand this position, and I generally agree, but I think that > specifically for a verb like QUERY_DEVICE that only reads information from > the kernel driver to user-space, there is no harm in the kernel just > providing all the information it can fit in the response buffer provided > by user-space. > Returning as much as possible information to userspace is OK, but it's doing it the wrong way. I'm not specifically discussing QUERY_DEVICE, as I'm concerned with every extended uverbs to be added to the kernel. > Let me explain: newer fields are added to the kernel response struct at the > end, together with a new comp_mask bit. > > Older user-space with newer kernels will simply ask only for the buffer > size they care about. The fact that the struct is truncated doesn't affect > these programs because the truncated struct is a valid struct that was > presented by the kernel in an older version. > You cannot ensure the userspace being correct when specifying a request. It's a wrong assumption (perhaps not so wrong in the case of InfiniBand/RDMA, as most userspace program are using libibverbs, librdmacm and provider's libraries). That's why we must not be liberal with the request and check any bit of it for being something valid, so that erroneous values are catch now, ensuring userspace is not trying to request things we don't know yet and it's not aware of it too. > They may or may not receive a comp_mask bit they don't recognize, but that > only tells them that the kernel supports new features they don't know about. > > Newer user-space with older kernel will give a larger buffer then the > kernel can fill. The kernel only fills in the beginning of the user-space > buffer, and provides user-space with the comp_mask bits that mark which > fields are valid. So user-space can tell that the end of the buffer isn't > valid. > But older userspace programs with newer kernel might break, as request's comp_mask was allowed to contains invalid values in older kernel. > In my implementation I also left the ending uninitialized, but the > kernel can zero it if you think it is important. > Why not, but it's a minor point. > So I hope you agree this scheme is extendible and allows keeping backward > and forward compatibility. If you can think of another scheme that will be > more strict with the buffer sizes, but doesn't require user-space to do > extra work, I'll be happy to hear about it. > I'm sorry about that but I cannot agree with the current scheme. > [1] Extending Verbs API, Tzahi Oved > https://www.openfabrics.org/images/docs/2013_Dev_Workshop/Tues_0423/2013_Workshop_Tues_0830_Tzahi_Oved-verbs_extensions_ofa_2013-tzahio.pdf Regards.
On 28/01/2015 15:19, Yann Droneaud wrote: > Hi, > > Le mardi 27 janvier 2015 à 08:50 +0200, Haggai Eran a écrit : >> On 26/01/2015 13:17, Yann Droneaud wrote: >>> ... >>> Le dimanche 25 janvier 2015 à 17:23 +0200, Haggai Eran a écrit : >>>> On 22/01/2015 15:28, Yann Droneaud wrote: >>>>> This patch ensures the extended QUERY_DEVICE uverbs request's >>>>> comp_mask has only known values. If userspace returns unknown >>>>> features, -EINVAL will be returned, allowing to probe/discover >>>>> which features are currently supported by the kernel. >>>> >>>> This probing process will be much more cumbersome than it needs to be >>>> because userspace will have to call QUERY_DEVICE repetitively with >>>> different comp_mask fields until it finds out the exact set of supported >>>> bits. >>>> >>> >>> O(log2(N)) >> >> I don't think user space developers will be happy having to do trial and >> error to determine what features the kernel driver supports. It might be >> even more then O(log2(N)). If my understanding of comp_mask bits usage is >> correct it would O(N). But it's not the time complexity I'm worried about, >> it's the fact that it requires user-space developers to go through hoops in >> order to get information that can be much more easily exported. >> > > But there's currently *NO* such mean that could give a hint to userspace > developer whether one bit of request's comp_mask is currently effective > in extended QUERY_DEVICE uverbs. While we have such mean in CREATE_FLOW > and DESTROY_FLOW: unsupported bits trigger -EINVAL. > > In QUERY_DEVICE, userspace developer is allowed to set request's > comp_mask to -1: it won't hurt it on current kernel, so why not using > this value or any other random value ? The program will run: it's "Works > For Me". > > But the same program (either binary or source code) might fail on > newer kernel where some bits in comp_mask gain a meaning not supported > by the program. Why don't we make the command comp_mask in QUERY_DEVICE zero in both versions. The behavior of the command with comp_mask = 0 will be to return the maximum amount of data that fits in the response buffer. The kernel will return -EINVAL if the input command comp_mask is not zero in the current version. If in the future we want to alter the behavior or add more input fields to QUERY_DEVICE, we would use new bits. >>> Or you had to add a different interface, dedicated to retrieve the exact >>> supported feature mask. >>> >>>>> Moreover, it also ensure the requested features set in comp_mask >>>>> are sequentially set, not skipping intermediate features, eg. the >>>>> "highest" requested feature also request all the "lower" ones. >>>>> This way each new feature will have to be stacked on top of the >>>>> existing ones: this is especially important for the request and >>>>> response data structures where fields are added after the >>>>> current ones when expanded to support a new feature. >>>> >>>> I think it is perfectly acceptable that not all drivers will implement >>>> all available features, and so you shouldn't enforce this requirement. >>> >>> With regard to QUERY_DEVICE: the data structure layout depends on the >>> comp_mask value. So either you propose a way to express multipart data >>> structure (see CMSG or NETLINK), or we have to ensure the data structure >>> is ever-growing, with each new chunck stacked over the existing ones: >>> that's the purpose of : >>> >>> if (cmd.comp_mask & (cmd.comp_mask + 1)) >>> return -EINVAL; >>> >>>> Also, it makes the comp_mask nothing more than a wasteful version number >>>> between 0 and 63. >>> >>> That's what I've already explained earlier in "Re: [PATCH v3 06/17] >>> IB/core: Add support for extended query device caps": >>> >>> http://mid.gmane.org/1421844612.13543.40.camel@opteya.com >> >> Yes, you wrote there: >>> Regarding comp_mask (not for this particular verb): >>> >>> It's not clear whether request's comp_mask describe the request or the >>> response, as such I'm puzzled. >>> >>> How would the kernel and the userspace be able to parse the request and >>> the response if they ignore unknown bits ? >>> >>> How would they be able to skip the unrecognized chunk of the request and >>> response buffer ? >>> >>> How one bit in a comp_mask is related to a chunk in the request or >>> response ? >>> >>> It's likely the kernel or userspace would have to skip the remaining >>> comp_mask's bits after encountering an unknown bit as the size of the >>> corresponding chunk in request or response would be unknown, making >>> impossible to locate the corresponding chunk for the next bit set in >>> comp_mask. Having said that, comp_mask is just a complicated way of >>> expressing a version, which is turn describe a size (ever growing). >> >> It is my understanding that each comp_mask bit marks a set of fields in >> the command or in the response struct as valid, so the struct format >> remains the same and the kernel and userspace don't need to make >> difficult calculation as to where each field is, but you can still pass >> a high bit set in comp_mask with one of the lower bits cleared. >> > > How can the struct format remain the same, if some fields are added > to implement newer feature ? I meant that the binary format for an older version is the prefix of the binary format of the newer version. >> I couldn't find this explicit detail in the mailing list, but I did found >> a presentation that was presented in OFA International Developer >> Workshop 2013 [1], that gave an example of of an verb where each >> comp_mask bit marked a different field as valid. >> > > Thanks for the link to the presentation. > > As I read it the presentation: > > - in request, comp_mask give hint to the kernel of additional fields in > the request. > > - in response, comp_mask give hint to userspace regarding the presence > of additional fields in the response. I'm not sure that in request you can regard the comp_mask as a hint. I agree that you need to enforce it in general and reject unknown bits there (although I thought QUERY_DEVICE would be an exception). > But commit 860f10a799c8 ("IB/core: Add flags for on demand paging > support") on top of commit 5a77abf9a97a ("IB/core: Add support for > extended query device caps") is not doing it the expected way > as one bit set in request's comp_mask has direct effect on > the response's fields. > > To be conformant with the "Extending Verbs API" presentation, > no check should be done on request's comp_mask, and on-demand paging > information should be returned to userspace in all case, provided > there's enough room in the response buffer. > > Anyway, allowing userspace to set any "hint" in the request's comp_mask > is opening a pandora box: being allowed to set any value, userspace > program will use any random value, as it will work with current kernel. > > But the same program used on newer kernel is going to trigger some > behavior unknown to the application or return an error the application > is not ready to handle ... breaking any kind of forward compatibility > promise. I don't think that the application should handle unknown response bits as an error. As you wrote, I see these as hints about more data in the response that the application is free to ignore. > It's likely the kernel will have to use the size of the request to > guess the hints in comp_mask are corrects to handle such. In such case, > relying only on the size of the request might be enough to detect > extended request, without the need for comp_mask. > > Regarding the answer, if the response buffer is smaller than the size > of the extended response, will the kernel keep setting the response's > comp_mask with all the bits it supports or will it unset the bits for > the fields it wasn't able to fit in the response buffer ? > > It's likely the kernel will have to use the size of the response buffer > to set the response's comp_mask. > > Instead commit 860f10a799c8 ("IB/core: Add flags for on demand paging > support") on top of commit 5a77abf9a97a ("IB/core: Add support for > extended query device caps") make it possible for the kernel to return > truncated response with full comp_mask. Such is going to break silently > when one will introduce a data structure with an ABI mismatch between > userspace and kernel (for example x86 vs amd64 ... we have some recent > exemples). As I said, I think unknown bits in the comp_mask are hints about unknown fields in the response that can be ignored by the application. However, I can agree to having the kernel checking the response buffer size as you wrote, and only setting the valid comp_mask bits. > >>> >>>> >>>> In the specific case of QUERY_DEVICE you might argue that there isn't >>>> any need for input comp_mask, only for output, and then you may enforce >>>> the input comp_mask will always be zero. >>> >>> The extended QUERY_DEVICE uverbs as currently merged is using comp_mask >>> from input to choose to report on-demand-paging related value. So it >>> seems it's needed. >>> >>> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/infiniband/core/uverbs_cmd.c?id=v3.19-rc6#n3297 >>> >>>> However, you will in any case need to be able to extended the size of the response in the future. >>>> >>> >>> That's already the case for on demand paging. >>> >>>>> >>>>> Link: http://mid.gmane.org/cover.1421931555.git.ydroneaud@opteya.com >>>>> Cc: Sagi Grimberg <sagig@mellanox.com> >>>>> Cc: Shachar Raindel <raindel@mellanox.com> >>>>> Cc: Eli Cohen <eli@mellanox.com> >>>>> Cc: Haggai Eran <haggaie@mellanox.com> >>>>> Signed-off-by: Yann Droneaud <ydroneaud@opteya.com> >>>>> --- >>>>> drivers/infiniband/core/uverbs_cmd.c | 6 ++++++ >>>>> 1 file changed, 6 insertions(+) >>>>> >>>>> diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c >>>>> index 8668b328b7e6..80a1c90f1dbf 100644 >>>>> --- a/drivers/infiniband/core/uverbs_cmd.c >>>>> +++ b/drivers/infiniband/core/uverbs_cmd.c >>>>> @@ -3313,6 +3313,12 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, >>>>> if (err) >>>>> return err; >>>>> >>>>> + if (cmd.comp_mask & (cmd.comp_mask + 1)) >>>>> + return -EINVAL; >>>>> + >>>>> + if (cmd.comp_mask & ~(__u32)IB_USER_VERBS_EX_QUERY_DEVICE_ODP) >>>>> + return -EINVAL; >>>>> + >>> >>> If we keep the checks on output buffer size from patch 1, returning >>> -ENOSPC in case of size mismatch, and if we are sure that no bit in >>> input comp_mask will ever trigger a call to a kernel function that can >>> make the uverb fail, the latter check on known value could be dropped, >>> allowing the userspace to set the highest mask (-1): userspace >>> will use -ENOSPC to probe the expected size of the response buffer >>> to match the highest supported comp_mask. But it's going to hurt >>> userspace application not ready to receive -ENOSPC on newer kernel >>> with extended QUERY_DEVICE ABI ... Oops. >>> >>> So in this end, the safest way to ensure userspace is doing the correct >>> thing so that we have backward and forward compatibility is to check >>> for known values in comp_mask, check for response buffer size and ensure >>> that data structure chunk are stacked. >>> >>> The tighter are the checks now, the easier the interface could be >>> extended latter. >> >> I understand this position, and I generally agree, but I think that >> specifically for a verb like QUERY_DEVICE that only reads information from >> the kernel driver to user-space, there is no harm in the kernel just >> providing all the information it can fit in the response buffer provided >> by user-space. >> > > Returning as much as possible information to userspace is OK, > but it's doing it the wrong way. > > I'm not specifically discussing QUERY_DEVICE, as I'm concerned with > every extended uverbs to be added to the kernel. > >> Let me explain: newer fields are added to the kernel response struct at the >> end, together with a new comp_mask bit. >> >> Older user-space with newer kernels will simply ask only for the buffer >> size they care about. The fact that the struct is truncated doesn't affect >> these programs because the truncated struct is a valid struct that was >> presented by the kernel in an older version. >> > > You cannot ensure the userspace being correct when specifying a request. > It's a wrong assumption (perhaps not so wrong in the case of > InfiniBand/RDMA, as most userspace program are using libibverbs, > librdmacm and provider's libraries). > > That's why we must not be liberal with the request and check any bit of > it for being something valid, so that erroneous values are catch now, > ensuring userspace is not trying to request things we don't know yet > and it's not aware of it too. Does the solution I proposed above satisfy this requirement? - The kernel validates input size == command struct size and cmd.comp_mask == 0. - The kernel fills as much information as it can fit in the buffer provided by userspace. - The kernel marks which fields it was able to fill in the response's comp_mask. Regards, Haggai -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, Le mercredi 28 janvier 2015 à 17:40 +0200, Haggai Eran a écrit : > On 28/01/2015 15:19, Yann Droneaud wrote: > > Le mardi 27 janvier 2015 à 08:50 +0200, Haggai Eran a écrit : > >> On 26/01/2015 13:17, Yann Droneaud wrote: > >>> ... > >>> Le dimanche 25 janvier 2015 à 17:23 +0200, Haggai Eran a écrit : > >>>> On 22/01/2015 15:28, Yann Droneaud wrote: > >>>>> This patch ensures the extended QUERY_DEVICE uverbs request's > >>>>> comp_mask has only known values. If userspace returns unknown > >>>>> features, -EINVAL will be returned, allowing to probe/discover > >>>>> which features are currently supported by the kernel. > >>>> > >>>> This probing process will be much more cumbersome than it needs to be > >>>> because userspace will have to call QUERY_DEVICE repetitively with > >>>> different comp_mask fields until it finds out the exact set of supported > >>>> bits. > >>>> > >>> > >>> O(log2(N)) > >> > >> I don't think user space developers will be happy having to do trial and > >> error to determine what features the kernel driver supports. It might be > >> even more then O(log2(N)). If my understanding of comp_mask bits usage is > >> correct it would O(N). But it's not the time complexity I'm worried about, > >> it's the fact that it requires user-space developers to go through hoops in > >> order to get information that can be much more easily exported. > >> > > > > But there's currently *NO* such mean that could give a hint to userspace > > developer whether one bit of request's comp_mask is currently effective > > in extended QUERY_DEVICE uverbs. While we have such mean in CREATE_FLOW > > and DESTROY_FLOW: unsupported bits trigger -EINVAL. > > > > In QUERY_DEVICE, userspace developer is allowed to set request's > > comp_mask to -1: it won't hurt it on current kernel, so why not using > > this value or any other random value ? The program will run: it's "Works > > For Me". > > > > But the same program (either binary or source code) might fail on > > newer kernel where some bits in comp_mask gain a meaning not supported > > by the program. > > Why don't we make the command comp_mask in QUERY_DEVICE zero in both > versions. The behavior of the command with comp_mask = 0 will be to > return the maximum amount of data that fits in the response buffer. The > kernel will return -EINVAL if the input command comp_mask is not zero in > the current version. > Yes, that's exactly what I wanted to do. > If in the future we want to alter the behavior or add more input fields > to QUERY_DEVICE, we would use new bits. > Yes. > >>> Or you had to add a different interface, dedicated to retrieve the exact > >>> supported feature mask. > >>> > >>>>> Moreover, it also ensure the requested features set in comp_mask > >>>>> are sequentially set, not skipping intermediate features, eg. the > >>>>> "highest" requested feature also request all the "lower" ones. > >>>>> This way each new feature will have to be stacked on top of the > >>>>> existing ones: this is especially important for the request and > >>>>> response data structures where fields are added after the > >>>>> current ones when expanded to support a new feature. > >>>> > >>>> I think it is perfectly acceptable that not all drivers will implement > >>>> all available features, and so you shouldn't enforce this requirement. > >>> > >>> With regard to QUERY_DEVICE: the data structure layout depends on the > >>> comp_mask value. So either you propose a way to express multipart data > >>> structure (see CMSG or NETLINK), or we have to ensure the data structure > >>> is ever-growing, with each new chunck stacked over the existing ones: > >>> that's the purpose of : > >>> > >>> if (cmd.comp_mask & (cmd.comp_mask + 1)) > >>> return -EINVAL; > >>> > >>>> Also, it makes the comp_mask nothing more than a wasteful version number > >>>> between 0 and 63. > >>> > >>> That's what I've already explained earlier in "Re: [PATCH v3 06/17] > >>> IB/core: Add support for extended query device caps": > >>> > >>> http://mid.gmane.org/1421844612.13543.40.camel@opteya.com > >> > >> Yes, you wrote there: > >>> Regarding comp_mask (not for this particular verb): > >>> > >>> It's not clear whether request's comp_mask describe the request or the > >>> response, as such I'm puzzled. > >>> > >>> How would the kernel and the userspace be able to parse the request and > >>> the response if they ignore unknown bits ? > >>> > >>> How would they be able to skip the unrecognized chunk of the request and > >>> response buffer ? > >>> > >>> How one bit in a comp_mask is related to a chunk in the request or > >>> response ? > >>> > >>> It's likely the kernel or userspace would have to skip the remaining > >>> comp_mask's bits after encountering an unknown bit as the size of the > >>> corresponding chunk in request or response would be unknown, making > >>> impossible to locate the corresponding chunk for the next bit set in > >>> comp_mask. Having said that, comp_mask is just a complicated way of > >>> expressing a version, which is turn describe a size (ever growing). > >> > >> It is my understanding that each comp_mask bit marks a set of fields in > >> the command or in the response struct as valid, so the struct format > >> remains the same and the kernel and userspace don't need to make > >> difficult calculation as to where each field is, but you can still pass > >> a high bit set in comp_mask with one of the lower bits cleared. > >> > > > > How can the struct format remain the same, if some fields are added > > to implement newer feature ? > > I meant that the binary format for an older version is the prefix of the > binary format of the newer version. > OK. > >> I couldn't find this explicit detail in the mailing list, but I did found > >> a presentation that was presented in OFA International Developer > >> Workshop 2013 [1], that gave an example of of an verb where each > >> comp_mask bit marked a different field as valid. > >> > > > > Thanks for the link to the presentation. > > > > As I read it the presentation: > > > > - in request, comp_mask give hint to the kernel of additional fields in > > the request. > > > > - in response, comp_mask give hint to userspace regarding the presence > > of additional fields in the response. > > I'm not sure that in request you can regard the comp_mask as a hint. It's a hint because the kernel also have to check the size match: if userspace say: "hey I've given you more fields" but the request size is shorter than the kernel expect for such fields, the kernel must return an error so that userspace is taught to fix its requests. > I agree that you need to enforce it in general and reject unknown bits > there (although I thought QUERY_DEVICE would be an exception). > I think it's better to stick to one common behavior so that every extended uverbs behave the same way. > > But commit 860f10a799c8 ("IB/core: Add flags for on demand paging > > support") on top of commit 5a77abf9a97a ("IB/core: Add support for > > extended query device caps") is not doing it the expected way > > as one bit set in request's comp_mask has direct effect on > > the response's fields. > > > > To be conformant with the "Extending Verbs API" presentation, > > no check should be done on request's comp_mask, and on-demand paging > > information should be returned to userspace in all case, provided > > there's enough room in the response buffer. > > > > Anyway, allowing userspace to set any "hint" in the request's comp_mask > > is opening a pandora box: being allowed to set any value, userspace > > program will use any random value, as it will work with current kernel. > > > > But the same program used on newer kernel is going to trigger some > > behavior unknown to the application or return an error the application > > is not ready to handle ... breaking any kind of forward compatibility > > promise. > > I don't think that the application should handle unknown response bits > as an error. As you wrote, I see these as hints about more data in the > response that the application is free to ignore. > I tried to explain the issue regarding older userspace setting random bits in request's comp_mask: it older kernel allows such, the same program will likely have issue with newer kernel where the request's comp_mask bits have a meaning now: the program will either face unknown behavior: the kernel does something the userspace was not expecting, or the kernel refuse to do it because there's not enough information in the request to fullfil it. That's why we must catch now unknown value to prevent userspace to use it now. If unknown value are not allowed, userspace program won't use it and then it could run asis on newer kernel. > > It's likely the kernel will have to use the size of the request to > > guess the hints in comp_mask are corrects to handle such. In such case, > > relying only on the size of the request might be enough to detect > > extended request, without the need for comp_mask. > > > > Regarding the answer, if the response buffer is smaller than the size > > of the extended response, will the kernel keep setting the response's > > comp_mask with all the bits it supports or will it unset the bits for > > the fields it wasn't able to fit in the response buffer ? > > > > It's likely the kernel will have to use the size of the response buffer > > to set the response's comp_mask. > > > > Instead commit 860f10a799c8 ("IB/core: Add flags for on demand paging > > support") on top of commit 5a77abf9a97a ("IB/core: Add support for > > extended query device caps") make it possible for the kernel to return > > truncated response with full comp_mask. Such is going to break silently > > when one will introduce a data structure with an ABI mismatch between > > userspace and kernel (for example x86 vs amd64 ... we have some recent > > exemples). > > As I said, I think unknown bits in the comp_mask are hints about unknown > fields in the response that can be ignored by the application. However, > I can agree to having the kernel checking the response buffer size as > you wrote, and only setting the valid comp_mask bits. > OK. > > > >>> > >>>> > >>>> In the specific case of QUERY_DEVICE you might argue that there isn't > >>>> any need for input comp_mask, only for output, and then you may enforce > >>>> the input comp_mask will always be zero. > >>> > >>> The extended QUERY_DEVICE uverbs as currently merged is using comp_mask > >>> from input to choose to report on-demand-paging related value. So it > >>> seems it's needed. > >>> > >>> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/infiniband/core/uverbs_cmd.c?id=v3.19-rc6#n3297 > >>> > >>>> However, you will in any case need to be able to extended the size of the response in the future. > >>>> > >>> > >>> That's already the case for on demand paging. > >>> > >>>>> > >>>>> Link: http://mid.gmane.org/cover.1421931555.git.ydroneaud@opteya.com > >>>>> Cc: Sagi Grimberg <sagig@mellanox.com> > >>>>> Cc: Shachar Raindel <raindel@mellanox.com> > >>>>> Cc: Eli Cohen <eli@mellanox.com> > >>>>> Cc: Haggai Eran <haggaie@mellanox.com> > >>>>> Signed-off-by: Yann Droneaud <ydroneaud@opteya.com> > >>>>> --- > >>>>> drivers/infiniband/core/uverbs_cmd.c | 6 ++++++ > >>>>> 1 file changed, 6 insertions(+) > >>>>> > >>>>> diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c > >>>>> index 8668b328b7e6..80a1c90f1dbf 100644 > >>>>> --- a/drivers/infiniband/core/uverbs_cmd.c > >>>>> +++ b/drivers/infiniband/core/uverbs_cmd.c > >>>>> @@ -3313,6 +3313,12 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, > >>>>> if (err) > >>>>> return err; > >>>>> > >>>>> + if (cmd.comp_mask & (cmd.comp_mask + 1)) > >>>>> + return -EINVAL; > >>>>> + > >>>>> + if (cmd.comp_mask & ~(__u32)IB_USER_VERBS_EX_QUERY_DEVICE_ODP) > >>>>> + return -EINVAL; > >>>>> + > >>> > >>> If we keep the checks on output buffer size from patch 1, returning > >>> -ENOSPC in case of size mismatch, and if we are sure that no bit in > >>> input comp_mask will ever trigger a call to a kernel function that can > >>> make the uverb fail, the latter check on known value could be dropped, > >>> allowing the userspace to set the highest mask (-1): userspace > >>> will use -ENOSPC to probe the expected size of the response buffer > >>> to match the highest supported comp_mask. But it's going to hurt > >>> userspace application not ready to receive -ENOSPC on newer kernel > >>> with extended QUERY_DEVICE ABI ... Oops. > >>> > >>> So in this end, the safest way to ensure userspace is doing the correct > >>> thing so that we have backward and forward compatibility is to check > >>> for known values in comp_mask, check for response buffer size and ensure > >>> that data structure chunk are stacked. > >>> > >>> The tighter are the checks now, the easier the interface could be > >>> extended latter. > >> > >> I understand this position, and I generally agree, but I think that > >> specifically for a verb like QUERY_DEVICE that only reads information from > >> the kernel driver to user-space, there is no harm in the kernel just > >> providing all the information it can fit in the response buffer provided > >> by user-space. > >> > > > > Returning as much as possible information to userspace is OK, > > but it's doing it the wrong way. > > > > I'm not specifically discussing QUERY_DEVICE, as I'm concerned with > > every extended uverbs to be added to the kernel. > > > >> Let me explain: newer fields are added to the kernel response struct at the > >> end, together with a new comp_mask bit. > >> > >> Older user-space with newer kernels will simply ask only for the buffer > >> size they care about. The fact that the struct is truncated doesn't affect > >> these programs because the truncated struct is a valid struct that was > >> presented by the kernel in an older version. > >> > > > > You cannot ensure the userspace being correct when specifying a request. > > It's a wrong assumption (perhaps not so wrong in the case of > > InfiniBand/RDMA, as most userspace program are using libibverbs, > > librdmacm and provider's libraries). > > > > That's why we must not be liberal with the request and check any bit of > > it for being something valid, so that erroneous values are catch now, > > ensuring userspace is not trying to request things we don't know yet > > and it's not aware of it too. > > Does the solution I proposed above satisfy this requirement? > - The kernel validates input size == command struct size and > cmd.comp_mask == 0. > - The kernel fills as much information as it can fit in the buffer > provided by userspace. > - The kernel marks which fields it was able to fill in the response's > comp_mask. > Yes. This scheme follow the "Extending Verbs API" presention while ensuring the command can be extended regarding to good practice one can find in http://blog.ffwll.ch/2013/11/botching-up-ioctls.html I'm going to submit the patches to have this behavior. Regards.
On Wed, Jan 28, 2015 at 02:19:11PM +0100, Yann Droneaud wrote: > But the same program (either binary or source code) might fail on > newer kernel where some bits in comp_mask gain a meaning not supported > by the program. To clarify some of this: The intention of the comp_mask scheme was to define a growing structure. The invariant is: A bigger structure allways has all smaller structures (past versions) as a prefix. So the size alone is enough for user space to know what is going on. userspace only processes structure members up to the size returned by the kernel, and there is only one structure layout. The purpose of the output comp_mask is to allow drivers to declare they do not support the new structure members, and comp_mask bits should only be used with new structure members do not have a natural 'null' value. Further, we need to distinguish cases where additional data is mandatory or optional. query_device is clearly optional, the only purpose the input comp mask serves is to reduce expensive work in the driver by indicating that some result bits are not needed. It is perfectly OK for the kernel to ignore the input comp mask, and OK for userspace to typically request all bits. (indeed, I think this is a pretty silly optimization myself, and the original patch that motivated this was restructured so it is not needed) Other verbs must be more strict, if one side does not understand the comp mask then the verb must fail in some way. Presumably the valid comp mask values are inferred in some way (eg query_device says the extended function is supported) Everything should fit in one of those two general cases.. > Thanks for the link to the presentation. If I recall the presentation is old and had some flaws that were addressed before things made it into libibverbs.. Thank you for looking at this, it is very important that this scheme is used properly, and it is very easy to make mistakes. I haven't had time to review any of these new patches myself. Regards, Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, Le jeudi 29 janvier 2015 à 11:09 -0700, Jason Gunthorpe a écrit : > On Wed, Jan 28, 2015 at 02:19:11PM +0100, Yann Droneaud wrote: > > > But the same program (either binary or source code) might fail on > > newer kernel where some bits in comp_mask gain a meaning not supported > > by the program. > > To clarify some of this: > > The intention of the comp_mask scheme was to define a growing > structure. > > The invariant is: A bigger structure allways has all smaller > structures (past versions) as a prefix. > > So the size alone is enough for user space to know what is going > on. userspace only processes structure members up to the size returned > by the kernel, and there is only one structure layout. > Unfortunately, the userspace don't get the size of the returned data: it's only a single "write()" syscall after all. So the kernel is left with the comp_mask in the response to express the returned size. > The purpose of the output comp_mask is to allow drivers to declare > they do not support the new structure members, and comp_mask bits > should only be used with new structure members do not have a natural > 'null' value. > It's not (yet) about drivers as the output's comp_mask (in the response), is set by uverbs layer. Do you think we have to enforce a 1 to 1 relation between the request comp_mask's bits and the response comp_mask's bits ? That would not fit with my understanding of "Extending Verbs API" presentation [1]: request's comp_mask describe the fields present in the request and response's comp_mask describe the fields present in the response. > Further, we need to distinguish cases where additional data is > mandatory or optional. > > query_device is clearly optional, the only purpose the input comp mask > serves is to reduce expensive work in the driver by indicating that > some result bits are not needed. That's not how I've understand the purpose of the request's comp_mask after reading the presentation: request's comp_mask describe the content of the request. Then that additional content can trigger the presence of additional fields in the response if needed. > It is perfectly OK for the kernel to > ignore the input comp mask, and OK for userspace to typically request > all bits. (indeed, I think this is a pretty silly optimization myself, > and the original patch that motivated this was restructured so it is > not needed) > It's not at all OK to ignore request's comp_mask: it has to be checked to find if there's a feature requested the kernel cannot fullfil, and any unknown bit is a possible feature request. So the kernel has to reject the request as a whole as it don't know how to process it. As we don't know yet how we would extend the extended QUERY_DEVICE uverbs, the kernel have to refuse any random value. http://blog.ffwll.ch/2013/11/botching-up-ioctls.html > Other verbs must be more strict, if one side does not understand the > comp mask then the verb must fail in some way. Presumably the valid > comp mask values are inferred in some way (eg query_device says the > extended function is supported) > > Everything should fit in one of those two general cases.. I believe every interface should return an error for any unknown value (every unused bits of a data structure), that is, there's only one case. > > > Thanks for the link to the presentation. > > If I recall the presentation is old and had some flaws that were > addressed before things made it into libibverbs.. > I have to have a look to this part of libibverbs: I'm not sure the extended QUERY_DEVICE is already implemented. > Thank you for looking at this, it is very important that this scheme > is used properly, and it is very easy to make mistakes. I haven't had > time to review any of these new patches myself. > I hope you would find some time to review my latest patchset so that we don't miss a corner case. It's starting to become urgent. Regards.
On Thu, Jan 29, 2015 at 07:35:14PM +0100, Yann Droneaud wrote: > Unfortunately, the userspace don't get the size of the returned data: > it's only a single "write()" syscall after all. A write syscall that behaves nothing like write() actually should, so I don't see why we can't have resp_len = write(fd,inout_buf,sizeof(input_len)); Returning 0 from write makes no sense at all. In the fullness of your patchset it will maintain the invariant that resp_len <= sizeof(input_len) Which seems OK to me considering what we have to work with, and a significantly better choice than 0. > So the kernel is left with the comp_mask in the response to express > the returned size. It was never the intent that the size should be computed from comp_mask. If the size is necessary it must be explicit. In this instance if the size is not returned then libibverbs will have to zero the entire user buffer before passing it to the kernel, because it has to ensure any tail for the user app is 0'd. Remember for santity we want comp mask bits for things that can't be 0 and we want 0 for things that are not set. struct ib_query_device_ex res; ibv_query_device_ex(..,res,sizeof(res)); printf("%u",res.foo_cap); // 0 if unsupported is OK if (res.comp_mask & COMP_BAR) printf("%u",res.bar_thingy); // 0 has meaning, must use COMP_BAR Ideally we want to minimize the number of COMP bits because it is a huge burden on the end user to work with them. > > The purpose of the output comp_mask is to allow drivers to declare > > they do not support the new structure members, and comp_mask bits > > should only be used with new structure members do not have a natural > > 'null' value. > It's not (yet) about drivers as the output's comp_mask (in the > response), is set by uverbs layer. > > Do you think we have to enforce a 1 to 1 relation between the request > comp_mask's bits and the response comp_mask's bits ? In the query context I would be happy enough to just treat the in bound buffer as uninitialized buffer space, but certainly generally speaking the comp_mask+size should refer to the structure - so input/output are not directly related. > > Further, we need to distinguish cases where additional data is > > mandatory or optional. > > > > query_device is clearly optional, the only purpose the input comp mask > > serves is to reduce expensive work in the driver by indicating that > > some result bits are not needed. > > That's not how I've understand the purpose of the request's comp_mask > after reading the presentation: request's comp_mask describe the content > of the request. Then that additional content can trigger the presence > of additional fields in the response if needed. Agreed - what I described was an abuse that some early Mellanox patches for a query_ex included and it was just a shortcut to avoid defining a dedicated input structure. It seems that same scheme was copied into these new patches. > > It is perfectly OK for the kernel to > > ignore the input comp mask, and OK for userspace to typically request > > all bits. (indeed, I think this is a pretty silly optimization myself, > > and the original patch that motivated this was restructured so it is > > not needed) > > > > It's not at all OK to ignore request's comp_mask: it has to be checked > to find if there's a feature requested the kernel cannot fullfil, and > any unknown bit is a possible feature request. So the kernel has to > reject the request as a whole as it don't know how to process it. In the context of the above scheme the input structure was just this: struct query_input { u64 requested_output; }; ie it wasn't actually a comp_mask, it just overlapped the comp_mask bytes on output. Such a use was never explicitly documented, and IIRC was never actually included in libibverbs. Unless someone has a strong reason we need to do this I am inclined to think that your interpretation is the better one, and we could always add a requested_output to the input someday if it became urgent. In any event, you are right, we can't ingore the input bytes today and expect to give them meaning tomorrow. > As we don't know yet how we would extend the extended QUERY_DEVICE > uverbs, the kernel have to refuse any random value. > > http://blog.ffwll.ch/2013/11/botching-up-ioctls.html or the kernel treats QUERY_DEVICE as an output only function and never inspects the in/out buffer at all. > > > Thanks for the link to the presentation. > > > > If I recall the presentation is old and had some flaws that were > > addressed before things made it into libibverbs.. > > I have to have a look to this part of libibverbs: I'm not sure > the extended QUERY_DEVICE is already implemented. The patches turned out to be unnecessary and were dropped, IIRC. > > Thank you for looking at this, it is very important that this scheme > > is used properly, and it is very easy to make mistakes. I haven't had > > time to review any of these new patches myself. > I hope you would find some time to review my latest patchset so that > we don't miss a corner case. It's starting to become urgent. I have and will, thank you Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le jeudi 29 janvier 2015 à 12:14 -0700, Jason Gunthorpe a écrit : > On Thu, Jan 29, 2015 at 07:35:14PM +0100, Yann Droneaud wrote: > > > Unfortunately, the userspace don't get the size of the returned data: > > it's only a single "write()" syscall after all. > > A write syscall that behaves nothing like write() actually should, so > I don't see why we can't have > > resp_len = write(fd,inout_buf,sizeof(input_len)); > > Returning 0 from write makes no sense at all. > 0 is not the result of the write() syscall, as for extended uverbs I've ensure the return value of the write() syscall was the size it was given. See http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/infiniband/core/uverbs_main.c?id=v3.19-rc6#n599 705 err = uverbs_ex_cmd_table[command](file, 706 &ucore, 707 &uhw); 708 709 if (err) 710 return err; 711 712 return written_count; See commit f21519b23c1 ("IB/core: extended command: an improved infrastructure for uverbs commands") http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f21519b23c1b6fa25366be4114ccf7fcf1c190f9 > In the fullness of your patchset it will maintain the invariant that > resp_len <= sizeof(input_len) > I don't catch your point: the response can be bigger than the request. > Which seems OK to me considering what we have to work with, and a > significantly better choice than 0. > > > So the kernel is left with the comp_mask in the response to express > > the returned size. > > It was never the intent that the size should be computed from > comp_mask. If the size is necessary it must be explicit. > > In this instance if the size is not returned then libibverbs will have > to zero the entire user buffer before passing it to the kernel, > because it has to ensure any tail for the user app is 0'd. > The proposed patch ensure the integrity of the response regarding comp_mask: if a bit is set in response's comp_mask that means the related fields are presents (and valid). So before parsing the response fields, userspace have to check response's comp_mask: fields access must be protected by correct check on comp_mask ... but it might be useful for the userspace developer to clear the response buffer just in case he/she decided to be lazy with the check. > Remember for santity we want comp mask bits for things that can't be 0 For me, it's better if a bit is set in response's comp_mask by the kernel when the kernel have written something in the related fields even if the those fields are all 0. > and we want 0 for things that are not set. > > struct ib_query_device_ex res; > ibv_query_device_ex(..,res,sizeof(res)); > > printf("%u",res.foo_cap); // 0 if unsupported is OK > if (res.comp_mask & COMP_BAR) > printf("%u",res.bar_thingy); // 0 has meaning, must use COMP_BAR > > Ideally we want to minimize the number of COMP bits because it is a > huge burden on the end user to work with them. > Sure. So I think comp_mask is just an overly complicated way of expressing the version and the size of the response. > > > The purpose of the output comp_mask is to allow drivers to declare > > > they do not support the new structure members, and comp_mask bits > > > should only be used with new structure members do not have a natural > > > 'null' value. > > > It's not (yet) about drivers as the output's comp_mask (in the > > response), is set by uverbs layer. > > > > Do you think we have to enforce a 1 to 1 relation between the request > > comp_mask's bits and the response comp_mask's bits ? > > In the query context I would be happy enough to just treat the in > bound buffer as uninitialized buffer space, but certainly generally > speaking the comp_mask+size should refer to the structure - so > input/output are not directly related. > OK. > > > Further, we need to distinguish cases where additional data is > > > mandatory or optional. > > > > > > query_device is clearly optional, the only purpose the input comp mask > > > serves is to reduce expensive work in the driver by indicating that > > > some result bits are not needed. > > > > That's not how I've understand the purpose of the request's comp_mask > > after reading the presentation: request's comp_mask describe the content > > of the request. Then that additional content can trigger the presence > > of additional fields in the response if needed. > > Agreed - what I described was an abuse that some early Mellanox > patches for a query_ex included and it was just a shortcut to avoid > defining a dedicated input structure. It seems that same scheme was > copied into these new patches. > OK > > > It is perfectly OK for the kernel to > > > ignore the input comp mask, and OK for userspace to typically request > > > all bits. (indeed, I think this is a pretty silly optimization myself, > > > and the original patch that motivated this was restructured so it is > > > not needed) > > > > > > > It's not at all OK to ignore request's comp_mask: it has to be checked > > to find if there's a feature requested the kernel cannot fullfil, and > > any unknown bit is a possible feature request. So the kernel has to > > reject the request as a whole as it don't know how to process it. > > In the context of the above scheme the input structure was just this: > > struct query_input > { > u64 requested_output; > }; > > ie it wasn't actually a comp_mask, it just overlapped the comp_mask > bytes on output. > > Such a use was never explicitly documented, and IIRC was never > actually included in libibverbs. > OK > Unless someone has a strong reason we need to do this I am inclined to > think that your interpretation is the better one, and we could always > add a requested_output to the input someday if it became urgent. > Sure one field can be added later: in such case, one bit of request's comp_mask will gain the meaning: "request_output field is present in the request". > In any event, you are right, we can't ingore the input bytes today and > expect to give them meaning tomorrow. > Sure. > > As we don't know yet how we would extend the extended QUERY_DEVICE > > uverbs, the kernel have to refuse any random value. > > > > http://blog.ffwll.ch/2013/11/botching-up-ioctls.html > > or the kernel treats QUERY_DEVICE as an output only function and never > inspects the in/out buffer at all. > It's something we could think of but the extended uverbs is already merge with the comp_mask field in the request and it cost only 4 bytes, while allowing us to make mistake on the first iteration of the extended QUERY_DEVICE uverb (provided we manage to add the check for the field being 0). > > > > Thanks for the link to the presentation. > > > > > > If I recall the presentation is old and had some flaws that were > > > addressed before things made it into libibverbs.. > > > > I have to have a look to this part of libibverbs: I'm not sure > > the extended QUERY_DEVICE is already implemented. > > The patches turned out to be unnecessary and were dropped, IIRC. > OK. > > > Thank you for looking at this, it is very important that this scheme > > > is used properly, and it is very easy to make mistakes. I haven't had > > > time to review any of these new patches myself. > > > I hope you would find some time to review my latest patchset so that > > we don't miss a corner case. It's starting to become urgent. > > I have and will, thank you > :) Thanks. Regards.
On Thu, Jan 29, 2015 at 10:14:29PM +0100, Yann Droneaud wrote: > > > Unfortunately, the userspace don't get the size of the returned data: > > > it's only a single "write()" syscall after all. > > > > A write syscall that behaves nothing like write() actually should, so > > I don't see why we can't have > > > > resp_len = write(fd,inout_buf,sizeof(input_len)); > > > > Returning 0 from write makes no sense at all. Okay, yes, I see, the flow is different for the ex versions from the non-ex versions. > > In the fullness of your patchset it will maintain the invariant that > > resp_len <= sizeof(input_len) > I don't catch your point: the response can be bigger than the request. Indeed, I forgot the scheme used that indirection buffer with pointers. My comments on write() result are all wrong. Sorry > > It was never the intent that the size should be computed from > > comp_mask. If the size is necessary it must be explicit. > > > > In this instance if the size is not returned then libibverbs will have > > to zero the entire user buffer before passing it to the kernel, > > because it has to ensure any tail for the user app is 0'd. > > The proposed patch ensure the integrity of the response regarding > comp_mask: if a bit is set in response's comp_mask that means the > related fields are presents (and valid). The patch it is good, but sitting this into the larger framework where we have libibverbs consumers compiled against arbitrary versions of the structure creates a broader complexity that someone has to deal with. libibverbs needs to ensure the trailer portion of the structure from the end user is 0'd. > > Remember for santity we want comp mask bits for things that can't be 0 > > For me, it's better if a bit is set in response's comp_mask by the > kernel when the kernel have written something in the related fields > even if the those fields are all 0. Yes, but my point is that comp mask bit should only be used to protect fields where 0 is not an acceptable 'do not support' answer. Primarily we should rely on memset(0) to provide compatibility, and only when really necessary introduce a comp mask bit. This is why the size is important, we can't deduce the size from comp_mask, and without the size we can't know where the kernel stopped writing and the whole thing becomes a little more fragile. Zeroing the response buffer before calling into the kernel is not a great general pattern. Perhaps having the kernel zero the trailing portion it doesn't write too is okay, but it still feels like not telling user space the size is asking for future trouble. > > Ideally we want to minimize the number of COMP bits because it is a > > huge burden on the end user to work with them. > > Sure. So I think comp_mask is just an overly complicated way of > expressing the version and the size of the response. Yes. But understand it is selected to try and provide a good end user experience, exposing structure versions or size directly to the user is very difficult to work with. Ideally the end user will not see many comp mask bits (ie I would drop the ODP one) and things will 'just work' as compiled. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 8668b328b7e6..80a1c90f1dbf 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -3313,6 +3313,12 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, if (err) return err; + if (cmd.comp_mask & (cmd.comp_mask + 1)) + return -EINVAL; + + if (cmd.comp_mask & ~(__u32)IB_USER_VERBS_EX_QUERY_DEVICE_ODP) + return -EINVAL; + if (cmd.reserved) return -EINVAL;
This patch ensures the extended QUERY_DEVICE uverbs request's comp_mask has only known values. If userspace returns unknown features, -EINVAL will be returned, allowing to probe/discover which features are currently supported by the kernel. Moreover, it also ensure the requested features set in comp_mask are sequentially set, not skipping intermediate features, eg. the "highest" requested feature also request all the "lower" ones. This way each new feature will have to be stacked on top of the existing ones: this is especially important for the request and response data structures where fields are added after the current ones when expanded to support a new feature. Link: http://mid.gmane.org/cover.1421931555.git.ydroneaud@opteya.com Cc: Sagi Grimberg <sagig@mellanox.com> Cc: Shachar Raindel <raindel@mellanox.com> Cc: Eli Cohen <eli@mellanox.com> Cc: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Yann Droneaud <ydroneaud@opteya.com> --- drivers/infiniband/core/uverbs_cmd.c | 6 ++++++ 1 file changed, 6 insertions(+)