Message ID | 1433420809-13529-5-git-send-email-kaike.wan@intel.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
On Thu, Jun 04, 2015 at 08:26:49AM -0400, kaike.wan@intel.com wrote: > From: Kaike Wan <kaike.wan@intel.com> > > This patch routes a SA pathrecord query to netlink first and processes the > response appropriately. If a failure is returned, the request will be sent > through IB. The decision whether to route the request to netlink first is > determined by the presence of a listener for the local service netlink > multicast group. If the user-space local service netlink multicast group > listener is not present, the request will be sent through IB, just like > what is currently being done. Why doesn't this follow the usual netlink usage idioms? I don't see a single nla_put or nla_nest_start, which, based on the RFC, is what I expect to see. Don't home brew net link message construction without a really great reason. I am very surprised to see this layered under send_mad, all the context is lost at that point, and it makes it impossible to provide all the non-IB context (IP addrs, netdevs, QOS, etc) Confused why all sorts of LS_NLA_PATH_F_X are defined but only LS_NLA_PATH_F_USER is referenced. What does F_USER even mean? I'm pretty sure that should be the same as cma: (IB_PATH_GMP | IB_PATH_PRIMARY | IB_PATH_BIDIRECTIONAL)) Why define new path flags when this is already done as a UAPI? Not sure the reply parse is done right either, where is the for loop? The entire point of the original comment was to move away from using IB SA Path query MADs in netlink, but this really only changes the reply format. Better, but wondering if we should go further. I was hoping to see the three callers of ib_sa_path_rec_get contribute their information to the netlink message. Suggest putting all the netlink stuff in its own file.. Bunch of smaller stuff, can look at v4.. Overall, much improved, I think. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> I am very surprised to see this layered under send_mad, all the > context is lost at that point, and it makes it impossible to provide > all the non-IB context (IP addrs, netdevs, QOS, etc) It's layered under query path record. This should provide the most benefit to the most applications, including the existing kernel clients. You could extend this by hooking into the rdma cm directly, but IMO, that can be viewed as another optimization. Basically, if one were to hook in only one place, this seems the most reasonable. - Sean -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> From: Jason Gunthorpe [mailto:jgunthorpe@obsidianresearch.com] > Sent: Thursday, June 04, 2015 4:20 PM > > On Thu, Jun 04, 2015 at 08:26:49AM -0400, kaike.wan@intel.com wrote: > > From: Kaike Wan <kaike.wan@intel.com> > > > > This patch routes a SA pathrecord query to netlink first and processes > > the response appropriately. If a failure is returned, the request will > > be sent through IB. The decision whether to route the request to > > netlink first is determined by the presence of a listener for the > > local service netlink multicast group. If the user-space local service > > netlink multicast group listener is not present, the request will be > > sent through IB, just like what is currently being done. > > Why doesn't this follow the usual netlink usage idioms? I don't see a single > nla_put or nla_nest_start, which, based on the RFC, is what I expect to see. > Don't home brew net link message construction without a really great reason. The only reason why we did not use nla_put is to avoid data copying. For example, to translate the input struct ib_sa_path_rec into struct ib_user_path_rec, we directly use the skb buffer reserved for the pathrecord attribute as the destination (ib_nl_set_path_rec_attr()) of ib_copy_path_rec_to_user, which is essentially the guts of nla_put. To use nla_put directly, we would have to use a temp buffer to hold ib_user_path_rec and then copy it to the skb. > > I am very surprised to see this layered under send_mad, all the context is lost > at that point, and it makes it impossible to provide all the non-IB context (IP > addrs, netdevs, QOS, etc) As Sean pointed out in a separate e-mail, this provides benefits to more applications. Currently, rdma_cm, ipoib, and SRP calls ib_sa_path_rec_get() to query pathrecord. In addition, this does not preclude optimization in other locations in the kernel in the future. > > Confused why all sorts of LS_NLA_PATH_F_X are defined but only > LS_NLA_PATH_F_USER is referenced. What does F_USER even mean? it indicates that the format of the attribute is struct ib_user_path_rec instead of packed MAD wire format (or struct ibv_path_rec in user space). This has been documented in the original RFC. I'm > pretty sure that should be the same as cma: > (IB_PATH_GMP | IB_PATH_PRIMARY | IB_PATH_BIDIRECTIONAL)) > > Why define new path flags when this is already done as a UAPI? yes, the flag is really copied from the pathrecord flags for struct ib_path_rec_data in include/uapi/rdma/ib_user_sa.h because our pathrecord attribute (struct rdma_nla_ls_path_rec) is similar to struct ib_path_rec_data, but not exactly the same. That's why we should define its own flags as a separate entity. > > Not sure the reply parse is done right either, where is the for loop? Well, for ib_sa_path_rec_get(), we are expecting to see only one pathrecord in the response. For other kernel locations where multiple pathrecords are expected, one may have to iterate through all returned attributes. > > The entire point of the original comment was to move away from using IB SA > Path query MADs in netlink, but this really only changes the reply format. > Better, but wondering if we should go further. > > I was hoping to see the three callers of ib_sa_path_rec_get contribute their > information to the netlink message. > > Suggest putting all the netlink stuff in its own file.. All our netlink stuff is based on the rdma_netlink.h definitions (client, multicast group, opcode, attributes) and I think that the include/uapi/rdma/rdma_entlink.h file is its natural habitat. > > Bunch of smaller stuff, can look at v4.. > > Overall, much improved, I think. > > Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jun 05, 2015 at 10:57:03AM +0000, Wan, Kaike wrote: > > Why doesn't this follow the usual netlink usage idioms? I don't > > see a single nla_put or nla_nest_start, which, based on the RFC, > > is what I expect to see. Don't home brew net link message > > construction without a really great reason. > > The only reason why we did not use nla_put is to avoid data > copying. For example, to translate the input struct ib_sa_path_rec > into struct ib_user_path_rec, we directly use the skb buffer > reserved for the pathrecord attribute as the destination > (ib_nl_set_path_rec_attr()) of ib_copy_path_rec_to_user, which is > essentially the guts of nla_put. To use nla_put directly, we would > have to use a temp buffer to hold ib_user_path_rec and then copy it > to the skb. Okay, but isn't that what nla_reserve is for? > > I am very surprised to see this layered under send_mad, all the > > context is lost at that point, and it makes it impossible to > > provide all the non-IB context (IP addrs, netdevs, QOS, etc) > > As Sean pointed out in a separate e-mail, this provides benefits to > more applications. Currently, rdma_cm, ipoib, and SRP calls > ib_sa_path_rec_get() to query pathrecord. Yes, I saw Sean's comment, and I am glad we all understand there are only 3 callers. And looking at the callers there are only a small number of elements being used: rdma cm: IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH | IB_SA_PATH_REC_REVERSIBLE | IB_SA_PATH_REC_SERVICE_ID; IB_SA_PATH_REC_QOS_CLASS; IB_SA_PATH_REC_TRAFFIC_CLASS ipoib: IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_NUMB_PATH | IB_SA_PATH_REC_TRAFFIC_CLASS | IB_SA_PATH_REC_PKEY, srp: IB_SA_PATH_REC_SERVICE_ID | IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_NUMB_PATH | IB_SA_PATH_REC_PKEY, So, rather than a SA mad in netlink, I'd suggest actually using extensible netlink here and define the query cleanly using nested attributes: RDMA_OP_RESOLVE (request) GIDS SERVICE_ID QOS/TCLASS PKEY QP TYPE (aka reversible) RDMA_OP_RESOLVE (reply) IB_PATH IB_PATH IB_PATH This could probably still be done under send_mad, and if you don't want to finish the job then that is probably fine. The reason to prefer native netlink format as a UAPI is because it is more specific. It is very hard to process a general SA Path Record query mad, there are many combinations and many wonky attributes. That makes implementing a correct/future proof user space client unnecessarily hard. With native netlink we can define our own conventions for UAPI extension. .. and with only 3 callers we don't have a huge burden to shift if a kAPI needs to change to make this natural .. > In addition, this does not preclude optimization in other locations > in the kernel in the future. Confused. Adding new netlink attributes is not an optimization, it is a UAPI enhancement. > > Confused why all sorts of LS_NLA_PATH_F_X are defined but only > > LS_NLA_PATH_F_USER is referenced. What does F_USER even mean? > > it indicates that the format of the attribute is struct > ib_user_path_rec instead of packed MAD wire format (or struct > ibv_path_rec in user space). This has been documented in the > original RFC. Why do we need two formats in the uapi? Drop the !user one. Even so, it isn't the netlink way to bury two different structs in the same attribute, you need to use netlink headers to tell them apart, not buried flags.. > I'm > > pretty sure that should be the same as cma: > > (IB_PATH_GMP | IB_PATH_PRIMARY | IB_PATH_BIDIRECTIONAL)) You missed this one You can't just ignore the flag bit, they have to be checked. It is very important to have an extensible uapi. > > Why define new path flags when this is already done as a UAPI? > > yes, the flag is really copied from the pathrecord flags for struct > ib_path_rec_data in include/uapi/rdma/ib_user_sa.h because our > pathrecord attribute (struct rdma_nla_ls_path_rec) is similar to > struct ib_path_rec_data, but not exactly the same. That's why we > should define its own flags as a separate entity. I would keep the original flags for consistency, especially once the !user case is dropped. This is because it should be exactly the same protocol and system as the other users. > > Not sure the reply parse is done right either, where is the for loop? > > Well, for ib_sa_path_rec_get(), we are expecting to see only one > pathrecord in the response. For other kernel locations where > multiple pathrecords are expected, one may have to iterate through > all returned attributes. No, you must accept multiple attributes and loop to find a supported one (today's kernel only supports (IB_PATH_GMP | IB_PATH_PRIMARY | IB_PATH_BIDIRECTIONAL)). Review the other implementations of IB_PATH_* flags in the kernel to see how this works. Otherwise future extensibility is compromised. > > Suggest putting all the netlink stuff in its own file.. > > All our netlink stuff is based on the rdma_netlink.h definitions > (client, multicast group, opcode, attributes) and I think that the > include/uapi/rdma/rdma_entlink.h file is its natural habitat. I mean the .c file stuf. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Yes, I saw Sean's comment, and I am glad we all understand there are > only 3 callers. And looking at the callers there are only a small > number of elements being used: > > rdma cm: > IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | > IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH | > IB_SA_PATH_REC_REVERSIBLE | IB_SA_PATH_REC_SERVICE_ID; > IB_SA_PATH_REC_QOS_CLASS; > IB_SA_PATH_REC_TRAFFIC_CLASS > > ipoib: > IB_SA_PATH_REC_DGID | > IB_SA_PATH_REC_SGID | > IB_SA_PATH_REC_NUMB_PATH | > IB_SA_PATH_REC_TRAFFIC_CLASS | > IB_SA_PATH_REC_PKEY, > > srp: > IB_SA_PATH_REC_SERVICE_ID | > IB_SA_PATH_REC_DGID | > IB_SA_PATH_REC_SGID | > IB_SA_PATH_REC_NUMB_PATH | > IB_SA_PATH_REC_PKEY, > > So, rather than a SA mad in netlink, I'd suggest actually using > extensible netlink here and define the query cleanly using nested > attributes: > > RDMA_OP_RESOLVE (request) > GIDS > SERVICE_ID > QOS/TCLASS > PKEY > QP TYPE (aka reversible) > > RDMA_OP_RESOLVE (reply) > IB_PATH > IB_PATH > IB_PATH I thought you were asking to move the hook site. This approach seems reasonable. We should just need to add a check against the compmask. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jun 05, 2015 at 08:53:01PM +0000, Hefty, Sean wrote: > > RDMA_OP_RESOLVE (reply) > > IB_PATH > > IB_PATH > > IB_PATH > > I thought you were asking to move the hook site. This approach > seems reasonable. We should just need to add a check against the > compmask. I've been asking if we can include more meta-information, like the netdev for rdma-cm and ipoib. Having the hook site in send_mad makes that seem hard to reach. (maybe not?) One one hand, it would be nice to launch a new UAPI with broad feature completeness, on the other hand, if it is a lot of work then go incremental. That is why I only said I was surprised :) Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> From: Jason Gunthorpe [mailto:jgunthorpe@obsidianresearch.com] > Sent: Friday, June 05, 2015 1:05 PM > > > > Why doesn't this follow the usual netlink usage idioms? I don't see > > > a single nla_put or nla_nest_start, which, based on the RFC, is what > > > I expect to see. Don't home brew net link message construction > > > without a really great reason. > > > > The only reason why we did not use nla_put is to avoid data copying. > > For example, to translate the input struct ib_sa_path_rec into struct > > ib_user_path_rec, we directly use the skb buffer reserved for the > > pathrecord attribute as the destination > > (ib_nl_set_path_rec_attr()) of ib_copy_path_rec_to_user, which is > > essentially the guts of nla_put. To use nla_put directly, we would > > have to use a temp buffer to hold ib_user_path_rec and then copy it to > > the skb. > > Okay, but isn't that what nla_reserve is for? Yes. I will replace part of the code with nla_reserve. Thank you for pointing it out. > > > > I am very surprised to see this layered under send_mad, all the > > > context is lost at that point, and it makes it impossible to provide > > > all the non-IB context (IP addrs, netdevs, QOS, etc) > > > > As Sean pointed out in a separate e-mail, this provides benefits to > > more applications. Currently, rdma_cm, ipoib, and SRP calls > > ib_sa_path_rec_get() to query pathrecord. > > Yes, I saw Sean's comment, and I am glad we all understand there are only 3 > callers. And looking at the callers there are only a small number of elements > being used: > > rdma cm: > IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | > IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH | > IB_SA_PATH_REC_REVERSIBLE | IB_SA_PATH_REC_SERVICE_ID; > IB_SA_PATH_REC_QOS_CLASS; IB_SA_PATH_REC_TRAFFIC_CLASS > > ipoib: > IB_SA_PATH_REC_DGID | > IB_SA_PATH_REC_SGID | > IB_SA_PATH_REC_NUMB_PATH | > IB_SA_PATH_REC_TRAFFIC_CLASS | > IB_SA_PATH_REC_PKEY, > > srp: > IB_SA_PATH_REC_SERVICE_ID | > IB_SA_PATH_REC_DGID | > IB_SA_PATH_REC_SGID | > IB_SA_PATH_REC_NUMB_PATH | > IB_SA_PATH_REC_PKEY, > > So, rather than a SA mad in netlink, I'd suggest actually using extensible > netlink here and define the query cleanly using nested > attributes: > > RDMA_OP_RESOLVE (request) > GIDS > SERVICE_ID > QOS/TCLASS > PKEY > QP TYPE (aka reversible) > > RDMA_OP_RESOLVE (reply) > IB_PATH > IB_PATH > IB_PATH Are you saying that instead of defining the pathrecord attribute only, we can define the following simpler attributes: GID, SERVICE_ID, QOS/TCLASS, PKEY, QP TYPE, IB_PATH, and then use them in the request or response? In this case, we need more attribute types (eg, SGID and DGID as two different attribute types instead of a single GID attribute type with a flags field in the attribute data) to avoid using flags as attribute modifiers. For response, we will still like to have the struct ibv_path_rec format (MAD wire format) for this patch series. In the future, it might be possible to use individual attributes just like the request. > > This could probably still be done under send_mad, and if you don't want to > finish the job then that is probably fine. We would like to put netlink call under send_mad() for two reasons: (1) It can potentially be used by other SA request (eg get ServiceRecord); (2) send_mad has other operations (store query point in idr map, set timeout) in addition to call the MAD stack. Theoretically, we could do all netlink call inside ib_sa_path_rec_get() before calling send_mad. However, since we need to handle timeout and netlink response in separate thread, we need to set everything up properly. > > The reason to prefer native netlink format as a UAPI is because it is more > specific. It is very hard to process a general SA Path Record query mad, there > are many combinations and many wonky attributes. That makes > implementing a correct/future proof user space client unnecessarily hard. > With native netlink we can define our own conventions for UAPI extension. Agree. That's why we are using struct ib_user_path_rec as the request input currently. As a matter of fact, we are using only some of the parameters in ibacm (sgid, dgid, slid, dlid, etc). > > .. and with only 3 callers we don't have a huge burden to shift if a kAPI needs > to change to make this natural .. > > > In addition, this does not preclude optimization in other locations in > > the kernel in the future. > > Confused. Adding new netlink attributes is not an optimization, it is a UAPI > enhancement. > > > > Confused why all sorts of LS_NLA_PATH_F_X are defined but only > > > LS_NLA_PATH_F_USER is referenced. What does F_USER even mean? > > > > it indicates that the format of the attribute is struct > > ib_user_path_rec instead of packed MAD wire format (or struct > > ibv_path_rec in user space). This has been documented in the original > > RFC. > > Why do we need two formats in the uapi? Drop the !user one. If we don't use struct ib_user_path_rec format, we don't need it any more. > > Even so, it isn't the netlink way to bury two different structs in the same > attribute, you need to use netlink headers to tell them apart, not buried > flags.. I was back and forth during the implementation on whether to use more attribute types or use a single attribute type and flags to define different structures. I can switch to use different attributes for different structures. > > > I'm > > > pretty sure that should be the same as cma: > > > (IB_PATH_GMP | IB_PATH_PRIMARY | IB_PATH_BIDIRECTIONAL)) > > You missed this one > > You can't just ignore the flag bit, they have to be checked. It is very important > to have an extensible uapi. > > > > Why define new path flags when this is already done as a UAPI? > > > > yes, the flag is really copied from the pathrecord flags for struct > > ib_path_rec_data in include/uapi/rdma/ib_user_sa.h because our > > pathrecord attribute (struct rdma_nla_ls_path_rec) is similar to > > struct ib_path_rec_data, but not exactly the same. That's why we > > should define its own flags as a separate entity. > > I would keep the original flags for consistency, especially once the !user case > is dropped. This is because it should be exactly the same protocol and system > as the other users. I will do that. > > > > Not sure the reply parse is done right either, where is the for loop? > > > > Well, for ib_sa_path_rec_get(), we are expecting to see only one > > pathrecord in the response. For other kernel locations where multiple > > pathrecords are expected, one may have to iterate through all returned > > attributes. > > No, you must accept multiple attributes and loop to find a supported one > (today's kernel only supports (IB_PATH_GMP | IB_PATH_PRIMARY | > IB_PATH_BIDIRECTIONAL)). Review the other implementations of > IB_PATH_* flags in the kernel to see how this works. > > Otherwise future extensibility is compromised. OK. I will do the looping to find the correct one. Thanks, Kaike -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma- > owner@vger.kernel.org] On Behalf Of Jason Gunthorpe > Sent: Friday, June 05, 2015 5:00 PM > > On Fri, Jun 05, 2015 at 08:53:01PM +0000, Hefty, Sean wrote: > > > RDMA_OP_RESOLVE (reply) > > > IB_PATH > > > IB_PATH > > > IB_PATH > > > > I thought you were asking to move the hook site. This approach seems > > reasonable. We should just need to add a check against the compmask. > > I've been asking if we can include more meta-information, like the netdev for > rdma-cm and ipoib. Having the hook site in send_mad makes that seem hard > to reach. (maybe not?) > > One one hand, it would be nice to launch a new UAPI with broad feature > completeness, on the other hand, if it is a lot of work then go incremental. > That is why I only said I was surprised :) I would like the incremental approach. Kaike -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jun 05, 2015 at 09:01:11PM +0000, Wan, Kaike wrote: > > So, rather than a SA mad in netlink, I'd suggest actually using extensible > > netlink here and define the query cleanly using nested > > attributes: > > > > RDMA_OP_RESOLVE (request) > > GIDS > > SERVICE_ID > > QOS/TCLASS > > PKEY > > QP TYPE (aka reversible) > > > > RDMA_OP_RESOLVE (reply) > > IB_PATH > > IB_PATH > > IB_PATH > > Are you saying that instead of defining the pathrecord attribute > only, we can define the following simpler attributes: GID, > SERVICE_ID, QOS/TCLASS, PKEY, QP TYPE, IB_PATH, and then use them in > the request or response? Request only. Response will be an array of ib_path_rec_data (and sorry if I was confusing on !user, user, the goal is to match ucma's existing API for this, see ucma_set_ib_path). Please check if we can extend the answer array too, or if it needs to be more like: RDMA_OP_RESOLVE (reply) RESOLVED_PATH (#1, REVERISBLE, GMP) FLAGS IB_PATH (ib_path_rec_data) FUTURE_EXTENSION RESOLVED_PATH (#2 FORWARD ONLY) .. RESOLVED_PATH (#3 REVERSE ONLY) .. RESOLVED_PATH (#4 Alternate FORWARD ONLY) .. RESOLVED_PATH (#5 Alternate REVERSE ONLY) .. RESOLVED_PATH (#6 Alternate REVERSIBLE, GMP) .. To elaborte again: This is best API I've seen to add APM - userspace can have the 'policy' side that APM needs and plug in to the kernel here with this netlink interface. Obviously the kernel side is missing, but lets plan ahead :) > In this case, we need more attribute types > (eg, SGID and DGID as two different attribute types instead of a > single GID attribute type with a flags field in the attribute data) Well, think about it and slice it up as you feel best, using netlink conventions. Ie every address should be carryed similar to IFLA_ADDRESS with the type (AF_*) and size appropriate. SGID and DGID are mandatory, but the overhead of splitting them is probably small ? I'd *really* like to use QP TYPE instead of reversible, a UD QP has a different path set need than a RC QP, for instance. This is forward looking toward APM. > > This could probably still be done under send_mad, and if you don't want to > > finish the job then that is probably fine. > > We would like to put netlink call under send_mad() for two reasons: > > (1) It can potentially be used by other SA request (eg get > ServiceRecord); > (2) send_mad has other operations (store query point in idr map, set > timeout) in addition to call the MAD stack. Theoretically, we could > do all netlink call inside ib_sa_path_rec_get() before calling > send_mad. However, since we need to handle timeout and netlink > response in separate thread, we need to set everything up properly. Yes, I appreciate this. > > Even so, it isn't the netlink way to bury two different structs in the same > > attribute, you need to use netlink headers to tell them apart, not buried > > flags.. > > I was back and forth during the implementation on whether to use > more attribute types or use a single attribute type and flags to > define different structures. I can switch to use different > attributes for different structures. Generally speaking with netlink we want to see the attribute self describe with the common length field. Different things get different attributes. This lets you use the existing mechanics for handling attributes and the code reads cleanly. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index 17e1cf7..205a419 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -45,12 +45,21 @@ #include <uapi/linux/if_ether.h> #include <rdma/ib_pack.h> #include <rdma/ib_cache.h> +#include <rdma/rdma_netlink.h> +#include <net/netlink.h> +#include <uapi/rdma/ib_user_sa.h> +#include <rdma/ib_marshall.h> #include "sa.h" MODULE_AUTHOR("Roland Dreier"); MODULE_DESCRIPTION("InfiniBand subnet administration query support"); MODULE_LICENSE("Dual BSD/GPL"); +#define IB_SA_LOCAL_SVC_TIMEOUT_MIN 100 +#define IB_SA_LOCAL_SVC_TIMEOUT_DEFAULT 2000 +#define IB_SA_LOCAL_SVC_TIMEOUT_MAX 200000 +static int sa_local_svc_timeout_ms = IB_SA_LOCAL_SVC_TIMEOUT_DEFAULT; + struct ib_sa_sm_ah { struct ib_ah *ah; struct kref ref; @@ -80,8 +89,25 @@ struct ib_sa_query { struct ib_mad_send_buf *mad_buf; struct ib_sa_sm_ah *sm_ah; int id; + u32 flags; + void *input; }; +#define IB_SA_ENABLE_LOCAL_SERVICE 0x00000001 +#define IB_SA_CANCEL 0x00000002 + +#define IB_SA_LOCAL_SVC_ENABLED(query) \ + ((query)->flags & IB_SA_ENABLE_LOCAL_SERVICE) +#define IB_SA_ENABLE_LOCAL_SVC(query) \ + ((query)->flags |= IB_SA_ENABLE_LOCAL_SERVICE) +#define IB_SA_DISABLE_LOCAL_SVC(query) \ + ((query)->flags &= ~IB_SA_ENABLE_LOCAL_SERVICE) + +#define IB_SA_QUERY_CANCELLED(query) \ + ((query)->flags & IB_SA_CANCEL) +#define IB_SA_CANCEL_QUERY(query) \ + ((query)->flags |= IB_SA_CANCEL) + struct ib_sa_service_query { void (*callback)(int, struct ib_sa_service_rec *, void *); void *context; @@ -106,6 +132,26 @@ struct ib_sa_mcmember_query { struct ib_sa_query sa_query; }; +struct ib_nl_request_info { + struct list_head list; + u32 seq; + unsigned long timeout; + struct ib_sa_query *query; +}; + +struct ib_nl_attr_info { + u16 type; + u16 len; /* Attribute payload length */ + void *input; + void (*set_attr)(struct sk_buff *skb, struct ib_nl_attr_info *info); +}; + +static LIST_HEAD(ib_nl_request_list); +static DEFINE_SPINLOCK(ib_nl_request_lock); +static atomic_t ib_nl_sa_request_seq; +static struct workqueue_struct *ib_nl_wq; +static struct delayed_work ib_nl_timed_work; + static void ib_sa_add_one(struct ib_device *device); static void ib_sa_remove_one(struct ib_device *device); @@ -381,6 +427,405 @@ static const struct ib_field guidinfo_rec_table[] = { .size_bits = 512 }, }; +static int ib_nl_send_msg(int opcode, struct ib_nl_attr_info *attrs, + int num_attrs, u32 seq) +{ + struct sk_buff *skb = NULL; + struct nlmsghdr *nlh; + void *data; + int ret = 0; + int i, len = 0; + + for (i = 0; i < num_attrs; i++) + len += nla_total_size(attrs[i].len); + + if (len <= 0) + return -EMSGSIZE; + + skb = nlmsg_new(len, GFP_KERNEL); + if (!skb) { + pr_err("alloc failed ret=%d\n", ret); + return -ENOMEM; + } + + /* Put nlmsg header only for now */ + data = ibnl_put_msg(skb, &nlh, seq, 0, RDMA_NL_SA, + opcode, GFP_KERNEL); + if (!data) { + kfree_skb(skb); + return -EMSGSIZE; + } + + /* Add attributes */ + for (i = 0; i < num_attrs; i++) + attrs[i].set_attr(skb, &attrs[i]); + + /* Repair the nlmsg header length */ + nlmsg_end(skb, nlh); + + ret = ibnl_multicast(skb, nlh, RDMA_NL_GROUP_LS, GFP_KERNEL); + if (!ret) { + ret = len; + } else { + if (ret != -ESRCH) + pr_err("ibnl_multicast failed l=%d, r=%d\n", len, ret); + ret = 0; + } + return ret; +} + +static struct ib_nl_request_info * +ib_nl_alloc_request(struct ib_sa_query *query) +{ + struct ib_nl_request_info *rinfo; + + rinfo = kzalloc(sizeof(*rinfo), GFP_ATOMIC); + if (rinfo == NULL) + return ERR_PTR(-ENOMEM); + + INIT_LIST_HEAD(&rinfo->list); + rinfo->seq = (u32)atomic_inc_return(&ib_nl_sa_request_seq); + rinfo->query = query; + + return rinfo; +} + +static void ib_nl_set_path_rec_attr(struct sk_buff *skb, + struct ib_nl_attr_info *info) +{ + struct nlattr *nla; + struct rdma_nla_ls_path_rec *nla_rec; + + nla = (struct nlattr *) skb_put(skb, nla_total_size(info->len)); + nla->nla_type = info->type; + nla->nla_len = nla_attr_size(info->len); + memset((unsigned char *) nla + nla->nla_len, 0, nla_padlen(info->len)); + + nla_rec = (struct rdma_nla_ls_path_rec *) nla_data(nla); + nla_rec->flags = LS_NLA_PATH_F_USER; + + /* + * We know that the input is of type struct ib_sa_path_rec while + * the output is of type struct ib_user_path_rec. + */ + ib_copy_path_rec_to_user((struct ib_user_path_rec *) nla_rec->path_rec, + (struct ib_sa_path_rec *) info->input); +} + +static int ib_nl_send_request(struct ib_nl_request_info *rinfo) +{ + struct ib_nl_attr_info info; + int opcode; + struct ib_sa_mad *mad; + unsigned long flags; + unsigned long delay; + int ret; + + mad = rinfo->query->mad_buf->mad; + switch (mad->mad_hdr.attr_id) { + case cpu_to_be16(IB_SA_ATTR_PATH_REC): + opcode = RDMA_NL_LS_OP_RESOLVE; + info.type = LS_NLA_TYPE_PATH_RECORD; + info.len = sizeof(struct rdma_nla_ls_path_rec) + + sizeof(struct ib_user_path_rec); + info.input = rinfo->query->input; + info.set_attr = ib_nl_set_path_rec_attr; + break; + default: + return -EINVAL; + } + + spin_lock_irqsave(&ib_nl_request_lock, flags); + ret = ib_nl_send_msg(opcode, &info, 1, rinfo->seq); + if (ret <= 0) { + ret = -EIO; + goto request_out; + } else { + ret = 0; + } + + delay = msecs_to_jiffies(sa_local_svc_timeout_ms); + rinfo->timeout = delay + jiffies; + list_add_tail(&rinfo->list, &ib_nl_request_list); + /* Start the timeout if this is the only request */ + if (ib_nl_request_list.next == &rinfo->list) + queue_delayed_work(ib_nl_wq, &ib_nl_timed_work, delay); + +request_out: + spin_unlock_irqrestore(&ib_nl_request_lock, flags); + + return ret; +} + +static int ib_nl_make_request(struct ib_sa_query *query) +{ + struct ib_nl_request_info *rinfo; + int ret; + + rinfo = ib_nl_alloc_request(query); + if (IS_ERR(rinfo)) + return -ENOMEM; + + ret = ib_nl_send_request(rinfo); + if (ret) + kfree(rinfo); + + return ret; +} + +static int ib_nl_cancel_request(struct ib_sa_query *query) +{ + unsigned long flags; + struct ib_nl_request_info *rinfo; + int found = 0; + + spin_lock_irqsave(&ib_nl_request_lock, flags); + list_for_each_entry(rinfo, &ib_nl_request_list, list) { + /* Let the timeout to take care of the callback */ + if (query == rinfo->query) { + IB_SA_CANCEL_QUERY(query); + rinfo->timeout = jiffies; + list_move(&rinfo->list, &ib_nl_request_list); + found = 1; + mod_delayed_work(ib_nl_wq, &ib_nl_timed_work, 1); + break; + } + } + spin_unlock_irqrestore(&ib_nl_request_lock, flags); + + return found; +} + +static void send_handler(struct ib_mad_agent *agent, + struct ib_mad_send_wc *mad_send_wc); + +static void ib_nl_process_good_resolve_rsp(struct ib_sa_query *query, + const struct nlmsghdr *nlh) +{ + struct ib_mad_send_wc mad_send_wc; + struct ib_sa_mad *mad = NULL; + const struct nlattr *attr; + struct rdma_nla_ls_path_rec *rec; + struct ib_user_path_rec *user_rec; + struct ib_sa_path_rec sa_rec; + struct ib_sa_path_query *path_query = + container_of(query, struct ib_sa_path_query, sa_query); + + if (query->callback) { + attr = (const struct nlattr *) nlmsg_data(nlh); + rec = (struct rdma_nla_ls_path_rec *) nla_data(attr); + if (rec->flags & LS_NLA_PATH_F_USER) { + user_rec = (struct ib_user_path_rec *) + rec->path_rec; + memset(&sa_rec, 0, sizeof(sa_rec)); + ib_copy_path_rec_from_user(&sa_rec, user_rec); + path_query->callback(0, &sa_rec, path_query->context); + } else { + mad = query->mad_buf->mad; + mad->mad_hdr.method |= IB_MGMT_METHOD_RESP; + memcpy(mad->data, rec->path_rec, + nla_len(attr) - sizeof(*rec)); + query->callback(query, 0, mad); + } + } + + mad_send_wc.send_buf = query->mad_buf; + mad_send_wc.status = IB_WC_SUCCESS; + send_handler(query->mad_buf->mad_agent, &mad_send_wc); +} + +static void ib_nl_request_timeout(struct work_struct *work) +{ + unsigned long flags; + struct ib_nl_request_info *rinfo; + struct ib_sa_query *query; + unsigned long delay; + struct ib_mad_send_wc mad_send_wc; + int ret; + + spin_lock_irqsave(&ib_nl_request_lock, flags); + while (!list_empty(&ib_nl_request_list)) { + rinfo = list_entry(ib_nl_request_list.next, + struct ib_nl_request_info, list); + + if (time_after(rinfo->timeout, jiffies)) { + delay = rinfo->timeout - jiffies; + if ((long)delay <= 0) + delay = 1; + queue_delayed_work(ib_nl_wq, &ib_nl_timed_work, delay); + break; + } + + list_del(&rinfo->list); + query = rinfo->query; + IB_SA_DISABLE_LOCAL_SVC(query); + /* Hold the lock to protect against query cancellation */ + if (IB_SA_QUERY_CANCELLED(query)) + ret = -1; + else + ret = ib_post_send_mad(query->mad_buf, NULL); + if (ret) { + mad_send_wc.send_buf = query->mad_buf; + mad_send_wc.status = IB_WC_WR_FLUSH_ERR; + spin_unlock_irqrestore(&ib_nl_request_lock, flags); + send_handler(query->port->agent, &mad_send_wc); + spin_lock_irqsave(&ib_nl_request_lock, flags); + } + kfree(rinfo); + } + spin_unlock_irqrestore(&ib_nl_request_lock, flags); +} + +static inline int ib_nl_is_good_resolve_resp(const struct nlmsghdr *nlh) +{ + const struct nlattr *attr; + struct rdma_nla_ls_path_rec *rec; + struct ib_path_rec_data rec_data; + + if (nlh->nlmsg_flags & RDMA_NL_LS_F_ERR) + return 0; + + if (!(nlh->nlmsg_flags & RDMA_NL_LS_F_OK)) + return 0; + + if (nlmsg_len(nlh) < nla_attr_size(sizeof(*rec))) + return 0; + + attr = (const struct nlattr *) nlmsg_data(nlh); + if (attr->nla_type != LS_NLA_TYPE_PATH_RECORD) + return 0; + + rec = (struct rdma_nla_ls_path_rec *) nla_data(attr); + if (((rec->flags & LS_NLA_PATH_F_USER) && + nla_len(attr) < sizeof(struct ib_user_path_rec)) || + (!(rec->flags & LS_NLA_PATH_F_USER) && + nla_len(attr) < sizeof(rec_data.path_rec))) + return 0; + + return 1; +} + +static int ib_nl_handle_set_timeout(struct sk_buff *skb, + struct netlink_callback *cb) +{ + const struct nlmsghdr *nlh = (struct nlmsghdr *)cb->nlh; + int timeout, delta, abs_delta; + const struct nlattr *attr; + struct rdma_nla_ls_timeout *to_attr; + unsigned long flags; + struct ib_nl_request_info *rinfo; + long delay = 0; + + if (nlmsg_len(nlh) < nla_attr_size(sizeof(*to_attr))) + goto settimeout_out; + + attr = (const struct nlattr *) nlmsg_data(nlh); + if (attr->nla_type != LS_NLA_TYPE_TIMEOUT || + nla_len(attr) != sizeof(*to_attr)) + goto settimeout_out; + + to_attr = (struct rdma_nla_ls_timeout *) nla_data(attr); + timeout = (int) to_attr->timeout; + if (timeout < IB_SA_LOCAL_SVC_TIMEOUT_MIN) + timeout = IB_SA_LOCAL_SVC_TIMEOUT_MIN; + if (timeout > IB_SA_LOCAL_SVC_TIMEOUT_MAX) + timeout = IB_SA_LOCAL_SVC_TIMEOUT_MAX; + + delta = timeout - sa_local_svc_timeout_ms; + if (delta < 0) + abs_delta = -delta; + else + abs_delta = delta; + + if (delta != 0) { + spin_lock_irqsave(&ib_nl_request_lock, flags); + sa_local_svc_timeout_ms = timeout; + list_for_each_entry(rinfo, &ib_nl_request_list, list) { + if (delta < 0 && abs_delta > rinfo->timeout) + rinfo->timeout = 0; + else + rinfo->timeout += delta; + + /* Get the new delay from the first entry */ + if (!delay) { + delay = rinfo->timeout - jiffies; + if (delay <= 0) + delay = 1; + } + } + if (delay) + mod_delayed_work(ib_nl_wq, &ib_nl_timed_work, + (unsigned long)delay); + spin_unlock_irqrestore(&ib_nl_request_lock, flags); + } + +settimeout_out: + return skb->len; +} + +static int ib_nl_handle_resolve_resp(struct sk_buff *skb, + struct netlink_callback *cb) +{ + const struct nlmsghdr *nlh = (struct nlmsghdr *)cb->nlh; + unsigned long flags; + struct ib_nl_request_info *rinfo; + struct ib_sa_query *query; + struct ib_mad_send_buf *send_buf; + struct ib_mad_send_wc mad_send_wc; + int found = 0; + int ret; + + spin_lock_irqsave(&ib_nl_request_lock, flags); + list_for_each_entry(rinfo, &ib_nl_request_list, list) { + /* + * If the query is cancelled, let the timeout routine + * take care of it. + */ + if (nlh->nlmsg_seq == rinfo->seq) { + found = !IB_SA_QUERY_CANCELLED(rinfo->query); + if (found) + list_del(&rinfo->list); + break; + } + } + + if (!found) { + spin_unlock_irqrestore(&ib_nl_request_lock, flags); + goto resp_out; + } + + query = rinfo->query; + send_buf = query->mad_buf; + + if (!ib_nl_is_good_resolve_resp(nlh)) { + /* if the result is a failure, send out the packet via IB */ + IB_SA_DISABLE_LOCAL_SVC(query); + ret = ib_post_send_mad(query->mad_buf, NULL); + spin_unlock_irqrestore(&ib_nl_request_lock, flags); + if (ret) { + mad_send_wc.send_buf = send_buf; + mad_send_wc.status = IB_WC_GENERAL_ERR; + send_handler(query->port->agent, &mad_send_wc); + } + } else { + spin_unlock_irqrestore(&ib_nl_request_lock, flags); + ib_nl_process_good_resolve_rsp(query, nlh); + } + + kfree(rinfo); +resp_out: + return skb->len; +} + +static struct ibnl_client_cbs ib_sa_cb_table[] = { + [RDMA_NL_LS_OP_RESOLVE] = { + .dump = ib_nl_handle_resolve_resp, + .module = THIS_MODULE }, + [RDMA_NL_LS_OP_SET_TIMEOUT] = { + .dump = ib_nl_handle_set_timeout, + .module = THIS_MODULE }, +}; + static void free_sm_ah(struct kref *kref) { struct ib_sa_sm_ah *sm_ah = container_of(kref, struct ib_sa_sm_ah, ref); @@ -502,7 +947,13 @@ void ib_sa_cancel_query(int id, struct ib_sa_query *query) mad_buf = query->mad_buf; spin_unlock_irqrestore(&idr_lock, flags); - ib_cancel_mad(agent, mad_buf); + /* + * If the query is still on the netlink request list, schedule + * it to be cancelled by the timeout routine. Otherwise, it has been + * sent to the MAD layer and has to be cancelled from there. + */ + if (!ib_nl_cancel_request(query)) + ib_cancel_mad(agent, mad_buf); } EXPORT_SYMBOL(ib_sa_cancel_query); @@ -638,6 +1089,14 @@ static int send_mad(struct ib_sa_query *query, int timeout_ms, gfp_t gfp_mask) query->mad_buf->context[0] = query; query->id = id; + if (IB_SA_LOCAL_SVC_ENABLED(query)) { + if (!ibnl_chk_listeners(RDMA_NL_GROUP_LS)) { + if (!ib_nl_make_request(query)) + return id; + } + IB_SA_DISABLE_LOCAL_SVC(query); + } + ret = ib_post_send_mad(query->mad_buf, NULL); if (ret) { spin_lock_irqsave(&idr_lock, flags); @@ -766,6 +1225,9 @@ int ib_sa_path_rec_get(struct ib_sa_client *client, *sa_query = &query->sa_query; + IB_SA_ENABLE_LOCAL_SVC(&query->sa_query); + query->sa_query.input = rec; + ret = send_mad(&query->sa_query, timeout_ms, gfp_mask); if (ret < 0) goto err2; @@ -1250,6 +1712,8 @@ static int __init ib_sa_init(void) get_random_bytes(&tid, sizeof tid); + atomic_set(&ib_nl_sa_request_seq, 0); + ret = ib_register_client(&sa_client); if (ret) { printk(KERN_ERR "Couldn't register ib_sa client\n"); @@ -1262,7 +1726,25 @@ static int __init ib_sa_init(void) goto err2; } + ib_nl_wq = create_singlethread_workqueue("ib_nl_sa_wq"); + if (!ib_nl_wq) { + ret = -ENOMEM; + goto err3; + } + + if (ibnl_add_client(RDMA_NL_SA, RDMA_NL_LS_NUM_OPS, + ib_sa_cb_table)) { + pr_err("Failed to add netlink callback\n"); + ret = -EINVAL; + goto err4; + } + INIT_DELAYED_WORK(&ib_nl_timed_work, ib_nl_request_timeout); + return 0; +err4: + destroy_workqueue(ib_nl_wq); +err3: + mcast_cleanup(); err2: ib_unregister_client(&sa_client); err1: @@ -1271,6 +1753,10 @@ err1: static void __exit ib_sa_cleanup(void) { + ibnl_remove_client(RDMA_NL_SA); + cancel_delayed_work(&ib_nl_timed_work); + flush_workqueue(ib_nl_wq); + destroy_workqueue(ib_nl_wq); mcast_cleanup(); ib_unregister_client(&sa_client); idr_destroy(&query_idr);