Message ID | 1364994796-10642-2-git-send-email-jsquyres@cisco.com (mailing list archive) |
---|---|
State | Rejected, archived |
Headers | show |
On Wed, Apr 3, 2013 at 6:13 AM, Jeff Squyres <jsquyres@cisco.com> wrote: > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h > index 8a66758..4670f6f 100644 > --- a/include/rdma/ib_verbs.h > +++ b/include/rdma/ib_verbs.h > @@ -174,8 +174,10 @@ enum ib_mtu { > IB_MTU_256 = 1, > IB_MTU_512 = 2, > IB_MTU_1024 = 3, > - IB_MTU_2048 = 4, > - IB_MTU_4096 = 5 > + IB_MTU_1500 = 4, > + IB_MTU_2048 = 5, > + IB_MTU_4096 = 6, > + IB_MTU_9000 = 7 > }; I don't think we can blithely do this... I think the IB enum values are defined to match the values used in the IB spec (PathRecord etc). Even if we change it so 1500 and 9000 are outside of the range used by the IB spec, I don't understand the motivation for this change. What does this buy us? How is iWARP working today without this change? - R. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 4/3/2013 11:52 AM, Roland Dreier wrote: > On Wed, Apr 3, 2013 at 6:13 AM, Jeff Squyres <jsquyres@cisco.com> wrote: >> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h >> index 8a66758..4670f6f 100644 >> --- a/include/rdma/ib_verbs.h >> +++ b/include/rdma/ib_verbs.h >> @@ -174,8 +174,10 @@ enum ib_mtu { >> IB_MTU_256 = 1, >> IB_MTU_512 = 2, >> IB_MTU_1024 = 3, >> - IB_MTU_2048 = 4, >> - IB_MTU_4096 = 5 >> + IB_MTU_1500 = 4, >> + IB_MTU_2048 = 5, >> + IB_MTU_4096 = 6, >> + IB_MTU_9000 = 7 >> }; > I don't think we can blithely do this... I think the IB enum values > are defined to match the values used in the IB spec (PathRecord etc). > > Even if we change it so 1500 and 9000 are outside of the range used by > the IB spec, I don't understand the motivation for this change. What > does this buy us? How is iWARP working today without this change? > > The IB_MTU stuff really doesn't apply to iwarp devices. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Apr 3, 2013, at 12:52 PM, Roland Dreier <roland@purestorage.com> wrote: > I don't think we can blithely do this... I think the IB enum values > are defined to match the values used in the IB spec (PathRecord etc). Gotcha. I inserted the enums in their proper numerical order to make the range comparisons simpler in ib_addr.h. But the 1500/9000 values could be tacked at the end of the current values (e.g., 6 and 7, respectively) -- it would just necessitate some different changes in ib_addr.h. > Even if we change it so 1500 and 9000 are outside of the range used by > the IB spec, I don't understand the motivation for this change. What > does this buy us? Our impression was that a userspace application cannot know the max message size it can send across a UD QP without having an accurate MTU enum. Specifically: the ibv_port_attr.max_msg_size value seems to be a higher-level value. E.g., on Mellanox devices, .max_msg_size is the max size of RC QP messages. Is there another way to determine max UD QP message size that we missed? > How is iWARP working today without this change? They lie about the actual/underlying MTU. But they don't have UD QPs, so .max_msg_size is sufficient for their RC QPs.
On 4/4/2013 8:22 AM, Jeff Squyres (jsquyres) wrote: > On Apr 3, 2013, at 12:52 PM, Roland Dreier <roland@purestorage.com> wrote: > >> I don't think we can blithely do this... I think the IB enum values >> are defined to match the values used in the IB spec (PathRecord etc). > > Gotcha. I inserted the enums in their proper numerical order to make the range comparisons simpler in ib_addr.h. But the 1500/9000 values could be tacked at the end of the current values (e.g., 6 and 7, respectively) -- it would just necessitate some different changes in ib_addr.h. What happens if in the future the IBTA adds new MTUs and allocates those currently reserved MTU values ? Wouldn't those values need to be standardized at the IBTA so that conflict won't occur ? -- Hal -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> What happens if in the future the IBTA adds new MTUs and allocates those > currently reserved MTU values ? Wouldn't those values need to be > standardized at the IBTA so that conflict won't occur ? The IBTA needs to standardize the values as they appear in MADs. The software values can differ. They would just need to be mapped. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma- > > > What happens if in the future the IBTA adds new MTUs and allocates > > those currently reserved MTU values ? Wouldn't those values need to be > > standardized at the IBTA so that conflict won't occur ? > > The IBTA needs to standardize the values as they appear in MADs. The > software values can differ. They would just need to be mapped. Even with a map I think having IB_MTU_1500 will cause some confusion as this is not an "IB" MTU. It seems an alternate enum name like RDMA_MTU_1500 is better. Just a stupid users opinion, Ira -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Even with a map I think having IB_MTU_1500 will cause some confusion as this is > not an "IB" MTU. It seems an alternate enum name like RDMA_MTU_1500 is better. Couldn't these be usable MTU's for RoCE? In hindsight, the user space API never should have exposed the mtu as an enum... Since an enum is an int, and we're never going to have anything with an mtu <= 5 bytes, couldn't we just store all new mtu values directly as their byte value? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: Hefty, Sean > > > Even with a map I think having IB_MTU_1500 will cause some confusion > > as this is not an "IB" MTU. It seems an alternate enum name like > RDMA_MTU_1500 is better. > > Couldn't these be usable MTU's for RoCE? I guess so, I don't have much experience with RoCE. If that is the case the RoCE annex might look at reserving these values in the spec? > > In hindsight, the user space API never should have exposed the mtu as an > enum... > > Since an enum is an int, and we're never going to have anything with an mtu > <= 5 bytes, couldn't we just store all new mtu values directly as their byte > value? That seems like a pretty good idea. Ira -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Apr 4, 2013, at 1:57 PM, "Weiny, Ira" <ira.weiny@intel.com> wrote: >> In hindsight, the user space API never should have exposed the mtu as an >> enum... >> >> Since an enum is an int, and we're never going to have anything with an mtu >> <= 5 bytes, couldn't we just store all new mtu values directly as their byte >> value? > > That seems like a pretty good idea. Agreed, but changing to an int would seem to have some fairly serious backwards compatibility issues. What is the right way to move forward here? Just to re-state: our issue is that there does not seem to be any other way to get the max UD message size without knowing the actual MTU (are we incorrect about that?). Hence, using the IB-defined values is not really sufficient.
> Agreed, but changing to an int would seem to have some fairly serious backwards > compatibility issues. Why can't IB_MTU_1500 = 1500? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma- > Subject: Re: [PATCH 2/2] Ad IB_MTU_1500|9000 enums. > > On Apr 4, 2013, at 1:57 PM, "Weiny, Ira" <ira.weiny@intel.com> wrote: > > >> In hindsight, the user space API never should have exposed the mtu as > >> an enum... > >> > >> Since an enum is an int, and we're never going to have anything with > >> an mtu <= 5 bytes, couldn't we just store all new mtu values directly > >> as their byte value? > > > > That seems like a pretty good idea. > > > Agreed, but changing to an int would seem to have some fairly serious > backwards compatibility issues. > > What is the right way to move forward here? > > Just to re-state: our issue is that there does not seem to be any other way to > get the max UD message size without knowing the actual MTU (are we > incorrect about that?). Hence, using the IB-defined values is not really > sufficient. I guess I am confused. Is this patch trying to support RoCE or a VNIC? Ira -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Apr 8, 2013, at 6:16 PM, "Hefty, Sean" <sean.hefty@intel.com> wrote:
> Why can't IB_MTU_1500 = 1500?
It certainly could. Additionally, since Roland was a little concerned about the "IB" prefix (since 1500 and 9000 are not IBTA-sanctioned MTUs), they could have a different prefix -- perhaps RDMA_MTU_1500.
Although I admit that it would be weird to have an enum that contains values with different prefixes:
enum ib_mtu {
IB_MTU_256 = 1,
IB_MTU_512 = 2,
IB_MTU_1024 = 3,
IB_MTU_2048 = 4,
IB_MTU_4096 = 5,
RDMA_MTU_1500 = 1500,
RDMA_MTU_9000 = 9000
};
On Apr 9, 2013, at 4:10 PM, "Weiny, Ira" <ira.weiny@intel.com> wrote: >> Just to re-state: our issue is that there does not seem to be any other way to >> get the max UD message size without knowing the actual MTU (are we >> incorrect about that?). Hence, using the IB-defined values is not really >> sufficient. > > I guess I am confused. Is this patch trying to support RoCE or a VNIC? Both, actually. The RoCE driver lies about its MTU (IIRC, it claims IB_MTU_1024, even if the MTU is actually 1500). So AFAIK, there's no way to know what the UD max message size is on RoCE, because the max message size attribute on port refers to RC, not UD.
> -----Original Message----- > From: Jeff Squyres (jsquyres) [mailto:jsquyres@cisco.com] > Subject: Re: [PATCH 2/2] Ad IB_MTU_1500|9000 enums. > > On Apr 8, 2013, at 6:16 PM, "Hefty, Sean" <sean.hefty@intel.com> wrote: > > > Why can't IB_MTU_1500 = 1500? > Sean, If the IBTA were to release new MTU enumerations which values would you recommend then? Ira > > It certainly could. Additionally, since Roland was a little concerned about the > "IB" prefix (since 1500 and 9000 are not IBTA-sanctioned MTUs), they could > have a different prefix -- perhaps RDMA_MTU_1500. > > Although I admit that it would be weird to have an enum that contains values > with different prefixes: > > enum ib_mtu { > IB_MTU_256 = 1, > IB_MTU_512 = 2, > IB_MTU_1024 = 3, > IB_MTU_2048 = 4, > IB_MTU_4096 = 5, > RDMA_MTU_1500 = 1500, > RDMA_MTU_9000 = 9000 > }; > > -- > Jeff Squyres > jsquyres@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> If the IBTA were to release new MTU enumerations which values would you > recommend then? I don't think there's a great solution here. We're mixing IBTA encoded values with non-IBTA values. We could reserve the 6-bit encoded values for IB, and use direct values for others (or at least jump beyond the 6-bit range). Or we can stop matching new IBTA MTU encodings (e.g. IB_MTU_1500 = 6). Or we go back in time and make mtu an int. - Sean -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: Hefty, Sean > Sent: Tuesday, April 09, 2013 6:30 PM > To: Weiny, Ira; Jeff Squyres (jsquyres) > Cc: Hal Rosenstock; Roland Dreier; linux-rdma@vger.kernel.org; Upinder > Malhi (umalhi) > Subject: RE: [PATCH 2/2] Ad IB_MTU_1500|9000 enums. > > > If the IBTA were to release new MTU enumerations which values would > > you recommend then? > > I don't think there's a great solution here. We're mixing IBTA encoded values > with non-IBTA values. We could reserve the 6-bit encoded values for IB, and > use direct values for others (or at least jump beyond the 6-bit range). Or we > can stop matching new IBTA MTU encodings (e.g. IB_MTU_1500 = 6). Or we > go back in time and make mtu an int. > I thought reserving the 6 bit's for IB and allowing the enum values to match the MTU was a pretty good compromise. Especially since PathRecord is defined in sa.h which is provided by libibverbs. That allows for that IB MTU enum to be used there. OTOH, now that we have moved toward decent defines in the libibumad library we could define the MTU enum there. But then we again go down the path of defining things multiple places and confusing the users... :-( As an aside I like the use of RDMA_MTU_* for these values. Again to distinguish them from the IBTA values. But I know that is poor form. Ira -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Apr 9, 2013, at 10:44 PM, "Weiny, Ira" <ira.weiny@intel.com> wrote:
> As an aside I like the use of RDMA_MTU_* for these values. Again to distinguish them from the IBTA values. But I know that is poor form.
So what's the right way to move forward on this? Is it this:
enum ib_mtu {
IB_MTU_256 = 1,
IB_MTU_512 = 2,
IB_MTU_1024 = 3,
IB_MTU_2048 = 4,
IB_MTU_4096 = 5,
RDMA_MTU_1500 = 1500,
RDMA_MTU_9000 = 9000
};
On Apr 12, 2013, at 11:40 AM, Jeff Squyres (jsquyres) <jsquyres@cisco.com> wrote: >> As an aside I like the use of RDMA_MTU_* for these values. Again to distinguish them from the IBTA values. But I know that is poor form. > > So what's the right way to move forward on this? Is it this: > > enum ib_mtu { > IB_MTU_256 = 1, > IB_MTU_512 = 2, > IB_MTU_1024 = 3, > IB_MTU_2048 = 4, > IB_MTU_4096 = 5, > RDMA_MTU_1500 = 1500, > RDMA_MTU_9000 = 9000 > }; Bump.
On 04/19/2013 11:24 AM, Jeff Squyres (jsquyres) wrote: > On Apr 12, 2013, at 11:40 AM, Jeff Squyres (jsquyres) <jsquyres@cisco.com> wrote: > >>> As an aside I like the use of RDMA_MTU_* for these values. Again to distinguish them from the IBTA values. But I know that is poor form. >> >> So what's the right way to move forward on this? Is it this: >> >> enum ib_mtu { >> IB_MTU_256 = 1, >> IB_MTU_512 = 2, >> IB_MTU_1024 = 3, >> IB_MTU_2048 = 4, >> IB_MTU_4096 = 5, >> RDMA_MTU_1500 = 1500, >> RDMA_MTU_9000 = 9000 >> }; > > > Bump. > This seems reasonable, but still concerns me a bit. The original version was flat out wrong because you can't re-arrange any exposed enum like this without requiring that all user space apps be recompiled. This is especially true because ibv_mtu_enum_to_int is an inline, so any compiled programs won't have been updated by a shared library update, yet the new enum values can end up showing up, so the apps break when they start seeing -1 as the return value, or when they interpret 1500 as 2048, etc. I wonder if the correct way forward isn't to leave these two functions alone (as inlines you can't change them without a recompile anyway). However, I would officially change the documentation to specify that both enum ib_mtu in the queue_pair structs and the result of ib_mtu_enum_to_int is not exact on non-IB link layers and is deprecated in favor of a new function: ib_qp_mtu() (or similar) that takes a queue pair and returns the real mtu for that queue pair based on params and device the queue pair is on (I could also see it being ibv_dev_mtu(struct ibv_device *) instead if you want to query the mtu before you make a queue pair instead). Then for both IBoE and USNIC devices, I would just store the actual MTU in the internal ibv device struct and return that when requested. This maintains 100% back compatibility with existing apps but allows a path for people to get better performance/utilization by using the new function instead of the old one. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> This seems reasonable, but still concerns me a bit. The original > version was flat out wrong because you can't re-arrange any exposed enum > like this without requiring that all user space apps be recompiled. > This is especially true because ibv_mtu_enum_to_int is an inline ib_mtu_enum_to_int() is a kernel function, not user space, so I think we're fine here, unless you're concerned about drivers built out of tree. - Sean -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/20/2013 07:29 PM, Hefty, Sean wrote: >> This seems reasonable, but still concerns me a bit. The original >> version was flat out wrong because you can't re-arrange any exposed >> enum like this without requiring that all user space apps be >> recompiled. This is especially true because ibv_mtu_enum_to_int is >> an inline > > ib_mtu_enum_to_int() is a kernel function, not user space, so I think > we're fine here, unless you're concerned about drivers built out of > tree. Well, although *I* might have to worry about out of kernel drivers, I wouldn't suggest such for upstream. However, for some reason I had it in my mind when I was reading the patch that it was against libibverbs. That's what I get for staying up late and reviewing when I'm tired :-/ -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Apr 22, 2013, at 1:30 PM, Doug Ledford <dledford@redhat.com> wrote: > However, for some reason I had it > in my mind when I was reading the patch that it was against libibverbs. > That's what I get for staying up late and reviewing when I'm tired :-/ There were other patches against libibverbs that were submitted at the same time. That being said, I see two obvious ways to go forward, both of which have pros/cons: 1. Extend the enum ib_mtu to include new enum values for 1500 and 9000 -- probably with a different prefix to indicate that they're not IBTA-sanctioned values (note that this will also require corresponding changes in libibverbs, since MTU values get passed up from kernel to userspace). PRO: fixes the immediate problem PRO: probably the lowest impact solution; just adding some more enum values CON: weird naming (IB_ and RDMA_ prefixes in the same ib_mtu enum; probably something similar in userspace) CON: doesn't do anything to address other MTU values (e.g., what if someone has an MTU of 1498?) 2. Change all instances of ib_mtu/ibv_mtu to an int. Code such as "switch(mtu) case IBV_MTU_1024: ..." will need to be updated to "switch(mtu) case 1024: ...". PRO: solves the problem for all MTU values PRO: eliminates the enum-to-int translation functions CON: much driver code will need to be updated per above, and also update logic checking for out-of-bounds MTU calues CON: similarly, userspace apps will need to be updated; it might be worthwhile to bump libibverbs to 2.x, and then intentionally change the MTU field names in ibv_port_attr and ibv_qp_attr so that apps using those fields will fail to compile with libibverbs 2.x (and therefore forcibly realize they need to adapt to the new int MTU values)
On 04/22/2013 02:00 PM, Jeff Squyres (jsquyres) wrote: > 2. Change all instances of ib_mtu/ibv_mtu to an int. Code such as > "switch(mtu) case IBV_MTU_1024: ..." will need to be updated to > "switch(mtu) case 1024: ...". > > PRO: solves the problem for all MTU values > PRO: eliminates the num-to-int translation functions > CON: much driver code will need to be updated per above, and also > update logic checking for out-of-bounds MTU calues > CON: similarly, userspace apps will need to be updated; it might be > worthwhile to bump libibverbs to 2.x, and then intentionally change > the MTU field names in ibv_port_attr and ibv_qp_attr so that apps > using those fields will fail to compile with libibverbs 2.x > (and therefore forcibly realize they need to adapt to the new > int MTU values) I was actually thinking that an ibverbs API version 2.0 might be an interesting way to go. The proliferation of non-IB link layers providing the verbs API make some of the original assumptions of IB link layer in the original API obsolete. But, if we were to do that, I'd take some time to really think the issue over and try to catch all of the needed updates in one go. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Apr 22, 2013, at 4:00 PM, Doug Ledford <dledford@redhat.com> wrote: >> 2. Change all instances of ib_mtu/ibv_mtu to an int. Code such as >> "switch(mtu) case IBV_MTU_1024: ..." will need to be updated to >> "switch(mtu) case 1024: ...". > > I was actually thinking that an ibverbs API version 2.0 might be an > interesting way to go. The proliferation of non-IB link layers > providing the verbs API make some of the original assumptions of IB link > layer in the original API obsolete. But, if we were to do that, I'd > take some time to really think the issue over and try to catch all of > the needed updates in one go. In addition to the MTU, another obvious issue is the active_speed attribute on the ibv_port_attr. On the kernel side, it's an enum (IB_SPEED_SDR through IB_SPEED_EDR), but there's no corresponding enum names in libibverbs. It would be good to make this value a non-enum-int, too.
diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h index 9996539..1f6fbbc 100644 --- a/include/rdma/ib_addr.h +++ b/include/rdma/ib_addr.h @@ -200,10 +200,14 @@ static inline enum ib_mtu iboe_get_mtu(int mtu) */ mtu = mtu - IB_GRH_BYTES - IB_BTH_BYTES - 28; - if (mtu >= ib_mtu_enum_to_int(IB_MTU_4096)) + if (mtu >= ib_mtu_enum_to_int(IB_MTU_9000)) + return IB_MTU_9000; + else if (mtu >= ib_mtu_enum_to_int(IB_MTU_4096)) return IB_MTU_4096; else if (mtu >= ib_mtu_enum_to_int(IB_MTU_2048)) return IB_MTU_2048; + else if (mtu >= ib_mtu_enum_to_int(IB_MTU_1500)) + return IB_MTU_1500; else if (mtu >= ib_mtu_enum_to_int(IB_MTU_1024)) return IB_MTU_1024; else if (mtu >= ib_mtu_enum_to_int(IB_MTU_512)) diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 8a66758..4670f6f 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -174,8 +174,10 @@ enum ib_mtu { IB_MTU_256 = 1, IB_MTU_512 = 2, IB_MTU_1024 = 3, - IB_MTU_2048 = 4, - IB_MTU_4096 = 5 + IB_MTU_1500 = 4, + IB_MTU_2048 = 5, + IB_MTU_4096 = 6, + IB_MTU_9000 = 7 }; static inline int ib_mtu_enum_to_int(enum ib_mtu mtu) @@ -184,8 +186,10 @@ static inline int ib_mtu_enum_to_int(enum ib_mtu mtu) case IB_MTU_256: return 256; case IB_MTU_512: return 512; case IB_MTU_1024: return 1024; + case IB_MTU_1500: return 1500; case IB_MTU_2048: return 2048; case IB_MTU_4096: return 4096; + case IB_MTU_9000: return 9000; default: return -1; } }