Message ID | 1431869786-6308-10-git-send-email-ogerlitz@mellanox.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
On Sun, May 17, 2015 at 04:36:25PM +0300, Or Gerlitz wrote: > From: Matan Barak <matanb@mellanox.com> > > mlx4 needs to report the number of supported timestamp > bits (mask) and the hca_core_clock frequency. This is critical information to parse the timestamp, why is it hidden in vendor specific land? We can't really look at the uapi changes here without also seeing the verbs side changes. I'd like to see Yann review all the uapi stuff as well. I'm not sure how much I like this idea of appending vendor stuff to these replies... Hiding uapi in a driver just seems like asking for trouble. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 19, 2015 at 12:58:01PM -0600, Jason Gunthorpe wrote: > On Sun, May 17, 2015 at 04:36:25PM +0300, Or Gerlitz wrote: > > From: Matan Barak <matanb@mellanox.com> > > > > mlx4 needs to report the number of supported timestamp > > bits (mask) and the hca_core_clock frequency. > > This is critical information to parse the timestamp, why is it hidden > in vendor specific land? Sorry, direct this toward the page offset in the next patch. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 19, 2015 at 10:00 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > On Tue, May 19, 2015 at 12:58:01PM -0600, Jason Gunthorpe wrote: >> On Sun, May 17, 2015 at 04:36:25PM +0300, Or Gerlitz wrote: >> > From: Matan Barak <matanb@mellanox.com> >> > >> > mlx4 needs to report the number of supported timestamp >> > bits (mask) and the hca_core_clock frequency. >> >> This is critical information to parse the timestamp, why is it hidden >> in vendor specific land? > > Sorry, direct this toward the page offset in the next patch. We see this as a piece of detail related to the specific vendor driver. I don't see the point of overloading the user space verbs library or app with an API to get the offset of the clock within the mmap page and such, do you? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 19, 2015 at 10:11:43PM +0300, Or Gerlitz wrote: > > Sorry, direct this toward the page offset in the next patch. > > We see this as a piece of detail related to the specific vendor > driver. I don't see the point of overloading the user space verbs > library or app with an API to get the offset of the clock within the > mmap page and such, do you? Can't you find a better place for this than overloading the query_device call? I'd rather see you add some kind of driver-specific command than this. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 19, 2015 at 10:15 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > On Tue, May 19, 2015 at 10:11:43PM +0300, Or Gerlitz wrote: >> > Sorry, direct this toward the page offset in the next patch. >> >> We see this as a piece of detail related to the specific vendor >> driver. I don't see the point of overloading the user space verbs >> library or app with an API to get the offset of the clock within the >> mmap page and such, do you? > > Can't you find a better place for this than overloading the > query_device call? I'd rather see you add some kind of driver-specific > command than this. Let's be precise -- Are you objecting adding the clock frequency and mask to the qeury device verb? why? If not, are objecting using the vendor specific track of the verb to pass from the vendor driver to the vendor library this or that detail which is needed for proper operation? why? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 19, 2015 at 10:30:00PM +0300, Or Gerlitz wrote: > Are you objecting adding the clock frequency and mask to the qeury device verb? > why? Lets see the verbs side and I'll let you know. > If not, are objecting using the vendor specific track of the verb to > pass from the vendor driver to the vendor library this or that detail > which is needed for proper operation? why? I'm uncomfortable seeing otherwise vendor-neutral calls gain vendor extensions. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 5/20/2015 3:29 AM, Jason Gunthorpe wrote: > On Tue, May 19, 2015 at 10:30:00PM +0300, Or Gerlitz wrote: >> >Are you objecting adding the clock frequency and mask to the qeury device verb? >> >why? > Lets see the verbs side and I'll let you know. You mean the user series of libibverbs/libmlx4? I don't see why this should be a must for the review of the kernel bits. The user-space code is coming up soon, sure, but we should be able to review kernel patches without requiring to actually see the user-space code. As the change-logs here explained, the clock frequency is needed for applications to convert the HCA clock delta (current time - timestamp on WC) into nano-secs and such.The mask is needed to realize how many bits from the 64b time-stamp are supported by the HW. > >> >If not, are objecting using the vendor specific track of the verb to >> >pass from the vendor driver to the vendor library this or that detail >> >which is needed for proper operation? why? > I'm uncomfortable seeing otherwise vendor-neutral calls gain vendor > extensions. But this is whole purpose of the udata framework in uverbs, right? for each uverb command the vendor user-space library has a well defined channel to communicate directly with the low level vendor driver throughout the uverbs channels. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 5/20/2015 3:29 AM, Jason Gunthorpe wrote: > On Tue, May 19, 2015 at 10:30:00PM +0300, Or Gerlitz wrote: >> >Are you objecting adding the clock frequency and mask to the qeury device verb? >> >why? > Lets see the verbs side and I'll let you know. You mean the user series of libibverbs/libmlx4? I don't see why this should be a must for the review of the kernel bits. The user-space code is coming up soon, sure, but we should be able to review kernel patches without requiring to actually see the user-space code. As the change-logs here explained, the clock frequency is needed for applications to convert the HCA clock delta (current time - timestamp on WC) into nano-secs and such.The mask is needed to realize how many bits from the 64b time-stamp are supported by the HW. > >> >If not, are objecting using the vendor specific track of the verb to >> >pass from the vendor driver to the vendor library this or that detail >> >which is needed for proper operation? why? > I'm uncomfortable seeing otherwise vendor-neutral calls gain vendor > extensions. But this is whole purpose of the udata framework in uverbs, right? for each uverb command the vendor user-space library has a well defined channel to communicate directly with the low level vendor driver throughout the uverbs channels. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, Le mercredi 20 mai 2015 à 17:41 +0300, Or Gerlitz a écrit : > On 5/20/2015 3:29 AM, Jason Gunthorpe wrote: > > On Tue, May 19, 2015 at 10:30:00PM +0300, Or Gerlitz wrote: > > > > Are you objecting adding the clock frequency and mask to the > > > > qeury device verb? > > > > why? > > Lets see the verbs side and I'll let you know. > > You mean the user series of libibverbs/libmlx4? I don't see why this > should be a must for the review of the kernel bits. The user-space > code > is coming up soon, sure, but we should be able to review kernel > patches > without requiring to actually see the user-space code. > In some other subsystems: no userspace code, no merge. http://blog.ffwll.ch/2015/05/gfx-kernel-upstreaming-requirements.html > As the change-logs here explained, the clock frequency is needed for > applications to convert the HCA clock delta (current time - timestamp > on > WC) into nano-secs and such.The mask is needed to realize how many > bits > from the 64b time-stamp are supported by the HW. > > > > > > > If not, are objecting using the vendor specific track of the > > > > verb to > > > > pass from the vendor driver to the vendor library this or that > > > > detail > > > > which is needed for proper operation? why? > > I'm uncomfortable seeing otherwise vendor-neutral calls gain vendor > > extensions. > > But this is whole purpose of the udata framework in uverbs, right? > for > each uverb command the vendor user-space library has a well defined > channel to communicate directly with the low level vendor driver > throughout the uverbs channels. > Uverbs convey information between kernel and userspace drivers to implement verbs for userspace application. I don't think it's designed to allow vendor to add random extensions in the best way with regard to backward/forward compability. Anyway, please, we have to make drivers which are going to behave as good citizens to the kernel *and* userspace. Adding a dedicated extensions which is going to be replaced later by a generic, vendor neutral, extension will be painful to maintain to ensure backward compatibility. So let's think how this timestamp extension can be made generic enough to be future-proof (and at least present proof to address current use cases). Regards.
On Wed, May 20, 2015 at 05:11:17PM +0200, Yann Droneaud wrote: > Hi, > > Le mercredi 20 mai 2015 à 17:41 +0300, Or Gerlitz a écrit : > > On 5/20/2015 3:29 AM, Jason Gunthorpe wrote: > > > On Tue, May 19, 2015 at 10:30:00PM +0300, Or Gerlitz wrote: > > > > > Are you objecting adding the clock frequency and mask to the > > > > > qeury device verb? > > > > > why? > > > Lets see the verbs side and I'll let you know. > > > > You mean the user series of libibverbs/libmlx4? I don't see why this > > should be a must for the review of the kernel bits. The user-space > > code > > is coming up soon, sure, but we should be able to review kernel > > patches > > without requiring to actually see the user-space code. > > > > In some other subsystems: no userspace code, no merge. > > http://blog.ffwll.ch/2015/05/gfx-kernel-upstreaming-requirements.html I think we need to have the same policy. Like for the time base related values, if the uapi is some kind of 'ibv_get_wc_timestamp_ns' call (which would make sense) then those are actually vendor values and we don't need them in the public structure. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 20, 2015 at 6:11 PM, Yann Droneaud <ydroneaud@opteya.com> wrote: >> But this is whole purpose of the udata framework in uverbs, right? for >> each uverb command the vendor user-space library has a well defined >> channel to communicate directly with the low level vendor driver >> throughout the uverbs channels. > Uverbs convey information between kernel and userspace drivers to > implement verbs for userspace application. I don't think it's designed > to allow vendor to add random extensions in the best way with regard to > backward/forward compability. Disagree that this is random extension. The people that designed this stack 10y ago (Roland and Co.) looked very nicely forward and realized that not all the HW are the same nor can be put 101% under the same API with no way out, and hence they came up with udata. Please state how you see the role of the uverbs udata mechanism. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 20, 2015 at 8:53 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote: > On Wed, May 20, 2015 at 6:11 PM, Yann Droneaud <ydroneaud@opteya.com> wrote: > >>> But this is whole purpose of the udata framework in uverbs, right? for >>> each uverb command the vendor user-space library has a well defined >>> channel to communicate directly with the low level vendor driver >>> throughout the uverbs channels. > >> Uverbs convey information between kernel and userspace drivers to >> implement verbs for userspace application. I don't think it's designed >> to allow vendor to add random extensions in the best way with regard to >> backward/forward compability. > > Disagree that this is random extension. The people that designed this > stack 10y ago (Roland and Co.) looked very nicely forward and realized > that not all the HW are the same nor can be put 101% under the same > API with no way out, and hence they came up with udata. > > Please state how you see the role of the uverbs udata mechanism. Guys, still waiting to hear why you think it's wrong here to use the mechanism which was built from day-1 for the purpose of allowing the user-space driver library to communicate with the kernel driver and pass values in both directions. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, May 23, 2015 at 7:26 AM, Or Gerlitz <gerlitz.or@gmail.com> wrote: > On Wed, May 20, 2015 at 8:53 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote: >> On Wed, May 20, 2015 at 6:11 PM, Yann Droneaud <ydroneaud@opteya.com> wrote: >> >>>> But this is whole purpose of the udata framework in uverbs, right? for >>>> each uverb command the vendor user-space library has a well defined >>>> channel to communicate directly with the low level vendor driver >>>> throughout the uverbs channels. >> >>> Uverbs convey information between kernel and userspace drivers to >>> implement verbs for userspace application. I don't think it's designed >>> to allow vendor to add random extensions in the best way with regard to >>> backward/forward compability. >> >> Disagree that this is random extension. The people that designed this >> stack 10y ago (Roland and Co.) looked very nicely forward and realized >> that not all the HW are the same nor can be put 101% under the same >> API with no way out, and hence they came up with udata. >> >> Please state how you see the role of the uverbs udata mechanism. > > Guys, still waiting to hear why you think it's wrong here to use the > mechanism which was built from day-1 for the purpose of allowing the > user-space driver library to communicate with the kernel driver and > pass values in both directions. Jason, ping, it's fair to require that if you made a review argument against the design done here and we've responded about a week ago, saying why this design is valid (e.g goes along the 10y old IB stack udata mechanism and such) -- you would comment on the response and not leave it in the air. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 26, 2015 at 11:10:45AM +0300, Or Gerlitz wrote: > Jason, ping, it's fair to require that if you made a review argument against > the design done here and we've responded about a week ago, saying why > this design is valid (e.g goes along the 10y old IB stack udata mechanism and > such) -- you would comment on the response and not leave it in the air. Was it not clear? Yann and I asked to see the user space side before reviewing this series further. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 26, 2015 at 7:06 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > On Tue, May 26, 2015 at 11:10:45AM +0300, Or Gerlitz wrote: > >> Jason, ping, it's fair to require that if you made a review argument against >> the design done here and we've responded about a week ago, saying why >> this design is valid (e.g goes along the 10y old IB stack udata mechanism and >> such) -- you would comment on the response and not leave it in the air. > > Was it not clear? Yann and I asked to see the user space side before > reviewing this series further. Jason, you (U2 BTW) play really, really hard - refusing to say **one** word on your approach towards the built-in udata mechanism for uverbs which I asked you to comment on. On top of that, as happens **all** the **time** in netdev and possibly other subsystems, user space facing kernel patches were reviewed and accepted in this list over the last ten years with-out seeing their user-space counter parts @ the time of the kernel submission. There's no reason to impose this as hard requirement just b/c two reviewers ask that. You don't own this place. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 26, 2015 at 09:33:18PM +0300, Or Gerlitz wrote: > On Tue, May 26, 2015 at 7:06 PM, Jason Gunthorpe > <jgunthorpe@obsidianresearch.com> wrote: > > On Tue, May 26, 2015 at 11:10:45AM +0300, Or Gerlitz wrote: > > > >> Jason, ping, it's fair to require that if you made a review argument against > >> the design done here and we've responded about a week ago, saying why > >> this design is valid (e.g goes along the 10y old IB stack udata mechanism and > >> such) -- you would comment on the response and not leave it in the air. > > > > Was it not clear? Yann and I asked to see the user space side before > > reviewing this series further. > > Jason, you (U2 BTW) play really, really hard - refusing to say **one** > word on your approach towards the built-in udata mechanism for uverbs > which I asked you to comment on. And I asked to see the user space side and you have angrily refused every time. So I guess we are both playing hard. FWIW, your comments on udata seemed compelling, but I want to see the whole solution before saying I'm OK with it. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 26, 2015 at 9:53 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > On Tue, May 26, 2015 at 09:33:18PM +0300, Or Gerlitz wrote: >> On Tue, May 26, 2015 at 7:06 PM, Jason Gunthorpe >> <jgunthorpe@obsidianresearch.com> wrote: >> > On Tue, May 26, 2015 at 11:10:45AM +0300, Or Gerlitz wrote: >> > >> >> Jason, ping, it's fair to require that if you made a review argument against >> >> the design done here and we've responded about a week ago, saying why >> >> this design is valid (e.g goes along the 10y old IB stack udata mechanism and >> >> such) -- you would comment on the response and not leave it in the air. >> > >> > Was it not clear? Yann and I asked to see the user space side before >> > reviewing this series further. >> >> Jason, you (U2 BTW) play really, really hard - refusing to say **one** >> word on your approach towards the built-in udata mechanism for uverbs >> which I asked you to comment on. > And I asked to see the user space side and you have angrily refused > every time. AFAIR I never ever refused to show any piece of code which went under my hands towards Linux to any-one. > So I guess we are both playing hard. I disagree, you act as sort of being the boss here, stating every now and then your preferences and way of engineering things as the ultimate guidelines for Linux and/or RDMA engineering. > FWIW, your comments on udata seemed compelling Good to hear > but I want to see the whole solution before saying I'm OK with it. go look, not the final cut but should be close to what we'll submit https://github.com/matanb10/libibverbs timestamp-v0 https://github.com/matanb10/libmlx4 timestamp-v0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 26, 2015 at 11:39:04PM +0300, Or Gerlitz wrote: > >> Jason, you (U2 BTW) play really, really hard - refusing to say **one** > >> word on your approach towards the built-in udata mechanism for uverbs > >> which I asked you to comment on. > > > And I asked to see the user space side and you have angrily refused > > every time. > > AFAIR I never ever refused to show any piece of code which went under > my hands towards Linux to any-one. For future reference, when someone asks a question and you go off on an tangental rant and ignore the question, then that process repeats, still without answering the question - most english speakers would call that refusing to answer the question. It is not looked upon kindly. I'm really confused why you didn't just post the github links last week, the patches are all a month old on there. Was it really so offensive to you that we wanted to review the kernel UAPI patches and verbs patches together? > > So I guess we are both playing hard. > > I disagree, you act as sort of being the boss here, stating every now > and then your preferences and way of engineering things as the > ultimate guidelines for Linux and/or RDMA engineering. Lets be clear Or, I have given you (and others) some very pointed comments and advice, privately and publicly. That is not 'being the boss' that is contributing to fix our community. When it comes to my patch comments, I give direction on what I want to see to provide my Reviewed-By. If you don't like it, then find someone else to review your code. I'm busy, and I don't work for you. If I don't want to review some patches because my questions have been ignored, then that is entirely my perogative. Pinging me *three times this week* on this stupid timestamp thing, is somewhere between annoying and offensive. Chill out Or. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 27, 2015 at 1:07 AM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > On Tue, May 26, 2015 at 11:39:04PM +0300, Or Gerlitz wrote: >> >> Jason, you (U2 BTW) play really, really hard - refusing to say **one** >> >> word on your approach towards the built-in udata mechanism for uverbs >> >> which I asked you to comment on. >> >> > And I asked to see the user space side and you have angrily refused >> > every time. >> >> AFAIR I never ever refused to show any piece of code which went under >> my hands towards Linux to any-one. > For future reference, when someone asks a question and you go off on > an tangental rant and ignore the question, then that process repeats, > still without answering the question - most english speakers would > call that refusing to answer the question. It is not looked upon kindly. Jason, It's not that you asked to see the code ala "hey, do you happen to have a git with the user space code for people to inspect while doing the review on the kernel part", but rather U2 saying in a definitive manner that posting the user space code should be imposed as pre-requirement to acceptance of the kernel parts. In parallel, U2 totally rejected our usage of udata @ on the spot and when I mentioned that it's a feature which was designed for that purpose exactly and from day one, it took me three reminders to get a "you know what, maybe that can fly" comment from you. So here I started to realize that there's something in the attitude that goes beyond the details, and I made the you're not the boss comment. > I'm really confused why you didn't just post the github links last > week, the patches are all a month old on there. Was it really so > offensive to you that we wanted to review the kernel UAPI patches and > verbs patches together? > >> > So I guess we are both playing hard. >> >> I disagree, you act as sort of being the boss here, stating every now >> and then your preferences and way of engineering things as the >> ultimate guidelines for Linux and/or RDMA engineering. > > Lets be clear Or, I have given you (and others) some very pointed > comments and advice, privately and publicly. That is not 'being the > boss' that is contributing to fix our community. > > When it comes to my patch comments, I give direction on what I want to > see to provide my Reviewed-By. > > If you don't like it, then find someone else to review your code. > > I'm busy, and I don't work for you. If I don't want to review some > patches because my questions have been ignored, then that is entirely > my perogative. > > Pinging me *three times this week* on this stupid timestamp thing, is > somewhere between annoying and offensive. See above, why I made these pings. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 27, 2015 at 02:54:12PM +0300, Or Gerlitz wrote: > On Wed, May 27, 2015 at 1:07 AM, Jason Gunthorpe > <jgunthorpe@obsidianresearch.com> wrote: > > On Tue, May 26, 2015 at 11:39:04PM +0300, Or Gerlitz wrote: > >> >> Jason, you (U2 BTW) play really, really hard - refusing to say **one** > >> >> word on your approach towards the built-in udata mechanism for uverbs > >> >> which I asked you to comment on. > >> > >> > And I asked to see the user space side and you have angrily refused > >> > every time. > >> > >> AFAIR I never ever refused to show any piece of code which went under > >> my hands towards Linux to any-one. > > > For future reference, when someone asks a question and you go off on > > an tangental rant and ignore the question, then that process repeats, > > still without answering the question - most english speakers would > > call that refusing to answer the question. It is not looked upon kindly. > > Jason, > > It's not that you asked to see the code ala "hey, do you happen to > have a git with the user space code for people to inspect while > doing the review on the kernel part", but rather U2 saying in a > definitive manner that posting the user space code should be imposed > as pre-requirement to acceptance of the kernel parts. I really didn't Or: First ask: 'We can't really look at the uapi changes here without also seeing the verbs side changes.' (I know others on the list feel the same, so I use 'we') Second ask: 'Lets see the verbs side and I'll let you know.' (.. to your questions based on my review comments ..) Third ask: 'I think we need to have the same policy.' (.. To Yann's point that other kernel communities have a mandatory UAPI policy) Fourth (exasperated) ask: 'Was it not clear? Yann and I asked to see the user space side before reviewing this series further.' I know you are ESL, and I cut you alot of slack, but *come on* - that is incredibly soft language, and certainly not bossing and imposing in a definitive manner a blanket requirement on all patches. Advice: You would be well served to spend a bit more time on your emails. I have no idea what 'but rather U2 saying' means, for instance. Sometimes I just guess at what you are trying to say :| That is the price we pay for an inclusive international community, but everyone needs to be careful before starting a flame war based on percived slight in the text and phrasing of a message. email is hard. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 27, 2015 at 9:48 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > On Wed, May 27, 2015 at 02:54:12PM +0300, Or Gerlitz wrote: >> On Wed, May 27, 2015 at 1:07 AM, Jason Gunthorpe >> <jgunthorpe@obsidianresearch.com> wrote: >> > On Tue, May 26, 2015 at 11:39:04PM +0300, Or Gerlitz wrote: >> >> >> Jason, you (U2 BTW) play really, really hard - refusing to say **one** >> >> >> word on your approach towards the built-in udata mechanism for uverbs >> >> >> which I asked you to comment on. >> >> >> >> > And I asked to see the user space side and you have angrily refused >> >> > every time. >> >> >> >> AFAIR I never ever refused to show any piece of code which went under >> >> my hands towards Linux to any-one. >> >> > For future reference, when someone asks a question and you go off on >> > an tangental rant and ignore the question, then that process repeats, >> > still without answering the question - most english speakers would >> > call that refusing to answer the question. It is not looked upon kindly. >> >> Jason, >> >> It's not that you asked to see the code ala "hey, do you happen to >> have a git with the user space code for people to inspect while >> doing the review on the kernel part", but rather U2 saying in a >> definitive manner that posting the user space code should be imposed >> as pre-requirement to acceptance of the kernel parts. > > I really didn't Or: > > First ask: > 'We can't really look at the uapi changes here without also seeing the > verbs side changes.' > (I know others on the list feel the same, so I use 'we') > > Second ask: > 'Lets see the verbs side and I'll let you know.' > (.. to your questions based on my review comments ..) > > Third ask: > 'I think we need to have the same policy.' > (.. To Yann's point that other kernel communities have a mandatory > UAPI policy) > > Fourth (exasperated) ask: > 'Was it not clear? Yann and I asked to see the user space side before > reviewing this series further.' > > I know you are ESL, and I cut you alot of slack, but *come on* - that > is incredibly soft language, and certainly not bossing and imposing in > a definitive manner a blanket requirement on all patches. Jason, ESL indeed am I, and in that respect, this clarification, even if being tedious to set or read, helps. Still, I'd like to further try and get you from where the bossing thing came: (1) "show me the user space code prior to acceptance of the kernel part" never was a requirement on this community since the day we were born (Q4/2004) (2) instantly rejecting a usage of a mechanism existing just for that use case since the first year of our life (2005) is something perceived by me as two people (that's the U2) that come and say, "game's over, the old, non-functioning boss is gone, new boss (== we) in town and forget about everything you knew before". So you say that is was wrong perception, I hope so. Let's see how Doug see your feedback, namely either as community reviewer feedback or as new rules being set overnight, waiting. This series is (1) simple compared to other stuff being reviewed here nowadays, and (2) has very nice value to latency sensitive applications, so two wins, lets get it done. > Advice: You would be well served to spend a bit more time on your > emails. I have no idea what 'but rather U2 saying' means, for > instance. Sometimes I just guess at what you are trying to say :| point taken > That is the price we pay for an inclusive international community, but > everyone needs to be careful before starting a flame war based on > percived slight in the text and phrasing of a message. email is hard. point taken. In this case (as you can see from my response above) I am not convinced yet that this was false positive. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 28, 2015 at 12:33:58AM +0300, Or Gerlitz wrote: > (2) instantly rejecting a usage of a mechanism existing just for that > use case since the first year of our life (2005) I didn't reject anything, I said I wanted to see the user side to continue my review. > This series is (1) simple compared to other stuff being reviewed here > nowadays, What? It is 11 patches, long, introduces several UAPI changes, does not implement a standardized feature, adds new uses of latent kernel functions and exists to support a unique feature of a single hardware vendor that few understand the usecase for. I would describe this as one of the toughest to review series on patchworks right now. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 5/28/2015 1:21 AM, Jason Gunthorpe wrote: > exists to support a unique feature of a single hardware vendor that few understand the use case for Responding in EIM (End In Mind) manner The use case is very clear, low latency applications using UD or RAW PACKET QPs that needs to know the time it takes for different HW/SW layers to get their packets through. The verbs version of SO_TIMESTAMP and friends (Documentation/networking/timestamping*)-- Christoph, can you add some info on common use-cases for this? I bet that > 20 upstream Eth NIC drivers supports time-stamping, so there's no reason that a modern HCA will not support it too. > It is 11 patches, long, introduces several UAPI changes, 1 IB/core: Change provider's API of create_cq to be extendible 2 IB/core: Change ib_create_cq to use struct ib_cq_init_attr 3 IB/core: Add CQ creation time-stamping flag 4 IB/core: Extend ib_uverbs_create_cq 5 IB/core: Add timestamp_mask and hca_core_clock to query_device 6 IB/core: Pass hardware specific data in query_device 7 IB/mlx4: Add mmap call to map the hardware clock 8 IB/mlx4: Support extended create_cq and query_device uverbs 9 IB/mlx4: Add support for timestamp in cq creation 10 IB/mlx4: Add timestamp_mask and hca_core_clock to query_device 11 IB/mlx4: Return hca core clock's offset in query_device vendor's data 01-02 just cosmetics that don't add any new functionality 03 adding CQ creation flag to the kernel verbs 04 new uverbs API to extend CQ creation 05 extending uverbs query device to return two more values 06 small fix to missing udata mechanics in uverbs query device 07-11 mlx4 provider side of the CQ setup and clock mmaping to user-space the core of the review should be around the 03-06 zone, and with experts such as Yann (and you) the uverbs part shouldn't be too complex to review and fix if needed. > does not implement a standardized feature, This is standard in Eth NIC, return the time-stamp of when the packet arrived/sent > adds new uses of latent kernel functions ESL I am Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 28 May 2015, Or Gerlitz wrote: > The use case is very clear, low latency applications using UD or RAW PACKET > QPs that needs to know the time it takes for different HW/SW layers to get > their packets through. The verbs version of SO_TIMESTAMP and friends > (Documentation/networking/timestamping*)-- Christoph, can you add some info on > common use-cases for this? Well timestamp information is widely in the financial industry to track latency and also for fairness. Exchanges/Brokers etc often must guarantee that the packet received first is not processed after packets received later. Packet timestamps in many ways affect processing and are also used for error checking etc. What we have to do without this is to use RDTSC to get a timestamp but the packet reception / sending time then is inaccurate due to the instructions that have to be executed before and after. And there is additional overhead due to that processing. > I bet that > 20 upstream Eth NIC drivers supports time-stamping, so there's no > reason that a modern HCA will not support it too. Right. Ethernet timekeeping support and timestamping has become fairly standard. Even the onboard one do that these days. > > does not implement a standardized feature, > > This is standard in Eth NIC, return the time-stamp of when the packet > arrived/sent It needs desperately to be in the standard. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 28, 2015 at 09:13:41AM -0500, Christoph Lameter wrote: > What we have to do without this is to use RDTSC to get a timestamp but the > packet reception / sending time then is inaccurate due to the instructions that > have to be executed before and after. And there is additional overhead due > to that processing. After a quick look through, the biggest question in my mind is what should the timestamp value in the wc be? Right now it is some coded thing in clock cycles. Should we require the driver to convert to ns before passing the wc back to the app? (Looks like the socket implementation uniformly uses us or ns) Should the app open code the conversion from clock cycles to ns, or vfunc down to the driver? Is the coding going to be OK for multiple HW vendors? These questions effect what the UAPI should be, if the answer is 'use ns everwhere' then some of this stuff being added does not belong in the general API. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 28 May 2015, Jason Gunthorpe wrote: > After a quick look through, the biggest question in my mind is what > should the timestamp value in the wc be? > > Right now it is some coded thing in clock cycles. This is sufficient since it can be converted to ns or whatever one wants. > Should we require the driver to convert to ns before passing the wc > back to the app? (Looks like the socket implementation uniformly uses > us or ns) But that requires additional processing. > Should the app open code the conversion from clock cycles to ns, or > vfunc down to the driver? The API provides the abilty to retrieve the clock freq which is sufficient for the user to convert the value to meaningful time values. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 28, 2015 at 12:14:15PM -0500, Christoph Lameter wrote: > On Thu, 28 May 2015, Jason Gunthorpe wrote: > > > After a quick look through, the biggest question in my mind is what > > should the timestamp value in the wc be? > > > > Right now it is some coded thing in clock cycles. > > This is sufficient since it can be converted to ns or whatever one wants. Sure it is sufficient, but is it a robust UAPI, will it support multiple hardware vendors? Is anyone else in ethernet using verbs to deliver IP packets? Having a conversion function, or doing it in the wc generation is more 'obviously safe' for future proofing the UAPI. > > Should we require the driver to convert to ns before passing the wc > > back to the app? (Looks like the socket implementation uniformly uses > > us or ns) > But that requires additional processing. Well, it is only additional if the app is going to ignore the time stamp or not convert it to ns right away. Is that the common use case? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 28 May 2015, Jason Gunthorpe wrote: > > This is sufficient since it can be converted to ns or whatever one wants. > > Sure it is sufficient, but is it a robust UAPI, will it support > multiple hardware vendors? What would prevent other hardware vendors from exporting their counters? > Is anyone else in ethernet using verbs to deliver IP packets? This is not only for Ethernet. Internally Infiniband is frequently used and there also timestamps are useful. > Having a conversion function, or doing it in the wc generation is more > 'obviously safe' for future proofing the UAPI. Well no. There has been a history of putting time corrections etc etc into these. Once you move from a raw counter to actual time various complications may need to be dealt with. For simple time differentials the counter is sufficient. If you really want proper "time" when something happens then you may want to scale and correct etc the value and have some sort of time sync approach. > > But that requires additional processing. > > Well, it is only additional if the app is going to ignore the time > stamp or not convert it to ns right away. Is that the common use case? The app may not need proper time but just a cycle count differential. A cycle count differential is often easier to handle than a ns value. And having ns values leads to the assumption that they are "correct" so various factors related to tuning the clock etc may have to be applied. You want to deal with these issues only if really necessary. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 28, 2015 at 01:30:52PM -0500, Christoph Lameter wrote: > On Thu, 28 May 2015, Jason Gunthorpe wrote: > > > > This is sufficient since it can be converted to ns or whatever one wants. > > > > Sure it is sufficient, but is it a robust UAPI, will it support > > multiple hardware vendors? > > What would prevent other hardware vendors from exporting their counters? Well, I'm not a HW vendor, so I don't know for sure, the proposal is: timestamp_mask - how many bits are valid in the timestamp. timestamp values could be 64bits the most. hca_core_clock - timestamp is given in HW cycles, hca_core_clock is the frequency of the HCA and is necessary in order to convert cycles to seconds. So presumably cycle difference is: (a-b) & mask And to ns is a*1E9/hca_core_clock So first, the above is a tricky bit of math to open code, integer overflow needs to be avoided - so at a minimum I'd like to see libibverbs provide this as a function call. I'd probably also say difference should be an inline too. If it is a function call then maybe we don't need to tell the app about hca_core_clock? Second: What about wrap around? Does it even make sense to expose less than 64 bits to userspace? Should the driver manage wrap around to create a flat 64 bit space? Otherwise, if the app has to do it, there is no event indicating wrap has occured. Without managing wrapping, cycle count difference is not a reliable calculation. > > Is anyone else in ethernet using verbs to deliver IP packets? > > This is not only for Ethernet. Internally Infiniband is frequently used > and there also timestamps are useful. Sure, but should any other vendors be commenting on this UAPI? > Well no. There has been a history of putting time corrections etc etc into > these. Once you move from a raw counter to actual time various > complications may need to be dealt with. For simple time differentials the > counter is sufficient. If you really want proper "time" when something > happens then you may want to scale and correct etc the value and have some > sort of time sync approach. Sure OK, I'm sold, export a cycle counter of some sort. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 28 May 2015, Jason Gunthorpe wrote: > Second: What about wrap around? Does it even make sense to expose less > than 64 bits to userspace? Should the driver manage wrap around to > create a flat 64 bit space? The wrap around is given by the mask. Cycle registers are often shorter than 64 bits. > Otherwise, if the app has to do it, there is no event indicating wrap > has occured. Well yes. You take the difference and then apply the mask. If you are outside of the cycle range then this will not give you a proper time. > Without managing wrapping, cycle count difference is not a reliable > calculation. Counters only works for the interval in which they do not go through a complete cycle. But that is the nature of these counters. And thats what you see when you look into the kernel timer subsystem for example. > > This is not only for Ethernet. Internally Infiniband is frequently used > > and there also timestamps are useful. > > Sure, but should any other vendors be commenting on this UAPI? I think they would have spoken up if they had any objections? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 28, 2015 at 03:34:00PM -0500, Christoph Lameter wrote: > On Thu, 28 May 2015, Jason Gunthorpe wrote: > > > Second: What about wrap around? Does it even make sense to expose less > > than 64 bits to userspace? Should the driver manage wrap around to > > create a flat 64 bit space? > > The wrap around is given by the mask. Cycle registers are often shorter > than 64 bits. I am aware of how cycle counters work. My point was exposing raw wrapping counters is a horrible UAPI. Shouldn't the driver software extend smaller counters to 64 bits? That would take a single or and an unlikely branch, so don't say 'performance' :) > through a complete cycle. But that is the nature of these counters. And > thats what you see when you look into the kernel timer subsystem for > example. Very little in the kernel is exposed to that wrapping, the timer subsystem takes care of it. Certainly, userspace never sees it. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > Second: What about wrap around? Does it even make sense to expose less > > > than 64 bits to userspace? Should the driver manage wrap around to > > > create a flat 64 bit space? > > > > The wrap around is given by the mask. Cycle registers are often shorter > > than 64 bits. > > I am aware of how cycle counters work. > > My point was exposing raw wrapping counters is a horrible UAPI. > > Shouldn't the driver software extend smaller counters to 64 bits? > That would take a single or and an unlikely branch, so don't say > 'performance' :) It's one thing to time stamp a frame or packet. But this is assigning a time stamp to a work completion. I don't even know what that's supposed to mean when considering 2 GB (or larger) transfers, RDMA read operations, XRC, dynamic connections, out of order retransmissions, shared receive queues, and other exotic features. IMO, this is currently vendor specific functionality, and not obviously applicable as a generic feature. It is certainly poorly defined and exposed in a very implementation specific way. The use case given by Christoph only speaks of packet level time stamps. One could argue that such a use case would place the stamp near the packet (similar to the GRH), rather than embedded into a work completion. This would allow time stamps even in the absence of a work completion. IMO, vendors already have ways to expose vendor specific features to user space. I would mark this as vendor specific and keep it out of the core. - Sean -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 28 May 2015, Jason Gunthorpe wrote: > My point was exposing raw wrapping counters is a horrible UAPI. Well this is a kernel bypass API and a lot of raw hardware issues will have to be handled since you do go directly to the device. > > through a complete cycle. But that is the nature of these counters. And > > thats what you see when you look into the kernel timer subsystem for > > example. > > Very little in the kernel is exposed to that wrapping, the timer > subsystem takes care of it. Certainly, userspace never sees it. Right but then we are not at the comfortable sockets API here but at the bare metal level. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 29 May 2015, Hefty, Sean wrote: > > Shouldn't the driver software extend smaller counters to 64 bits? > > That would take a single or and an unlikely branch, so don't say > > 'performance' :) > > It's one thing to time stamp a frame or packet. But this is assigning a ti= > me stamp to a work completion. I don't even know what that's supposed to m= > ean when considering 2 GB (or larger) transfers, RDMA read operations, XRC,= > dynamic connections, out of order retransmissions, shared receive queues, = > and other exotic features. IMO, this is currently vendor specific function= What is the issue here? The timestamp is created when the processing is finished by the nic and the completion entry becomes available. > ality, and not obviously applicable as a generic feature. It is certainly = > poorly defined and exposed in a very implementation specific way. It is exactly defined like any other cycle counters in hardware and it is exposed using an API that allows multiple vendors to use these cycle counters without regard to a particular vendors implementation. I sure wish that Intel would be supporting a feature like this. Please come up with a better alternative if there is one. This is likely going to be a differentiator for the vendor used in our industry. > The use case given by Christoph only speaks of packet level time stamps. O= > ne could argue that such a use case would place the stamp near the packet (= > similar to the GRH), rather than embedded into a work completion. This wou= > ld allow time stamps even in the absence of a work completion. > > IMO, vendors already have ways to expose vendor specific features to user s= > pace. I would mark this as vendor specific and keep it out of the core. How exactly would that work? How can we attach vendor specific extension to a completion structure? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2015-05-29 at 08:46 -0500, Christoph Lameter wrote: > On Thu, 28 May 2015, Jason Gunthorpe wrote: > > > My point was exposing raw wrapping counters is a horrible UAPI. > > Well this is a kernel bypass API and a lot of raw hardware issues will > have to be handled since you do go directly to the device. No, that's not entirely true, and it *certainly* is not the correct way to think about verbs extensions. Is it kernel bypass? Yes. Does it go direct to the hardware? Not as far as the user application is concerned. The direct hardware access is abstracted away in the verbs library. Because the verbs library is a hardware abstraction layer, any extensions to it need to be well thought out. And by that I mean if it is of general use, then it should be added in a general, abstract way that any hardware can implement. If it is specific to just one vendor's hardware, then it can be added in a means that is specific to that vendor's hardware. Now, as a general rule, I would call timestamps general. They should be added in a fashion that anyone can implement. They should also be well defined. Sean's questions raise a very valid point. Exactly what is being timestamped, and do we care about different timestamp options? Is it completion of message, start of message, transfer from HCA to main system memory completion, etc. The 00/10 header to this patch series was probably answering Sean's question, but just based on the name of the TIMESTAMP flag to the CQ creation attr struct it isn't clear that this is the case. > > > through a complete cycle. But that is the nature of these counters. And > > > thats what you see when you look into the kernel timer subsystem for > > > example. > > > > Very little in the kernel is exposed to that wrapping, the timer > > subsystem takes care of it. Certainly, userspace never sees it. > > Right but then we are not at the comfortable sockets API here but at the > bare metal level. That's not entirely true. We still hold to our abstrations, they are just intentionally kept very thin and high performing.
> What is the issue here? The timestamp is created when the processing is > finished by the nic and the completion entry becomes available. The timestamp is generated when a work completion entry is written. If there's a clear use case for this, it hasn't been described. The use case you mentioned only works if there is a 1:1 relationship between a packet and a work completion. That is not what is being defined here. > It is exactly defined like any other cycle counters in hardware and it is > exposed using an API that allows multiple vendors to use these cycle > counters without regard to a particular vendors implementation. I disagree. This is associated with a specific implementation. It assumes that the counter is part of a CQ entry, and that the counter is written when the completion is written. There's nothing that requires other vendors to follow that model, nor is it clear that this is a generic or useful enough operation that other vendors would want to follow this model. Why not have the time stamp record the start of the transaction? The end? Have two stamps, once for the first packet, and one for the last? Limit this to single packet operations only? > > The use case given by Christoph only speaks of packet level time stamps. > O= > > ne could argue that such a use case would place the stamp near the > packet (= > > similar to the GRH), rather than embedded into a work completion. This > wou= > > ld allow time stamps even in the absence of a work completion. > > > > IMO, vendors already have ways to expose vendor specific features to > user s= > > pace. I would mark this as vendor specific and keep it out of the core. > > How exactly would that work? How can we attach vendor specific extension > to a completion structure? I'm just stating that there is at other ways of exposing this sort of feature. A time stamp could just as easily be written with the data, similar to the grh. One of the points of defining the verbs extensions was exactly so that a vendor could export their own functionality. - Sean -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 29 May 2015, Doug Ledford wrote: > > Well this is a kernel bypass API and a lot of raw hardware issues will > > have to be handled since you do go directly to the device. > > No, that's not entirely true, and it *certainly* is not the correct way > to think about verbs extensions. Is it kernel bypass? Yes. Does it go > direct to the hardware? Not as far as the user application is > concerned. The direct hardware access is abstracted away in the verbs There is a compromise here by using the kernel as a administrative function (setup and configuration of QPs) but using bare metal for the data path. The structures modified for send/receive are structures that are directly understood and handled by the hardware. That is the core benefit of the RDMA API which results in the wanted performance and latency. The administrative function / bare metal separation is also reflected in this patchset. The admin function allows the determination of the cycle counter freq and size. The bare metal cycle counter exists in the fastpath. > library. Because the verbs library is a hardware abstraction layer, any > extensions to it need to be well thought out. And by that I mean if it > is of general use, then it should be added in a general, abstract way > that any hardware can implement. If it is specific to just one vendor's > hardware, then it can be added in a means that is specific to that > vendor's hardware. What is particular here to the vendor? > Now, as a general rule, I would call timestamps general. They should be > added in a fashion that anyone can implement. They should also be well > defined. Sean's questions raise a very valid point. Exactly what is > being timestamped, and do we care about different timestamp options? Is > it completion of message, start of message, transfer from HCA to main > system memory completion, etc. The 00/10 header to this patch series > was probably answering Sean's question, but just based on the name of > the TIMESTAMP flag to the CQ creation attr struct it isn't clear that > this is the case. Ok then lets answer that. > > Right but then we are not at the comfortable sockets API here but at the > > bare metal level. > > That's not entirely true. We still hold to our abstrations, they are > just intentionally kept very thin and high performing. Well there is a distinction here. We provide the comfort of setup and administrative functions through the kernel API. We still try to isolate the application as much as possible when we go to the data paths but we need to hit bare metal in the fastpath in order to accomplish our mission of maximum performance and minimum latency. This cannot be accomplished with kernel calls and therefore it is an examplle of kernel bypass. We want this as comfy as possible of course and well defined. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2015-05-28 at 12:14 -0500, Christoph Lameter wrote: > On Thu, 28 May 2015, Jason Gunthorpe wrote: > > > After a quick look through, the biggest question in my mind is what > > should the timestamp value in the wc be? > > > > Right now it is some coded thing in clock cycles. > > This is sufficient since it can be converted to ns or whatever one wants. It is sufficient for your use. It is not, however, a good API. > > Should we require the driver to convert to ns before passing the wc > > back to the app? (Looks like the socket implementation uniformly uses > > us or ns) > > But that requires additional processing. Yes. > > Should the app open code the conversion from clock cycles to ns, or > > vfunc down to the driver? > > The API provides the abilty to retrieve the clock freq which is > sufficient for the user to convert the value to meaningful time values. I would prefer if the access to the timestamp were implemented via a function in libibverbs (I haven't looked at the git repo, too little time, I'll get to it). Something like ibv_get_cqe_timestamp(). That function should be general and return a suitable, normalized value (ns probably). If you just want a simple comparator without the overhead of normalizing to time, and are willing to accept the consequences of that, then I would expect you to use something like ibv_get_raw_cqe_timestamp() to get the unadulterated cycle counter. For the most part, the user space application should not know details like "we are using a cycle counter in the HCA processor for timestamping", that's below the level of abstraction we attempt to maintain at the verbs level. Libmlx4 should be the only thing aware of that fact, and it talks to the mlx4 driver in the kernel to get the details it needs. And by putting it into a function that libmlx4 implements, if libcxgb4 decides to implement timestamps and does it in a different way, the app doesn't care, it just uses the same call.
PiBOb3csIGFzIGEgZ2VuZXJhbCBydWxlLCBJIHdvdWxkIGNhbGwgdGltZXN0YW1wcyBnZW5lcmFs LiAgVGhleSBzaG91bGQgYmUNCj4gYWRkZWQgaW4gYSBmYXNoaW9uIHRoYXQgYW55b25lIGNhbiBp bXBsZW1lbnQuICBUaGV5IHNob3VsZCBhbHNvIGJlIHdlbGwNCj4gZGVmaW5lZC4gIFNlYW4ncyBx dWVzdGlvbnMgcmFpc2UgYSB2ZXJ5IHZhbGlkIHBvaW50LiAgRXhhY3RseSB3aGF0IGlzDQo+IGJl aW5nIHRpbWVzdGFtcGVkLCBhbmQgZG8gd2UgY2FyZSBhYm91dCBkaWZmZXJlbnQgdGltZXN0YW1w IG9wdGlvbnM/ICBJcw0KPiBpdCBjb21wbGV0aW9uIG9mIG1lc3NhZ2UsIHN0YXJ0IG9mIG1lc3Nh Z2UsIHRyYW5zZmVyIGZyb20gSENBIHRvIG1haW4NCj4gc3lzdGVtIG1lbW9yeSBjb21wbGV0aW9u LCBldGMuICBUaGUgMDAvMTAgaGVhZGVyIHRvIHRoaXMgcGF0Y2ggc2VyaWVzDQo+IHdhcyBwcm9i YWJseSBhbnN3ZXJpbmcgU2VhbidzIHF1ZXN0aW9uLCBidXQganVzdCBiYXNlZCBvbiB0aGUgbmFt ZSBvZg0KPiB0aGUgVElNRVNUQU1QIGZsYWcgdG8gdGhlIENRIGNyZWF0aW9uIGF0dHIgc3RydWN0 IGl0IGlzbid0IGNsZWFyIHRoYXQNCj4gdGhpcyBpcyB0aGUgY2FzZS4NCg0KSSBkaWRuJ3Qgc2Vl IHRoZSBpbmZvcm1hdGlvbiB0aGF0IEkgd2FzIGxvb2tpbmcgZm9yIGluIHRoZSBwYXRjaCBoZWFk ZXIgdG8gdGhpcyBzZXJpZXMuICBBcyBKYXNvbiBwb2ludGVkIG91dCwgdGhlIHVzZSBjYXNlIGlz IGxhY2tpbmcuDQoNCklNTywgaXQgY291bGQgbWFrZSBqdXN0IGFzIG11Y2ggc2Vuc2UgdG8gYXNz b2NpYXRlL2VuYWJsZSB0aW1lIHN0YW1waW5nIHdpdGggdGhlIFFQIGFzIHdpdGggdGhlIENRLCBv ciBldmVuIG1ha2UgaXQgY29uZmlndXJhYmxlIHBlciBvcGVyYXRpb24gb3Igb3BlcmF0aW9uIHR5 cGUuDQoNCklmIENocmlzdG9waCBoYXMgYSBjbGVhciB1c2UgY2FzZSBhbmQgd2FudHMgdG8gZ28g dG8gdGhlICdiYXJlIG1ldGFsJywgdGhlbiBhIHZlbmRvciBzcGVjaWZpYyBvcHRpb24gc2VlbXMg aWRlYWwuICBBdCBsZWFzdCB1bnRpbCB0aGVyZSBhcmUgb3RoZXIgaW1wbGVtZW50YXRpb25zIG9y IHRoZSBkcml2aW5nIHVzZSBjYXNlIGlzIGNsZWFyZXIuDQoNCi0gU2Vhbg0K -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2015-05-29 at 17:09 +0000, Hefty, Sean wrote: > > Now, as a general rule, I would call timestamps general. They should be > > added in a fashion that anyone can implement. They should also be well > > defined. Sean's questions raise a very valid point. Exactly what is > > being timestamped, and do we care about different timestamp options? Is > > it completion of message, start of message, transfer from HCA to main > > system memory completion, etc. The 00/10 header to this patch series > > was probably answering Sean's question, but just based on the name of > > the TIMESTAMP flag to the CQ creation attr struct it isn't clear that > > this is the case. > > I didn't see the information that I was looking for in the patch header to this series. As Jason pointed out, the use case is lacking. > > IMO, it could make just as much sense to associate/enable time stamping with the QP as with the CQ, or even make it configurable per operation or operation type. > > If Christoph has a clear use case and wants to go to the 'bare metal', then a vendor specific option seems ideal. At least until there are other implementations or the driving use case is clearer. The use case is clear IMO. It's for financial trading software. I don't think they really care about details like whether it's the start or end packet, or completion, or whatever. They need a tie breaker between when they have two different buy or sell orders on the same lot of stock. Any deterministic timing/ordering method will do as long as they consistently apply it I think. And the faster and lower overhead the process, the better. He doesn't really want a timestamp, he merely wants a sequence ordering. But a timestamp is what they are using to get him what he needs. Is that a fair guess Christoph?
> The use case is clear IMO. It's for financial trading software. I > don't think they really care about details like whether it's the start > or end packet, or completion, or whatever. They need a tie breaker > between when they have two different buy or sell orders on the same lot > of stock. Any deterministic timing/ordering method will do as long as > they consistently apply it I think. And the faster and lower overhead > the process, the better. He doesn't really want a timestamp, he merely > wants a sequence ordering. But a timestamp is what they are using to > get him what he needs. > > Is that a fair guess Christoph? I get Christoph's usage model. But AFAIK, he uses UD packets, often with multicast. So the timestamp ends up associated indirectly with an actual packet. That model makes sense to me. It’s the generic time stamping of a work completion (for, say, a 2 GB transfer over RC using an SRQ) that doesn't make sense to me. That's the part I'm struggling with.
On 5/29/2015 11:52 AM, Hefty, Sean wrote: >> What is the issue here? The timestamp is created when the processing is >> finished by the nic and the completion entry becomes available. > The timestamp is generated when a work completion entry is written. If there's a clear use case for this, it hasn't been described. The use case you mentioned only works if there is a 1:1 relationship between a packet and a work completion. That is not what is being defined here. > >> It is exactly defined like any other cycle counters in hardware and it is >> exposed using an API that allows multiple vendors to use these cycle >> counters without regard to a particular vendors implementation. > I disagree. This is associated with a specific implementation. It assumes that the counter is part of a CQ entry, and that the counter is written when the completion is written. There's nothing that requires other vendors to follow that model, nor is it clear that this is a generic or useful enough operation that other vendors would want to follow this model. Why not have the time stamp record the start of the transaction? The end? Have two stamps, once for the first packet, and one for the last? Limit this to single packet operations only? FWIW: cxgb4 hardware includes a hw timestamp in its CQE as well. It is used by SW for CQ overflow detection and debug timing analysis... >>> The use case given by Christoph only speaks of packet level time stamps. >> O= >>> ne could argue that such a use case would place the stamp near the >> packet (= >>> similar to the GRH), rather than embedded into a work completion. This >> wou= >>> ld allow time stamps even in the absence of a work completion. >>> >>> IMO, vendors already have ways to expose vendor specific features to >> user s= >>> pace. I would mark this as vendor specific and keep it out of the core. >> How exactly would that work? How can we attach vendor specific extension >> to a completion structure? > I'm just stating that there is at other ways of exposing this sort of feature. A time stamp could just as easily be written with the data, similar to the grh. One of the points of defining the verbs extensions was exactly so that a vendor could export their own functionality. > > - Sean > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 29 May 2015, Doug Ledford wrote: > The use case is clear IMO. It's for financial trading software. I > don't think they really care about details like whether it's the start > or end packet, or completion, or whatever. They need a tie breaker > between when they have two different buy or sell orders on the same lot > of stock. Any deterministic timing/ordering method will do as long as > they consistently apply it I think. And the faster and lower overhead > the process, the better. He doesn't really want a timestamp, he merely > wants a sequence ordering. But a timestamp is what they are using to > get him what he needs. > > Is that a fair guess Christoph? We want to have a time stamp when the action is complete and the data is available to the application or the send action is complete and the CQ entry can be reused. That is a well defined point and that is how the time stamps function in the existing implementation. This is an obvious understanding and its pretty clear. The time stamp needs to be at the end of the action because the timestamp is used to: 1. Assess the impact of network processing. This can be compared with packet timestamps from capture devices off the wire. 2. Delineate the start of packet processing in software. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > This is sufficient since it can be converted to ns or whatever one wants. > > It is sufficient for your use. It is not, however, a good API. I hate these foggy statements that you guys come up with. Why is it not a good API? Having a cycle counter without processiing overhead is a good thing and the way counters are handled is pretty well established. > > The API provides the abilty to retrieve the clock freq which is > > sufficient for the user to convert the value to meaningful time values. > > I would prefer if the access to the timestamp were implemented via a > function in libibverbs (I haven't looked at the git repo, too little > time, I'll get to it). Something like ibv_get_cqe_timestamp(). That > function should be general and return a suitable, normalized value (ns > probably). If you just want a simple comparator without the overhead of > normalizing to time, and are willing to accept the consequences of that, > then I would expect you to use something like That would introduce additional latencies and would make that feature no longer useful to us. The advantage of this approach is that the counter is in the same cacheline that is already used. It is very low overhead. > ibv_get_raw_cqe_timestamp() to get the unadulterated cycle counter. For > the most part, the user space application should not know details like > "we are using a cycle counter in the HCA processor for timestamping", Why not? A cycle counter is the general way of providing timestamps in hardware. RDTSC is such a cycle counter as well. There are numerous examples of counters like that. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 29 May 2015, Hefty, Sean wrote: > > What is the issue here? The timestamp is created when the processing is > > finished by the nic and the completion entry becomes available. > > The timestamp is generated when a work completion entry is written. If the= > re's a clear use case for this, it hasn't been described. The use case you= > mentioned only works if there is a 1:1 relationship between a packet and a= > work completion. That is not what is being defined here. It does make sense to have a timestamp when the work described by a CQ has been completed. For that you do not need a 1:1 correspondence. My use case is focused on single packet processing because that is what we do here. > > It is exactly defined like any other cycle counters in hardware and it is > > exposed using an API that allows multiple vendors to use these cycle > > counters without regard to a particular vendors implementation. > > I disagree. This is associated with a specific implementation. It assumes= > that the counter is part of a CQ entry, and that the counter is written wh= > en the completion is written. There's nothing that requires other vendors = > to follow that model, nor is it clear that this is a generic or useful enou= > gh operation that other vendors would want to follow this model. Why not h= > ave the time stamp record the start of the transaction? The end? Have two= > stamps, once for the first packet, and one for the last? Limit this to si= > ngle packet operations only? Why would you have a timestamp at the beginning of the transaction? That is useless because you can use packet capture devices to establish that timepoint. At that point you have not identified the CQ anyways. Having a timestamp when an action is complete makes sense, is generic and general. Adding the timestamp to the data means that the application now has to separate the metadata (timestamp) from the data stream. Not good. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> We want to have a time stamp when the action is complete and the data is > available to the application or the send action is complete and the CQ > entry can be reused. This is what polling the completion from the CQ tells you, independent of there being a time stamp. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 1 Jun 2015, Hefty, Sean wrote: > > We want to have a time stamp when the action is complete and the data is > > available to the application or the send action is complete and the CQ > > entry can be reused. > > This is what polling the completion from the CQ tells you, independent of t= > here being a time stamp. But you may not be polling that frequently. Polling threads may check multiple sources of events and may also currently executing code to handle an event. Also there is the problem of the OS interrupting you. All of these sources of inaccuracy are removed by the timestamp. That was for inbound. For outbound you do not get a timestamp without this feature. Typically reclaim of outbound work requeust is delayed quite a bit and getting a timestamp later does not reflect the actual time the message was sent. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c index bb7d42c..b77dd77 100644 --- a/drivers/infiniband/hw/mlx4/main.c +++ b/drivers/infiniband/hw/mlx4/main.c @@ -233,6 +233,8 @@ static int mlx4_ib_query_device(struct ib_device *ibdev, props->max_total_mcast_qp_attach = props->max_mcast_qp_attach * props->max_mcast_grp; props->max_map_per_fmr = dev->dev->caps.max_fmr_maps; + props->hca_core_clock = dev->dev->caps.hca_core_clock; + props->timestamp_mask = 0xFFFFFFFFFFFFULL; out: kfree(in_mad);