diff mbox

[for-next,09/10] IB/mlx4: Add timestamp_mask and hca_core_clock to query_device

Message ID 1431869786-6308-10-git-send-email-ogerlitz@mellanox.com (mailing list archive)
State Superseded
Headers show

Commit Message

Or Gerlitz May 17, 2015, 1:36 p.m. UTC
From: Matan Barak <matanb@mellanox.com>

mlx4 needs to report the number of supported timestamp
bits (mask) and the hca_core_clock frequency.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/infiniband/hw/mlx4/main.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

Comments

Jason Gunthorpe May 19, 2015, 6:58 p.m. UTC | #1
On Sun, May 17, 2015 at 04:36:25PM +0300, Or Gerlitz wrote:
> From: Matan Barak <matanb@mellanox.com>
> 
> mlx4 needs to report the number of supported timestamp
> bits (mask) and the hca_core_clock frequency.

This is critical information to parse the timestamp, why is it hidden
in vendor specific land?

We can't really look at the uapi changes here without also seeing the
verbs side changes.

I'd like to see Yann review all the uapi stuff as well.

I'm not sure how much I like this idea of appending vendor stuff to
these replies... Hiding uapi in a driver just seems like asking for
trouble.
	
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe May 19, 2015, 7 p.m. UTC | #2
On Tue, May 19, 2015 at 12:58:01PM -0600, Jason Gunthorpe wrote:
> On Sun, May 17, 2015 at 04:36:25PM +0300, Or Gerlitz wrote:
> > From: Matan Barak <matanb@mellanox.com>
> > 
> > mlx4 needs to report the number of supported timestamp
> > bits (mask) and the hca_core_clock frequency.
> 
> This is critical information to parse the timestamp, why is it hidden
> in vendor specific land?

Sorry, direct this toward the page offset in the next patch.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz May 19, 2015, 7:11 p.m. UTC | #3
On Tue, May 19, 2015 at 10:00 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Tue, May 19, 2015 at 12:58:01PM -0600, Jason Gunthorpe wrote:
>> On Sun, May 17, 2015 at 04:36:25PM +0300, Or Gerlitz wrote:
>> > From: Matan Barak <matanb@mellanox.com>
>> >
>> > mlx4 needs to report the number of supported timestamp
>> > bits (mask) and the hca_core_clock frequency.
>>
>> This is critical information to parse the timestamp, why is it hidden
>> in vendor specific land?
>
> Sorry, direct this toward the page offset in the next patch.

We see this as a piece of detail related to the specific vendor
driver. I don't see the point of overloading the user space verbs
library or app with an API to get the offset of the clock within the
mmap page and such, do you?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe May 19, 2015, 7:15 p.m. UTC | #4
On Tue, May 19, 2015 at 10:11:43PM +0300, Or Gerlitz wrote:
> > Sorry, direct this toward the page offset in the next patch.
> 
> We see this as a piece of detail related to the specific vendor
> driver. I don't see the point of overloading the user space verbs
> library or app with an API to get the offset of the clock within the
> mmap page and such, do you?

Can't you find a better place for this than overloading the
query_device call? I'd rather see you add some kind of driver-specific
command than this.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz May 19, 2015, 7:30 p.m. UTC | #5
On Tue, May 19, 2015 at 10:15 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Tue, May 19, 2015 at 10:11:43PM +0300, Or Gerlitz wrote:
>> > Sorry, direct this toward the page offset in the next patch.
>>
>> We see this as a piece of detail related to the specific vendor
>> driver. I don't see the point of overloading the user space verbs
>> library or app with an API to get the offset of the clock within the
>> mmap page and such, do you?
>
> Can't you find a better place for this than overloading the
> query_device call? I'd rather see you add some kind of driver-specific
> command than this.

Let's be precise --

Are you objecting adding the clock frequency and mask to the qeury device verb?
why?

If not, are objecting using the vendor specific track of the verb to
pass from the vendor driver to the vendor library this or that detail
which is needed for proper operation? why?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe May 20, 2015, 12:29 a.m. UTC | #6
On Tue, May 19, 2015 at 10:30:00PM +0300, Or Gerlitz wrote:
> Are you objecting adding the clock frequency and mask to the qeury device verb?
> why?

Lets see the verbs side and I'll let you know.

> If not, are objecting using the vendor specific track of the verb to
> pass from the vendor driver to the vendor library this or that detail
> which is needed for proper operation? why?

I'm uncomfortable seeing otherwise vendor-neutral calls gain vendor
extensions.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz May 20, 2015, 2:40 p.m. UTC | #7
On 5/20/2015 3:29 AM, Jason Gunthorpe wrote:
> On Tue, May 19, 2015 at 10:30:00PM +0300, Or Gerlitz wrote:
>> >Are you objecting adding the clock frequency and mask to the qeury device verb?
>> >why?
> Lets see the verbs side and I'll let you know.

You mean the user series of libibverbs/libmlx4? I don't see why this 
should be a must for the review of the kernel bits. The user-space code 
is coming up soon, sure, but we should be able to review kernel patches 
without requiring to actually see the user-space code.

As the change-logs here explained, the clock frequency is needed for 
applications to convert the HCA clock delta (current time - timestamp on 
WC) into nano-secs and such.The mask is needed to realize how many bits
from the 64b time-stamp are supported by the HW.

>
>> >If not, are objecting using the vendor specific track of the verb to
>> >pass from the vendor driver to the vendor library this or that detail
>> >which is needed for proper operation? why?
> I'm uncomfortable seeing otherwise vendor-neutral calls gain vendor
> extensions.

But this is whole purpose of the udata framework in uverbs, right? for 
each uverb command the vendor user-space library has a well defined 
channel to communicate directly with the low level vendor driver 
throughout the uverbs channels.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz May 20, 2015, 2:41 p.m. UTC | #8
On 5/20/2015 3:29 AM, Jason Gunthorpe wrote:
> On Tue, May 19, 2015 at 10:30:00PM +0300, Or Gerlitz wrote:
>> >Are you objecting adding the clock frequency and mask to the qeury device verb?
>> >why?
> Lets see the verbs side and I'll let you know.

You mean the user series of libibverbs/libmlx4? I don't see why this 
should be a must for the review of the kernel bits. The user-space code 
is coming up soon, sure, but we should be able to review kernel patches 
without requiring to actually see the user-space code.

As the change-logs here explained, the clock frequency is needed for 
applications to convert the HCA clock delta (current time - timestamp on 
WC) into nano-secs and such.The mask is needed to realize how many bits
from the 64b time-stamp are supported by the HW.

>
>> >If not, are objecting using the vendor specific track of the verb to
>> >pass from the vendor driver to the vendor library this or that detail
>> >which is needed for proper operation? why?
> I'm uncomfortable seeing otherwise vendor-neutral calls gain vendor
> extensions.

But this is whole purpose of the udata framework in uverbs, right? for 
each uverb command the vendor user-space library has a well defined 
channel to communicate directly with the low level vendor driver 
throughout the uverbs channels.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yann Droneaud May 20, 2015, 3:11 p.m. UTC | #9
Hi,

Le mercredi 20 mai 2015 à 17:41 +0300, Or Gerlitz a écrit :
> On 5/20/2015 3:29 AM, Jason Gunthorpe wrote:
> > On Tue, May 19, 2015 at 10:30:00PM +0300, Or Gerlitz wrote:
> > > > Are you objecting adding the clock frequency and mask to the 
> > > > qeury device verb?
> > > > why?
> > Lets see the verbs side and I'll let you know.
> 
> You mean the user series of libibverbs/libmlx4? I don't see why this 
> should be a must for the review of the kernel bits. The user-space 
> code 
> is coming up soon, sure, but we should be able to review kernel 
> patches 
> without requiring to actually see the user-space code.
> 

In some other subsystems: no userspace code, no merge.

http://blog.ffwll.ch/2015/05/gfx-kernel-upstreaming-requirements.html


> As the change-logs here explained, the clock frequency is needed for 
> applications to convert the HCA clock delta (current time - timestamp 
> on 
> WC) into nano-secs and such.The mask is needed to realize how many 
> bits
> from the 64b time-stamp are supported by the HW.
> 
> > 
> > > > If not, are objecting using the vendor specific track of the 
> > > > verb to
> > > > pass from the vendor driver to the vendor library this or that 
> > > > detail
> > > > which is needed for proper operation? why?
> > I'm uncomfortable seeing otherwise vendor-neutral calls gain vendor
> > extensions.
> 
> But this is whole purpose of the udata framework in uverbs, right? 
> for 
> each uverb command the vendor user-space library has a well defined 
> channel to communicate directly with the low level vendor driver 
> throughout the uverbs channels.
> 

Uverbs convey information between kernel and userspace drivers to
implement verbs for userspace application. I don't think it's designed
to allow vendor to add random extensions in the best way with regard to
backward/forward compability.


Anyway, please, we have to make drivers which are going to behave as
good citizens to the kernel *and* userspace. Adding a dedicated
extensions which is going to be replaced later by a generic, vendor
neutral, extension will be painful to maintain to ensure backward
compatibility.

So let's think how this timestamp extension can be made generic enough
to be future-proof (and at least present proof to address current use
cases).

Regards.
Jason Gunthorpe May 20, 2015, 5:37 p.m. UTC | #10
On Wed, May 20, 2015 at 05:11:17PM +0200, Yann Droneaud wrote:
> Hi,
> 
> Le mercredi 20 mai 2015 à 17:41 +0300, Or Gerlitz a écrit :
> > On 5/20/2015 3:29 AM, Jason Gunthorpe wrote:
> > > On Tue, May 19, 2015 at 10:30:00PM +0300, Or Gerlitz wrote:
> > > > > Are you objecting adding the clock frequency and mask to the 
> > > > > qeury device verb?
> > > > > why?
> > > Lets see the verbs side and I'll let you know.
> > 
> > You mean the user series of libibverbs/libmlx4? I don't see why this 
> > should be a must for the review of the kernel bits. The user-space 
> > code 
> > is coming up soon, sure, but we should be able to review kernel 
> > patches 
> > without requiring to actually see the user-space code.
> > 
> 
> In some other subsystems: no userspace code, no merge.
> 
> http://blog.ffwll.ch/2015/05/gfx-kernel-upstreaming-requirements.html

I think we need to have the same policy.

Like for the time base related values, if the uapi is some kind of
'ibv_get_wc_timestamp_ns' call (which would make sense) then those are
actually vendor values and we don't need them in the public structure.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz May 20, 2015, 5:53 p.m. UTC | #11
On Wed, May 20, 2015 at 6:11 PM, Yann Droneaud <ydroneaud@opteya.com> wrote:

>> But this is whole purpose of the udata framework in uverbs, right? for
>> each uverb command the vendor user-space library has a well defined
>> channel to communicate directly with the low level vendor driver
>> throughout the uverbs channels.

> Uverbs convey information between kernel and userspace drivers to
> implement verbs for userspace application. I don't think it's designed
> to allow vendor to add random extensions in the best way with regard to
> backward/forward compability.

Disagree that this is random extension. The people that designed this
stack 10y ago (Roland and Co.) looked very nicely forward and realized
that not all the HW are the same nor can be put 101% under the same
API with no way out, and hence they came up with udata.

Please state how you see the role of the uverbs udata mechanism.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz May 23, 2015, 4:26 a.m. UTC | #12
On Wed, May 20, 2015 at 8:53 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
> On Wed, May 20, 2015 at 6:11 PM, Yann Droneaud <ydroneaud@opteya.com> wrote:
>
>>> But this is whole purpose of the udata framework in uverbs, right? for
>>> each uverb command the vendor user-space library has a well defined
>>> channel to communicate directly with the low level vendor driver
>>> throughout the uverbs channels.
>
>> Uverbs convey information between kernel and userspace drivers to
>> implement verbs for userspace application. I don't think it's designed
>> to allow vendor to add random extensions in the best way with regard to
>> backward/forward compability.
>
> Disagree that this is random extension. The people that designed this
> stack 10y ago (Roland and Co.) looked very nicely forward and realized
> that not all the HW are the same nor can be put 101% under the same
> API with no way out, and hence they came up with udata.
>
> Please state how you see the role of the uverbs udata mechanism.

Guys, still waiting to hear why you think it's wrong here to use the
mechanism which was built from day-1 for the purpose of allowing the
user-space driver library to communicate with the kernel driver and
pass values in both directions.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz May 26, 2015, 8:10 a.m. UTC | #13
On Sat, May 23, 2015 at 7:26 AM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
> On Wed, May 20, 2015 at 8:53 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
>> On Wed, May 20, 2015 at 6:11 PM, Yann Droneaud <ydroneaud@opteya.com> wrote:
>>
>>>> But this is whole purpose of the udata framework in uverbs, right? for
>>>> each uverb command the vendor user-space library has a well defined
>>>> channel to communicate directly with the low level vendor driver
>>>> throughout the uverbs channels.
>>
>>> Uverbs convey information between kernel and userspace drivers to
>>> implement verbs for userspace application. I don't think it's designed
>>> to allow vendor to add random extensions in the best way with regard to
>>> backward/forward compability.
>>
>> Disagree that this is random extension. The people that designed this
>> stack 10y ago (Roland and Co.) looked very nicely forward and realized
>> that not all the HW are the same nor can be put 101% under the same
>> API with no way out, and hence they came up with udata.
>>
>> Please state how you see the role of the uverbs udata mechanism.
>
> Guys, still waiting to hear why you think it's wrong here to use the
> mechanism which was built from day-1 for the purpose of allowing the
> user-space driver library to communicate with the kernel driver and
> pass values in both directions.

Jason, ping, it's fair to require that if you made a review argument against
the design done here and we've responded about a week ago, saying why
this design is valid (e.g goes along the 10y old IB stack udata mechanism and
such) -- you would comment on the response and not  leave it in the air.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe May 26, 2015, 4:06 p.m. UTC | #14
On Tue, May 26, 2015 at 11:10:45AM +0300, Or Gerlitz wrote:

> Jason, ping, it's fair to require that if you made a review argument against
> the design done here and we've responded about a week ago, saying why
> this design is valid (e.g goes along the 10y old IB stack udata mechanism and
> such) -- you would comment on the response and not  leave it in the air.

Was it not clear? Yann and I asked to see the user space side before
reviewing this series further.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz May 26, 2015, 6:33 p.m. UTC | #15
On Tue, May 26, 2015 at 7:06 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Tue, May 26, 2015 at 11:10:45AM +0300, Or Gerlitz wrote:
>
>> Jason, ping, it's fair to require that if you made a review argument against
>> the design done here and we've responded about a week ago, saying why
>> this design is valid (e.g goes along the 10y old IB stack udata mechanism and
>> such) -- you would comment on the response and not  leave it in the air.
>
> Was it not clear? Yann and I asked to see the user space side before
> reviewing this series further.

Jason, you (U2 BTW) play really, really hard - refusing to say **one**
word on your approach towards the built-in udata mechanism for uverbs
which I asked you to comment on.

On top of that, as happens **all** the **time** in netdev and possibly
other subsystems, user space facing kernel patches were reviewed and
accepted in this list over the last ten years with-out seeing their
user-space counter parts @ the time of the kernel submission. There's
no reason to impose this as hard requirement just b/c two reviewers
ask that. You don't own this place.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe May 26, 2015, 6:53 p.m. UTC | #16
On Tue, May 26, 2015 at 09:33:18PM +0300, Or Gerlitz wrote:
> On Tue, May 26, 2015 at 7:06 PM, Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com> wrote:
> > On Tue, May 26, 2015 at 11:10:45AM +0300, Or Gerlitz wrote:
> >
> >> Jason, ping, it's fair to require that if you made a review argument against
> >> the design done here and we've responded about a week ago, saying why
> >> this design is valid (e.g goes along the 10y old IB stack udata mechanism and
> >> such) -- you would comment on the response and not  leave it in the air.
> >
> > Was it not clear? Yann and I asked to see the user space side before
> > reviewing this series further.
> 
> Jason, you (U2 BTW) play really, really hard - refusing to say **one**
> word on your approach towards the built-in udata mechanism for uverbs
> which I asked you to comment on.

And I asked to see the user space side and you have angrily refused
every time.

So I guess we are both playing hard.

FWIW, your comments on udata seemed compelling, but I want to see the
whole solution before saying I'm OK with it.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz May 26, 2015, 8:39 p.m. UTC | #17
On Tue, May 26, 2015 at 9:53 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Tue, May 26, 2015 at 09:33:18PM +0300, Or Gerlitz wrote:
>> On Tue, May 26, 2015 at 7:06 PM, Jason Gunthorpe
>> <jgunthorpe@obsidianresearch.com> wrote:
>> > On Tue, May 26, 2015 at 11:10:45AM +0300, Or Gerlitz wrote:
>> >
>> >> Jason, ping, it's fair to require that if you made a review argument against
>> >> the design done here and we've responded about a week ago, saying why
>> >> this design is valid (e.g goes along the 10y old IB stack udata mechanism and
>> >> such) -- you would comment on the response and not  leave it in the air.
>> >
>> > Was it not clear? Yann and I asked to see the user space side before
>> > reviewing this series further.
>>
>> Jason, you (U2 BTW) play really, really hard - refusing to say **one**
>> word on your approach towards the built-in udata mechanism for uverbs
>> which I asked you to comment on.

> And I asked to see the user space side and you have angrily refused
> every time.

AFAIR I never ever refused to show any piece of code which went under
my hands towards Linux to any-one.

> So I guess we are both playing hard.

I disagree, you act as sort of being the boss here, stating every now
and then your preferences and way of engineering things as the
ultimate guidelines for Linux and/or RDMA engineering.

> FWIW, your comments on udata seemed compelling

Good to hear

> but I want to see the whole solution before saying I'm OK with it.

go look, not the final cut but should be close to what we'll submit

https://github.com/matanb10/libibverbs timestamp-v0
https://github.com/matanb10/libmlx4 timestamp-v0
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe May 26, 2015, 10:07 p.m. UTC | #18
On Tue, May 26, 2015 at 11:39:04PM +0300, Or Gerlitz wrote:
> >> Jason, you (U2 BTW) play really, really hard - refusing to say **one**
> >> word on your approach towards the built-in udata mechanism for uverbs
> >> which I asked you to comment on.
> 
> > And I asked to see the user space side and you have angrily refused
> > every time.
> 
> AFAIR I never ever refused to show any piece of code which went under
> my hands towards Linux to any-one.

For future reference, when someone asks a question and you go off on
an tangental rant and ignore the question, then that process repeats,
still without answering the question - most english speakers would
call that refusing to answer the question. It is not looked upon
kindly.

I'm really confused why you didn't just post the github links last
week, the patches are all a month old on there. Was it really so
offensive to you that we wanted to review the kernel UAPI patches and
verbs patches together?

> > So I guess we are both playing hard.
> 
> I disagree, you act as sort of being the boss here, stating every now
> and then your preferences and way of engineering things as the
> ultimate guidelines for Linux and/or RDMA engineering.

Lets be clear Or, I have given you (and others) some very pointed
comments and advice, privately and publicly. That is not 'being the
boss' that is contributing to fix our community.

When it comes to my patch comments, I give direction on what I want to
see to provide my Reviewed-By.

If you don't like it, then find someone else to review your code.

I'm busy, and I don't work for you. If I don't want to review some
patches because my questions have been ignored, then that is entirely
my perogative.

Pinging me *three times this week* on this stupid timestamp thing, is
somewhere between annoying and offensive.

Chill out Or.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz May 27, 2015, 11:54 a.m. UTC | #19
On Wed, May 27, 2015 at 1:07 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Tue, May 26, 2015 at 11:39:04PM +0300, Or Gerlitz wrote:
>> >> Jason, you (U2 BTW) play really, really hard - refusing to say **one**
>> >> word on your approach towards the built-in udata mechanism for uverbs
>> >> which I asked you to comment on.
>>
>> > And I asked to see the user space side and you have angrily refused
>> > every time.
>>
>> AFAIR I never ever refused to show any piece of code which went under
>> my hands towards Linux to any-one.

> For future reference, when someone asks a question and you go off on
> an tangental rant and ignore the question, then that process repeats,
> still without answering the question - most english speakers would
> call that refusing to answer the question. It is not looked upon kindly.

Jason,

It's not that you asked to see the code ala "hey, do you happen to
have a git with
the user space code for people to inspect while doing the review on
the kernel part", but
rather U2 saying in a definitive manner that posting the user space
code should be
imposed as pre-requirement to acceptance of the kernel parts.

In parallel, U2 totally rejected our usage of udata @ on the spot and
when I mentioned
that it's a feature which was designed for that purpose exactly and
from day one, it took
me three reminders to get a "you know what, maybe that can fly"
comment from you.

So here I started to realize that there's something in the attitude
that goes beyond
the details, and I made the you're not the boss comment.


> I'm really confused why you didn't just post the github links last
> week, the patches are all a month old on there. Was it really so
> offensive to you that we wanted to review the kernel UAPI patches and
> verbs patches together?
>
>> > So I guess we are both playing hard.
>>
>> I disagree, you act as sort of being the boss here, stating every now
>> and then your preferences and way of engineering things as the
>> ultimate guidelines for Linux and/or RDMA engineering.
>
> Lets be clear Or, I have given you (and others) some very pointed
> comments and advice, privately and publicly. That is not 'being the
> boss' that is contributing to fix our community.
>
> When it comes to my patch comments, I give direction on what I want to
> see to provide my Reviewed-By.
>
> If you don't like it, then find someone else to review your code.
>
> I'm busy, and I don't work for you. If I don't want to review some
> patches because my questions have been ignored, then that is entirely
> my perogative.
>
> Pinging me *three times this week* on this stupid timestamp thing, is
> somewhere between annoying and offensive.

See above, why I made these pings.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe May 27, 2015, 6:48 p.m. UTC | #20
On Wed, May 27, 2015 at 02:54:12PM +0300, Or Gerlitz wrote:
> On Wed, May 27, 2015 at 1:07 AM, Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com> wrote:
> > On Tue, May 26, 2015 at 11:39:04PM +0300, Or Gerlitz wrote:
> >> >> Jason, you (U2 BTW) play really, really hard - refusing to say **one**
> >> >> word on your approach towards the built-in udata mechanism for uverbs
> >> >> which I asked you to comment on.
> >>
> >> > And I asked to see the user space side and you have angrily refused
> >> > every time.
> >>
> >> AFAIR I never ever refused to show any piece of code which went under
> >> my hands towards Linux to any-one.
> 
> > For future reference, when someone asks a question and you go off on
> > an tangental rant and ignore the question, then that process repeats,
> > still without answering the question - most english speakers would
> > call that refusing to answer the question. It is not looked upon kindly.
> 
> Jason,
> 
> It's not that you asked to see the code ala "hey, do you happen to
> have a git with the user space code for people to inspect while
> doing the review on the kernel part", but rather U2 saying in a
> definitive manner that posting the user space code should be imposed
> as pre-requirement to acceptance of the kernel parts.

I really didn't Or:

First ask:
 'We can't really look at the uapi changes here without also seeing the
  verbs side changes.'
 (I know others on the list feel the same, so I use 'we')

Second ask:
 'Lets see the verbs side and I'll let you know.'
(.. to your questions based on my review comments ..)

Third ask:
 'I think we need to have the same policy.'
(.. To Yann's point that other kernel communities have a mandatory
  UAPI policy)

Fourth (exasperated) ask:
 'Was it not clear? Yann and I asked to see the user space side before
  reviewing this series further.'

I know you are ESL, and I cut you alot of slack, but *come on* - that
is incredibly soft language, and certainly not bossing and imposing in
a definitive manner a blanket requirement on all patches.

Advice: You would be well served to spend a bit more time on your
emails. I have no idea what 'but rather U2 saying' means, for
instance. Sometimes I just guess at what you are trying to say :|

That is the price we pay for an inclusive international community, but
everyone needs to be careful before starting a flame war based on
percived slight in the text and phrasing of a message. email is hard.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz May 27, 2015, 9:33 p.m. UTC | #21
On Wed, May 27, 2015 at 9:48 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Wed, May 27, 2015 at 02:54:12PM +0300, Or Gerlitz wrote:
>> On Wed, May 27, 2015 at 1:07 AM, Jason Gunthorpe
>> <jgunthorpe@obsidianresearch.com> wrote:
>> > On Tue, May 26, 2015 at 11:39:04PM +0300, Or Gerlitz wrote:
>> >> >> Jason, you (U2 BTW) play really, really hard - refusing to say **one**
>> >> >> word on your approach towards the built-in udata mechanism for uverbs
>> >> >> which I asked you to comment on.
>> >>
>> >> > And I asked to see the user space side and you have angrily refused
>> >> > every time.
>> >>
>> >> AFAIR I never ever refused to show any piece of code which went under
>> >> my hands towards Linux to any-one.
>>
>> > For future reference, when someone asks a question and you go off on
>> > an tangental rant and ignore the question, then that process repeats,
>> > still without answering the question - most english speakers would
>> > call that refusing to answer the question. It is not looked upon kindly.
>>
>> Jason,
>>
>> It's not that you asked to see the code ala "hey, do you happen to
>> have a git with the user space code for people to inspect while
>> doing the review on the kernel part", but rather U2 saying in a
>> definitive manner that posting the user space code should be imposed
>> as pre-requirement to acceptance of the kernel parts.
>
> I really didn't Or:
>
> First ask:
>  'We can't really look at the uapi changes here without also seeing the
>   verbs side changes.'
>  (I know others on the list feel the same, so I use 'we')
>
> Second ask:
>  'Lets see the verbs side and I'll let you know.'
> (.. to your questions based on my review comments ..)
>
> Third ask:
>  'I think we need to have the same policy.'
> (.. To Yann's point that other kernel communities have a mandatory
>   UAPI policy)
>
> Fourth (exasperated) ask:
>  'Was it not clear? Yann and I asked to see the user space side before
>   reviewing this series further.'
>
> I know you are ESL, and I cut you alot of slack, but *come on* - that
> is incredibly soft language, and certainly not bossing and imposing in
> a definitive manner a blanket requirement on all patches.

Jason,

ESL indeed am I, and in that respect, this clarification, even if
being tedious to set or read, helps.

Still, I'd like to further try and get you from where the bossing thing came:

(1) "show me the user space code prior to acceptance of the kernel
part" never was a requirement on this community since the day we were
born (Q4/2004)

(2) instantly rejecting a usage of a mechanism existing just for that
use case since the first year of our life (2005)

is something perceived by me as two people (that's the U2) that come
and say, "game's over, the old, non-functioning boss is gone, new boss
(== we) in town and forget about everything you knew before".

So you say that is was wrong perception, I hope so. Let's see how Doug
see your feedback, namely either as community reviewer feedback or as
new rules being set overnight, waiting.

This series is (1) simple compared to other stuff being reviewed here
nowadays, and (2) has very nice value to latency sensitive
applications, so two wins, lets get it done.

> Advice: You would be well served to spend a bit more time on your
> emails. I have no idea what 'but rather U2 saying' means, for
> instance. Sometimes I just guess at what you are trying to say :|

point taken

> That is the price we pay for an inclusive international community, but
> everyone needs to be careful before starting a flame war based on
> percived slight in the text and phrasing of a message. email is hard.

point taken. In this case (as you can see from my response above) I am
not convinced yet that this was false positive.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe May 27, 2015, 10:21 p.m. UTC | #22
On Thu, May 28, 2015 at 12:33:58AM +0300, Or Gerlitz wrote:

> (2) instantly rejecting a usage of a mechanism existing just for that
> use case since the first year of our life (2005)

I didn't reject anything, I said I wanted to see the user side to
continue my review.

> This series is (1) simple compared to other stuff being reviewed here
> nowadays,

What? It is 11 patches, long, introduces several UAPI changes, does
not implement a standardized feature, adds new uses of latent kernel
functions and exists to support a unique feature of a single hardware
vendor that few understand the usecase for.

I would describe this as one of the toughest to review series on
patchworks right now.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz May 28, 2015, 7:04 a.m. UTC | #23
On 5/28/2015 1:21 AM, Jason Gunthorpe wrote:
> exists to support a unique feature of a single hardware vendor that few understand the use case for

Responding in EIM (End In Mind) manner

The use case is very clear, low latency applications using UD or RAW 
PACKET QPs that needs to know the time it takes for different HW/SW 
layers to get their packets through. The verbs version of SO_TIMESTAMP 
and friends (Documentation/networking/timestamping*)-- Christoph, can 
you add some info on common use-cases for this?

I bet that > 20 upstream Eth NIC drivers supports time-stamping, so 
there's no reason that a modern HCA will not support it too.

> It is 11 patches, long, introduces several UAPI changes,

      1  IB/core: Change provider's API of create_cq to be extendible
      2  IB/core: Change ib_create_cq to use struct ib_cq_init_attr
      3  IB/core: Add CQ creation time-stamping flag
      4  IB/core: Extend ib_uverbs_create_cq
      5  IB/core: Add timestamp_mask and hca_core_clock to query_device
      6  IB/core: Pass hardware specific data in query_device
      7  IB/mlx4: Add mmap call to map the hardware clock
      8  IB/mlx4: Support extended create_cq and query_device uverbs
      9  IB/mlx4: Add support for timestamp in cq creation
     10  IB/mlx4: Add timestamp_mask and hca_core_clock to query_device
     11  IB/mlx4: Return hca core clock's offset in query_device 
vendor's data

01-02 just cosmetics that don't add any new functionality
03    adding CQ creation flag to the kernel verbs
04    new uverbs API to extend CQ creation
05    extending uverbs query device to return two more values
06    small fix to missing udata mechanics in uverbs query device
07-11 mlx4 provider side of the CQ setup and clock mmaping to user-space

the core of the review should be around the 03-06 zone, and with experts 
such as
Yann (and you) the uverbs part shouldn't be too complex to review and 
fix if needed.

> does not implement a standardized feature,

This is standard in Eth NIC, return the time-stamp of when the packet 
arrived/sent


> adds new uses of latent kernel functions

ESL I am

Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) May 28, 2015, 2:13 p.m. UTC | #24
On Thu, 28 May 2015, Or Gerlitz wrote:

> The use case is very clear, low latency applications using UD or RAW PACKET
> QPs that needs to know the time it takes for different HW/SW layers to get
> their packets through. The verbs version of SO_TIMESTAMP and friends
> (Documentation/networking/timestamping*)-- Christoph, can you add some info on
> common use-cases for this?

Well timestamp information is widely in the financial industry to track
latency and also for fairness. Exchanges/Brokers etc often must guarantee
that the packet received first is not processed after packets received
later. Packet timestamps in many ways affect processing and are also used
for error checking etc.

What we have to do without this is to use RDTSC to get a timestamp but the
packet reception / sending time then is inaccurate due to the instructions that
have to be executed before and after. And there is additional overhead due
to that processing.

> I bet that > 20 upstream Eth NIC drivers supports time-stamping, so there's no
> reason that a modern HCA will not support it too.

Right. Ethernet timekeeping support and timestamping has become fairly
standard. Even the onboard one do that these days.

> > does not implement a standardized feature,
>
> This is standard in Eth NIC, return the time-stamp of when the packet
> arrived/sent

It needs desperately to be in the standard.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe May 28, 2015, 4:24 p.m. UTC | #25
On Thu, May 28, 2015 at 09:13:41AM -0500, Christoph Lameter wrote:
> What we have to do without this is to use RDTSC to get a timestamp but the
> packet reception / sending time then is inaccurate due to the instructions that
> have to be executed before and after. And there is additional overhead due
> to that processing.

After a quick look through, the biggest question in my mind is what
should the timestamp value in the wc be?

Right now it is some coded thing in clock cycles.

Should we require the driver to convert to ns before passing the wc
back to the app? (Looks like the socket implementation uniformly uses
us or ns)

Should the app open code the conversion from clock cycles to ns, or
vfunc down to the driver?

Is the coding going to be OK for multiple HW vendors?

These questions effect what the UAPI should be, if the answer is 'use
ns everwhere' then some of this stuff being added does not belong in
the general API.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) May 28, 2015, 5:14 p.m. UTC | #26
On Thu, 28 May 2015, Jason Gunthorpe wrote:

> After a quick look through, the biggest question in my mind is what
> should the timestamp value in the wc be?
>
> Right now it is some coded thing in clock cycles.

This is sufficient since it can be converted to ns or whatever one wants.

> Should we require the driver to convert to ns before passing the wc
> back to the app? (Looks like the socket implementation uniformly uses
> us or ns)

But that requires additional processing.

> Should the app open code the conversion from clock cycles to ns, or
> vfunc down to the driver?

The API provides the abilty to retrieve the clock freq which is
sufficient for the user to convert the value to meaningful time values.



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe May 28, 2015, 5:50 p.m. UTC | #27
On Thu, May 28, 2015 at 12:14:15PM -0500, Christoph Lameter wrote:
> On Thu, 28 May 2015, Jason Gunthorpe wrote:
> 
> > After a quick look through, the biggest question in my mind is what
> > should the timestamp value in the wc be?
> >
> > Right now it is some coded thing in clock cycles.
> 
> This is sufficient since it can be converted to ns or whatever one wants.

Sure it is sufficient, but is it a robust UAPI, will it support
multiple hardware vendors?

Is anyone else in ethernet using verbs to deliver IP packets?

Having a conversion function, or doing it in the wc generation is more
'obviously safe' for future proofing the UAPI.

> > Should we require the driver to convert to ns before passing the wc
> > back to the app? (Looks like the socket implementation uniformly uses
> > us or ns)

> But that requires additional processing.

Well, it is only additional if the app is going to ignore the time
stamp or not convert it to ns right away. Is that the common use case?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) May 28, 2015, 6:30 p.m. UTC | #28
On Thu, 28 May 2015, Jason Gunthorpe wrote:

> > This is sufficient since it can be converted to ns or whatever one wants.
>
> Sure it is sufficient, but is it a robust UAPI, will it support
> multiple hardware vendors?

What would prevent other hardware vendors from exporting their counters?

> Is anyone else in ethernet using verbs to deliver IP packets?

This is not only for Ethernet. Internally Infiniband is frequently used
and there also timestamps are useful.

> Having a conversion function, or doing it in the wc generation is more
> 'obviously safe' for future proofing the UAPI.

Well no. There has been a history of putting time corrections etc etc into
these. Once you move from a raw counter to actual time various
complications may need to be dealt with. For simple time differentials the
counter is sufficient. If you really want proper "time" when something
happens then you may want to scale and correct etc the value and have some
sort of time sync approach.

> > But that requires additional processing.
>
> Well, it is only additional if the app is going to ignore the time
> stamp or not convert it to ns right away. Is that the common use case?

The app may not need proper time but just a cycle count differential. A
cycle count differential is often easier to handle than a ns value. And
having ns values leads to the assumption that they are "correct" so
various factors related to tuning the clock etc may have to be applied.
You want to deal with these issues only if really necessary.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe May 28, 2015, 7:50 p.m. UTC | #29
On Thu, May 28, 2015 at 01:30:52PM -0500, Christoph Lameter wrote:
> On Thu, 28 May 2015, Jason Gunthorpe wrote:
> 
> > > This is sufficient since it can be converted to ns or whatever one wants.
> >
> > Sure it is sufficient, but is it a robust UAPI, will it support
> > multiple hardware vendors?
> 
> What would prevent other hardware vendors from exporting their counters?

Well, I'm not a HW vendor, so I don't know for sure, the proposal is:

timestamp_mask - how many bits are valid in the timestamp.
		 timestamp values could be 64bits the most.
hca_core_clock - timestamp is given in HW cycles, hca_core_clock
                 is the frequency of the HCA and is necessary in
		 order to convert cycles to seconds.

So presumably cycle difference is:
  (a-b) & mask
And to ns is
 a*1E9/hca_core_clock

So first, the above is a tricky bit of math to open code, integer
overflow needs to be avoided - so at a minimum I'd like to see
libibverbs provide this as a function call. I'd probably also say
difference should be an inline too.

If it is a function call then maybe we don't need to tell the app
about hca_core_clock?

Second: What about wrap around? Does it even make sense to expose less
than 64 bits to userspace? Should the driver manage wrap around to
create a flat 64 bit space?

Otherwise, if the app has to do it, there is no event indicating wrap
has occured.

Without managing wrapping, cycle count difference is not a reliable
calculation.

> > Is anyone else in ethernet using verbs to deliver IP packets?
> 
> This is not only for Ethernet. Internally Infiniband is frequently used
> and there also timestamps are useful.

Sure, but should any other vendors be commenting on this UAPI?

> Well no. There has been a history of putting time corrections etc etc into
> these. Once you move from a raw counter to actual time various
> complications may need to be dealt with. For simple time differentials the
> counter is sufficient. If you really want proper "time" when something
> happens then you may want to scale and correct etc the value and have some
> sort of time sync approach.

Sure OK, I'm sold, export a cycle counter of some sort.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) May 28, 2015, 8:34 p.m. UTC | #30
On Thu, 28 May 2015, Jason Gunthorpe wrote:

> Second: What about wrap around? Does it even make sense to expose less
> than 64 bits to userspace? Should the driver manage wrap around to
> create a flat 64 bit space?

The wrap around is given by the mask. Cycle registers are often shorter
than 64 bits.

> Otherwise, if the app has to do it, there is no event indicating wrap
> has occured.

Well yes. You take the difference and then apply the mask. If you are
outside of the cycle range then this will not give you a proper time.

> Without managing wrapping, cycle count difference is not a reliable
> calculation.

Counters only works for the interval in which they do not go
through a complete cycle. But that is the nature of these counters. And
thats what you see when you look into the kernel timer subsystem for
example.

> > This is not only for Ethernet. Internally Infiniband is frequently used
> > and there also timestamps are useful.
>
> Sure, but should any other vendors be commenting on this UAPI?

I think they would have spoken up if they had any objections?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe May 28, 2015, 8:47 p.m. UTC | #31
On Thu, May 28, 2015 at 03:34:00PM -0500, Christoph Lameter wrote:
> On Thu, 28 May 2015, Jason Gunthorpe wrote:
> 
> > Second: What about wrap around? Does it even make sense to expose less
> > than 64 bits to userspace? Should the driver manage wrap around to
> > create a flat 64 bit space?
> 
> The wrap around is given by the mask. Cycle registers are often shorter
> than 64 bits.

I am aware of how cycle counters work.

My point was exposing raw wrapping counters is a horrible UAPI.

Shouldn't the driver software extend smaller counters to 64 bits?
That would take a single or and an unlikely branch, so don't say
'performance' :)

> through a complete cycle. But that is the nature of these counters. And
> thats what you see when you look into the kernel timer subsystem for
> example.

Very little in the kernel is exposed to that wrapping, the timer
subsystem takes care of it. Certainly, userspace never sees it.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hefty, Sean May 29, 2015, 7:59 a.m. UTC | #32
> > > Second: What about wrap around? Does it even make sense to expose less
> > > than 64 bits to userspace? Should the driver manage wrap around to
> > > create a flat 64 bit space?
> >
> > The wrap around is given by the mask. Cycle registers are often shorter
> > than 64 bits.
> 
> I am aware of how cycle counters work.
> 
> My point was exposing raw wrapping counters is a horrible UAPI.
> 
> Shouldn't the driver software extend smaller counters to 64 bits?
> That would take a single or and an unlikely branch, so don't say
> 'performance' :)

It's one thing to time stamp a frame or packet.  But this is assigning a time stamp to a work completion.  I don't even know what that's supposed to mean when considering 2 GB (or larger) transfers, RDMA read operations, XRC, dynamic connections, out of order retransmissions, shared receive queues, and other exotic features.  IMO, this is currently vendor specific functionality, and not obviously applicable as a generic feature.  It is certainly poorly defined and exposed in a very implementation specific way.

The use case given by Christoph only speaks of packet level time stamps.  One could argue that such a use case would place the stamp near the packet (similar to the GRH), rather than embedded into a work completion.  This would allow time stamps even in the absence of a work completion.

IMO, vendors already have ways to expose vendor specific features to user space.  I would mark this as vendor specific and keep it out of the core.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) May 29, 2015, 1:46 p.m. UTC | #33
On Thu, 28 May 2015, Jason Gunthorpe wrote:

> My point was exposing raw wrapping counters is a horrible UAPI.

Well this is a kernel bypass API and a lot of raw hardware issues will
have to be handled since you do go directly to the device.

> > through a complete cycle. But that is the nature of these counters. And
> > thats what you see when you look into the kernel timer subsystem for
> > example.
>
> Very little in the kernel is exposed to that wrapping, the timer
> subsystem takes care of it. Certainly, userspace never sees it.

Right but then we are not at the comfortable sockets API here but at the
bare metal level.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) May 29, 2015, 1:55 p.m. UTC | #34
On Fri, 29 May 2015, Hefty, Sean wrote:

> > Shouldn't the driver software extend smaller counters to 64 bits?
> > That would take a single or and an unlikely branch, so don't say
> > 'performance' :)
>
> It's one thing to time stamp a frame or packet.  But this is assigning a ti=
> me stamp to a work completion.  I don't even know what that's supposed to m=
> ean when considering 2 GB (or larger) transfers, RDMA read operations, XRC,=
>  dynamic connections, out of order retransmissions, shared receive queues, =
> and other exotic features.  IMO, this is currently vendor specific function=

What is the issue here? The timestamp is created when the processing is
finished by the nic and the completion entry becomes available.

> ality, and not obviously applicable as a generic feature.  It is certainly =
> poorly defined and exposed in a very implementation specific way.

It is exactly defined like any other cycle counters in hardware and it is
exposed using an API that allows multiple vendors to use these cycle
counters without regard to a particular vendors implementation.

I sure wish that Intel would be supporting a feature like this. Please
come up with a better alternative if there is one. This is likely going to
be a differentiator for the vendor used in our industry.

> The use case given by Christoph only speaks of packet level time stamps.  O=
> ne could argue that such a use case would place the stamp near the packet (=
> similar to the GRH), rather than embedded into a work completion.  This wou=
> ld allow time stamps even in the absence of a work completion.
>
> IMO, vendors already have ways to expose vendor specific features to user s=
> pace.  I would mark this as vendor specific and keep it out of the core.

How exactly would that work? How can we attach vendor specific extension
to a completion structure?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford May 29, 2015, 4:49 p.m. UTC | #35
On Fri, 2015-05-29 at 08:46 -0500, Christoph Lameter wrote:
> On Thu, 28 May 2015, Jason Gunthorpe wrote:
> 
> > My point was exposing raw wrapping counters is a horrible UAPI.
> 
> Well this is a kernel bypass API and a lot of raw hardware issues will
> have to be handled since you do go directly to the device.

No, that's not entirely true, and it *certainly* is not the correct way
to think about verbs extensions.  Is it kernel bypass?  Yes.  Does it go
direct to the hardware?  Not as far as the user application is
concerned.  The direct hardware access is abstracted away in the verbs
library.  Because the verbs library is a hardware abstraction layer, any
extensions to it need to be well thought out.  And by that I mean if it
is of general use, then it should be added in a general, abstract way
that any hardware can implement.  If it is specific to just one vendor's
hardware, then it can be added in a means that is specific to that
vendor's hardware.

Now, as a general rule, I would call timestamps general.  They should be
added in a fashion that anyone can implement.  They should also be well
defined.  Sean's questions raise a very valid point.  Exactly what is
being timestamped, and do we care about different timestamp options?  Is
it completion of message, start of message, transfer from HCA to main
system memory completion, etc.  The 00/10 header to this patch series
was probably answering Sean's question, but just based on the name of
the TIMESTAMP flag to the CQ creation attr struct it isn't clear that
this is the case.

> > > through a complete cycle. But that is the nature of these counters. And
> > > thats what you see when you look into the kernel timer subsystem for
> > > example.
> >
> > Very little in the kernel is exposed to that wrapping, the timer
> > subsystem takes care of it. Certainly, userspace never sees it.
> 
> Right but then we are not at the comfortable sockets API here but at the
> bare metal level.

That's not entirely true.  We still hold to our abstrations, they are
just intentionally kept very thin and high performing.
Hefty, Sean May 29, 2015, 4:52 p.m. UTC | #36
> What is the issue here? The timestamp is created when the processing is
> finished by the nic and the completion entry becomes available.

The timestamp is generated when a work completion entry is written.  If there's a clear use case for this, it hasn't been described.  The use case you mentioned only works if there is a 1:1 relationship between a packet and a work completion.  That is not what is being defined here.

> It is exactly defined like any other cycle counters in hardware and it is
> exposed using an API that allows multiple vendors to use these cycle
> counters without regard to a particular vendors implementation.

I disagree.  This is associated with a specific implementation.  It assumes that the counter is part of a CQ entry, and that the counter is written when the completion is written.  There's nothing that requires other vendors to follow that model, nor is it clear that this is a generic or useful enough operation that other vendors would want to follow this model.  Why not have the time stamp record the start of the transaction?  The end?  Have two stamps, once for the first packet, and one for the last?  Limit this to single packet operations only?

> > The use case given by Christoph only speaks of packet level time stamps.
> O=
> > ne could argue that such a use case would place the stamp near the
> packet (=
> > similar to the GRH), rather than embedded into a work completion.  This
> wou=
> > ld allow time stamps even in the absence of a work completion.
> >
> > IMO, vendors already have ways to expose vendor specific features to
> user s=
> > pace.  I would mark this as vendor specific and keep it out of the core.
> 
> How exactly would that work? How can we attach vendor specific extension
> to a completion structure?

I'm just stating that there is at other ways of exposing this sort of feature.  A time stamp could just as easily be written with the data, similar to the grh.  One of the points of defining the verbs extensions was exactly so that a vendor could export their own functionality.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) May 29, 2015, 4:59 p.m. UTC | #37
On Fri, 29 May 2015, Doug Ledford wrote:

> > Well this is a kernel bypass API and a lot of raw hardware issues will
> > have to be handled since you do go directly to the device.
>
> No, that's not entirely true, and it *certainly* is not the correct way
> to think about verbs extensions.  Is it kernel bypass?  Yes.  Does it go
> direct to the hardware?  Not as far as the user application is
> concerned.  The direct hardware access is abstracted away in the verbs

There is a compromise here by using the kernel as a administrative
function (setup and configuration of QPs) but using bare metal for the
data path. The structures modified for send/receive are structures that
are directly understood and handled by the hardware. That is the core
benefit of the RDMA API which results in the wanted performance and
latency.

The administrative function / bare metal separation is also reflected in
this patchset. The admin function allows the determination of the cycle
counter freq and size. The bare metal cycle counter exists in the
fastpath.

> library.  Because the verbs library is a hardware abstraction layer, any
> extensions to it need to be well thought out.  And by that I mean if it
> is of general use, then it should be added in a general, abstract way
> that any hardware can implement.  If it is specific to just one vendor's
> hardware, then it can be added in a means that is specific to that
> vendor's hardware.

What is particular here to the vendor?

> Now, as a general rule, I would call timestamps general.  They should be
> added in a fashion that anyone can implement.  They should also be well
> defined.  Sean's questions raise a very valid point.  Exactly what is
> being timestamped, and do we care about different timestamp options?  Is
> it completion of message, start of message, transfer from HCA to main
> system memory completion, etc.  The 00/10 header to this patch series
> was probably answering Sean's question, but just based on the name of
> the TIMESTAMP flag to the CQ creation attr struct it isn't clear that
> this is the case.

Ok then lets answer that.

> > Right but then we are not at the comfortable sockets API here but at the
> > bare metal level.
>
> That's not entirely true.  We still hold to our abstrations, they are
> just intentionally kept very thin and high performing.

Well there is a distinction here.

We provide the comfort of setup and administrative functions through the
kernel API. We still try to isolate the application as much as possible
when we go to the data paths but we need to hit bare metal in the fastpath
in order to accomplish our mission of maximum performance and minimum
latency. This cannot be accomplished with kernel calls and therefore it is
an examplle of kernel bypass. We want this as comfy as possible of course
and well defined.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford May 29, 2015, 4:59 p.m. UTC | #38
On Thu, 2015-05-28 at 12:14 -0500, Christoph Lameter wrote:
> On Thu, 28 May 2015, Jason Gunthorpe wrote:
> 
> > After a quick look through, the biggest question in my mind is what
> > should the timestamp value in the wc be?
> >
> > Right now it is some coded thing in clock cycles.
> 
> This is sufficient since it can be converted to ns or whatever one wants.

It is sufficient for your use.  It is not, however, a good API.

> > Should we require the driver to convert to ns before passing the wc
> > back to the app? (Looks like the socket implementation uniformly uses
> > us or ns)
> 
> But that requires additional processing.

Yes.

> > Should the app open code the conversion from clock cycles to ns, or
> > vfunc down to the driver?
> 
> The API provides the abilty to retrieve the clock freq which is
> sufficient for the user to convert the value to meaningful time values.

I would prefer if the access to the timestamp were implemented via a
function in libibverbs (I haven't looked at the git repo, too little
time, I'll get to it).  Something like ibv_get_cqe_timestamp().  That
function should be general and return a suitable, normalized value (ns
probably).  If you just want a simple comparator without the overhead of
normalizing to time, and are willing to accept the consequences of that,
then I would expect you to use something like
ibv_get_raw_cqe_timestamp() to get the unadulterated cycle counter.  For
the most part, the user space application should not know details like
"we are using a cycle counter in the HCA processor for timestamping",
that's below the level of abstraction we attempt to maintain at the
verbs level.  Libmlx4 should be the only thing aware of that fact, and
it talks to the mlx4 driver in the kernel to get the details it needs.
And by putting it into a function that libmlx4 implements, if libcxgb4
decides to implement timestamps and does it in a different way, the app
doesn't care, it just uses the same call.
Hefty, Sean May 29, 2015, 5:09 p.m. UTC | #39
PiBOb3csIGFzIGEgZ2VuZXJhbCBydWxlLCBJIHdvdWxkIGNhbGwgdGltZXN0YW1wcyBnZW5lcmFs
LiAgVGhleSBzaG91bGQgYmUNCj4gYWRkZWQgaW4gYSBmYXNoaW9uIHRoYXQgYW55b25lIGNhbiBp
bXBsZW1lbnQuICBUaGV5IHNob3VsZCBhbHNvIGJlIHdlbGwNCj4gZGVmaW5lZC4gIFNlYW4ncyBx
dWVzdGlvbnMgcmFpc2UgYSB2ZXJ5IHZhbGlkIHBvaW50LiAgRXhhY3RseSB3aGF0IGlzDQo+IGJl
aW5nIHRpbWVzdGFtcGVkLCBhbmQgZG8gd2UgY2FyZSBhYm91dCBkaWZmZXJlbnQgdGltZXN0YW1w
IG9wdGlvbnM/ICBJcw0KPiBpdCBjb21wbGV0aW9uIG9mIG1lc3NhZ2UsIHN0YXJ0IG9mIG1lc3Nh
Z2UsIHRyYW5zZmVyIGZyb20gSENBIHRvIG1haW4NCj4gc3lzdGVtIG1lbW9yeSBjb21wbGV0aW9u
LCBldGMuICBUaGUgMDAvMTAgaGVhZGVyIHRvIHRoaXMgcGF0Y2ggc2VyaWVzDQo+IHdhcyBwcm9i
YWJseSBhbnN3ZXJpbmcgU2VhbidzIHF1ZXN0aW9uLCBidXQganVzdCBiYXNlZCBvbiB0aGUgbmFt
ZSBvZg0KPiB0aGUgVElNRVNUQU1QIGZsYWcgdG8gdGhlIENRIGNyZWF0aW9uIGF0dHIgc3RydWN0
IGl0IGlzbid0IGNsZWFyIHRoYXQNCj4gdGhpcyBpcyB0aGUgY2FzZS4NCg0KSSBkaWRuJ3Qgc2Vl
IHRoZSBpbmZvcm1hdGlvbiB0aGF0IEkgd2FzIGxvb2tpbmcgZm9yIGluIHRoZSBwYXRjaCBoZWFk
ZXIgdG8gdGhpcyBzZXJpZXMuICBBcyBKYXNvbiBwb2ludGVkIG91dCwgdGhlIHVzZSBjYXNlIGlz
IGxhY2tpbmcuDQoNCklNTywgaXQgY291bGQgbWFrZSBqdXN0IGFzIG11Y2ggc2Vuc2UgdG8gYXNz
b2NpYXRlL2VuYWJsZSB0aW1lIHN0YW1waW5nIHdpdGggdGhlIFFQIGFzIHdpdGggdGhlIENRLCBv
ciBldmVuIG1ha2UgaXQgY29uZmlndXJhYmxlIHBlciBvcGVyYXRpb24gb3Igb3BlcmF0aW9uIHR5
cGUuDQoNCklmIENocmlzdG9waCBoYXMgYSBjbGVhciB1c2UgY2FzZSBhbmQgd2FudHMgdG8gZ28g
dG8gdGhlICdiYXJlIG1ldGFsJywgdGhlbiBhIHZlbmRvciBzcGVjaWZpYyBvcHRpb24gc2VlbXMg
aWRlYWwuICBBdCBsZWFzdCB1bnRpbCB0aGVyZSBhcmUgb3RoZXIgaW1wbGVtZW50YXRpb25zIG9y
IHRoZSBkcml2aW5nIHVzZSBjYXNlIGlzIGNsZWFyZXIuDQoNCi0gU2Vhbg0K
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford May 29, 2015, 5:14 p.m. UTC | #40
On Fri, 2015-05-29 at 17:09 +0000, Hefty, Sean wrote:
> > Now, as a general rule, I would call timestamps general.  They should be
> > added in a fashion that anyone can implement.  They should also be well
> > defined.  Sean's questions raise a very valid point.  Exactly what is
> > being timestamped, and do we care about different timestamp options?  Is
> > it completion of message, start of message, transfer from HCA to main
> > system memory completion, etc.  The 00/10 header to this patch series
> > was probably answering Sean's question, but just based on the name of
> > the TIMESTAMP flag to the CQ creation attr struct it isn't clear that
> > this is the case.
> 
> I didn't see the information that I was looking for in the patch header to this series.  As Jason pointed out, the use case is lacking.
> 
> IMO, it could make just as much sense to associate/enable time stamping with the QP as with the CQ, or even make it configurable per operation or operation type.
> 
> If Christoph has a clear use case and wants to go to the 'bare metal', then a vendor specific option seems ideal.  At least until there are other implementations or the driving use case is clearer.

The use case is clear IMO.  It's for financial trading software.  I
don't think they really care about details like whether it's the start
or end packet, or completion, or whatever.  They need a tie breaker
between when they have two different buy or sell orders on the same lot
of stock.  Any deterministic timing/ordering method will do as long as
they consistently apply it I think.  And the faster and lower overhead
the process, the better.  He doesn't really want a timestamp, he merely
wants a sequence ordering.  But a timestamp is what they are using to
get him what he needs.

Is that a fair guess Christoph?
Hefty, Sean May 29, 2015, 5:18 p.m. UTC | #41
> The use case is clear IMO.  It's for financial trading software.  I

> don't think they really care about details like whether it's the start

> or end packet, or completion, or whatever.  They need a tie breaker

> between when they have two different buy or sell orders on the same lot

> of stock.  Any deterministic timing/ordering method will do as long as

> they consistently apply it I think.  And the faster and lower overhead

> the process, the better.  He doesn't really want a timestamp, he merely

> wants a sequence ordering.  But a timestamp is what they are using to

> get him what he needs.

> 

> Is that a fair guess Christoph?


I get Christoph's usage model.  But AFAIK, he uses UD packets, often with multicast.  So the timestamp ends up associated indirectly with an actual packet.  That model makes sense to me.  It’s the generic time stamping of a work completion (for, say, a 2 GB transfer over RC using an SRQ) that doesn't make sense to me.  That's the part I'm struggling with.
Steve Wise May 29, 2015, 7:21 p.m. UTC | #42
On 5/29/2015 11:52 AM, Hefty, Sean wrote:
>> What is the issue here? The timestamp is created when the processing is
>> finished by the nic and the completion entry becomes available.
> The timestamp is generated when a work completion entry is written.  If there's a clear use case for this, it hasn't been described.  The use case you mentioned only works if there is a 1:1 relationship between a packet and a work completion.  That is not what is being defined here.
>
>> It is exactly defined like any other cycle counters in hardware and it is
>> exposed using an API that allows multiple vendors to use these cycle
>> counters without regard to a particular vendors implementation.
> I disagree.  This is associated with a specific implementation.  It assumes that the counter is part of a CQ entry, and that the counter is written when the completion is written.  There's nothing that requires other vendors to follow that model, nor is it clear that this is a generic or useful enough operation that other vendors would want to follow this model.  Why not have the time stamp record the start of the transaction?  The end?  Have two stamps, once for the first packet, and one for the last?  Limit this to single packet operations only?

FWIW: cxgb4 hardware includes a hw timestamp in its CQE as well. It is 
used by SW for CQ overflow detection and debug timing analysis...

>>> The use case given by Christoph only speaks of packet level time stamps.
>> O=
>>> ne could argue that such a use case would place the stamp near the
>> packet (=
>>> similar to the GRH), rather than embedded into a work completion.  This
>> wou=
>>> ld allow time stamps even in the absence of a work completion.
>>>
>>> IMO, vendors already have ways to expose vendor specific features to
>> user s=
>>> pace.  I would mark this as vendor specific and keep it out of the core.
>> How exactly would that work? How can we attach vendor specific extension
>> to a completion structure?
> I'm just stating that there is at other ways of exposing this sort of feature.  A time stamp could just as easily be written with the data, similar to the grh.  One of the points of defining the verbs extensions was exactly so that a vendor could export their own functionality.
>
> - Sean
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) June 1, 2015, 11:39 a.m. UTC | #43
On Fri, 29 May 2015, Doug Ledford wrote:

> The use case is clear IMO.  It's for financial trading software.  I
> don't think they really care about details like whether it's the start
> or end packet, or completion, or whatever.  They need a tie breaker
> between when they have two different buy or sell orders on the same lot
> of stock.  Any deterministic timing/ordering method will do as long as
> they consistently apply it I think.  And the faster and lower overhead
> the process, the better.  He doesn't really want a timestamp, he merely
> wants a sequence ordering.  But a timestamp is what they are using to
> get him what he needs.
>
> Is that a fair guess Christoph?

We want to have a time stamp when the action is complete and the data is
available to the application or the send action is complete and the CQ
entry can be reused. That is a well defined point and that is how the time
stamps function in the existing implementation. This is an obvious
understanding and its pretty clear.

The time stamp needs to be at the end of the action because the timestamp
is used to:

1. Assess the impact of network processing. This can be compared with
   packet timestamps from capture devices off the wire.
2. Delineate the start of packet processing in software.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) June 1, 2015, 11:44 a.m. UTC | #44
> > This is sufficient since it can be converted to ns or whatever one wants.
>
> It is sufficient for your use.  It is not, however, a good API.

I hate these foggy statements that you guys come up with. Why is it not a
good API? Having a cycle counter without processiing overhead is a good
thing and the way counters are handled is pretty well established.

> > The API provides the abilty to retrieve the clock freq which is
> > sufficient for the user to convert the value to meaningful time values.
>
> I would prefer if the access to the timestamp were implemented via a
> function in libibverbs (I haven't looked at the git repo, too little
> time, I'll get to it).  Something like ibv_get_cqe_timestamp().  That
> function should be general and return a suitable, normalized value (ns
> probably).  If you just want a simple comparator without the overhead of
> normalizing to time, and are willing to accept the consequences of that,
> then I would expect you to use something like

That would introduce additional latencies and would make that feature no
longer useful to us. The advantage of this approach is that the counter is
in the same cacheline that is already used. It is very low overhead.

> ibv_get_raw_cqe_timestamp() to get the unadulterated cycle counter.  For
> the most part, the user space application should not know details like
> "we are using a cycle counter in the HCA processor for timestamping",

Why not? A cycle counter is the general way of providing
timestamps in hardware. RDTSC is such a cycle counter as well. There are
numerous examples of counters like that.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) June 1, 2015, 11:50 a.m. UTC | #45
On Fri, 29 May 2015, Hefty, Sean wrote:

> > What is the issue here? The timestamp is created when the processing is
> > finished by the nic and the completion entry becomes available.
>
> The timestamp is generated when a work completion entry is written.  If the=
> re's a clear use case for this, it hasn't been described.  The use case you=
>  mentioned only works if there is a 1:1 relationship between a packet and a=
>  work completion.  That is not what is being defined here.

It does make sense to have a timestamp when the work described by a CQ has
been completed. For that you do not need a 1:1 correspondence.

My use case is focused on single packet processing because that is what we
do here.

> > It is exactly defined like any other cycle counters in hardware and it is
> > exposed using an API that allows multiple vendors to use these cycle
> > counters without regard to a particular vendors implementation.
>
> I disagree.  This is associated with a specific implementation.  It assumes=
>  that the counter is part of a CQ entry, and that the counter is written wh=
> en the completion is written.  There's nothing that requires other vendors =
> to follow that model, nor is it clear that this is a generic or useful enou=
> gh operation that other vendors would want to follow this model.  Why not h=
> ave the time stamp record the start of the transaction?  The end?  Have two=
>  stamps, once for the first packet, and one for the last?  Limit this to si=
> ngle packet operations only?

Why would you have a timestamp at the beginning of the transaction? That
is useless because you can use packet capture devices to establish that
timepoint. At that point you have not identified the CQ anyways.

Having a timestamp when an action is complete makes sense, is generic
and general.

Adding the timestamp to the data means that the application now has to
separate the metadata (timestamp) from the data stream. Not good.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hefty, Sean June 1, 2015, 2:54 p.m. UTC | #46
> We want to have a time stamp when the action is complete and the data is
> available to the application or the send action is complete and the CQ
> entry can be reused.

This is what polling the completion from the CQ tells you, independent of there being a time stamp.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) June 4, 2015, 12:58 a.m. UTC | #47
On Mon, 1 Jun 2015, Hefty, Sean wrote:

> > We want to have a time stamp when the action is complete and the data is
> > available to the application or the send action is complete and the CQ
> > entry can be reused.
>
> This is what polling the completion from the CQ tells you, independent of t=
> here being a time stamp.

But you may not be polling that frequently. Polling threads may check
multiple sources of events and may also currently executing code to handle
an event. Also there is the problem of the OS interrupting you. All of
these sources of inaccuracy are removed by the timestamp.

That was for inbound. For outbound you do not get a timestamp without this
feature. Typically reclaim of outbound work requeust is delayed quite a
bit and getting a timestamp later does not reflect the actual time the
message was sent.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index bb7d42c..b77dd77 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -233,6 +233,8 @@  static int mlx4_ib_query_device(struct ib_device *ibdev,
 	props->max_total_mcast_qp_attach = props->max_mcast_qp_attach *
 					   props->max_mcast_grp;
 	props->max_map_per_fmr = dev->dev->caps.max_fmr_maps;
+	props->hca_core_clock = dev->dev->caps.hca_core_clock;
+	props->timestamp_mask = 0xFFFFFFFFFFFFULL;
 
 out:
 	kfree(in_mad);