[V1,libibverbs,1/8] Add ibv_poll_cq_ex verb

Message ID	1456306924-31298-2-git-send-email-yishaih@mellanox.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <linux-rdma-owner@kernel.org> From: Yishai Hadas <yishaih@mellanox.com> To: dledford@redhat.com Cc: linux-rdma@vger.kernel.org, yishaih@mellanox.com, matanb@mellanox.com, majd@mellanox.com, talal@mellanox.com, ogerlitz@mellanox.com Subject: [PATCH V1 libibverbs 1/8] Add ibv_poll_cq_ex verb Date: Wed, 24 Feb 2016 11:41:57 +0200 Message-Id: <1456306924-31298-2-git-send-email-yishaih@mellanox.com> In-Reply-To: <1456306924-31298-1-git-send-email-yishaih@mellanox.com> References: <1456306924-31298-1-git-send-email-yishaih@mellanox.com> Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk

Yishai Hadas Feb. 24, 2016, 9:41 a.m. UTC

From: Matan Barak <matanb@mellanox.com>

This is an extension verb for ibv_poll_cq. It allows the user to poll
the cq for specific wc fields only, while allowing to extend the wc.
The verb calls the provider in order to fill the WC with the required
information.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
---
 include/infiniband/verbs.h | 107 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 107 insertions(+)

Jason Gunthorpe Feb. 24, 2016, 7:02 p.m. UTC | #1

On Wed, Feb 24, 2016 at 11:41:57AM +0200, Yishai Hadas wrote:

> +enum {
> +	IBV_WC_EX_WITH_8BIT_FIELDS = IBV_WC_EX_WITH_SL			|
> +				     IBV_WC_EX_WITH_DLID_PATH_BITS
> +};
> +
> +struct ibv_wc_ex {
> +	uint64_t		wr_id;
> +	/* wc_flags is a combination of ibv_wc_flags_ex flags. The IBV_WC_EX_WITH_XXX
> +	 * flags dynamically define the valid fields in buffer[0].
> +	 */
> +	uint64_t		wc_flags;
> +	uint32_t		status;
> +	uint32_t		opcode;
> +	uint32_t		vendor_err;
> +	uint32_t		reserved;
> +	uint8_t			buffer[0];
> +};

Um, maybe you should give an example of how on earth anyone is
supposed to use this, all of this looks like a *really bad idea* to
me.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Yishai Hadas Feb. 25, 2016, 8:01 a.m. UTC | #2

On 2/24/2016 9:02 PM, Jason Gunthorpe wrote:
> On Wed, Feb 24, 2016 at 11:41:57AM +0200, Yishai Hadas wrote:
>
>> +enum {
>> +	IBV_WC_EX_WITH_8BIT_FIELDS = IBV_WC_EX_WITH_SL			|
>> +				     IBV_WC_EX_WITH_DLID_PATH_BITS
>> +};
>> +
>> +struct ibv_wc_ex {
>> +	uint64_t		wr_id;
>> +	/* wc_flags is a combination of ibv_wc_flags_ex flags. The IBV_WC_EX_WITH_XXX
>> +	 * flags dynamically define the valid fields in buffer[0].
>> +	 */
>> +	uint64_t		wc_flags;
>> +	uint32_t		status;
>> +	uint32_t		opcode;
>> +	uint32_t		vendor_err;
>> +	uint32_t		reserved;
>> +	uint8_t			buffer[0];
>> +};
>
> Um, maybe you should give an example of how on earth anyone is
> supposed to use this, all of this looks like a *really bad idea* to
> me.
>

The last patch is this series is a clear example of a typical usage of.
It was added as part of rc_pingpong, please look at.

In addition, there are detailed man pages that describe the idea/usage 
of the new verbs around, see patch #7.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe Feb. 25, 2016, 5:05 p.m. UTC | #3

On Thu, Feb 25, 2016 at 10:01:11AM +0200, Yishai Hadas wrote:
> On 2/24/2016 9:02 PM, Jason Gunthorpe wrote:
> >On Wed, Feb 24, 2016 at 11:41:57AM +0200, Yishai Hadas wrote:
> >
> >>+enum {
> >>+	IBV_WC_EX_WITH_8BIT_FIELDS = IBV_WC_EX_WITH_SL			|
> >>+				     IBV_WC_EX_WITH_DLID_PATH_BITS
> >>+};
> >>+
> >>+struct ibv_wc_ex {
> >>+	uint64_t		wr_id;
> >>+	/* wc_flags is a combination of ibv_wc_flags_ex flags. The IBV_WC_EX_WITH_XXX
> >>+	 * flags dynamically define the valid fields in buffer[0].
> >>+	 */
> >>+	uint64_t		wc_flags;
> >>+	uint32_t		status;
> >>+	uint32_t		opcode;
> >>+	uint32_t		vendor_err;
> >>+	uint32_t		reserved;
> >>+	uint8_t			buffer[0];
> >>+};
> >
> >Um, maybe you should give an example of how on earth anyone is
> >supposed to use this, all of this looks like a *really bad idea* to
> >me.
> >
> 
> The last patch is this series is a clear example of a typical usage of.
> It was added as part of rc_pingpong, please look at.
> 
> In addition, there are detailed man pages that describe the idea/usage of
> the new verbs around, see patch #7.

The manual page and rc_pingpong do different things.

This still looks like a horrible user API.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Matan Barak Feb. 28, 2016, 4:03 p.m. UTC | #4

On 25/02/2016 19:05, Jason Gunthorpe wrote:
> On Thu, Feb 25, 2016 at 10:01:11AM +0200, Yishai Hadas wrote:
>> On 2/24/2016 9:02 PM, Jason Gunthorpe wrote:
>>> On Wed, Feb 24, 2016 at 11:41:57AM +0200, Yishai Hadas wrote:
>>>
>>>> +enum {
>>>> +	IBV_WC_EX_WITH_8BIT_FIELDS = IBV_WC_EX_WITH_SL			|
>>>> +				     IBV_WC_EX_WITH_DLID_PATH_BITS
>>>> +};
>>>> +
>>>> +struct ibv_wc_ex {
>>>> +	uint64_t		wr_id;
>>>> +	/* wc_flags is a combination of ibv_wc_flags_ex flags. The IBV_WC_EX_WITH_XXX
>>>> +	 * flags dynamically define the valid fields in buffer[0].
>>>> +	 */
>>>> +	uint64_t		wc_flags;
>>>> +	uint32_t		status;
>>>> +	uint32_t		opcode;
>>>> +	uint32_t		vendor_err;
>>>> +	uint32_t		reserved;
>>>> +	uint8_t			buffer[0];
>>>> +};
>>>
>>> Um, maybe you should give an example of how on earth anyone is
>>> supposed to use this, all of this looks like a *really bad idea* to
>>> me.
>>>
>>
>> The last patch is this series is a clear example of a typical usage of.
>> It was added as part of rc_pingpong, please look at.
>>
>> In addition, there are detailed man pages that describe the idea/usage of
>> the new verbs around, see patch #7.
>
> The manual page and rc_pingpong do different things.
>

There are two options to use the API. They are both described well in 
the man page. This example uses the more trivial and easy-to-use way.

> This still looks like a horrible user API.
>

Why do you think so?

We want to present a completion structure that could be extended, 
without adding the overhead of setting unnecessary fields and without 
risking that adding future attributes will make the completion be bigger 
than a cache line (and create a great penalty). This also came up in the 
earlier libibverbs API discussion.

We could introduce a verb for every new field (poll_cq_ts), but what 
will a user do if he wants new_feature_a and new_feature_b from the same 
completion? In addition, polluting libibverbs with so many polling verbs 
is (to say the least) awkward.

The second option we could think of is to give the user a way to define 
which fields he would like to get - not wasting time, space and cache 
lines on fields he doesn't want. If we could do that automatically, it 
would of course be better (for example, a smart JIT or a compiler that 
compiles the user-space application along with the vendor libraries 
might be able to do that automatically. Even then, in low level 
languages it won't optimize the data layout), but since in the current 
way we work I don't see a way we could do that, I think it's best to add 
that to the API.

The user defines which fields interest him via create_cq_ex and then get 
a compact struct that consists only of the fields he need, sorted in a 
way that avoids holes if possible (alignment wise). This API guaranteed 
to be extensible without penalty and be backward compatible. In respect 
to the discussion we had on list when the kernel part was introduced, we 
believe this API represents a valid choice to fill all the requirements.

> Jason
>

Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe Feb. 29, 2016, 7:17 p.m. UTC | #5

On Sun, Feb 28, 2016 at 06:03:36PM +0200, Matan Barak (External) wrote:

> >The manual page and rc_pingpong do different things.
> 
> There are two options to use the API. They are both described well in the
> man page. This example uses the more trivial and easy-to-use way.

Gross.

> >This still looks like a horrible user API.
> >
> 
> Why do you think so?

By my count it scores about a 2 on the Rusty API scale, which is
pretty bad.

> We want to present a completion structure that could be extended, without
> adding the overhead of setting unnecessary fields and without risking that
> adding future attributes will make the completion be bigger than a cache
> line (and create a great penalty). This also came up in the earlier
> libibverbs API discussion.

This series trade cache line efficiency for computation efficiency,
and it isn't at all clear to me that is going to be a win. Better
include some good benchmarks.

Hint: Cacheline size is much less important if the cache lines are
resident and dirty, and the driver writes make them dirty. The driver
should be dirtying them in a way that avoids a RMW penalty.

> We could introduce a verb for every new field (poll_cq_ts), but what will a
> user do if he wants new_feature_a and new_feature_b from the same
> completion? In addition, polluting libibverbs with so many polling verbs is
> (to say the least) awkward.

As opposed to asking the user to hand craft structures and use ugly
awkward macros?

I'd say give it another think and try and make the user facing API
saner.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Matan Barak March 1, 2016, 8:52 a.m. UTC | #6

On 29/02/2016 21:17, Jason Gunthorpe wrote:
> On Sun, Feb 28, 2016 at 06:03:36PM +0200, Matan Barak (External) wrote:
>
>>> The manual page and rc_pingpong do different things.
>>
>> There are two options to use the API. They are both described well in the
>> man page. This example uses the more trivial and easy-to-use way.
>
> Gross.
>
>>> This still looks like a horrible user API.
>>>
>>
>> Why do you think so?
>
> By my count it scores about a 2 on the Rusty API scale, which is
> pretty bad.
>

This is like writing something that means nothing....
Opinions are always appreciated if they contain valid arguments, so please.

>> We want to present a completion structure that could be extended, without
>> adding the overhead of setting unnecessary fields and without risking that
>> adding future attributes will make the completion be bigger than a cache
>> line (and create a great penalty). This also came up in the earlier
>> libibverbs API discussion.
>
> This series trade cache line efficiency for computation efficiency,
> and it isn't at all clear to me that is going to be a win. Better
> include some good benchmarks.
>

WRONG. Which computational overhead are you talking about?
The user-space driver could optimize the common cases and eliminate 
*almost* all extra computational overhead (this is done in libmlx4 and 
libmlx5 user-space drivers).
Even if there was such an overhead, this is not related to the API. 
Future hardwares could do that even entirely in hardware eliminating 
this all together.

> Hint: Cacheline size is much less important if the cache lines are
> resident and dirty, and the driver writes make them dirty. The driver
> should be dirtying them in a way that avoids a RMW penalty.
>

The user-space driver writes one completion at a time, which (depending 
on the user configuration) is probably less than a cache line. Why do 
you think it's worse than how ibv_poll_cq behaves? The way the 
user-space driver optimizes/dirties the cache line is unrelated to the API.

>> We could introduce a verb for every new field (poll_cq_ts), but what will a
>> user do if he wants new_feature_a and new_feature_b from the same
>> completion? In addition, polluting libibverbs with so many polling verbs is
>> (to say the least) awkward.
>
> As opposed to asking the user to hand craft structures and use ugly
> awkward macros?
>

Function per every completion fields permutation is uglier for sure 
(there are currently 9 optional completion fields in this proposal, you 
could easily do the math to calculate how many permutations we have).

> I'd say give it another think and try and make the user facing API
> saner.
>

Lets think of the main requirement here - an extensible way of getting 
completions with only user required data. So either you give an explicit 
function for every permutation - which is pretty awful (to say the 
least) or you have a way to say "this is what I want" and "if the vendor 
reported this field, please give it to me".

Since it could be possible telling a future hardware "I'm only 
interested in these fields in a CQ", the second approach seems to be better.

> Jason
>

Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Lameter (Ampere) March 1, 2016, 4:10 p.m. UTC | #7

On Tue, 1 Mar 2016, Matan Barak (External) wrote:

> Function per every completion fields permutation is uglier for sure (there are
> currently 9 optional completion fields in this proposal, you could easily do
> the math to calculate how many permutations we have).
>
> > I'd say give it another think and try and make the user facing API
> > saner.
> >
>
> Lets think of the main requirement here - an extensible way of getting
> completions with only user required data. So either you give an explicit
> function for every permutation - which is pretty awful (to say the least) or
> you have a way to say "this is what I want" and "if the vendor reported this
> field, please give it to me".
>
> Since it could be possible telling a future hardware "I'm only interested in
> these fields in a CQ", the second approach seems to be better.

I like the approach in this patch and the example shows me how to
integrate that easily into our code base. Please lets get this merged.

Being able to customize the CQ contents so that they fit into a cacheline
is pretty important. We have high throughput packet capture devices that
are affected by slight changes to the cache footprint. We have seen
significant regressions in the past because fields were added that we did
not see to be important. I like it.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe March 1, 2016, 5:24 p.m. UTC | #8

On Tue, Mar 01, 2016 at 10:52:33AM +0200, Matan Barak (External) wrote:
> >By my count it scores about a 2 on the Rusty API scale, which is
> >pretty bad.
> 
> This is like writing something that means nothing....

If you don't agree with the points on the Rusty scale as what
constitutes good API design then we really have no common ground on
this point at all.

> >This series trade cache line efficiency for computation efficiency,
> >and it isn't at all clear to me that is going to be a win. Better
> >include some good benchmarks.
> >
> 
> WRONG. Which computational overhead are you talking about?

I don't see the driver side posted, but obviously there must be
indexing maths and possibly branching, depending on the design, that
is overhead.

> The user-space driver writes one completion at a time, which (depending on
> the user configuration) is probably less than a cache line. Why do you think
> it's worse than how ibv_poll_cq behaves? The way the user-space driver
> optimizes/dirties the cache line is unrelated to the API.

I didn't say it was worse, I questioned that this should be the
overarching goal when no data to support this has been posted and it
doesn't even make inutitive sense that worrying about the size of
*dirty cache lines* is even very important, (well, within limits).

Christoph may have some data, but I do wonder if his results are
polluted by the current ibv_poll_cq driver implementations which do
trigger RMW overheads.

> Function per every completion fields permutation is uglier for sure (there
> are currently 9 optional completion fields in this proposal, you could
> easily do the math to calculate how many permutations we have).

Why would you do every permutation?

Try a non-memcpy API design. Provide an opaque 'cursor' and functions
to return certain data. This has more unconditional branches, but may
avoid branching within the driver and certainly avoid memcpy's and
pretty much all cache line dirtying. Sean had some results that
were positive on this sort of direction.

Justify this horrible API is *necessary* with something concrete, just
not random hand waving and mumbling about cache lines. That means
benchmark several options.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Matan Barak March 2, 2016, 7:34 a.m. UTC | #9

On 01/03/2016 19:24, Jason Gunthorpe wrote:
> On Tue, Mar 01, 2016 at 10:52:33AM +0200, Matan Barak (External) wrote:
>>> By my count it scores about a 2 on the Rusty API scale, which is
>>> pretty bad.
>>
>> This is like writing something that means nothing....
>
> If you don't agree with the points on the Rusty scale as what
> constitutes good API design then we really have no common ground on
> this point at all.
>

We can discuss actual points. That's how things can really progress.

>>> This series trade cache line efficiency for computation efficiency,
>>> and it isn't at all clear to me that is going to be a win. Better
>>> include some good benchmarks.
>>>
>>
>> WRONG. Which computational overhead are you talking about?
>
> I don't see the driver side posted, but obviously there must be
> indexing maths and possibly branching, depending on the design, that
> is overhead.
>

libmlx4 patches were posted on the mailing list awhile ago [1].
One of the patches there optimized ibv_poll_cq_ex for common cases with 
almost zero overhead. Anyway, as I wrote before - this could be 
delegated to a future hardware so I'm not sure it's relevant here.

>> The user-space driver writes one completion at a time, which (depending on
>> the user configuration) is probably less than a cache line. Why do you think
>> it's worse than how ibv_poll_cq behaves? The way the user-space driver
>> optimizes/dirties the cache line is unrelated to the API.
>
> I didn't say it was worse, I questioned that this should be the
> overarching goal when no data to support this has been posted and it
> doesn't even make inutitive sense that worrying about the size of
> *dirty cache lines* is even very important, (well, within limits).
>

We ran ib_send_bw (perftest package) when developing this - one time 
using the regular ibv_poll_cq and another time with ibv_poll_cq_ex and 
got identical results (the difference was tiny and not conclusive), but 
I don't have the actual results right now. This was one of the common 
cases the user-space driver optimized. We could re-run them and post 
results if you want.

> Christoph may have some data, but I do wonder if his results are
> polluted by the current ibv_poll_cq driver implementations which do
> trigger RMW overheads.
>
>> Function per every completion fields permutation is uglier for sure (there
>> are currently 9 optional completion fields in this proposal, you could
>> easily do the math to calculate how many permutations we have).
>
> Why would you do every permutation?
>

Because user-space applications might want to get completion 
time-stamping and size, but not the rest of the fields. They could 
pretty much decide which data matters to them and get only that.

> Try a non-memcpy API design. Provide an opaque 'cursor' and functions
> to return certain data. This has more unconditional branches, but may
> avoid branching within the driver and certainly avoid memcpy's and
> pretty much all cache line dirtying. Sean had some results that
> were positive on this sort of direction.
>

Does the opaque pointer guarantees an aligned access? Who allocates the 
space for the vendor's CQE? Any concrete example?
One of the problems here are that CQEs could be formatted as -
"if QP type is y, then copy the field z from o". Doing that this way may 
result doing the same "if" multiple times. The current approach could 
still avoid memcpy and write straight to the user's buffer.

> Justify this horrible API is *necessary* with something concrete, just
> not random hand waving and mumbling about cache lines. That means
> benchmark several options.
>
> Jason
>

Matan
[1] https://www.spinics.net/lists/linux-rdma/msg29837.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe March 2, 2016, 6:28 p.m. UTC | #10

On Wed, Mar 02, 2016 at 09:34:55AM +0200, Matan Barak (External) wrote:
> >I didn't say it was worse, I questioned that this should be the
> >overarching goal when no data to support this has been posted and it
> >doesn't even make inutitive sense that worrying about the size of
> >*dirty cache lines* is even very important, (well, within limits).
> >
> 
> We ran ib_send_bw (perftest package) when developing this - one time using
> the regular ibv_poll_cq and another time with ibv_poll_cq_ex and got
> identical results (the difference was tiny and not conclusive), but I don't
> have the actual results right now. This was one of the common cases the
> user-space driver optimized. We could re-run them and post results if you
> want.

So all this ugly API to minimize cache line usage has no measured
performance gain?

You see why I'm concerned ..

> >Try a non-memcpy API design. Provide an opaque 'cursor' and functions
> >to return certain data. This has more unconditional branches, but may
> >avoid branching within the driver and certainly avoid memcpy's and
> >pretty much all cache line dirtying. Sean had some results that
> >were positive on this sort of direction.
> >
> 
> Does the opaque pointer guarantees an aligned access? Who allocates the
> space for the vendor's CQE? Any concrete example?
> One of the problems here are that CQEs could be formatted as -
> "if QP type is y, then copy the field z from o". Doing that this way may
> result doing the same "if" multiple times. The current approach could still
> avoid memcpy and write straight to the user's buffer.

No, none of that...

struct ibv_cq
{
    int (*read_next_cq)(ibv_cq *cq,struct common_wr *res);
    int (*read_address)(ibv_cq *cq,struct wr_address *res);
    uint64_t (*read_timestamp(ibv_cq *cq);
    uint32_t (*read_immediate_data(ibv_cq *cq);
    void (*read_something_else)(ibv_cq *cq,struct something_else *res);
};

while (cq->read_next_cq(cq,&wc)) // Advance the cursor
{
    // Read the cursor
    uint64_t ts = cq->read_timestamp(cq);
    if (wc->opcode ==  ..)
        uint32_t imm = cq->read_immediate_data(cq);
}

Lots of variety for syntax, I've choosen the above for clarity.

1) The function pointers provided by the driver read directly from the
   CQ memory written by the adaptor. No memcpy. The pointers vary
   on a CQ by CQ basis so they can be optimal.
   Alignment is a driver/hardware problem and is exactly the same as
   any other scheme
2) The driver fills in the pointers in a way that is optimal for the
   specific CQ - if there are multiple formats then the driver
   provides multiple function implementations and chooses the
   combiantion that matches the CQ when it is created.
   The intent is to avoid all conditional branching in the driver on
   these callbacks. The functions should be small enough to avoid
   a frame setup (eg build them with -fomit-frame-pointer and other
   function call related optimizations to achieve this)
2a) If the CQ does not support an extension like read_something_else
    then the driver fills the pointer with something that calls
    abort(), no conditional branching.
    The application is responsible for only calling allowed accessors
    based on how the CQ is created. No branching is needed.
3) read_next_cq goes entry by entry. If signaling the hardware
   that a CQe is now available is expensive then the driver will need
   some kind of batching scheme for that. Ie signal on empty, signal
   ever N, etc. An additional entry point may be required to manage
   this.
3a) This could also follow the new kAPI pattern and use a callback
    scheme which makes the batching and budgeting alot simpler for
    the application.
4) A basic analysis says this trades cache line dirtying of the wc
   array for unconditional branches.
   It  eliminates at least 1 conditional branch per CQE iteration by
   using only 1 loop.
   For some things it may entirely eliminate memory traffic in favor of
   register returns (eg my read_immediate_data example)
   For some things this may entirely eliminate work, eg the
   address and immediate data rarely need to be parsed for some apps

   Compared to the poll_ex proposal this eliminates a huge
   number of conditional branches as 'wc_flags' and related no longer
   exist at all.

   Only careful benchmarking will tell if the unconditional branch
   overhead is greater than the savings from dropping conditional
   branches and cache line dirtying.

5) If the vendors could get together, it might be possible to replace
   some function pointers with inlined pointer math eg:

   struct ibv_cq
   {
        const void *cur_cqe;
	size_t immdata_offset;
   }

   inline uint32_t ibv_cqe_read_immedate_data(ibv_cq *cq)
   {
      return *(uint32_t *)(cq->cur_cqe + cq->immdata_offset);
   }

   Obvious this requires most vendors to use a HW CQE format that
   can support the above expression.
   
   Further, touching on an idea Cristoph brought up a long time
   ago - it would be easy to use this API to build a Vendor Optimized
   libibverbs this would support a single driver's card and would
   inline the CQE parse using something like the above, tuned for the
   single driver. It would be interesting to see if that is a win or
   not.

I hope you see how this sort of API is alot saner than what is
proposed, it is type safe, misuses not caught by the compiler blow up
100% of the time at runtime, the usage is intuitive and matches
typical C conventions.

The question that must be answered is the peformance for the various
options, and that is what I mean by benchmarking. I would probably
approach this with the perf tool and look at the various CPU counters
for each option and then get someone like Christoph to chime in with
his specalized benchmark environment when I think the best option has
been found and has already been carefully micro optimized.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Lameter (Ampere) March 2, 2016, 7:08 p.m. UTC | #11

On Wed, 2 Mar 2016, Jason Gunthorpe wrote:

> So all this ugly API to minimize cache line usage has no measured
> performance gain?

We have seen an increased cacheline footprint adding ~100-200ns to receive
latencies during loops. This does not show up in synthetic loads that do
not do much processing since their cache footprint is minimal.

> > Does the opaque pointer guarantees an aligned access? Who allocates the
> > space for the vendor's CQE? Any concrete example?
> > One of the problems here are that CQEs could be formatted as -
> > "if QP type is y, then copy the field z from o". Doing that this way may
> > result doing the same "if" multiple times. The current approach could still
> > avoid memcpy and write straight to the user's buffer.
>
> No, none of that...
>
> struct ibv_cq
> {
>     int (*read_next_cq)(ibv_cq *cq,struct common_wr *res);
>     int (*read_address)(ibv_cq *cq,struct wr_address *res);
>     uint64_t (*read_timestamp(ibv_cq *cq);
>     uint32_t (*read_immediate_data(ibv_cq *cq);
>     void (*read_something_else)(ibv_cq *cq,struct something_else *res);
> };

Argh. You are requiring multiple indirect function calls
top retrieve the same imformation and therefore significantly increase
latency. This is going to cause lots of problems for procesing at high
speed where we have to use the processor caches as carefully as possible
to squeeze out all we can get.

> 4) A basic analysis says this trades cache line dirtying of the wc
>    array for unconditional branches.
>    It  eliminates at least 1 conditional branch per CQE iteration by
>    using only 1 loop.

This done none of that stuff at all if the device directly follows the
programmed format. There will be no need to do any driver formatting at
all.

>    Compared to the poll_ex proposal this eliminates a huge
>    number of conditional branches as 'wc_flags' and related no longer
>    exist at all.

wc_flags may be something bothersome. You do not want to check inside the
loop. All cqe's should come with the fields requested and the
layout of the data must be fixed when in the receive loop. No additional
branches in the loop.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe March 2, 2016, 7:51 p.m. UTC | #12

On Wed, Mar 02, 2016 at 01:08:30PM -0600, Christoph Lameter wrote:
> On Wed, 2 Mar 2016, Jason Gunthorpe wrote:
> 
> > So all this ugly API to minimize cache line usage has no measured
> > performance gain?
> 
> We have seen an increased cacheline footprint adding ~100-200ns to receive
> latencies during loops. This does not show up in synthetic loads that do
> not do much processing since their cache footprint is minimal.

I know you've seen effects..

But apparently the patch authors haven't.

> > > Does the opaque pointer guarantees an aligned access? Who allocates the
> > > space for the vendor's CQE? Any concrete example?
> > > One of the problems here are that CQEs could be formatted as -
> > > "if QP type is y, then copy the field z from o". Doing that this way may
> > > result doing the same "if" multiple times. The current approach could still
> > > avoid memcpy and write straight to the user's buffer.
> >
> > No, none of that...
> >
> > struct ibv_cq
> > {
> >     int (*read_next_cq)(ibv_cq *cq,struct common_wr *res);
> >     int (*read_address)(ibv_cq *cq,struct wr_address *res);
> >     uint64_t (*read_timestamp(ibv_cq *cq);
> >     uint32_t (*read_immediate_data(ibv_cq *cq);
> >     void (*read_something_else)(ibv_cq *cq,struct something_else *res);
> > };
> 
> Argh. You are requiring multiple indirect function calls
> top retrieve the same imformation and therefore significantly increase
> latency.

It depends. These are unconditional branches and they elide alot of
other code and conditional branches by their design.

The mlx4 driver does something like this on every CQE to parse the
immediate data:

+	if (is_send) {
+		switch (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) {
+		case MLX4_RECV_OPCODE_RDMA_WRITE_IMM:
+			if (wc_flags & IBV_WC_EX_WITH_IMM) {
+				*wc_buffer.b32++ = cqe->immed_rss_invalid;

With this additional overhead for all the parse paths that have no
immediate:

+			if (wc_flags & IBV_WC_EX_WITH_IMM)
+				wc_buffer.b32++; // No imm to set

Whereas my suggestion would looke more like:

mlx4_read_immediate_data(cq) {return cq->cur_cqe->immed_rss_invalid;};

[and even that can be micro-optimized further]

And the setting of WITH_IMM during the common parse is much less branchy:

opcode = cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK;
wc->flags |= WITH_IMM_BIT << ((unsigned int)(opcode == MLX4_RECV_OPCODE_RDMA_WRITE_IMM ||
                                            opcode == MLX4_RECV_OPCODE_SEND_IMM ...))

I bet that is a lot less work going on, even including the indirect
jump overhead.

> This is going to cause lots of problems for procesing at high
> speed where we have to use the processor caches as carefully as possible
> to squeeze out all we can get.

The driver should be built so the branch targets are all in close
cache lines, with function prologues deliberately elided so the branch
targets are small. There may even be further techniques to micro
optimize the jump, if one wanted to spend the effort.

Further, since we are only running the code the caller actually needs
we are not wasting icache lines trundling through the driver CQE parse
that has to handle all cases, even ones the caller doesn't care about.

It is possible this uses fewer icache lines than what we have today,
it depends how big the calls end up..

> > 4) A basic analysis says this trades cache line dirtying of the wc
> >    array for unconditional branches.
> >    It  eliminates at least 1 conditional branch per CQE iteration by
> >    using only 1 loop.
> 
> This done none of that stuff at all if the device directly follows the
> programmed format. There will be no need to do any driver formatting at
> all.

Maybe, but no such device exists? I'm not sure I want to see a
miserable API on the faint hope of hardware support..

Even if hardware appears, this basic API pattern can still support
that optimally by using the #5 technique - and still avoids the memcpy
from the DMA buffer to the user memory.

> >    Compared to the poll_ex proposal this eliminates a huge
> >    number of conditional branches as 'wc_flags' and related no longer
> >    exist at all.
> 
> wc_flags may be something bothersome. You do not want to check inside the
> loop.

Did you look at the mlx4 driver patch? It checks wc_flags constantly
when memcpying the HW CQE to the user memory - in the per CQE loop:

+			if (wc_flags & IBV_WC_EX_WITH_IMM) {
+				*wc_buffer.b32++ = cqe->immed_rss_invalid;

Every single user CQE write (or even not write) is guarded by a
conditional branch on the flags, and use of post-increment like that
creates critical path data dependencies and kills ILP. All this extra
code eats icache lines.

Sure, the scheme saves dritying cache lines, but at the cost of a huge
number of these conditional branches and more code. I'm deeply
skeptical that is an overall win compared to dirtying more cache
lines - mispredicted branches are expensive.

What I've suggested avoids all the cache line dirtying and avoids all
the above conditional branching and data dependencies at the cost of
a few indirect unconditional jumps. (and if #5 is possible they aren't
even jumps)

I say there is no way to tell which scheme performs better without
careful benchmarking - it is not obvious any of these trade offs are
winners.

> All cqe's should come with the fields requested and the
> layout of the data must be fixed when in the receive loop. No additional
> branches in the loop.

Sure, for the caller, I'm looking at this from the whole system
perspective. The driver is doing a wack more crap to produce this
formatting and it certainly isn't free.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Matan Barak March 21, 2016, 3:24 p.m. UTC | #13

On 02/03/2016 21:51, Jason Gunthorpe wrote:
> On Wed, Mar 02, 2016 at 01:08:30PM -0600, Christoph Lameter wrote:
>> On Wed, 2 Mar 2016, Jason Gunthorpe wrote:
>>
>>> So all this ugly API to minimize cache line usage has no measured
>>> performance gain?
>>
>> We have seen an increased cacheline footprint adding ~100-200ns to receive
>> latencies during loops. This does not show up in synthetic loads that do
>> not do much processing since their cache footprint is minimal.
>
> I know you've seen effects..
>
> But apparently the patch authors haven't.
>
>>>> Does the opaque pointer guarantees an aligned access? Who allocates the
>>>> space for the vendor's CQE? Any concrete example?
>>>> One of the problems here are that CQEs could be formatted as -
>>>> "if QP type is y, then copy the field z from o". Doing that this way may
>>>> result doing the same "if" multiple times. The current approach could still
>>>> avoid memcpy and write straight to the user's buffer.
>>>
>>> No, none of that...
>>>
>>> struct ibv_cq
>>> {
>>>      int (*read_next_cq)(ibv_cq *cq,struct common_wr *res);
>>>      int (*read_address)(ibv_cq *cq,struct wr_address *res);
>>>      uint64_t (*read_timestamp(ibv_cq *cq);
>>>      uint32_t (*read_immediate_data(ibv_cq *cq);
>>>      void (*read_something_else)(ibv_cq *cq,struct something_else *res);
>>> };
>>
>> Argh. You are requiring multiple indirect function calls
>> top retrieve the same imformation and therefore significantly increase
>> latency.
>
> It depends. These are unconditional branches and they elide alot of
> other code and conditional branches by their design.
>

Our performance team and I measured the new ibv_poll_cq_ex vs the old 
poll_cq implementation. We measured the number of cycles around the 
poll_cq call. We saw ~15 cycles increase for every CQE, which is about 
2% over the regular poll_cq (which is not extensible). We've used 
ib_write_lat in order to do these measurements.

Pinpointing the source of this increase, we found that the function 
pointer of our internal poll_one routine is the source of this. Our 
poll_cq_ex implementation uses a per-CQ poll_one_ex function, which is 
*almost* tailored for the user required fields. Using a static poll_one 
will cause a lot of conditional branches and will decrease performance.
Meaning, using a function pointer vs using a static poll_one function 
(that the compiler is free to inline) causes this effect, but using a 
monolithic poll_one function will incur substantial computation overhead.

Using a per field getter introduces such a call (unconditional jump) for 
every field. If we were using static linking (+ whole program linkage), 
I agree this route is better. However, based on the results we saw 
earlier, we are worried this might incur tremendous overhead.
In addition, the user has to call read_next_cq for every completion 
entry (which is another unconditional jump).

> The mlx4 driver does something like this on every CQE to parse the
> immediate data:
>
> +	if (is_send) {
> +		switch (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) {
> +		case MLX4_RECV_OPCODE_RDMA_WRITE_IMM:
> +			if (wc_flags & IBV_WC_EX_WITH_IMM) {
> +				*wc_buffer.b32++ = cqe->immed_rss_invalid;
>
> With this additional overhead for all the parse paths that have no
> immediate:
>
> +			if (wc_flags & IBV_WC_EX_WITH_IMM)
> +				wc_buffer.b32++; // No imm to set
>
> Whereas my suggestion would looke more like:
>
> mlx4_read_immediate_data(cq) {return cq->cur_cqe->immed_rss_invalid;};
>
> [and even that can be micro-optimized further]
>
> And the setting of WITH_IMM during the common parse is much less branchy:
>
> opcode = cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK;
> wc->flags |= WITH_IMM_BIT << ((unsigned int)(opcode == MLX4_RECV_OPCODE_RDMA_WRITE_IMM ||
>                                              opcode == MLX4_RECV_OPCODE_SEND_IMM ...))
>
> I bet that is a lot less work going on, even including the indirect
> jump overhead.
>

So we would have read the owner_sr_opcode several times (for almost 
every related wc field) and parse it accordingly or we'll have to store 
the pre-processed owner_sr_opcode in the CQ itself and incur the 
overhead of "copying" it. In addition, we'll need another API (end_cq) 
in order to indicate return-cq-element-to-hardware-ownership (and 
release vendor's per poll_cq locks if exist). So, as far as we 
understand, you propose something like:

struct ibv_cq
{
     /* more argument is new */
     int (*read_next_cq)(ibv_cq *cq,struct common_wr *res, bool more);
     /* function is new */
     int (*end_cq)(ibv_cq *cq);
     int (*read_address)(ibv_cq *cq,struct wr_address *res);
     uint64_t (*read_timestamp(ibv_cq *cq);
     uint32_t (*read_immediate_data(ibv_cq *cq);
     void (*read_something_else)(ibv_cq *cq,struct something_else *res);
};

Again, every call here introduces an unconditional jump and makes the CQ 
structure larger. So, the CQ struct itself might not fit in the cache line.

In addition, ibv_cq isn't extensible as it's today. Addign a new 
ibv_cq_ex will change all APIs that use it.

>> This is going to cause lots of problems for procesing at high
>> speed where we have to use the processor caches as carefully as possible
>> to squeeze out all we can get.
>
> The driver should be built so the branch targets are all in close
> cache lines, with function prologues deliberately elided so the branch
> targets are small. There may even be further techniques to micro
> optimize the jump, if one wanted to spend the effort.
>
> Further, since we are only running the code the caller actually needs
> we are not wasting icache lines trundling through the driver CQE parse
> that has to handle all cases, even ones the caller doesn't care about.
>

The current optimization we implement in our user-space drivers is to 
introduce multiple poll_one_ex functions, where every function assigns 
only some of the WC fields. Each function is tailored for a different 
use case and common scenario. By doing this, we don't assign fields that 
the user doesn't care about and we can avoid all conditional branches.

> It is possible this uses fewer icache lines than what we have today,
> it depends how big the calls end up..
>

The current proposal offers several poll_one_ex function - a function 
per common case. That's true we have more functions, but the function we 
actually used is *almost* tailored to the user's requirements.
All checks and conditions are done once. We trade-off the function 
pointer + re-checking/re-formatting some fields over and over again by 
copying the fields to ibv_wc_ex.

>>> 4) A basic analysis says this trades cache line dirtying of the wc
>>>     array for unconditional branches.
>>>     It  eliminates at least 1 conditional branch per CQE iteration by
>>>     using only 1 loop.
>>
>> This done none of that stuff at all if the device directly follows the
>> programmed format. There will be no need to do any driver formatting at
>> all.
>
> Maybe, but no such device exists? I'm not sure I want to see a
> miserable API on the faint hope of hardware support..
>
> Even if hardware appears, this basic API pattern can still support
> that optimally by using the #5 technique - and still avoids the memcpy
> from the DMA buffer to the user memory.
>
>>>     Compared to the poll_ex proposal this eliminates a huge
>>>     number of conditional branches as 'wc_flags' and related no longer
>>>     exist at all.
>>
>> wc_flags may be something bothersome. You do not want to check inside the
>> loop.
>
> Did you look at the mlx4 driver patch? It checks wc_flags constantly
> when memcpying the HW CQE to the user memory - in the per CQE loop:
>
> +			if (wc_flags & IBV_WC_EX_WITH_IMM) {
> +				*wc_buffer.b32++ = cqe->immed_rss_invalid;
>
> Every single user CQE write (or even not write) is guarded by a
> conditional branch on the flags, and use of post-increment like that
> creates critical path data dependencies and kills ILP. All this extra
> code eats icache lines.
>
 >

Look at the new code for libmlx5. This code is optimized *at compile 
time*. We create a poll_one_ex function for a common case. This has zero 
overhead.

Regarding user-space check for wc_flags, I agree on eliminating this 
check. If you created a CQ with a set of attributes, they are guaranteed 
to exist on known values for the opcode.

> Sure, the scheme saves dritying cache lines, but at the cost of a huge
> number of these conditional branches and more code. I'm deeply
> skeptical that is an overall win compared to dirtying more cache
> lines - mispredicted branches are expensive.
>
> What I've suggested avoids all the cache line dirtying and avoids all
> the above conditional branching and data dependencies at the cost of
> a few indirect unconditional jumps. (and if #5 is possible they aren't
> even jumps)
>

When eliminating "if (wc_flags & IBV_WC_EX_WITH_IMM)" in the user's 
space application, conditional branching is eliminated in this proposal 
as well.

The question becomes - what costs more - copying the data or using an 
unconditional branch.

> I say there is no way to tell which scheme performs better without
> careful benchmarking - it is not obvious any of these trade offs are
> winners.
>

Agree. It may depend on the user application as well.

>> All cqe's should come with the fields requested and the
>> layout of the data must be fixed when in the receive loop. No additional
>> branches in the loop.
>
> Sure, for the caller, I'm looking at this from the whole system
> perspective. The driver is doing a wack more crap to produce this
> formatting and it certainly isn't free.
>

Dynamic WC format (or getter functions) isn't going to be free, at least 
not without linking the vendor's user-space driver to the application 
itself or making all vendors behave the same.

> Jason
>

Yishai and Matan.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe March 21, 2016, 5:09 p.m. UTC | #14

On Mon, Mar 21, 2016 at 05:24:23PM +0200, Matan Barak wrote:

> Our performance team and I measured the new ibv_poll_cq_ex vs the old
> poll_cq implementation. We measured the number of cycles around the poll_cq
> call. We saw ~15 cycles increase for every CQE, which is about 2% over the
> regular poll_cq (which is not extensible). We've used ib_write_lat in order
> to do these measurements.

So you understand my concern, this really ugly interface is
obstensibly designed this way for performance and yet it is 2% slower
than what we have today. :(

> Using a per field getter introduces such a call (unconditional jump) for
> every field.

As I keep saying, there is an overall trade off that could reduce the
number of jumps and code being run.

> In addition, the user has to call read_next_cq for every completion entry
> (which is another unconditional jump).

This is just replacing the loop inside the existing ibv_poll_cq_ex,
the number of overall jumps doesn't change.

> >opcode = cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK;
> >wc->flags |= WITH_IMM_BIT << ((unsigned int)(opcode == MLX4_RECV_OPCODE_RDMA_WRITE_IMM ||
> >                                             opcode == MLX4_RECV_OPCODE_SEND_IMM ...))
> >
> >I bet that is a lot less work going on, even including the indirect
> >jump overhead.
> >
> 
> So we would have read the owner_sr_opcode several times (for almost every
> related wc field)

No, I think you've missed the point, the accessors pretty much just
return fixed offsets in the HW WC. From what I've seen the HW WC looks
well structured, so, eg, the offset of the timestamp field is fixed.
No need to check owner_sr_opcode again.

> "copying" it. In addition, we'll need another API (end_cq) in order to
> indicate return-cq-element-to-hardware-ownership (and release vendor's per
> poll_cq locks if exist). So, as far as we understand, you propose
> something

Yes, I mentioned this, some kind of completion batching scheme would
be needed.

This could actually prove to be a net win, depending on how carefully
it is designed and typical application needs. The current batch every
poll_cq loop is not necessarily ideal.

> Again, every call here introduces an unconditional jump and makes the CQ
> structure larger. So, the CQ struct itself might not fit in the cache line.

Don't obsess so much about 1 or two cache lines in this sort of
context. Burning a large number of stack lines for the wc array is
bad of course, but an extra line isn't so bad, especially if it
is offset by an icache line.

> In addition, ibv_cq isn't extensible as it's today. Addign a new ibv_cq_ex
> will change all APIs that use it.

Not really, it just makes the structure longer, this is trivially
done.

> The current optimization we implement in our user-space drivers is to
> introduce multiple poll_one_ex functions, where every function assigns only
> some of the WC fields. Each function is tailored for a different use case
> and common scenario. By doing this, we don't assign fields that the user
> doesn't care about and we can avoid all conditional branches.

But the provider can't cover all cases - so uncommon cases will run
slow? Do you understand why I keep calling this a horrible API?

> Look at the new code for libmlx5. This code is optimized *at compile time*.
> We create a poll_one_ex function for a common case. This has zero overhead.

I didn't see this on patchworks. Do you have a link?

Even if the wc_flags is constant and thus optimized there is still
alot of opcode related branching and ++'s running around that
code. That is 'free' in the sense is similar to what is in the
existing implementation, but I'm suggesting a direction that largely
merges those branches with the mandatory user branches as well.

> >What I've suggested avoids all the cache line dirtying and avoids all
> >the above conditional branching and data dependencies at the cost of
> >a few indirect unconditional jumps. (and if #5 is possible they aren't
> >even jumps)
> >
> 
> When eliminating "if (wc_flags & IBV_WC_EX_WITH_IMM)" in the user's space
> application, conditional branching is eliminated in this proposal as well.

Just the wc_flags branching is gone, there is still excess branching
around the opcode that is eliminated.

> The question becomes - what costs more - copying the data or using an
> unconditional branch.

This will depend very much on how many completions are being
processed in one go. Copying is almost certainly better for a small
number of completions - but if there is a small number then why care
so much about dcache lines?

> Dynamic WC format (or getter functions) isn't going to be free, at
> least not without linking the vendor's user-space driver to the
> application itself or making all vendors behave the same.

I don't think this should be so quickly discounted - clearly this API
allows for the accessors to be fully inlined if the user desires, and
you can't beat that for performance.

At the very least it, or some other better API option, deserves study
and benchmarking before comitting to the ugly API.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[V1,libibverbs,1/8] Add ibv_poll_cq_ex verb

Commit Message

Comments

Patch