diff mbox

[V2,libibverbs,2/7] Add member functions to poll an extended CQ

Message ID 1463658702-9975-3-git-send-email-yishaih@mellanox.com (mailing list archive)
State Superseded
Headers show

Commit Message

Yishai Hadas May 19, 2016, 11:51 a.m. UTC
From: Matan Barak <matanb@mellanox.com>

Currently, ibv_poll_cq is one stop shop for polling all the
completion's attribute. The current implementation has a few
implications:
1. The vendor's completion format is transformed to an agnostic IB
   format (copying data which might has been used directly).
2. Adding new completion attributes means wasting more time in copies.
3. Extensible functions require wasting cycles on logic which
   determines whether a function is known and implemented in the
   provider.

In order to solve these problems, we introduce a new poll_cq mechanism
for extensible CQs. Polling is done in batches, where every batch
could contain more than one completion. A batch is started with
calling to start_poll_ex function. This already fetches a completion
that the user can now query using the various query functions.
Advancing to the next completion is done using next_poll_ex.
After querying all completions (or when the user wants to stop
fetching completions in the current batch), end_poll_ex is called.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
---
 include/infiniband/verbs.h | 102 +++++++++++++++++++++++++++++++++++++++++++++
 man/ibv_create_cq_ex.3     |  77 ++++++++++++++++++++++++++++++++++
 2 files changed, 179 insertions(+)

Comments

Jason Gunthorpe May 19, 2016, 6:19 p.m. UTC | #1
On Thu, May 19, 2016 at 02:51:37PM +0300, Yishai Hadas wrote:
> +	int (*start_poll_ex)(struct ibv_cq_ex *current,
> +			     struct ibv_poll_cq_ex_attr *attr);
> +	int (*next_poll_ex)(struct ibv_cq_ex *current);
> +	void (*end_poll_ex)(struct ibv_cq_ex *current);

Probably don't need the _ex here..

> +	uint64_t (*read_wr_id)(struct ibv_cq_ex *current);
> +	enum ibv_wc_status (*read_status)(struct ibv_cq_ex *current);
> +	enum ibv_wc_opcode (*read_opcode)(struct ibv_cq_ex *current);

opcode, status and wr_id are always available in every WR, so I think
a faster way to return them should be devised. Perahps have the driver
write them to ibv_cq_ex and have the inline functions read the struct
directly?

How does this work multi-threaded? Is the locking done in
start_poll.end_poll?

> +	uint32_t (*read_vendor_err)(struct ibv_cq_ex *current);
> +	uint32_t (*read_byte_len)(struct ibv_cq_ex *current);
> +	uint32_t (*read_imm_data)(struct ibv_cq_ex *current);
> +	uint32_t (*read_qp_num)(struct ibv_cq_ex *current);
> +	uint32_t (*read_src_qp)(struct ibv_cq_ex *current);
> +	int (*read_wc_flags)(struct ibv_cq_ex *current);
> +	uint16_t (*read_pkey_index)(struct ibv_cq_ex *current);
> +	uint16_t (*read_slid)(struct ibv_cq_ex *current);
> +	uint8_t (*read_sl)(struct ibv_cq_ex *current);
> +	uint8_t (*read_dlid_path_bits)(struct ibv_cq_ex *current);

Maybe some of these should be grouped into a struct? What kind of
performance hit did you measure on these calls? Reading all the ud
parameters is not very expensive...

read_ud_info ?

What does the driver look like? I'm assuming it turned out as short
and jump free as I outlined?

Finally, these extended APIs really need to be always available - so
the core code should always provide a compatability wrapper for
providers that do not implement this API. The compat wrapper should
use the existing provider wc interface.

We should not have stuff like in your rc_pingpong example where
totally different code paths are required on something like wc
processing.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yishai Hadas May 23, 2016, 3:31 p.m. UTC | #2
On 5/19/2016 9:19 PM, Jason Gunthorpe wrote:
> On Thu, May 19, 2016 at 02:51:37PM +0300, Yishai Hadas wrote:
>> +	int (*start_poll_ex)(struct ibv_cq_ex *current,
>> +			     struct ibv_poll_cq_ex_attr *attr);
>> +	int (*next_poll_ex)(struct ibv_cq_ex *current);
>> +	void (*end_poll_ex)(struct ibv_cq_ex *current);
>
> Probably don't need the _ex here..

The '_ex' points that those functions support extension. 
(ibv_poll_cq_ex* attr, ibv_cq_ex). Their matching ibv_xxx wrappers also 
have _ex (e.g. ibv_start_poll_ex).
Makes sense ?

>
>> +	uint64_t (*read_wr_id)(struct ibv_cq_ex *current);
>> +	enum ibv_wc_status (*read_status)(struct ibv_cq_ex *current);
>> +	enum ibv_wc_opcode (*read_opcode)(struct ibv_cq_ex *current);
>
> opcode, status and wr_id are always available in every WR, so I think
> a faster way to return them should be devised. Perahps have the driver
> write them to ibv_cq_ex and have the inline functions read the struct
> directly?

Specifically talking on 'opcode', the fact that it always exists doesn't 
mean that application will read it. Furthermore, in many cases 
application doesn't need it at all (e.g. only SEND is done, there is CQ 
per operation, etc.) In addition, in some cases it can be achieved based 
on the wr_id that is returned.

Re 'status' and 'wr_id':
 From our benchmark testing we found that coping the the data as part of 
the read_xxx is the major cost and not the function call, so not sure 
that coping the status/wr_id to ibv_cq_ex directly will really give a 
boost here.

However, as it makes sense will run some benchmark testing on that 
suggestion and see if it has some improvement.

> How does this work multi-threaded? Is the locking done in
> start_poll.end_poll?

Yes, however as a vendor controls on the start/end calls it can hook 
less-lock functions in case the CQ is created with a single threaded 
access. (see down stream patch "Create a single threaded CQ" in this 
series).

>
>> +	uint32_t (*read_vendor_err)(struct ibv_cq_ex *current);
>> +	uint32_t (*read_byte_len)(struct ibv_cq_ex *current);
>> +	uint32_t (*read_imm_data)(struct ibv_cq_ex *current);
>> +	uint32_t (*read_qp_num)(struct ibv_cq_ex *current);
>> +	uint32_t (*read_src_qp)(struct ibv_cq_ex *current);
>> +	int (*read_wc_flags)(struct ibv_cq_ex *current);
>> +	uint16_t (*read_pkey_index)(struct ibv_cq_ex *current);
>> +	uint16_t (*read_slid)(struct ibv_cq_ex *current);
>> +	uint8_t (*read_sl)(struct ibv_cq_ex *current);
>> +	uint8_t (*read_dlid_path_bits)(struct ibv_cq_ex *current);
>
> Maybe some of these should be grouped into a struct? What kind of
> performance hit did you measure on these calls? Reading all the ud
> parameters is not very expensive...
>
> read_ud_info ?

As part of our benchmark testing we grouped in one read call status & 
wr_id which are usually needed but didn't see a real impact. Even in UD 
flow, application may not need all fields so supplying a read_ud_info 
which will copy whole candidate fields might not be useful.
As the API is extensible we may do further extensions in the future if 
will be proved to be useful.

> What does the driver look like? I'm assuming it turned out as short
> and jump free as I outlined?
Yes, the read_xxx are short and are really jump free.

> Finally, these extended APIs really need to be always available - so
> the core code should always provide a compatability wrapper for
> providers that do not implement this API. The compat wrapper should
> use the existing provider wc interface.

The compat layer will work in a non optimal way as it can't support 
batching and there will 2 copies, one from the HW CQ to ibv_wc even for 
not required fields, second, from the ibv_wc to the read_xxx output.

However, if required it can be posted as some extra patches in another 
series.

> We should not have stuff like in your rc_pingpong example where
> totally different code paths are required on something like wc
> processing.

The example code shares the parsing and the logic between the 2 modes 
and demonstrates the usage of the new APIs. I agree that having a compat 
layer can prevent the need for 2 code paths.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe May 24, 2016, 5:34 p.m. UTC | #3
On Mon, May 23, 2016 at 06:31:52PM +0300, Yishai Hadas wrote:
> On 5/19/2016 9:19 PM, Jason Gunthorpe wrote:
> >On Thu, May 19, 2016 at 02:51:37PM +0300, Yishai Hadas wrote:
> >>+	int (*start_poll_ex)(struct ibv_cq_ex *current,
> >>+			     struct ibv_poll_cq_ex_attr *attr);
> >>+	int (*next_poll_ex)(struct ibv_cq_ex *current);
> >>+	void (*end_poll_ex)(struct ibv_cq_ex *current);
> >
> >Probably don't need the _ex here..
> 
> The '_ex' points that those functions support extension. (ibv_poll_cq_ex*
> attr, ibv_cq_ex). Their matching ibv_xxx wrappers also have _ex (e.g.
> ibv_start_poll_ex).
> Makes sense ?

No, don't use _ex unless there is an non-_ex variant.

> Specifically talking on 'opcode', the fact that it always exists doesn't
> mean that application will read it. Furthermore, in many cases application
> doesn't need it at all (e.g. only SEND is done, there is CQ per operation,
> etc.) In addition, in some cases it can be achieved based on the wr_id that
> is returned.
> 
> Re 'status' and 'wr_id':
> From our benchmark testing we found that coping the the data as part of the
> read_xxx is the major cost and not the function call, so not sure that
> coping the status/wr_id to ibv_cq_ex directly will really give a boost here.

Interesting result, OK.

> As part of our benchmark testing we grouped in one read call status & wr_id
> which are usually needed but didn't see a real impact. Even in UD flow,
> application may not need all fields so supplying a read_ud_info which will
> copy whole candidate fields might not be useful.
> As the API is extensible we may do further extensions in the future if will
> be proved to be useful.

Okay, if benchmarks support this..

> >Finally, these extended APIs really need to be always available - so
> >the core code should always provide a compatability wrapper for
> >providers that do not implement this API. The compat wrapper should
> >use the existing provider wc interface.
> 
> The compat layer will work in a non optimal way as it can't support batching
> and there will 2 copies, one from the HW CQ to ibv_wc even for not required
> fields, second, from the ibv_wc to the read_xxx output.

Yes. We should be encouraging all our drivers to implement the new API
and all our users to use the new API.

> However, if required it can be posted as some extra patches in another
> series.

Yes, this avoids the friction moving the users.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index b76b09f..b391353 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -856,6 +856,10 @@  struct ibv_cq {
 	uint32_t		async_events_completed;
 };
 
+struct ibv_poll_cq_ex_attr {
+	uint32_t comp_mask;
+};
+
 struct ibv_cq_ex {
 	struct ibv_context     *context;
 	struct ibv_comp_channel *channel;
@@ -869,6 +873,23 @@  struct ibv_cq_ex {
 	uint32_t		async_events_completed;
 
 	uint32_t		comp_mask;
+	int (*start_poll_ex)(struct ibv_cq_ex *current,
+			     struct ibv_poll_cq_ex_attr *attr);
+	int (*next_poll_ex)(struct ibv_cq_ex *current);
+	void (*end_poll_ex)(struct ibv_cq_ex *current);
+	uint64_t (*read_wr_id)(struct ibv_cq_ex *current);
+	enum ibv_wc_status (*read_status)(struct ibv_cq_ex *current);
+	enum ibv_wc_opcode (*read_opcode)(struct ibv_cq_ex *current);
+	uint32_t (*read_vendor_err)(struct ibv_cq_ex *current);
+	uint32_t (*read_byte_len)(struct ibv_cq_ex *current);
+	uint32_t (*read_imm_data)(struct ibv_cq_ex *current);
+	uint32_t (*read_qp_num)(struct ibv_cq_ex *current);
+	uint32_t (*read_src_qp)(struct ibv_cq_ex *current);
+	int (*read_wc_flags)(struct ibv_cq_ex *current);
+	uint16_t (*read_pkey_index)(struct ibv_cq_ex *current);
+	uint16_t (*read_slid)(struct ibv_cq_ex *current);
+	uint8_t (*read_sl)(struct ibv_cq_ex *current);
+	uint8_t (*read_dlid_path_bits)(struct ibv_cq_ex *current);
 };
 
 static inline struct ibv_cq *ibv_cq_ex_to_cq(struct ibv_cq_ex *cq)
@@ -876,6 +897,87 @@  static inline struct ibv_cq *ibv_cq_ex_to_cq(struct ibv_cq_ex *cq)
 	return (struct ibv_cq *)cq;
 }
 
+static inline int ibv_start_poll_ex(struct ibv_cq_ex *cq,
+				    struct ibv_poll_cq_ex_attr *attr)
+{
+	return cq->start_poll_ex(cq, attr);
+}
+
+static inline int ibv_next_poll_ex(struct ibv_cq_ex *cq)
+{
+	return cq->next_poll_ex(cq);
+}
+
+static inline void ibv_end_poll_ex(struct ibv_cq_ex *cq)
+{
+	cq->end_poll_ex(cq);
+}
+
+static inline uint64_t ibv_wc_read_wr_id(struct ibv_cq_ex *cq)
+{
+	return cq->read_wr_id(cq);
+}
+
+static inline enum ibv_wc_status ibv_wc_read_status(struct ibv_cq_ex *cq)
+{
+	return cq->read_status(cq);
+}
+
+static inline enum ibv_wc_opcode ibv_wc_read_opcode(struct ibv_cq_ex *cq)
+{
+	return cq->read_opcode(cq);
+}
+
+static inline uint32_t ibv_wc_read_vendor_err(struct ibv_cq_ex *cq)
+{
+	return cq->read_vendor_err(cq);
+}
+
+static inline uint32_t ibv_wc_read_byte_len(struct ibv_cq_ex *cq)
+{
+	return cq->read_byte_len(cq);
+}
+
+static inline uint32_t ibv_wc_read_imm_data(struct ibv_cq_ex *cq)
+{
+	return cq->read_imm_data(cq);
+}
+
+static inline uint32_t ibv_wc_read_qp_num(struct ibv_cq_ex *cq)
+{
+	return cq->read_qp_num(cq);
+}
+
+static inline uint32_t ibv_wc_read_src_qp(struct ibv_cq_ex *cq)
+{
+	return cq->read_src_qp(cq);
+}
+
+static inline int ibv_wc_read_wc_flags(struct ibv_cq_ex *cq)
+{
+	return cq->read_wc_flags(cq);
+}
+
+static inline uint16_t ibv_wc_read_pkey_index(struct ibv_cq_ex *cq)
+{
+	return cq->read_pkey_index(cq);
+}
+
+static inline uint16_t ibv_wc_read_slid(struct ibv_cq_ex *cq)
+{
+	return cq->read_slid(cq);
+}
+
+static inline uint8_t ibv_wc_read_sl(struct ibv_cq_ex *cq)
+{
+	return cq->read_sl(cq);
+}
+
+static inline uint8_t ibv_wc_read_dlid_path_bits(struct ibv_cq_ex *cq)
+{
+	return cq->read_dlid_path_bits(cq);
+}
+
 struct ibv_ah {
 	struct ibv_context     *context;
 	struct ibv_pd	       *pd;
diff --git a/man/ibv_create_cq_ex.3 b/man/ibv_create_cq_ex.3
index cf43784..1dc3e43 100644
--- a/man/ibv_create_cq_ex.3
+++ b/man/ibv_create_cq_ex.3
@@ -41,6 +41,83 @@  enum ibv_wc_flags_ex {
         IBV_WC_EX_WITH_DLID_PATH_BITS        = 1 << 7,  /* Require dlid path bits in WC */
 };
 
+.SH "Polling an extended CQ"
+In order to poll an extended CQ efficiently, a user could use the following functions.
+
+.TP
+.B Completion iterator functions
+
+.BI "int ibv_start_poll_ex(struct ibv_cq_ex " "*cq" ", struct ibv_poll_cq_ex_attr " "*attr")
+.br
+Start polling a batch of work completions.
+.I attr
+is given in order to make this function
+easily extensible in the future. This function either returns 0 on success or an error code
+otherwise. When no completions are available on the CQ, ENOENT is returned, but the CQ remains
+in a valid state. On success, querying the completion's attribute could be done using the query
+functions described below. If an error code is given, end_poll_ex shouldn't be called.
+
+.BI "int ibv_next_poll_ex(struct ibv_cq_ex " "*cq")
+.br
+This function is called in order to get the next work completion. It has to be called after
+.I start_poll_ex
+and before
+.I end_poll_ex
+are called. This function either returns 0 on success or an error code
+otherwise. When no completions are available on the CQ, ENOENT is returned, but the CQ remains
+in a valid state. On success, querying the completion's attribute could be done using the query
+functions described below. If an error code is given, end_poll_ex should still be called,
+indicating this is the end of the polled batch.
+
+.BI "void ibv_end_poll_ex(struct ibv_cq_ex " "*cq")
+.br
+This function indicates the end of polling batch of work completions. After calling this function, the user should start a new batch
+by calling
+.I start_poll_ex.
+
+.TP
+.B Polling fields in the completion
+These functions are used in order to poll the current completion. The current completion is the completion which the iterator points to (start_poll_ex and next_poll_ex advances this iterator). Only fields that the user requested via wc_flags in ibv_create_cq_ex could be queried. In addition, some fields are only valid in certain opcodes and status codes.
+
+.BI "uint64_t ibv_wc_read_wr_id(struct ibv_cq_ex " "*cq"); \c
+ Get the wr_id from the current completion.
+
+.BI "enum ibv_wc_status ibv_wc_read_status(struct ibv_cq_ex " "*cq"); \c
+ Get the status from the current completion.
+
+.BI "enum ibv_wc_opcode ibv_wc_read_opcode(struct ibv_cq_ex " "*cq"); \c
+ Get the opcode from the current completion.
+
+.BI "uint32_t ibv_wc_read_vendor_err(struct ibv_cq_ex " "*cq"); \c
+ Get the vendor error from the current completion.
+
+.BI "uint32_t ibv_wc_read_byte_len(struct ibv_cq_ex " "*cq"); \c
+ Get the vendor error from the current completion.
+
+.BI "uint32_t ibv_wc_read_imm_data(struct ibv_cq_ex " "*cq"); \c
+ Get the immediate data field from the current completion.
+
+.BI "uint32_t ibv_wc_read_qp_num(struct ibv_cq_ex " "*cq"); \c
+ Get the QP number field from the current completion.
+
+.BI "uint32_t ibv_wc_read_src_qp(struct ibv_cq_ex " "*cq"); \c
+ Get the source QP number field from the current completion.
+
+.BI "int ibv_wc_read_wc_flags(struct ibv_cq_ex " "*cq"); \c
+ Get the QP flags field from the current completion.
+
+.BI "uint16_t ibv_wc_read_pkey_index(struct ibv_cq_ex " "*cq"); \c
+ Get the pkey index field from the current completion.
+
+.BI "uint16_t ibv_wc_read_slid(struct ibv_cq_ex " "*cq"); \c
+ Get the slid field from the current completion.
+
+.BI "uint8_t ibv_wc_read_sl(struct ibv_cq_ex " "*cq"); \c
+ Get the sl field from the current completion.
+
+.BI "uint8_t ibv_wc_read_dlid_path_bits(struct ibv_cq_ex " "*cq"); \c
+ Get the dlid_path_bits field from the current completion.
+
 .SH "RETURN VALUE"
 .B ibv_create_cq_ex()
 returns a pointer to the CQ, or NULL if the request fails.