diff mbox

[RFC,rdma-core] Verbs: Introduce resource domain

Message ID 1505648922-21346-1-git-send-email-yishaih@mellanox.com (mailing list archive)
State RFC
Headers show

Commit Message

Yishai Hadas Sept. 17, 2017, 11:48 a.m. UTC
Resource domain represents a collection of IB resources (as of QP, CQ,
etc.) which are accessed by an application from the same context (e.g.
same thread) and as such can share hardware and software resources to
allow smarter management of resources by the provider library and better
performance.

Today, without the resource domain, provider libraries ensure thread
safety for all resources and implement some arbitrary logic regarding
hardware resource management (e.g. pre-allocated limit pools, dynamic
allocation, round-robin vs lru, etc.).

We would like to allow the application to explicitly specify which
resources are going to be used by which resource domains, and by that
allow the libraries to improve the HW and SW resource utilization.

When application creates several resources on the same resource domain
index, the provider library can share the related hardware resources
between them and as such better utilize its available resources.
On the other hand if each resource is created by the application on a
dedicated resource domain index it will use a dedicated hardware
resource.

For example if few QPs are used by the application from the same thread
the hardware doorbell can be shared between them. This drops the need to
allocate a dedicate doorbell per QP or share doorbells which are handled
by different threads.

On the other hand if few QPs are not logically related an application
can enjoy the option to get a dedicated resource per QP and prevent the
need to take a lock in its data path flow to protect that doorbell
update operations.

This RFC patch introduces the following verbs extensions to support that:
1) Extend ibv_query_device_ex to return max_resource_domains.
This attribute represents range of valid [0..max_resource_domains] device
domains that resources can be part of. Each driver which supports
this feature will return a positive value, declaring its max unique
resource domains that can be used.

2) Extend ibv_create_qp_ex to get resource_domain to enable associating
a QP into a specific resource domain. Each library will use the input
value to manage the related hardware resources, and reduce the amount of
locks required.

Application is required to use a single context when accessing any of
the IB objects per each resource domain.

Specifically, we plan to use this API in mlx5 driver to better manage
the doorbell (UAR) hardware resource which is a limited hardware resource
by exposing an option from kernel to allocate it on demand and share
among QPs or attach it to a specific a QP based on application needs
which are represented by the resource domain property.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
---
 libibverbs/verbs.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Jason Gunthorpe Sept. 18, 2017, 8:54 p.m. UTC | #1
On Sun, Sep 17, 2017 at 02:48:42PM +0300, Yishai Hadas wrote:

> Resource domain represents a collection of IB resources (as of QP, CQ,
> etc.) which are accessed by an application from the same context (e.g.
> same thread) and as such can share hardware and software resources to
> allow smarter management of resources by the provider library and better
> performance.

This sounds exactly like a PD to me. Why do we need another construct?
What is wrong with a 'thread-unsafe' flag during PD creation and then
contain the shared resources in the PD?

>  	uint32_t		raw_packet_caps; /* Use ibv_raw_packet_caps */
> +	uint32_t		max_resource_domains;

Even with this approach, not sure a max makes much sense, this value should
just be hashed into whatever range the provider has on a resource by
resource basis.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Rosenbaum Sept. 19, 2017, 2:05 p.m. UTC | #2
On Mon, Sep 18, 2017 at 11:54 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Sun, Sep 17, 2017 at 02:48:42PM +0300, Yishai Hadas wrote:
>
>> Resource domain represents a collection of IB resources (as of QP, CQ,
>> etc.) which are accessed by an application from the same context (e.g.
>> same thread) and as such can share hardware and software resources to
>> allow smarter management of resources by the provider library and better
>> performance.
>
> This sounds exactly like a PD to me. Why do we need another construct?
> What is wrong with a 'thread-unsafe' flag during PD creation and then
> contain the shared resources in the PD?

PD does not fit good enough. It is does not cover CQ's in allocation
flow, but limited to QP's, SRQ's, WQ's.
Also, using multiple PD's will require to use a dedicated MR per PD,
instead of sharing a single MR for the entire memory used over
multiple threads / QP's.

>
>>       uint32_t                raw_packet_caps; /* Use ibv_raw_packet_caps */
>> +     uint32_t                max_resource_domains;
>
> Even with this approach, not sure a max makes much sense, this value should
> just be hashed into whatever range the provider has on a resource by
> resource basis.

By exposing the max_resource_domains we allow the application to take
the decision of which element will get hurt if it reaches that limit.
If we leave the logic hidden inside the libraries, the application
losses control over the best performance vs average better
performance.

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Sept. 21, 2017, 1:41 a.m. UTC | #3
On Tue, Sep 19, 2017 at 05:05:11PM +0300, Alex Rosenbaum wrote:
> On Mon, Sep 18, 2017 at 11:54 PM, Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com> wrote:
> > On Sun, Sep 17, 2017 at 02:48:42PM +0300, Yishai Hadas wrote:
> >
> >> Resource domain represents a collection of IB resources (as of QP, CQ,
> >> etc.) which are accessed by an application from the same context (e.g.
> >> same thread) and as such can share hardware and software resources to
> >> allow smarter management of resources by the provider library and better
> >> performance.
> >
> > This sounds exactly like a PD to me. Why do we need another construct?
> > What is wrong with a 'thread-unsafe' flag during PD creation and then
> > contain the shared resources in the PD?
> 
> PD does not fit good enough. It is does not cover CQ's in allocation
> flow, but limited to QP's, SRQ's, WQ's.

Well, CQ could gain a PD parameter, so that isn't a big deal..

> Also, using multiple PD's will require to use a dedicated MR per PD,
> instead of sharing a single MR for the entire memory used over
> multiple threads / QP's.

This is a bigger problem, but it could be solved, if not a bit messy..

Your concept patch requires adding a new input to every single object
creation function, this is a pretty big API change..

On the other hand, the PD is already an input to almost all of the
required creation functions so it much less API churn.

But, as you say, you'd have to solve the MR sharing problem at least
to make PD workable. Perhaps some kind of 'child' PD that was only for
threading..

> >>       uint32_t                raw_packet_caps; /* Use ibv_raw_packet_caps */
> >> +     uint32_t                max_resource_domains;
> >
> > Even with this approach, not sure a max makes much sense, this value should
> > just be hashed into whatever range the provider has on a resource by
> > resource basis.
>
> By exposing the max_resource_domains we allow the application to take
> the decision of which element will get hurt if it reaches that limit.
> If we leave the logic hidden inside the libraries, the application
> losses control over the best performance vs average better
> performance.

That goes both ways, a single 'max' does not make sense if there are
multiple different sorts of objects being controlled (eg UAR and
something else). It only works well in your context of having a single
constrained UAR.

Probably the better option is to introduce something like a 'struct
verbs_locking_domain *' input which is a proper object and can have a
range of operators.

I would say each creating locking domain objects is just strided
across the various limited resouces in the device.

'dv' accessors, eg mlxdv_set_locking_uar() can be provided to directly
assign specific limited resources under the control of the
application. 'dv' query functions can be used to directly learn the
specific limits for each specific thing.

This way we don't have to revisit things. This would also be a good
place to centralize all the crazy no threading flags we have
accumulated in various places and ways (including the stupid
environment variables). a 'struct verbs_locking_domain' option that
switches off all internal locking would make sense too.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Rosenbaum Sept. 26, 2017, 9:09 a.m. UTC | #4
On Thu, Sep 21, 2017 at 4:41 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> But, as you say, you'd have to solve the MR sharing problem at least
> to make PD workable. Perhaps some kind of 'child' PD that was only for
> threading..

OK, this definitely sounds like an interesting approach.
// alloc a child ibv_pd, which inherits the parent properties and adds
some limitations (e.g. thread domain, memory domain)

Two options I'm thinking in this spirit:

1) Extend the original alloc PD API (stay with a single PD alloc entry point):
struct ibv_pd *ibv_alloc_pd_ex(
        struct ibv_context *my_ctx,
        struct ibv_pd_init_attr {
                ibv_pd* parent_pd = my_pd, // use NULL to alloc a
parent PD, or use a parent pd to alloc a child PD
                int flags = IBV_PD_PARENT | IBV_PD_INIT_ATTR_THREAD_DOMAIN)
        } );

2) New dedicated alloc Child PD API:
struct ibv_pd *ibv_alloc_child_pd(struct ibv_pd *pd, int flags =
IBV_PD_INIT_ATTR_THREAD_DOMAIN);


> That goes both ways, a single 'max' does not make sense if there are
> multiple different sorts of objects being controlled (eg UAR and
> something else). It only works well in your context of having a single
> constrained UAR.
I agree.


> Probably the better option is to introduce something like a 'struct
> verbs_locking_domain *' input which is a proper object and can have a
> range of operators.

I did not understand how the 'struct verbs_locking_domain' comes into play here.
In regards to the child PD, I'm assuming the 'struct ibv_pd' will be
encap in a 'struct verbs_pd', and/or maybe encap in a 'struct
provider_pd'. Any linkage and dedicated resource should be stored in
the internal structures.


> I would say each creating locking domain objects is just strided
> across the various limited resources in the device.

With the above verbs API for child PD on dedicated thread, user gets a
best effort allocation of thread resource. I agree that striding
across resources sounds good.


> 'dv' accessors, eg mlxdv_set_locking_uar() can be provided to directly
> assign specific limited resources under the control of the
> application. 'dv' query functions can be used to directly learn the
> specific limits for each specific thing.

int mlx5dv_set_pd_uar_index(struct ibv_pd* pd, int uar_index);
So with DV API, application can selectively choose a UAR index for
Mellanox HW for a child (or parent) PD.
All objects created with this PD will use the chosen UAR index or
other thread options.
max_uar_index will be expose by the DV device caps via
'mlx5dv_query_device()'. If user exceeds the max_uar_index, the above
function call will fail.


Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Sept. 30, 2017, 3:59 p.m. UTC | #5
On Tue, Sep 26, 2017 at 12:09:12PM +0300, Alex Rosenbaum wrote:

> 2) New dedicated alloc Child PD API:
> struct ibv_pd *ibv_alloc_child_pd(struct ibv_pd *pd, int flags =
> IBV_PD_INIT_ATTR_THREAD_DOMAIN);

I think I prefer this one.

This sounds like a nice idea, but I wonder what a provider
implementation will look like now that there are two types of PD
objects and so forth.

The providers will need helpers 'get_pd()' to return the top level pd
struct that has the driver data and 'get_locking()' that returns
the thread properties associated with that PD. All accesses to the PD
will have to go through those accessors.

> > Probably the better option is to introduce something like a 'struct
> > verbs_locking_domain *' input which is a proper object and can have a
> > range of operators.
> 
> I did not understand how the 'struct verbs_locking_domain' comes into play here.
> In regards to the child PD, I'm assuming the 'struct ibv_pd' will be
> encap in a 'struct verbs_pd', and/or maybe encap in a 'struct
> provider_pd'. Any linkage and dedicated resource should be stored in
> the internal structures.

This was intented as an alterantive. If the PD approach is no good,
then use this instead of just an integer.

In the PD case this object exists but is hidden from the
uapi. 'get_locking()' returns it. eg a child pd would have a struct
verbs_locking_domain and the driver specific version as the driver
data of a pd object.

> int mlx5dv_set_pd_uar_index(struct ibv_pd* pd, int uar_index);
> So with DV API, application can selectively choose a UAR index for
> Mellanox HW for a child (or parent) PD.
> All objects created with this PD will use the chosen UAR index or
> other thread options.
> max_uar_index will be expose by the DV device caps via
> 'mlx5dv_query_device()'. If user exceeds the max_uar_index, the above
> function call will fail.

Yes.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/libibverbs/verbs.h b/libibverbs/verbs.h
index 8cdf8ab..b2b51c1 100644
--- a/libibverbs/verbs.h
+++ b/libibverbs/verbs.h
@@ -272,6 +272,7 @@  struct ibv_device_attr_ex {
 	uint32_t		max_wq_type_rq;
 	struct ibv_packet_pacing_caps packet_pacing_caps;
 	uint32_t		raw_packet_caps; /* Use ibv_raw_packet_caps */
+	uint32_t		max_resource_domains;
 };
 
 enum ibv_mtu {
@@ -785,7 +786,8 @@  enum ibv_qp_init_attr_mask {
 	IBV_QP_INIT_ATTR_MAX_TSO_HEADER = 1 << 3,
 	IBV_QP_INIT_ATTR_IND_TABLE	= 1 << 4,
 	IBV_QP_INIT_ATTR_RX_HASH	= 1 << 5,
-	IBV_QP_INIT_ATTR_RESERVED	= 1 << 6
+	IBV_QP_INIT_ATTR_RES_DOMAIN	= 1 << 6,
+	IBV_QP_INIT_ATTR_RESERVED	= 1 << 7
 };
 
 enum ibv_qp_create_flags {
@@ -821,6 +823,7 @@  struct ibv_qp_init_attr_ex {
 	struct ibv_rwq_ind_table       *rwq_ind_tbl;
 	struct ibv_rx_hash_conf	rx_hash_conf;
 	uint32_t		source_qpn;
+	uint32_t		resource_domain;
 };
 
 enum ibv_qp_open_attr_mask {