diff mbox

[v2,3/4] rsockets: distribute completion queue vectors among multiple cores

Message ID f028f54b9900ee34fcd1abd1d90d10e6@imap.linux.ibm.com (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Sreedhar Kodali Sept. 5, 2014, 1:18 p.m. UTC
From: Sreedhar Kodali <srkodali@linux.vnet.ibm.com>

     Distribute interrupt vectors among multiple cores while processing
     completion events.  By default the existing mechanism always
     defaults to core 0 for comp vector processing during the creation
     of a completion queue.  If the workload is very high, then this
     results in bottleneck at core 0 because the same core is used for
     both event and task processing.

     A '/comp_vector' option is exposed, the value of which is a range
     or comma separated list of cores for distributing interrupt
     vectors.  If not set, the existing mechanism prevails where in
     comp vector processing is directed to core 0.

     Signed-off-by: Sreedhar Kodali <srkodali@linux.vnet.ibm.com>
     ---



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Bart Van Assche Sept. 5, 2014, 3:47 p.m. UTC | #1
On 09/05/14 15:18, Sreedhar Kodali wrote:
> From: Sreedhar Kodali <srkodali@linux.vnet.ibm.com>
>      Distribute interrupt vectors among multiple cores while processing
>      completion events.  By default the existing mechanism always
>      defaults to core 0 for comp vector processing during the creation
>      of a completion queue.  If the workload is very high, then this
>      results in bottleneck at core 0 because the same core is used for
>      both event and task processing.
>
>      A '/comp_vector' option is exposed, the value of which is a range
>      or comma separated list of cores for distributing interrupt
>      vectors.  If not set, the existing mechanism prevails where in
>      comp vector processing is directed to core 0.

Shouldn't "core" be changed into "completion vector" in this patch 
description ? It is not possible to select a CPU core directly via the 
completion vector argument of ib_create_cq(). Which completion vector 
maps to which CPU core depends on how /proc/irq/<irq>/smp_affinity has 
been configured.

> +    if ((f = fopen(RS_CONF_DIR "/comp_vector", "r"))) {

Is it optimal to have a single global configuration file for the 
completion vector mask for all applications ? Suppose that a server is 
equipped with two CPU sockets, one PCIe bus and one HCA with one port 
and that that HCA has allocated eight completion vectors. If IRQ 
affinity is configured such that the first four completion vectors are 
associated with the first CPU socket and the second four completion 
vectors with the second CPU socket then to achieve optimal performance 
applications that run on the first socket should only use completion 
vectors 0..3 and applications that run on the second socket should only 
use completion vectors 4..7. Should this kind of configuration be 
supported by the rsockets library ?

Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sreedhar Kodali Sept. 6, 2014, 3:06 a.m. UTC | #2
Hi Bart,

Thanks for your comments.  Please find my responses to your queries 
below:

On 2014-09-05 21:17, Bart Van Assche wrote:
> On 09/05/14 15:18, Sreedhar Kodali wrote:
>> From: Sreedhar Kodali <srkodali@linux.vnet.ibm.com>
>>      Distribute interrupt vectors among multiple cores while 
>> processing
>>      completion events.  By default the existing mechanism always
>>      defaults to core 0 for comp vector processing during the creation
>>      of a completion queue.  If the workload is very high, then this
>>      results in bottleneck at core 0 because the same core is used for
>>      both event and task processing.
>> 
>>      A '/comp_vector' option is exposed, the value of which is a range
>>      or comma separated list of cores for distributing interrupt
>>      vectors.  If not set, the existing mechanism prevails where in
>>      comp vector processing is directed to core 0.
> 
> Shouldn't "core" be changed into "completion vector" in this patch
> description ? It is not possible to select a CPU core directly via the
> completion vector argument of ib_create_cq(). Which completion vector
> maps to which CPU core depends on how /proc/irq/<irq>/smp_affinity has
> been configured.
> 

Sure.  We need to revise the description given that actual routing
of interrupt vector processing is set at the system level via 
smp_affinity.

>> +    if ((f = fopen(RS_CONF_DIR "/comp_vector", "r"))) {
> 
> Is it optimal to have a single global configuration file for the
> completion vector mask for all applications ? Suppose that a server is
> equipped with two CPU sockets, one PCIe bus and one HCA with one port
> and that that HCA has allocated eight completion vectors. If IRQ
> affinity is configured such that the first four completion vectors are
> associated with the first CPU socket and the second four completion
> vectors with the second CPU socket then to achieve optimal performance
> applications that run on the first socket should only use completion
> vectors 0..3 and applications that run on the second socket should
> only use completion vectors 4..7. Should this kind of configuration be
> supported by the rsockets library ?

The provided patch only covers equal distribution of completion vectors
among the specified cores.  But, it is not generic enough to cover
the scenario you have suggested.  If it were to be the case, then
we need to

a) alter the configuration file format to recognize cpu sockets
b) or introduce separate configuration files for each cpu socket
c) alter the distribution logic to recognize socket based grouping

This definitely increases the complexity of the code.  Not sure
whether this is necessary to cover most of the general use cases.
If so, then we can target it.

> 
> Bart.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" 
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thank You.

- Sreedhar

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Sept. 8, 2014, 6:47 a.m. UTC | #3
On 09/06/14 05:06, Sreedhar Kodali wrote:
> The provided patch only covers equal distribution of completion vectors
> among the specified cores.  But, it is not generic enough to cover
> the scenario you have suggested.  If it were to be the case, then
> we need to
>
> a) alter the configuration file format to recognize cpu sockets
> b) or introduce separate configuration files for each cpu socket
> c) alter the distribution logic to recognize socket based grouping
>
> This definitely increases the complexity of the code.  Not sure
> whether this is necessary to cover most of the general use cases.
> If so, then we can target it.

Hello Sreedhar,

Sorry if I wasn't clear enough. What I meant is that I think that it 
would be useful to be able to configure a different completion vector 
mask per process instead of one global completion vector mask for all 
processes that use the rsockets library. Being able to configure one 
completion vector mask per process would make it possible to avoid that 
one busy process negatively impacts the performance of another process 
that uses RDMA.

Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sreedhar Kodali Sept. 8, 2014, 2:27 p.m. UTC | #4
Hi Bart,

Thanks for your insights.  Your suggestion of having a separate
vector per process is really useful when multiple processes are
simultaneously running in the same environment.  At the implementation
level this requires an extended configuration file format. Can't
we possibly think of this as a next iteration?

Thank You.

- Sreedhar

On 2014-09-08 12:17, Bart Van Assche wrote:
> On 09/06/14 05:06, Sreedhar Kodali wrote:
>> The provided patch only covers equal distribution of completion 
>> vectors
>> among the specified cores.  But, it is not generic enough to cover
>> the scenario you have suggested.  If it were to be the case, then
>> we need to
>> 
>> a) alter the configuration file format to recognize cpu sockets
>> b) or introduce separate configuration files for each cpu socket
>> c) alter the distribution logic to recognize socket based grouping
>> 
>> This definitely increases the complexity of the code.  Not sure
>> whether this is necessary to cover most of the general use cases.
>> If so, then we can target it.
> 
> Hello Sreedhar,
> 
> Sorry if I wasn't clear enough. What I meant is that I think that it
> would be useful to be able to configure a different completion vector
> mask per process instead of one global completion vector mask for all
> processes that use the rsockets library. Being able to configure one
> completion vector mask per process would make it possible to avoid
> that one busy process negatively impacts the performance of another
> process that uses RDMA.
> 
> Bart.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" 
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Sept. 8, 2014, 6:22 p.m. UTC | #5
On 09/08/14 16:27, Sreedhar Kodali wrote:
> Thanks for your insights.  Your suggestion of having a separate
> vector per process is really useful when multiple processes are
> simultaneously running in the same environment.  At the implementation
> level this requires an extended configuration file format. Can't
> we possibly think of this as a next iteration?

This patch introduces a new configuration file 
(/etc/rdma/rsocket/comp_vector). User expect that such configuration 
files remain supported after a software upgrade. In other words, 
introduction of a new configuration file deserves careful consideration. 
Do we really need this new configuration file ? Why not e.g. to change 
the default behavior of the rsockets library into spreading the 
interrupt processing workload over all completion vectors supported by 
the HCA ? Furthermore, the number of completion vectors that is 
supported by an HCA not only depends on the HCA firmware but also on the 
number of completion vectors a HCA could allocate during driver 
initialization. The number of MSI-X vectors available on an x86-64 
system is limited (224). Dual-socket x86-64 motherboards with up to 11 
PCIe slots are available commercially. If all PCIe-slots are populated 
it's almost sure that one of the PCIe cards won't be assigned as many 
MSI-X vectors as it could use.

Sorry but I'm not convinced that using the same completion vector mask 
for all HCA's and all processes is a good idea. Or is there perhaps 
something I have overlooked ?

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sreedhar Kodali Sept. 9, 2014, 12:28 p.m. UTC | #6
On 2014-09-08 23:52, Bart Van Assche wrote:
> On 09/08/14 16:27, Sreedhar Kodali wrote:
>> Thanks for your insights.  Your suggestion of having a separate
>> vector per process is really useful when multiple processes are
>> simultaneously running in the same environment.  At the implementation
>> level this requires an extended configuration file format. Can't
>> we possibly think of this as a next iteration?
> 
> This patch introduces a new configuration file
> (/etc/rdma/rsocket/comp_vector). User expect that such configuration
> files remain supported after a software upgrade. In other words,
> introduction of a new configuration file deserves careful
> consideration. Do we really need this new configuration file ? Why not
> e.g. to change the default behavior of the rsockets library into
> spreading the interrupt processing workload over all completion
> vectors supported by the HCA ? Furthermore, the number of completion
> vectors that is supported by an HCA not only depends on the HCA
> firmware but also on the number of completion vectors a HCA could
> allocate during driver initialization. The number of MSI-X vectors
> available on an x86-64 system is limited (224). Dual-socket x86-64
> motherboards with up to 11 PCIe slots are available commercially. If
> all PCIe-slots are populated it's almost sure that one of the PCIe
> cards won't be assigned as many MSI-X vectors as it could use.
> 
> Sorry but I'm not convinced that using the same completion vector mask
> for all HCA's and all processes is a good idea. Or is there perhaps
> something I have overlooked ?
> 
> Bart.

Hi Bart,

I agree with your suggestions - working on the revised patch 
accordingly.

Thank You.

- Sreedhar

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sreedhar Kodali Sept. 11, 2014, 12:34 p.m. UTC | #7
On 2014-09-09 17:58, Sreedhar Kodali wrote:
> On 2014-09-08 23:52, Bart Van Assche wrote:
>> On 09/08/14 16:27, Sreedhar Kodali wrote:
>>> Thanks for your insights.  Your suggestion of having a separate
>>> vector per process is really useful when multiple processes are
>>> simultaneously running in the same environment.  At the 
>>> implementation
>>> level this requires an extended configuration file format. Can't
>>> we possibly think of this as a next iteration?
>> 
>> This patch introduces a new configuration file
>> (/etc/rdma/rsocket/comp_vector). User expect that such configuration
>> files remain supported after a software upgrade. In other words,
>> introduction of a new configuration file deserves careful
>> consideration. Do we really need this new configuration file ? Why not
>> e.g. to change the default behavior of the rsockets library into
>> spreading the interrupt processing workload over all completion
>> vectors supported by the HCA ? Furthermore, the number of completion
>> vectors that is supported by an HCA not only depends on the HCA
>> firmware but also on the number of completion vectors a HCA could
>> allocate during driver initialization. The number of MSI-X vectors
>> available on an x86-64 system is limited (224). Dual-socket x86-64
>> motherboards with up to 11 PCIe slots are available commercially. If
>> all PCIe-slots are populated it's almost sure that one of the PCIe
>> cards won't be assigned as many MSI-X vectors as it could use.
>> 
>> Sorry but I'm not convinced that using the same completion vector mask
>> for all HCA's and all processes is a good idea. Or is there perhaps
>> something I have overlooked ?
>> 
>> Bart.
> 
> Hi Bart,
> 
> I agree with your suggestions - working on the revised patch 
> accordingly.
> 
> Thank You.
> 
> - Sreedhar
> 

Hi Bart,

I have sent the revised patch v4 that groups and assigns comp vectors 
per process as you suggested.  Please go through it.

Thank You.

- Sreedhar
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" 
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Sept. 15, 2014, 9:06 a.m. UTC | #8
On 09/11/14 14:34, Sreedhar Kodali wrote:
> I have sent the revised patch v4 that groups and assigns comp vectors
> per process as you suggested.  Please go through it.

Shouldn't there be agreement about the approach before a patch is 
reworked and reposted ? I think the following aspects deserve wider 
discussion and agreement about these aspects is needed before the patch 
itself is discussed further:
- Do we need to discuss a policy that defines which completion vectors 
are associated with which CPU sockets ? Such a policy is needed to allow 
RDMA software to constrain RDMA completions to a single CPU socket and 
hence to avoid inter-socket cache misses. One possible policy is to 
associate an equal number of completion vectors with each CPU socket. If 
e.g. 8 completion vectors are provided by an HCA and two CPU sockets are 
available then completion vectors 0..3 could be bound to the CPU socket 
with index 0 and vectors 4..7 could be bound to CPU socket that has been 
assigned index 1 by the Linux kernel.
- Would it be useful to modify the irqbalance software such that it 
becomes aware of HCA's that provide multiple MSI-X vectors and hence 
automatically applies the policy mentioned in the previous bullet ?
- What should the default behavior be of the rsockets library ? Keep the 
current behavior (use completion vector 0), select one of the available 
completion vectors in a round-robin fashion or perhaps yet another policy ?
- The number of completion vectors provided by a HCA can change after a 
PCIe card has been added to or removed from the system. Such changes 
affect the number of bits of the completion mask that are relevant. How 
to handle this ?
- If a configuration option is added in the rsockets library to specify 
which completion vectors a process is allowed to use, should it be 
possible to specify individual completion vectors or is it sufficient if 
CPU socket numbers can be specified ? That last choice has the advantage 
that it is independent of the exact number of completion vectors that 
has been allocated by an HCA.
- How to cope with systems in which multiple RDMA HCA's are present and 
in which each HCA provides a different number of completion vectors ? Is 
a completion vector bitmask a proper means for such systems to specify 
which completion vectors should be used ?
- Do we need to treat virtual machine guests and CPU hot-plugging 
separately or can we rely on the information about CPU sockets that is 
provided by the hypervisor to the guest ?

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sreedhar Kodali Sept. 16, 2014, 4:33 a.m. UTC | #9
Hi Bart,

Thanks for your detailed thoughts and insights into comp vector
assignment.  As you rightly pointed out, let's hear from wider
community as well before we attempt another iteration on the path.
I have included below some details about the latest patch to
understand where we are currently.

On 2014-09-15 14:36, Bart Van Assche wrote:
> On 09/11/14 14:34, Sreedhar Kodali wrote:
>> I have sent the revised patch v4 that groups and assigns comp vectors
>> per process as you suggested.  Please go through it.
> 
> Shouldn't there be agreement about the approach before a patch is
> reworked and reposted ? I think the following aspects deserve wider
> discussion and agreement about these aspects is needed before the
> patch itself is discussed further:

Absolutely.

> - Do we need to discuss a policy that defines which completion vectors
> are associated with which CPU sockets ? Such a policy is needed to
> allow RDMA software to constrain RDMA completions to a single CPU
> socket and hence to avoid inter-socket cache misses. One possible
> policy is to associate an equal number of completion vectors with each
> CPU socket. If e.g. 8 completion vectors are provided by an HCA and
> two CPU sockets are available then completion vectors 0..3 could be
> bound to the CPU socket with index 0 and vectors 4..7 could be bound
> to CPU socket that has been assigned index 1 by the Linux kernel.
> - Would it be useful to modify the irqbalance software such that it
> becomes aware of HCA's that provide multiple MSI-X vectors and hence
> automatically applies the policy mentioned in the previous bullet ?

Having a policy based approach is good.  But we need to explore where
in the OFED stack this policy can be specified and enforced.  Not sure,
rsockets would be the right place to hold policy based extensions as
it is simply an abstraction layer on top of rdmacm library.

> - What should the default behavior be of the rsockets library ? Keep
> the current behavior (use completion vector 0), select one of the
> available completion vectors in a round-robin fashion or perhaps yet
> another policy ?

Keep the current behavior if user has not specified any option.

> - The number of completion vectors provided by a HCA can change after
> a PCIe card has been added to or removed from the system. Such changes
> affect the number of bits of the completion mask that are relevant.
> How to handle this ?

Completion mask based approach is dropped in favor of storing
the values of completion vectors.

> - If a configuration option is added in the rsockets library to
> specify which completion vectors a process is allowed to use, should
> it be possible to specify individual completion vectors or is it
> sufficient if CPU socket numbers can be specified ? That last choice
> has the advantage that it is independent of the exact number of
> completion vectors that has been allocated by an HCA.

Specify individual completion vectors through config option.
This is on premise that user is aware of the allocation.

> - How to cope with systems in which multiple RDMA HCA's are present
> and in which each HCA provides a different number of completion
> vectors ? Is a completion vector bitmask a proper means for such
> systems to specify which completion vectors should be used ?

As mentioned above, bitmask based approach is done away with
in favor of absolute values in the latest v4 patch.

> - Do we need to treat virtual machine guests and CPU hot-plugging
> separately or can we rely on the information about CPU sockets that is
> provided by the hypervisor to the guest ?
> 
> Bart.

Thank You.

- Sreedhar

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hefty, Sean Sept. 16, 2014, 4:57 a.m. UTC | #10
> Thanks for your detailed thoughts and insights into comp vector
> assignment.  As you rightly pointed out, let's hear from wider
> community as well before we attempt another iteration on the path.
> I have included below some details about the latest patch to
> understand where we are currently.

Are there any similar mechanisms defined elsewhere that we can use as guidance?

I don't recall, does completion vector 0 refer to a specific vector or does it mean use a default assignment?

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sreedhar Kodali Sept. 17, 2014, 5:54 a.m. UTC | #11
On 2014-09-16 10:27, Hefty, Sean wrote:
>> Thanks for your detailed thoughts and insights into comp vector
>> assignment.  As you rightly pointed out, let's hear from wider
>> community as well before we attempt another iteration on the path.
>> I have included below some details about the latest patch to
>> understand where we are currently.
> 
> Are there any similar mechanisms defined elsewhere that we can use as 
> guidance?
> 
> I don't recall, does completion vector 0 refer to a specific vector or
> does it mean use a default assignment?
> 
> - Sean

Hi Sean,

Not sure about similar mechanisms but some references to irq balancing
available on the net.  I think, comp vector 0 is a specific vector that
is used by default.

Thank You.

- Sreedhar

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/src/rsocket.c b/src/rsocket.c
index b70d56a..ffea0ca 100644
--- a/src/rsocket.c
+++ b/src/rsocket.c
@@ -116,6 +116,8 @@  static uint32_t def_mem = (1 << 17);
  static uint32_t def_wmem = (1 << 17);
  static uint32_t polling_time = 10;
  static uint16_t restart_onintr = 0;
+static uint16_t next_comp_vector = 0;
+static uint64_t comp_vector_mask = 0;

  /*
   * Immediate data format is determined by the upper bits
@@ -548,6 +550,37 @@  void rs_configure(void)
  		(void) fscanf(f, "%hu", &restart_onintr);
  		fclose(f);
  	}
+
+	if ((f = fopen(RS_CONF_DIR "/comp_vector", "r"))) {
+		char vbuf[256];
+		char *vptr;
+		vptr = fgets(vbuf, sizeof(vbuf), f);
+		fclose(f);
+		if (vptr) {
+			char *tok, *save, *tmp, *str, *tok2;
+			int lvect, uvect, vect;
+
+			for (str = vptr; ; str = NULL) {
+				tok = strtok_r(str, ",", &save);
+				if (tok == NULL) {
+					break;
+				}
+				if (!(tmp = strpbrk(tok, "-"))) {
+					lvect = uvect = atoi(tok);
+				} else {
+					tok2 = tmp + 1;
+					*tmp = '\0';
+					lvect = atoi(tok);
+					uvect = atoi(tok2);
+				}
+				lvect = (lvect < 0) ? 0 : ((lvect > 63) ? 63 : lvect);
+				uvect = (uvect < 0) ? 0 : ((uvect > 63) ? 63 : uvect);
+				for (vect = lvect; vect <= uvect; vect++) {
+					comp_vector_mask |= ((uint64_t)1 << vect);
+				}
+			}
+		}
+	}
  	init = 1;
  out:
  	pthread_mutex_unlock(&mut);
@@ -762,12 +795,27 @@  static int ds_init_bufs(struct ds_qp *qp)
   */
  static int rs_create_cq(struct rsocket *rs, struct rdma_cm_id *cm_id)
  {
+	int vector = 0;
+
  	cm_id->recv_cq_channel = ibv_create_comp_channel(cm_id->verbs);
  	if (!cm_id->recv_cq_channel)
  		return -1;

+	if (comp_vector_mask) {
+		int found = 0;
+		while (found == 0) {
+			if (comp_vector_mask & ((uint64_t) 1 << next_comp_vector)) {
+				found = 1;
+				vector = next_comp_vector;
+			}
+			if (++next_comp_vector == 64) {
+				next_comp_vector = 0;
+			}
+		}
+	}
+
  	cm_id->recv_cq = ibv_create_cq(cm_id->verbs, rs->sq_size + 
rs->rq_size,
-				       cm_id, cm_id->recv_cq_channel, 0);
+				       cm_id, cm_id->recv_cq_channel, vector);
  	if (!cm_id->recv_cq)
  		goto err1;