Message ID | f028f54b9900ee34fcd1abd1d90d10e6@imap.linux.ibm.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
On 09/05/14 15:18, Sreedhar Kodali wrote: > From: Sreedhar Kodali <srkodali@linux.vnet.ibm.com> > Distribute interrupt vectors among multiple cores while processing > completion events. By default the existing mechanism always > defaults to core 0 for comp vector processing during the creation > of a completion queue. If the workload is very high, then this > results in bottleneck at core 0 because the same core is used for > both event and task processing. > > A '/comp_vector' option is exposed, the value of which is a range > or comma separated list of cores for distributing interrupt > vectors. If not set, the existing mechanism prevails where in > comp vector processing is directed to core 0. Shouldn't "core" be changed into "completion vector" in this patch description ? It is not possible to select a CPU core directly via the completion vector argument of ib_create_cq(). Which completion vector maps to which CPU core depends on how /proc/irq/<irq>/smp_affinity has been configured. > + if ((f = fopen(RS_CONF_DIR "/comp_vector", "r"))) { Is it optimal to have a single global configuration file for the completion vector mask for all applications ? Suppose that a server is equipped with two CPU sockets, one PCIe bus and one HCA with one port and that that HCA has allocated eight completion vectors. If IRQ affinity is configured such that the first four completion vectors are associated with the first CPU socket and the second four completion vectors with the second CPU socket then to achieve optimal performance applications that run on the first socket should only use completion vectors 0..3 and applications that run on the second socket should only use completion vectors 4..7. Should this kind of configuration be supported by the rsockets library ? Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Bart, Thanks for your comments. Please find my responses to your queries below: On 2014-09-05 21:17, Bart Van Assche wrote: > On 09/05/14 15:18, Sreedhar Kodali wrote: >> From: Sreedhar Kodali <srkodali@linux.vnet.ibm.com> >> Distribute interrupt vectors among multiple cores while >> processing >> completion events. By default the existing mechanism always >> defaults to core 0 for comp vector processing during the creation >> of a completion queue. If the workload is very high, then this >> results in bottleneck at core 0 because the same core is used for >> both event and task processing. >> >> A '/comp_vector' option is exposed, the value of which is a range >> or comma separated list of cores for distributing interrupt >> vectors. If not set, the existing mechanism prevails where in >> comp vector processing is directed to core 0. > > Shouldn't "core" be changed into "completion vector" in this patch > description ? It is not possible to select a CPU core directly via the > completion vector argument of ib_create_cq(). Which completion vector > maps to which CPU core depends on how /proc/irq/<irq>/smp_affinity has > been configured. > Sure. We need to revise the description given that actual routing of interrupt vector processing is set at the system level via smp_affinity. >> + if ((f = fopen(RS_CONF_DIR "/comp_vector", "r"))) { > > Is it optimal to have a single global configuration file for the > completion vector mask for all applications ? Suppose that a server is > equipped with two CPU sockets, one PCIe bus and one HCA with one port > and that that HCA has allocated eight completion vectors. If IRQ > affinity is configured such that the first four completion vectors are > associated with the first CPU socket and the second four completion > vectors with the second CPU socket then to achieve optimal performance > applications that run on the first socket should only use completion > vectors 0..3 and applications that run on the second socket should > only use completion vectors 4..7. Should this kind of configuration be > supported by the rsockets library ? The provided patch only covers equal distribution of completion vectors among the specified cores. But, it is not generic enough to cover the scenario you have suggested. If it were to be the case, then we need to a) alter the configuration file format to recognize cpu sockets b) or introduce separate configuration files for each cpu socket c) alter the distribution logic to recognize socket based grouping This definitely increases the complexity of the code. Not sure whether this is necessary to cover most of the general use cases. If so, then we can target it. > > Bart. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" > in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Thank You. - Sreedhar -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/06/14 05:06, Sreedhar Kodali wrote: > The provided patch only covers equal distribution of completion vectors > among the specified cores. But, it is not generic enough to cover > the scenario you have suggested. If it were to be the case, then > we need to > > a) alter the configuration file format to recognize cpu sockets > b) or introduce separate configuration files for each cpu socket > c) alter the distribution logic to recognize socket based grouping > > This definitely increases the complexity of the code. Not sure > whether this is necessary to cover most of the general use cases. > If so, then we can target it. Hello Sreedhar, Sorry if I wasn't clear enough. What I meant is that I think that it would be useful to be able to configure a different completion vector mask per process instead of one global completion vector mask for all processes that use the rsockets library. Being able to configure one completion vector mask per process would make it possible to avoid that one busy process negatively impacts the performance of another process that uses RDMA. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Bart, Thanks for your insights. Your suggestion of having a separate vector per process is really useful when multiple processes are simultaneously running in the same environment. At the implementation level this requires an extended configuration file format. Can't we possibly think of this as a next iteration? Thank You. - Sreedhar On 2014-09-08 12:17, Bart Van Assche wrote: > On 09/06/14 05:06, Sreedhar Kodali wrote: >> The provided patch only covers equal distribution of completion >> vectors >> among the specified cores. But, it is not generic enough to cover >> the scenario you have suggested. If it were to be the case, then >> we need to >> >> a) alter the configuration file format to recognize cpu sockets >> b) or introduce separate configuration files for each cpu socket >> c) alter the distribution logic to recognize socket based grouping >> >> This definitely increases the complexity of the code. Not sure >> whether this is necessary to cover most of the general use cases. >> If so, then we can target it. > > Hello Sreedhar, > > Sorry if I wasn't clear enough. What I meant is that I think that it > would be useful to be able to configure a different completion vector > mask per process instead of one global completion vector mask for all > processes that use the rsockets library. Being able to configure one > completion vector mask per process would make it possible to avoid > that one busy process negatively impacts the performance of another > process that uses RDMA. > > Bart. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" > in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/08/14 16:27, Sreedhar Kodali wrote: > Thanks for your insights. Your suggestion of having a separate > vector per process is really useful when multiple processes are > simultaneously running in the same environment. At the implementation > level this requires an extended configuration file format. Can't > we possibly think of this as a next iteration? This patch introduces a new configuration file (/etc/rdma/rsocket/comp_vector). User expect that such configuration files remain supported after a software upgrade. In other words, introduction of a new configuration file deserves careful consideration. Do we really need this new configuration file ? Why not e.g. to change the default behavior of the rsockets library into spreading the interrupt processing workload over all completion vectors supported by the HCA ? Furthermore, the number of completion vectors that is supported by an HCA not only depends on the HCA firmware but also on the number of completion vectors a HCA could allocate during driver initialization. The number of MSI-X vectors available on an x86-64 system is limited (224). Dual-socket x86-64 motherboards with up to 11 PCIe slots are available commercially. If all PCIe-slots are populated it's almost sure that one of the PCIe cards won't be assigned as many MSI-X vectors as it could use. Sorry but I'm not convinced that using the same completion vector mask for all HCA's and all processes is a good idea. Or is there perhaps something I have overlooked ? Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2014-09-08 23:52, Bart Van Assche wrote: > On 09/08/14 16:27, Sreedhar Kodali wrote: >> Thanks for your insights. Your suggestion of having a separate >> vector per process is really useful when multiple processes are >> simultaneously running in the same environment. At the implementation >> level this requires an extended configuration file format. Can't >> we possibly think of this as a next iteration? > > This patch introduces a new configuration file > (/etc/rdma/rsocket/comp_vector). User expect that such configuration > files remain supported after a software upgrade. In other words, > introduction of a new configuration file deserves careful > consideration. Do we really need this new configuration file ? Why not > e.g. to change the default behavior of the rsockets library into > spreading the interrupt processing workload over all completion > vectors supported by the HCA ? Furthermore, the number of completion > vectors that is supported by an HCA not only depends on the HCA > firmware but also on the number of completion vectors a HCA could > allocate during driver initialization. The number of MSI-X vectors > available on an x86-64 system is limited (224). Dual-socket x86-64 > motherboards with up to 11 PCIe slots are available commercially. If > all PCIe-slots are populated it's almost sure that one of the PCIe > cards won't be assigned as many MSI-X vectors as it could use. > > Sorry but I'm not convinced that using the same completion vector mask > for all HCA's and all processes is a good idea. Or is there perhaps > something I have overlooked ? > > Bart. Hi Bart, I agree with your suggestions - working on the revised patch accordingly. Thank You. - Sreedhar -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2014-09-09 17:58, Sreedhar Kodali wrote: > On 2014-09-08 23:52, Bart Van Assche wrote: >> On 09/08/14 16:27, Sreedhar Kodali wrote: >>> Thanks for your insights. Your suggestion of having a separate >>> vector per process is really useful when multiple processes are >>> simultaneously running in the same environment. At the >>> implementation >>> level this requires an extended configuration file format. Can't >>> we possibly think of this as a next iteration? >> >> This patch introduces a new configuration file >> (/etc/rdma/rsocket/comp_vector). User expect that such configuration >> files remain supported after a software upgrade. In other words, >> introduction of a new configuration file deserves careful >> consideration. Do we really need this new configuration file ? Why not >> e.g. to change the default behavior of the rsockets library into >> spreading the interrupt processing workload over all completion >> vectors supported by the HCA ? Furthermore, the number of completion >> vectors that is supported by an HCA not only depends on the HCA >> firmware but also on the number of completion vectors a HCA could >> allocate during driver initialization. The number of MSI-X vectors >> available on an x86-64 system is limited (224). Dual-socket x86-64 >> motherboards with up to 11 PCIe slots are available commercially. If >> all PCIe-slots are populated it's almost sure that one of the PCIe >> cards won't be assigned as many MSI-X vectors as it could use. >> >> Sorry but I'm not convinced that using the same completion vector mask >> for all HCA's and all processes is a good idea. Or is there perhaps >> something I have overlooked ? >> >> Bart. > > Hi Bart, > > I agree with your suggestions - working on the revised patch > accordingly. > > Thank You. > > - Sreedhar > Hi Bart, I have sent the revised patch v4 that groups and assigns comp vectors per process as you suggested. Please go through it. Thank You. - Sreedhar > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" > in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/11/14 14:34, Sreedhar Kodali wrote: > I have sent the revised patch v4 that groups and assigns comp vectors > per process as you suggested. Please go through it. Shouldn't there be agreement about the approach before a patch is reworked and reposted ? I think the following aspects deserve wider discussion and agreement about these aspects is needed before the patch itself is discussed further: - Do we need to discuss a policy that defines which completion vectors are associated with which CPU sockets ? Such a policy is needed to allow RDMA software to constrain RDMA completions to a single CPU socket and hence to avoid inter-socket cache misses. One possible policy is to associate an equal number of completion vectors with each CPU socket. If e.g. 8 completion vectors are provided by an HCA and two CPU sockets are available then completion vectors 0..3 could be bound to the CPU socket with index 0 and vectors 4..7 could be bound to CPU socket that has been assigned index 1 by the Linux kernel. - Would it be useful to modify the irqbalance software such that it becomes aware of HCA's that provide multiple MSI-X vectors and hence automatically applies the policy mentioned in the previous bullet ? - What should the default behavior be of the rsockets library ? Keep the current behavior (use completion vector 0), select one of the available completion vectors in a round-robin fashion or perhaps yet another policy ? - The number of completion vectors provided by a HCA can change after a PCIe card has been added to or removed from the system. Such changes affect the number of bits of the completion mask that are relevant. How to handle this ? - If a configuration option is added in the rsockets library to specify which completion vectors a process is allowed to use, should it be possible to specify individual completion vectors or is it sufficient if CPU socket numbers can be specified ? That last choice has the advantage that it is independent of the exact number of completion vectors that has been allocated by an HCA. - How to cope with systems in which multiple RDMA HCA's are present and in which each HCA provides a different number of completion vectors ? Is a completion vector bitmask a proper means for such systems to specify which completion vectors should be used ? - Do we need to treat virtual machine guests and CPU hot-plugging separately or can we rely on the information about CPU sockets that is provided by the hypervisor to the guest ? Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Bart, Thanks for your detailed thoughts and insights into comp vector assignment. As you rightly pointed out, let's hear from wider community as well before we attempt another iteration on the path. I have included below some details about the latest patch to understand where we are currently. On 2014-09-15 14:36, Bart Van Assche wrote: > On 09/11/14 14:34, Sreedhar Kodali wrote: >> I have sent the revised patch v4 that groups and assigns comp vectors >> per process as you suggested. Please go through it. > > Shouldn't there be agreement about the approach before a patch is > reworked and reposted ? I think the following aspects deserve wider > discussion and agreement about these aspects is needed before the > patch itself is discussed further: Absolutely. > - Do we need to discuss a policy that defines which completion vectors > are associated with which CPU sockets ? Such a policy is needed to > allow RDMA software to constrain RDMA completions to a single CPU > socket and hence to avoid inter-socket cache misses. One possible > policy is to associate an equal number of completion vectors with each > CPU socket. If e.g. 8 completion vectors are provided by an HCA and > two CPU sockets are available then completion vectors 0..3 could be > bound to the CPU socket with index 0 and vectors 4..7 could be bound > to CPU socket that has been assigned index 1 by the Linux kernel. > - Would it be useful to modify the irqbalance software such that it > becomes aware of HCA's that provide multiple MSI-X vectors and hence > automatically applies the policy mentioned in the previous bullet ? Having a policy based approach is good. But we need to explore where in the OFED stack this policy can be specified and enforced. Not sure, rsockets would be the right place to hold policy based extensions as it is simply an abstraction layer on top of rdmacm library. > - What should the default behavior be of the rsockets library ? Keep > the current behavior (use completion vector 0), select one of the > available completion vectors in a round-robin fashion or perhaps yet > another policy ? Keep the current behavior if user has not specified any option. > - The number of completion vectors provided by a HCA can change after > a PCIe card has been added to or removed from the system. Such changes > affect the number of bits of the completion mask that are relevant. > How to handle this ? Completion mask based approach is dropped in favor of storing the values of completion vectors. > - If a configuration option is added in the rsockets library to > specify which completion vectors a process is allowed to use, should > it be possible to specify individual completion vectors or is it > sufficient if CPU socket numbers can be specified ? That last choice > has the advantage that it is independent of the exact number of > completion vectors that has been allocated by an HCA. Specify individual completion vectors through config option. This is on premise that user is aware of the allocation. > - How to cope with systems in which multiple RDMA HCA's are present > and in which each HCA provides a different number of completion > vectors ? Is a completion vector bitmask a proper means for such > systems to specify which completion vectors should be used ? As mentioned above, bitmask based approach is done away with in favor of absolute values in the latest v4 patch. > - Do we need to treat virtual machine guests and CPU hot-plugging > separately or can we rely on the information about CPU sockets that is > provided by the hypervisor to the guest ? > > Bart. Thank You. - Sreedhar -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Thanks for your detailed thoughts and insights into comp vector > assignment. As you rightly pointed out, let's hear from wider > community as well before we attempt another iteration on the path. > I have included below some details about the latest patch to > understand where we are currently. Are there any similar mechanisms defined elsewhere that we can use as guidance? I don't recall, does completion vector 0 refer to a specific vector or does it mean use a default assignment? - Sean -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2014-09-16 10:27, Hefty, Sean wrote: >> Thanks for your detailed thoughts and insights into comp vector >> assignment. As you rightly pointed out, let's hear from wider >> community as well before we attempt another iteration on the path. >> I have included below some details about the latest patch to >> understand where we are currently. > > Are there any similar mechanisms defined elsewhere that we can use as > guidance? > > I don't recall, does completion vector 0 refer to a specific vector or > does it mean use a default assignment? > > - Sean Hi Sean, Not sure about similar mechanisms but some references to irq balancing available on the net. I think, comp vector 0 is a specific vector that is used by default. Thank You. - Sreedhar -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/src/rsocket.c b/src/rsocket.c index b70d56a..ffea0ca 100644 --- a/src/rsocket.c +++ b/src/rsocket.c @@ -116,6 +116,8 @@ static uint32_t def_mem = (1 << 17); static uint32_t def_wmem = (1 << 17); static uint32_t polling_time = 10; static uint16_t restart_onintr = 0; +static uint16_t next_comp_vector = 0; +static uint64_t comp_vector_mask = 0; /* * Immediate data format is determined by the upper bits @@ -548,6 +550,37 @@ void rs_configure(void) (void) fscanf(f, "%hu", &restart_onintr); fclose(f); } + + if ((f = fopen(RS_CONF_DIR "/comp_vector", "r"))) { + char vbuf[256]; + char *vptr; + vptr = fgets(vbuf, sizeof(vbuf), f); + fclose(f); + if (vptr) { + char *tok, *save, *tmp, *str, *tok2; + int lvect, uvect, vect; + + for (str = vptr; ; str = NULL) { + tok = strtok_r(str, ",", &save); + if (tok == NULL) { + break; + } + if (!(tmp = strpbrk(tok, "-"))) { + lvect = uvect = atoi(tok); + } else { + tok2 = tmp + 1; + *tmp = '\0'; + lvect = atoi(tok); + uvect = atoi(tok2); + } + lvect = (lvect < 0) ? 0 : ((lvect > 63) ? 63 : lvect); + uvect = (uvect < 0) ? 0 : ((uvect > 63) ? 63 : uvect); + for (vect = lvect; vect <= uvect; vect++) { + comp_vector_mask |= ((uint64_t)1 << vect); + } + } + } + } init = 1; out: pthread_mutex_unlock(&mut); @@ -762,12 +795,27 @@ static int ds_init_bufs(struct ds_qp *qp) */ static int rs_create_cq(struct rsocket *rs, struct rdma_cm_id *cm_id) { + int vector = 0; + cm_id->recv_cq_channel = ibv_create_comp_channel(cm_id->verbs); if (!cm_id->recv_cq_channel) return -1; + if (comp_vector_mask) { + int found = 0; + while (found == 0) { + if (comp_vector_mask & ((uint64_t) 1 << next_comp_vector)) { + found = 1; + vector = next_comp_vector; + } + if (++next_comp_vector == 64) { + next_comp_vector = 0; + } + } + } + cm_id->recv_cq = ibv_create_cq(cm_id->verbs, rs->sq_size + rs->rq_size, - cm_id, cm_id->recv_cq_channel, 0); + cm_id, cm_id->recv_cq_channel, vector); if (!cm_id->recv_cq) goto err1;