Message ID | 1525360767-12985-1-git-send-email-rzambre@uci.edu (mailing list archive) |
---|---|
State | RFC |
Headers | show |
On Thu, May 03, 2018 at 03:19:27PM +0000, Rohit Zambre wrote: > An independent communication path is one that shares no hardware resources > with other communication paths. From a Verbs perspective, an independent > path is the one obtained by the first QP in a context. The next QPs of the > context may or may not share hardware resources amongst themselves; the > mapping of the resources to the QPs is provider-specific. Sharing resources > can hurt throughput in certain cases. When only one thread uses the > independent path, we term it an uncontended independent path. > > Today, the user has no way to request for an independent path for an > arbitrary QP within a context. To create multiple independent paths, the > Verbs user must create mulitple contexts with 1 QP per context. However, > this translates to significant hardware-resource wastage: 89% in the case > of the ConnectX-4 mlx5 device. > > This RFC patch allows the user to request for uncontended independent > communication paths in Verbs through an "independent" flag during Thread > Domain (TD) creation. The patch also provides a first-draft implementation > of uncontended independent paths in the mlx5 provider. > > In mlx5, every even-odd pair of TDs share the same UAR page, which is not > case when the user creates multiple contexts with one TD per context. When > the user requests for an independent TD, the driver will dynamically > allocate a new UAR page and map bfreg_0 of that UAR to the TD. bfreg_1 of > the UAR belonging to an independent TD is never used and is essentially > wasted. Hence, there must be a maximum number of independent paths allowed > within a context since the hardware resources are limited. This would be > half of the maximum number of dynamic UARs allowed per context. > > Signed-off-by: Rohit Zambre <rzambre@uci.edu> > libibverbs/verbs.h | 1 + > providers/mlx5/mlx5.c | 3 +++ > providers/mlx5/mlx5.h | 2 ++ > providers/mlx5/verbs.c | 51 +++++++++++++++++++++++++++++++++++--------------- > 4 files changed, 42 insertions(+), 15 deletions(-) > > diff --git a/libibverbs/verbs.h b/libibverbs/verbs.h > index eb57824..b5fa56f 100644 > +++ b/libibverbs/verbs.h > @@ -561,6 +561,7 @@ struct ibv_pd { > }; > > struct ibv_td_init_attr { > + int independent; > uint32_t comp_mask; > }; This isn't OK, it breaks the ABI. Needs to be after comp_mask and you need to introduce a comp mask flag that says the data is present. Also needs a man page update to describe the new functionality. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 3, 2018 at 6:19 PM, Rohit Zambre <rzambre@uci.edu> wrote: > An independent communication path is one that shares no hardware resources > with other communication paths. From a Verbs perspective, an independent > path is the one obtained by the first QP in a context. The next QPs of the > context may or may not share hardware resources amongst themselves; the > mapping of the resources to the QPs is provider-specific. Sharing resources > can hurt throughput in certain cases. When only one thread uses the > independent path, we term it an uncontended independent path. > > Today, the user has no way to request for an independent path for an > arbitrary QP within a context. To create multiple independent paths, the > Verbs user must create mulitple contexts with 1 QP per context. However, > this translates to significant hardware-resource wastage: 89% in the case > of the ConnectX-4 mlx5 device. > > This RFC patch allows the user to request for uncontended independent > communication paths in Verbs through an "independent" flag during Thread > Domain (TD) creation. The patch also provides a first-draft implementation > of uncontended independent paths in the mlx5 provider. > > In mlx5, every even-odd pair of TDs share the same UAR page, which is not > case when the user creates multiple contexts with one TD per context. When > the user requests for an independent TD, the driver will dynamically > allocate a new UAR page and map bfreg_0 of that UAR to the TD. bfreg_1 of > the UAR belonging to an independent TD is never used and is essentially > wasted. Hence, there must be a maximum number of independent paths allowed > within a context since the hardware resources are limited. This would be > half of the maximum number of dynamic UARs allowed per context. I'm not sure I follow what you're trying to achieve here on the mlx5 HW level. Are you assuming that two threads with seperate 'indep-comm-paths' using separate bfreg on the same UAR page causes some contention and performance hit in the mlx5 HW? We should first prove that's true, and then design a solution to solve it. Do you have benchmark results of any kind? When you create two seperate ibv_context you will separate a lot more then just the UAR pages on which the bfreg are mapped. Ehe entier software locking scheme is separated. The ibv_td object allows the user to separate resources so that locks could be managed in a smarter way in the provider lib data fast path. For that we allocate a bfreg for each ibv_td obj. Using a dedicated bfreg allows lower latency sends, as the doorbell does not need a lock to write the even/odd entries. At the time we did not extend the work to cover additional locks in mlx5. but it seems your series is targeting something else. Alex -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 3, 2018 at 3:15 PM, Alex Rosenbaum <rosenbaumalex@gmail.com> wrote: > On Thu, May 3, 2018 at 6:19 PM, Rohit Zambre <rzambre@uci.edu> wrote: >> An independent communication path is one that shares no hardware resources >> with other communication paths. From a Verbs perspective, an independent >> path is the one obtained by the first QP in a context. The next QPs of the >> context may or may not share hardware resources amongst themselves; the >> mapping of the resources to the QPs is provider-specific. Sharing resources >> can hurt throughput in certain cases. When only one thread uses the >> independent path, we term it an uncontended independent path. >> >> Today, the user has no way to request for an independent path for an >> arbitrary QP within a context. To create multiple independent paths, the >> Verbs user must create mulitple contexts with 1 QP per context. However, >> this translates to significant hardware-resource wastage: 89% in the case >> of the ConnectX-4 mlx5 device. >> >> This RFC patch allows the user to request for uncontended independent >> communication paths in Verbs through an "independent" flag during Thread >> Domain (TD) creation. The patch also provides a first-draft implementation >> of uncontended independent paths in the mlx5 provider. >> >> In mlx5, every even-odd pair of TDs share the same UAR page, which is not >> case when the user creates multiple contexts with one TD per context. When >> the user requests for an independent TD, the driver will dynamically >> allocate a new UAR page and map bfreg_0 of that UAR to the TD. bfreg_1 of >> the UAR belonging to an independent TD is never used and is essentially >> wasted. Hence, there must be a maximum number of independent paths allowed >> within a context since the hardware resources are limited. This would be >> half of the maximum number of dynamic UARs allowed per context. > > I'm not sure I follow what you're trying to achieve here on the mlx5 HW level. > Are you assuming that two threads with seperate 'indep-comm-paths' > using separate bfreg on the same UAR page causes some contention and > performance hit in the mlx5 HW? > We should first prove that's true, and then design a solution to solve it. > Do you have benchmark results of any kind? Yes, there is a ~20% drop in message rates when there are concurrent BlueFlame writes to separate bfregs on the same UAR page. The graph attached reports message rates using rdma-core for 2-byte RDMA-writes using 16 threads. Each thread is driving its own QP. Each thread has its own CQ. Thread Domains are not used in this benchmark. The x-axis is the ratio of #QPs:#CTXs. For example, 2-way CTX-sharing means there are 8 CTXs with 2 QPs each. "wBF" means "with BlueFlame" and "woBF" means without (by setting MLX5_SHUT_UP_BF=1). "wPostlist2" means the size of the linked-list of WQEs is 2. "woPostlist" means each thread is posting only 1 WQE per ibv_post_send. These numbers are on a ConnectX-4 mlx5 device (on the Gomez machine of JLSE). The numbers are the same on the ConnectX-4 device on the Thor cluster of the HPC Advisory Council. The behavior with MOFED is the same with slight differences in absolute numbers; the drop is ~15%. The first drop in the green line is due to concurrent BlueFlame writes on the same UAR page. The second drop is due to bfreg lock contention between the 5th and the 16th QP. With a postlist size greater than 1, rdma-core does only 64-bit DoorBells. Concurrent Doorbells don't hurt. Concurrent BlueFlame writes do. What is exactly causing this, I am not sure. But from some more experimenting, I think the answer lies in how the NIC finds out whether to fetch the WQE from the BlueFlame buffer or DMA-read it from memory. I wasn't able to find a "bit" that was set during WQE preparation that tells the NIC where to read from. But it could be something else entirely.. We are addressing the green line with this patch. > When you create two seperate ibv_context you will separate a lot more > then just the UAR pages on which the bfreg are mapped. Ehe entier > software locking scheme is separated. Right. In the description, I wanted to emphasize the independent path aspect of different contexts since that is most important to the MPI library. The locking can be controlled through Thread Domains. > The ibv_td object allows the user to separate resources so that locks > could be managed in a smarter way in the provider lib data fast path. > For that we allocate a bfreg for each ibv_td obj. Using a dedicated > bfreg allows lower latency sends, as the doorbell does not need a lock > to write the even/odd entries. > At the time we did not extend the work to cover additional locks in > mlx5. but it seems your series is targeting something else. If you are referring to [1], then that patch is targeting just to disable QP-lock if a Thread Domain is specified. To create an independent software path, the MPI library will use the Thread Domain. [1] https://patchwork.kernel.org/patch/10367419/
On 5/4/2018 12:46 AM, Rohit Zambre wrote: > On Thu, May 3, 2018 at 3:15 PM, Alex Rosenbaum <rosenbaumalex@gmail.com> wrote: >> On Thu, May 3, 2018 at 6:19 PM, Rohit Zambre <rzambre@uci.edu> wrote: >>> An independent communication path is one that shares no hardware resources >>> with other communication paths. From a Verbs perspective, an independent >>> path is the one obtained by the first QP in a context. The next QPs of the >>> context may or may not share hardware resources amongst themselves; the >>> mapping of the resources to the QPs is provider-specific. Sharing resources >>> can hurt throughput in certain cases. When only one thread uses the >>> independent path, we term it an uncontended independent path. >>> >>> Today, the user has no way to request for an independent path for an >>> arbitrary QP within a context. To create multiple independent paths, the >>> Verbs user must create mulitple contexts with 1 QP per context. However, >>> this translates to significant hardware-resource wastage: 89% in the case >>> of the ConnectX-4 mlx5 device. >>> >>> This RFC patch allows the user to request for uncontended independent >>> communication paths in Verbs through an "independent" flag during Thread >>> Domain (TD) creation. The patch also provides a first-draft implementation >>> of uncontended independent paths in the mlx5 provider. >>> >>> In mlx5, every even-odd pair of TDs share the same UAR page, which is not >>> case when the user creates multiple contexts with one TD per context. When >>> the user requests for an independent TD, the driver will dynamically >>> allocate a new UAR page and map bfreg_0 of that UAR to the TD. bfreg_1 of >>> the UAR belonging to an independent TD is never used and is essentially >>> wasted. Hence, there must be a maximum number of independent paths allowed >>> within a context since the hardware resources are limited. This would be >>> half of the maximum number of dynamic UARs allowed per context. >> >> I'm not sure I follow what you're trying to achieve here on the mlx5 HW level. >> Are you assuming that two threads with seperate 'indep-comm-paths' >> using separate bfreg on the same UAR page causes some contention and >> performance hit in the mlx5 HW? >> We should first prove that's true, and then design a solution to solve it. >> Do you have benchmark results of any kind? > > Yes, there is a ~20% drop in message rates when there are concurrent > BlueFlame writes to separate bfregs on the same UAR page. Can you please share your test code to help us make sure that you are really referring to the above case with the below analysis ? > The graph attached reports message rates using rdma-core for 2-byte > RDMA-writes using 16 threads. Each thread is driving its own QP. Each > thread has its own CQ. Thread Domains are not used in this benchmark. Can you try to use in your test TDs and see if you get the same results before your initial patch ? this mode cleanly guarantees the 1<->1 UAR bfreg to a QP. > The x-axis is the ratio of #QPs:#CTXs. For example, 2-way CTX-sharing > means there are 8 CTXs with 2 QPs each. "wBF" means "with BlueFlame" > and "woBF" means without (by setting MLX5_SHUT_UP_BF=1). "wPostlist2" > means the size of the linked-list of WQEs is 2. Looking at your graph, the best results are wPostlist2-wBF, correct ? but in that case we don't expect BF at all but DB as you wrote below. Can you please clarify the test and the results that are represented here ? > "woPostlist" means > each thread is posting only 1 WQE per ibv_post_send. These numbers are > on a ConnectX-4 mlx5 device (on the Gomez machine of JLSE). The > numbers are the same on the ConnectX-4 device on the Thor cluster of > the HPC Advisory Council. The behavior with MOFED is the same with > slight differences in absolute numbers; the drop is ~15%. > > The first drop in the green line is due to concurrent BlueFlame writes > on the same UAR page. The second drop is due to bfreg lock contention > between the 5th and the 16th QP. With a postlist size greater than 1, > rdma-core does only 64-bit DoorBells. Concurrent Doorbells don't hurt. > Concurrent BlueFlame writes do. What is exactly causing this, I am not > sure. But from some more experimenting, I think the answer lies in how > the NIC finds out whether to fetch the WQE from the BlueFlame buffer > or DMA-read it from memory. I wasn't able to find a "bit" that was set > during WQE preparation that tells the NIC where to read from. But it > could be something else entirely.. > > We are addressing the green line with this patch. > >> When you create two seperate ibv_context you will separate a lot more >> then just the UAR pages on which the bfreg are mapped. Ehe entier >> software locking scheme is separated. > > Right. In the description, I wanted to emphasize the independent path > aspect of different contexts since that is most important to the MPI > library. The locking can be controlled through Thread Domains. > >> The ibv_td object allows the user to separate resources so that locks >> could be managed in a smarter way in the provider lib data fast path. >> For that we allocate a bfreg for each ibv_td obj. Using a dedicated >> bfreg allows lower latency sends, as the doorbell does not need a lock >> to write the even/odd entries. >> At the time we did not extend the work to cover additional locks in >> mlx5. but it seems your series is targeting something else. > > If you are referring to [1], then that patch is targeting just to > disable QP-lock if a Thread Domain is specified. To create an > independent software path, the MPI library will use the Thread Domain. > > [1] https://patchwork.kernel.org/patch/10367419/ > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, May 6, 2018 at 7:47 AM, Yishai Hadas <yishaih@dev.mellanox.co.il> wrote: > On 5/4/2018 12:46 AM, Rohit Zambre wrote: >> >> On Thu, May 3, 2018 at 3:15 PM, Alex Rosenbaum <rosenbaumalex@gmail.com> >> wrote: >>> >>> On Thu, May 3, 2018 at 6:19 PM, Rohit Zambre <rzambre@uci.edu> wrote: >>>> >>>> An independent communication path is one that shares no hardware >>>> resources >>>> with other communication paths. From a Verbs perspective, an independent >>>> path is the one obtained by the first QP in a context. The next QPs of >>>> the >>>> context may or may not share hardware resources amongst themselves; the >>>> mapping of the resources to the QPs is provider-specific. Sharing >>>> resources >>>> can hurt throughput in certain cases. When only one thread uses the >>>> independent path, we term it an uncontended independent path. >>>> >>>> Today, the user has no way to request for an independent path for an >>>> arbitrary QP within a context. To create multiple independent paths, the >>>> Verbs user must create mulitple contexts with 1 QP per context. However, >>>> this translates to significant hardware-resource wastage: 89% in the >>>> case >>>> of the ConnectX-4 mlx5 device. >>>> >>>> This RFC patch allows the user to request for uncontended independent >>>> communication paths in Verbs through an "independent" flag during Thread >>>> Domain (TD) creation. The patch also provides a first-draft >>>> implementation >>>> of uncontended independent paths in the mlx5 provider. >>>> >>>> In mlx5, every even-odd pair of TDs share the same UAR page, which is >>>> not >>>> case when the user creates multiple contexts with one TD per context. >>>> When >>>> the user requests for an independent TD, the driver will dynamically >>>> allocate a new UAR page and map bfreg_0 of that UAR to the TD. bfreg_1 >>>> of >>>> the UAR belonging to an independent TD is never used and is essentially >>>> wasted. Hence, there must be a maximum number of independent paths >>>> allowed >>>> within a context since the hardware resources are limited. This would be >>>> half of the maximum number of dynamic UARs allowed per context. >>> >>> >>> I'm not sure I follow what you're trying to achieve here on the mlx5 HW >>> level. >>> Are you assuming that two threads with seperate 'indep-comm-paths' >>> using separate bfreg on the same UAR page causes some contention and >>> performance hit in the mlx5 HW? >>> We should first prove that's true, and then design a solution to solve >>> it. >>> Do you have benchmark results of any kind? >> >> >> Yes, there is a ~20% drop in message rates when there are concurrent >> BlueFlame writes to separate bfregs on the same UAR page. > > > Can you please share your test code to help us make sure that you are really > referring to the above case with the below analysis ? I have attached my benchmark code. The critical path of interest is lines 554-598. In the README, I have included an example of how to run the benchmark. Let me know if you have any questions/concerns regarding the benchmark code. >> The graph attached reports message rates using rdma-core for 2-byte >> RDMA-writes using 16 threads. Each thread is driving its own QP. Each >> thread has its own CQ. Thread Domains are not used in this benchmark. > > > Can you try to use in your test TDs and see if you get the same results > before your initial patch ? this mode cleanly guarantees the 1<->1 UAR bfreg > to a QP. I will most likely have the numbers with TDs in the second half of the week. I will report them here then. >> The x-axis is the ratio of #QPs:#CTXs. For example, 2-way CTX-sharing >> means there are 8 CTXs with 2 QPs each. "wBF" means "with BlueFlame" >> and "woBF" means without (by setting MLX5_SHUT_UP_BF=1). "wPostlist2" >> means the size of the linked-list of WQEs is 2. > > > Looking at your graph, the best results are wPostlist2-wBF, correct ? but in > that case we don't expect BF at all but DB as you wrote below. Can you > please clarify the test and the results that are represented here ? Yes, correct. With postlist, rdma-core doesn't use BF, just DB. I included the various lines to show differences in behavior. The semantics of Verbs-users may or may not allow the use of features such as postlist. > >> "woPostlist" means >> each thread is posting only 1 WQE per ibv_post_send. These numbers are >> on a ConnectX-4 mlx5 device (on the Gomez machine of JLSE). The >> numbers are the same on the ConnectX-4 device on the Thor cluster of >> the HPC Advisory Council. The behavior with MOFED is the same with >> slight differences in absolute numbers; the drop is ~15%. >> >> The first drop in the green line is due to concurrent BlueFlame writes >> on the same UAR page. The second drop is due to bfreg lock contention >> between the 5th and the 16th QP. With a postlist size greater than 1, >> rdma-core does only 64-bit DoorBells. Concurrent Doorbells don't hurt. >> Concurrent BlueFlame writes do. What is exactly causing this, I am not >> sure. But from some more experimenting, I think the answer lies in how >> the NIC finds out whether to fetch the WQE from the BlueFlame buffer >> or DMA-read it from memory. I wasn't able to find a "bit" that was set >> during WQE preparation that tells the NIC where to read from. But it >> could be something else entirely.. >> >> We are addressing the green line with this patch. >> >>> When you create two seperate ibv_context you will separate a lot more >>> then just the UAR pages on which the bfreg are mapped. Ehe entier >>> software locking scheme is separated. >> >> >> Right. In the description, I wanted to emphasize the independent path >> aspect of different contexts since that is most important to the MPI >> library. The locking can be controlled through Thread Domains. >> >>> The ibv_td object allows the user to separate resources so that locks >>> could be managed in a smarter way in the provider lib data fast path. >>> For that we allocate a bfreg for each ibv_td obj. Using a dedicated >>> bfreg allows lower latency sends, as the doorbell does not need a lock >>> to write the even/odd entries. >>> At the time we did not extend the work to cover additional locks in >>> mlx5. but it seems your series is targeting something else. >> >> >> If you are referring to [1], then that patch is targeting just to >> disable QP-lock if a Thread Domain is specified. To create an >> independent software path, the MPI library will use the Thread Domain. >> >> [1] https://patchwork.kernel.org/patch/10367419/ >> >
On 5/7/2018 7:26 AM, Rohit Zambre wrote: > On Sun, May 6, 2018 at 7:47 AM, Yishai Hadas <yishaih@dev.mellanox.co.il> wrote: >> On 5/4/2018 12:46 AM, Rohit Zambre wrote: >>> >>> On Thu, May 3, 2018 at 3:15 PM, Alex Rosenbaum <rosenbaumalex@gmail.com> >>> wrote: >>>> >>>> On Thu, May 3, 2018 at 6:19 PM, Rohit Zambre <rzambre@uci.edu> wrote: >>>>> >>>>> An independent communication path is one that shares no hardware >>>>> resources >>>>> with other communication paths. From a Verbs perspective, an independent >>>>> path is the one obtained by the first QP in a context. The next QPs of >>>>> the >>>>> context may or may not share hardware resources amongst themselves; the >>>>> mapping of the resources to the QPs is provider-specific. Sharing >>>>> resources >>>>> can hurt throughput in certain cases. When only one thread uses the >>>>> independent path, we term it an uncontended independent path. >>>>> >>>>> Today, the user has no way to request for an independent path for an >>>>> arbitrary QP within a context. To create multiple independent paths, the >>>>> Verbs user must create mulitple contexts with 1 QP per context. However, >>>>> this translates to significant hardware-resource wastage: 89% in the >>>>> case >>>>> of the ConnectX-4 mlx5 device. >>>>> >>>>> This RFC patch allows the user to request for uncontended independent >>>>> communication paths in Verbs through an "independent" flag during Thread >>>>> Domain (TD) creation. The patch also provides a first-draft >>>>> implementation >>>>> of uncontended independent paths in the mlx5 provider. >>>>> >>>>> In mlx5, every even-odd pair of TDs share the same UAR page, which is >>>>> not >>>>> case when the user creates multiple contexts with one TD per context. >>>>> When >>>>> the user requests for an independent TD, the driver will dynamically >>>>> allocate a new UAR page and map bfreg_0 of that UAR to the TD. bfreg_1 >>>>> of >>>>> the UAR belonging to an independent TD is never used and is essentially >>>>> wasted. Hence, there must be a maximum number of independent paths >>>>> allowed >>>>> within a context since the hardware resources are limited. This would be >>>>> half of the maximum number of dynamic UARs allowed per context. >>>> >>>> >>>> I'm not sure I follow what you're trying to achieve here on the mlx5 HW >>>> level. >>>> Are you assuming that two threads with seperate 'indep-comm-paths' >>>> using separate bfreg on the same UAR page causes some contention and >>>> performance hit in the mlx5 HW? >>>> We should first prove that's true, and then design a solution to solve >>>> it. >>>> Do you have benchmark results of any kind? >>> >>> >>> Yes, there is a ~20% drop in message rates when there are concurrent >>> BlueFlame writes to separate bfregs on the same UAR page. >> >> >> Can you please share your test code to help us make sure that you are really >> referring to the above case with the below analysis ? > > I have attached my benchmark code. The critical path of interest is > lines 554-598. In the README, I have included an example of how to run > the benchmark. Let me know if you have any questions/concerns > regarding the benchmark code. > >>> The graph attached reports message rates using rdma-core for 2-byte >>> RDMA-writes using 16 threads. Each thread is driving its own QP. Each >>> thread has its own CQ. Thread Domains are not used in this benchmark. >> >> >> Can you try to use in your test TDs and see if you get the same results >> before your initial patch ? this mode cleanly guarantees the 1<->1 UAR bfreg >> to a QP. > > I will most likely have the numbers with TDs in the second half of the > week. I will report them here then. > Yes, please share your results with the TDs usage with and without your patch, this may help clarification the issue. >>> The x-axis is the ratio of #QPs:#CTXs. For example, 2-way CTX-sharing >>> means there are 8 CTXs with 2 QPs each. "wBF" means "with BlueFlame" >>> and "woBF" means without (by setting MLX5_SHUT_UP_BF=1). "wPostlist2" >>> means the size of the linked-list of WQEs is 2. >> >> >> Looking at your graph, the best results are wPostlist2-wBF, correct ? but in >> that case we don't expect BF at all but DB as you wrote below. Can you >> please clarify the test and the results that are represented here ? > > Yes, correct. With postlist, rdma-core doesn't use BF, just DB. I > included the various lines to show differences in behavior. The > semantics of Verbs-users may or may not allow the use of features such > as postlist. > So what in the graph referred to your initial patch improvements ? the green line was a DB test and not a BF results. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, May 7, 2018 at 10:24 AM, Yishai Hadas <yishaih@dev.mellanox.co.il> wrote: > On 5/7/2018 7:26 AM, Rohit Zambre wrote: >> >> On Sun, May 6, 2018 at 7:47 AM, Yishai Hadas <yishaih@dev.mellanox.co.il> >> wrote: >>> >>> On 5/4/2018 12:46 AM, Rohit Zambre wrote: >>>> >>>> >>>> On Thu, May 3, 2018 at 3:15 PM, Alex Rosenbaum <rosenbaumalex@gmail.com> >>>> wrote: >>>>> >>>>> >>>>> On Thu, May 3, 2018 at 6:19 PM, Rohit Zambre <rzambre@uci.edu> wrote: >>>>>> >>>>>> >>>>>> An independent communication path is one that shares no hardware >>>>>> resources >>>>>> with other communication paths. From a Verbs perspective, an >>>>>> independent >>>>>> path is the one obtained by the first QP in a context. The next QPs of >>>>>> the >>>>>> context may or may not share hardware resources amongst themselves; >>>>>> the >>>>>> mapping of the resources to the QPs is provider-specific. Sharing >>>>>> resources >>>>>> can hurt throughput in certain cases. When only one thread uses the >>>>>> independent path, we term it an uncontended independent path. >>>>>> >>>>>> Today, the user has no way to request for an independent path for an >>>>>> arbitrary QP within a context. To create multiple independent paths, >>>>>> the >>>>>> Verbs user must create mulitple contexts with 1 QP per context. >>>>>> However, >>>>>> this translates to significant hardware-resource wastage: 89% in the >>>>>> case >>>>>> of the ConnectX-4 mlx5 device. >>>>>> >>>>>> This RFC patch allows the user to request for uncontended independent >>>>>> communication paths in Verbs through an "independent" flag during >>>>>> Thread >>>>>> Domain (TD) creation. The patch also provides a first-draft >>>>>> implementation >>>>>> of uncontended independent paths in the mlx5 provider. >>>>>> >>>>>> In mlx5, every even-odd pair of TDs share the same UAR page, which is >>>>>> not >>>>>> case when the user creates multiple contexts with one TD per context. >>>>>> When >>>>>> the user requests for an independent TD, the driver will dynamically >>>>>> allocate a new UAR page and map bfreg_0 of that UAR to the TD. bfreg_1 >>>>>> of >>>>>> the UAR belonging to an independent TD is never used and is >>>>>> essentially >>>>>> wasted. Hence, there must be a maximum number of independent paths >>>>>> allowed >>>>>> within a context since the hardware resources are limited. This would >>>>>> be >>>>>> half of the maximum number of dynamic UARs allowed per context. >>>>> >>>>> >>>>> >>>>> I'm not sure I follow what you're trying to achieve here on the mlx5 HW >>>>> level. >>>>> Are you assuming that two threads with seperate 'indep-comm-paths' >>>>> using separate bfreg on the same UAR page causes some contention and >>>>> performance hit in the mlx5 HW? >>>>> We should first prove that's true, and then design a solution to solve >>>>> it. >>>>> Do you have benchmark results of any kind? >>>> >>>> >>>> >>>> Yes, there is a ~20% drop in message rates when there are concurrent >>>> BlueFlame writes to separate bfregs on the same UAR page. >>> >>> >>> >>> Can you please share your test code to help us make sure that you are >>> really >>> referring to the above case with the below analysis ? >> >> >> I have attached my benchmark code. The critical path of interest is >> lines 554-598. In the README, I have included an example of how to run >> the benchmark. Let me know if you have any questions/concerns >> regarding the benchmark code. >> >>>> The graph attached reports message rates using rdma-core for 2-byte >>>> RDMA-writes using 16 threads. Each thread is driving its own QP. Each >>>> thread has its own CQ. Thread Domains are not used in this benchmark. >>> >>> >>> >>> Can you try to use in your test TDs and see if you get the same results >>> before your initial patch ? this mode cleanly guarantees the 1<->1 UAR >>> bfreg >>> to a QP. >> >> >> I will most likely have the numbers with TDs in the second half of the >> week. I will report them here then. >> > > Yes, please share your results with the TDs usage with and without your > patch, this may help clarification the issue. > >>>> The x-axis is the ratio of #QPs:#CTXs. For example, 2-way CTX-sharing >>>> means there are 8 CTXs with 2 QPs each. "wBF" means "with BlueFlame" >>>> and "woBF" means without (by setting MLX5_SHUT_UP_BF=1). "wPostlist2" >>>> means the size of the linked-list of WQEs is 2. >>> >>> >>> >>> Looking at your graph, the best results are wPostlist2-wBF, correct ? but >>> in >>> that case we don't expect BF at all but DB as you wrote below. Can you >>> please clarify the test and the results that are represented here ? >> >> >> Yes, correct. With postlist, rdma-core doesn't use BF, just DB. I >> included the various lines to show differences in behavior. The >> semantics of Verbs-users may or may not allow the use of features such >> as postlist. >> > > So what in the graph referred to your initial patch improvements ? the green > line was a DB test and not a BF results. The green line is without Postlist. So, the number of WQEs per ibv_post_send is 1. In this case, rdma-core uses BF, not DB. The graph doesn't show improvements from my patches; I'm just showing the current behavior under different scenarios: "wPostlist2-wBF" means DB is used on WC pages; "woPostlistwBF" means BF is used on WC pages; "woPostlistwoBF" means sending 1 WQE per ibv_post_send on UC pages. Hope this is clearer. Please let me know if you have more clarification questions. -Rohit -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, May 7, 2018 at 11:52 AM, Rohit Zambre <rzambre@uci.edu> wrote: > On Mon, May 7, 2018 at 10:24 AM, Yishai Hadas > <yishaih@dev.mellanox.co.il> wrote: >> On 5/7/2018 7:26 AM, Rohit Zambre wrote: >>> >>> On Sun, May 6, 2018 at 7:47 AM, Yishai Hadas <yishaih@dev.mellanox.co.il> >>> wrote: >>>> >>>> On 5/4/2018 12:46 AM, Rohit Zambre wrote: >>>>> >>>>> >>>>> On Thu, May 3, 2018 at 3:15 PM, Alex Rosenbaum <rosenbaumalex@gmail.com> >>>>> wrote: >>>>>> >>>>>> >>>>>> On Thu, May 3, 2018 at 6:19 PM, Rohit Zambre <rzambre@uci.edu> wrote: >>>>>>> >>>>>>> >>>>>>> An independent communication path is one that shares no hardware >>>>>>> resources >>>>>>> with other communication paths. From a Verbs perspective, an >>>>>>> independent >>>>>>> path is the one obtained by the first QP in a context. The next QPs of >>>>>>> the >>>>>>> context may or may not share hardware resources amongst themselves; >>>>>>> the >>>>>>> mapping of the resources to the QPs is provider-specific. Sharing >>>>>>> resources >>>>>>> can hurt throughput in certain cases. When only one thread uses the >>>>>>> independent path, we term it an uncontended independent path. >>>>>>> >>>>>>> Today, the user has no way to request for an independent path for an >>>>>>> arbitrary QP within a context. To create multiple independent paths, >>>>>>> the >>>>>>> Verbs user must create mulitple contexts with 1 QP per context. >>>>>>> However, >>>>>>> this translates to significant hardware-resource wastage: 89% in the >>>>>>> case >>>>>>> of the ConnectX-4 mlx5 device. >>>>>>> >>>>>>> This RFC patch allows the user to request for uncontended independent >>>>>>> communication paths in Verbs through an "independent" flag during >>>>>>> Thread >>>>>>> Domain (TD) creation. The patch also provides a first-draft >>>>>>> implementation >>>>>>> of uncontended independent paths in the mlx5 provider. >>>>>>> >>>>>>> In mlx5, every even-odd pair of TDs share the same UAR page, which is >>>>>>> not >>>>>>> case when the user creates multiple contexts with one TD per context. >>>>>>> When >>>>>>> the user requests for an independent TD, the driver will dynamically >>>>>>> allocate a new UAR page and map bfreg_0 of that UAR to the TD. bfreg_1 >>>>>>> of >>>>>>> the UAR belonging to an independent TD is never used and is >>>>>>> essentially >>>>>>> wasted. Hence, there must be a maximum number of independent paths >>>>>>> allowed >>>>>>> within a context since the hardware resources are limited. This would >>>>>>> be >>>>>>> half of the maximum number of dynamic UARs allowed per context. >>>>>> >>>>>> >>>>>> >>>>>> I'm not sure I follow what you're trying to achieve here on the mlx5 HW >>>>>> level. >>>>>> Are you assuming that two threads with seperate 'indep-comm-paths' >>>>>> using separate bfreg on the same UAR page causes some contention and >>>>>> performance hit in the mlx5 HW? >>>>>> We should first prove that's true, and then design a solution to solve >>>>>> it. >>>>>> Do you have benchmark results of any kind? >>>>> >>>>> >>>>> >>>>> Yes, there is a ~20% drop in message rates when there are concurrent >>>>> BlueFlame writes to separate bfregs on the same UAR page. >>>> >>>> >>>> >>>> Can you please share your test code to help us make sure that you are >>>> really >>>> referring to the above case with the below analysis ? >>> >>> >>> I have attached my benchmark code. The critical path of interest is >>> lines 554-598. In the README, I have included an example of how to run >>> the benchmark. Let me know if you have any questions/concerns >>> regarding the benchmark code. >>> >>>>> The graph attached reports message rates using rdma-core for 2-byte >>>>> RDMA-writes using 16 threads. Each thread is driving its own QP. Each >>>>> thread has its own CQ. Thread Domains are not used in this benchmark. >>>> >>>> >>>> >>>> Can you try to use in your test TDs and see if you get the same results >>>> before your initial patch ? this mode cleanly guarantees the 1<->1 UAR >>>> bfreg >>>> to a QP. >>> >>> >>> I will most likely have the numbers with TDs in the second half of the >>> week. I will report them here then. >>> >> >> Yes, please share your results with the TDs usage with and without your >> patch, this may help clarification the issue. I have attached the graph which shows sender-receiver, 2-byte RDMA_WRITE message rates with Thread Domains and the effects of my patches. All the lines in the graph are with 16 threads using BlueFlame writes i.e. no postlist. The lines represent the following: woTD: base without using Thread Domains i.e. using the statically allocated UARs wTD: base using Thread Domains; 1 TD for each QP wTD-nolocks: base using TD + [1] wTD-independent: base using TD + this RFC patch wTD-nolocks-independent: base using TD + [1] + this RFC patch where, base = rdma-core@1eee3c837e0290f1ac7e5ac453ed69e8fd927aab [1] = https://patchwork.kernel.org/patch/10374625/ This is from the Gomez machines of the JLSE cluster which host a ConnectX-4 card. The dedicated nodes are running the 4.16 elrepo kernel (http://elrepo.org/linux/kernel/el7/SPECS/kernel-ml-4.16.spec) I had initially attributed the 8-way to 16-way drop to the contention between the 5th and the 16th QP in the original case of using just the statically allocated UARs. However, it seems like something more is happening since the drop exists even in the "wTD-nolocks-independent" line where we have no locks, no UAR-sharing. From perf, I see there is 36% increase in L1-dcache-load-misses going from 8-way to 16-way in "wTD-nolocks-independent". I'm still investigating but let me know if you have any ideas/thoughts. >>>>> The x-axis is the ratio of #QPs:#CTXs. For example, 2-way CTX-sharing >>>>> means there are 8 CTXs with 2 QPs each. "wBF" means "with BlueFlame" >>>>> and "woBF" means without (by setting MLX5_SHUT_UP_BF=1). "wPostlist2" >>>>> means the size of the linked-list of WQEs is 2. >>>> >>>> >>>> >>>> Looking at your graph, the best results are wPostlist2-wBF, correct ? but >>>> in >>>> that case we don't expect BF at all but DB as you wrote below. Can you >>>> please clarify the test and the results that are represented here ? >>> >>> >>> Yes, correct. With postlist, rdma-core doesn't use BF, just DB. I >>> included the various lines to show differences in behavior. The >>> semantics of Verbs-users may or may not allow the use of features such >>> as postlist. >>> >> >> So what in the graph referred to your initial patch improvements ? the green >> line was a DB test and not a BF results. > > The green line is without Postlist. So, the number of WQEs per > ibv_post_send is 1. In this case, rdma-core uses BF, not DB. The graph > doesn't show improvements from my patches; I'm just showing the > current behavior under different scenarios: "wPostlist2-wBF" means DB > is used on WC pages; "woPostlistwBF" means BF is used on WC pages; > "woPostlistwoBF" means sending 1 WQE per ibv_post_send on UC pages. > Hope this is clearer. Please let me know if you have more > clarification questions. > > -Rohit
diff --git a/libibverbs/verbs.h b/libibverbs/verbs.h index eb57824..b5fa56f 100644 --- a/libibverbs/verbs.h +++ b/libibverbs/verbs.h @@ -561,6 +561,7 @@ struct ibv_pd { }; struct ibv_td_init_attr { + int independent; uint32_t comp_mask; }; diff --git a/providers/mlx5/mlx5.c b/providers/mlx5/mlx5.c index 3a3fc47..b8fa5ce 100644 --- a/providers/mlx5/mlx5.c +++ b/providers/mlx5/mlx5.c @@ -1056,6 +1056,9 @@ static struct verbs_context *mlx5_alloc_context(struct ibv_device *ibdev, context->max_srq_recv_wr = resp.max_srq_recv_wr; context->num_dyn_bfregs = resp.num_dyn_bfregs; + context->max_ind_dyn_paths = context->num_dyn_bfregs / MLX5_NUM_NON_FP_BFREGS_PER_UAR / 2; + context->count_ind_dyn_paths = 0; + if (context->num_dyn_bfregs) { context->count_dyn_bfregs = calloc(context->num_dyn_bfregs, sizeof(*context->count_dyn_bfregs)); diff --git a/providers/mlx5/mlx5.h b/providers/mlx5/mlx5.h index f0f376c..74bf10d 100644 --- a/providers/mlx5/mlx5.h +++ b/providers/mlx5/mlx5.h @@ -295,6 +295,8 @@ struct mlx5_context { uint16_t flow_action_flags; uint64_t max_dm_size; uint32_t eth_min_inline_size; + uint32_t max_ind_dyn_paths; + uint32_t count_ind_dyn_paths; }; struct mlx5_bitmap { diff --git a/providers/mlx5/verbs.c b/providers/mlx5/verbs.c index 71728c8..b28ed9e 100644 --- a/providers/mlx5/verbs.c +++ b/providers/mlx5/verbs.c @@ -164,19 +164,32 @@ static void mlx5_put_bfreg_index(struct mlx5_context *ctx, uint32_t bfreg_dyn_in { pthread_mutex_lock(&ctx->dyn_bfregs_mutex); ctx->count_dyn_bfregs[bfreg_dyn_index]--; + if (bfreg_dyn_index < ctx->max_ind_dyn_paths * MLX5_NUM_NON_FP_BFREGS_PER_UAR) + ctx->count_ind_dyn_paths--; pthread_mutex_unlock(&ctx->dyn_bfregs_mutex); } -static int mlx5_get_bfreg_index(struct mlx5_context *ctx) +static int mlx5_get_bfreg_index(struct mlx5_context *ctx, int independent) { int i; pthread_mutex_lock(&ctx->dyn_bfregs_mutex); - for (i = 0; i < ctx->num_dyn_bfregs; i++) { - if (!ctx->count_dyn_bfregs[i]) { - ctx->count_dyn_bfregs[i]++; - pthread_mutex_unlock(&ctx->dyn_bfregs_mutex); - return i; + if (independent) { + for (i = 0; i < ctx->max_ind_dyn_paths * MLX5_NUM_NON_FP_BFREGS_PER_UAR; i+=MLX5_NUM_NON_FP_BFREGS_PER_UAR) { + if (!ctx->count_dyn_bfregs[i]) { + ctx->count_dyn_bfregs[i]++; + ctx->count_ind_dyn_paths++; + pthread_mutex_unlock(&ctx->dyn_bfregs_mutex); + return i; + } + } + } else { + for (i = ctx->max_ind_dyn_paths * MLX5_NUM_NON_FP_BFREGS_PER_UAR; i < ctx->num_dyn_bfregs; i++) { + if (!ctx->count_dyn_bfregs[i]) { + ctx->count_dyn_bfregs[i]++; + pthread_mutex_unlock(&ctx->dyn_bfregs_mutex); + return i; + } } } @@ -186,7 +199,7 @@ static int mlx5_get_bfreg_index(struct mlx5_context *ctx) } /* Returns a dedicated BF to be used by a thread domain */ -static struct mlx5_bf *mlx5_attach_dedicated_bf(struct ibv_context *context) +static struct mlx5_bf *mlx5_attach_dedicated_bf(struct ibv_context *context, int independent) { struct mlx5_uar_info uar; struct mlx5_context *ctx = to_mctx(context); @@ -198,7 +211,7 @@ static struct mlx5_bf *mlx5_attach_dedicated_bf(struct ibv_context *context) int mmap_bf_index; int num_bfregs_per_page; - bfreg_dyn_index = mlx5_get_bfreg_index(ctx); + bfreg_dyn_index = mlx5_get_bfreg_index(ctx, independent); if (bfreg_dyn_index < 0) { errno = ENOENT; return NULL; @@ -212,13 +225,15 @@ static struct mlx5_bf *mlx5_attach_dedicated_bf(struct ibv_context *context) num_bfregs_per_page = ctx->num_uars_per_page * MLX5_NUM_NON_FP_BFREGS_PER_UAR; uar_page_index = bfreg_dyn_index / num_bfregs_per_page; - /* The first bf index of each page will hold the mapped area address of the UAR */ - mmap_bf_index = ctx->start_dyn_bfregs_index + (uar_page_index * num_bfregs_per_page); + if (!independent) { + /* The first bf index of each page will hold the mapped area address of the UAR */ + mmap_bf_index = ctx->start_dyn_bfregs_index + (uar_page_index * num_bfregs_per_page); - pthread_mutex_lock(&ctx->dyn_bfregs_mutex); - if (ctx->bfs[mmap_bf_index].uar) { - /* UAR was already mapped, set its matching bfreg */ - goto set_reg; + pthread_mutex_lock(&ctx->dyn_bfregs_mutex); + if (ctx->bfs[mmap_bf_index].uar) { + /* UAR was already mapped, set its matching bfreg */ + goto set_reg; + } } ctx->bfs[mmap_bf_index].uar = mlx5_mmap(&uar, uar_page_index, context->cmd_fd, dev->page_size, @@ -261,19 +276,25 @@ static void mlx5_detach_dedicated_bf(struct ibv_context *context, struct mlx5_bf struct ibv_td *mlx5_alloc_td(struct ibv_context *context, struct ibv_td_init_attr *init_attr) { struct mlx5_td *td; + struct mlx5_context *mctx = to_mctx(context); if (init_attr->comp_mask) { errno = EINVAL; return NULL; } + if (init_attr->independent && (mctx->count_ind_dyn_paths >= mctx->max_ind_dyn_paths)) { + errno = EINVAL; + return NULL; + } + td = calloc(1, sizeof(*td)); if (!td) { errno = ENOMEM; return NULL; } - td->bf = mlx5_attach_dedicated_bf(context); + td->bf = mlx5_attach_dedicated_bf(context, init_attr->independent); if (!td->bf) { free(td); return NULL;
An independent communication path is one that shares no hardware resources with other communication paths. From a Verbs perspective, an independent path is the one obtained by the first QP in a context. The next QPs of the context may or may not share hardware resources amongst themselves; the mapping of the resources to the QPs is provider-specific. Sharing resources can hurt throughput in certain cases. When only one thread uses the independent path, we term it an uncontended independent path. Today, the user has no way to request for an independent path for an arbitrary QP within a context. To create multiple independent paths, the Verbs user must create mulitple contexts with 1 QP per context. However, this translates to significant hardware-resource wastage: 89% in the case of the ConnectX-4 mlx5 device. This RFC patch allows the user to request for uncontended independent communication paths in Verbs through an "independent" flag during Thread Domain (TD) creation. The patch also provides a first-draft implementation of uncontended independent paths in the mlx5 provider. In mlx5, every even-odd pair of TDs share the same UAR page, which is not case when the user creates multiple contexts with one TD per context. When the user requests for an independent TD, the driver will dynamically allocate a new UAR page and map bfreg_0 of that UAR to the TD. bfreg_1 of the UAR belonging to an independent TD is never used and is essentially wasted. Hence, there must be a maximum number of independent paths allowed within a context since the hardware resources are limited. This would be half of the maximum number of dynamic UARs allowed per context. Signed-off-by: Rohit Zambre <rzambre@uci.edu> --- libibverbs/verbs.h | 1 + providers/mlx5/mlx5.c | 3 +++ providers/mlx5/mlx5.h | 2 ++ providers/mlx5/verbs.c | 51 +++++++++++++++++++++++++++++++++++--------------- 4 files changed, 42 insertions(+), 15 deletions(-)