Message ID | 20190503111841.4391-6-trond.myklebust@hammerspace.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | bh-safe lock removal for SUNRPC | expand |
> On May 3, 2019, at 7:18 AM, Trond Myklebust <trondmy@gmail.com> wrote: > > Allow more time for softirqd Have you thought about performance tests for this one? > Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> > --- > net/sunrpc/sched.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c > index c7e81336620c..6b37c9a4b48f 100644 > --- a/net/sunrpc/sched.c > +++ b/net/sunrpc/sched.c > @@ -1253,7 +1253,7 @@ static int rpciod_start(void) > goto out_failed; > rpciod_workqueue = wq; > /* Note: highpri because network receive is latency sensitive */ The above comment should be deleted as well. > - wq = alloc_workqueue("xprtiod", WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_HIGHPRI, 0); > + wq = alloc_workqueue("xprtiod", WQ_MEM_RECLAIM | WQ_UNBOUND, 0); > if (!wq) > goto free_rpciod; > xprtiod_workqueue = wq; -- Chuck Lever
Following up on this. Now with even more data! > On May 6, 2019, at 4:41 PM, Chuck Lever <chuck.lever@oracle.com> wrote: > > >> On May 3, 2019, at 7:18 AM, Trond Myklebust <trondmy@gmail.com> wrote: >> >> Allow more time for softirqd > > Have you thought about performance tests for this one? I tested this series on my 12-core two-socket client using a variety of tests including iozone, fio, and fstests. The network under test is 56Gb InfiniBand (TCP uses IPoIB). I tested both TCP and RDMA. With lock debugging and memory leak testing enabled, I did not see any functional regressions or new leaks or crashes. Thus IMO this series is "safe to apply." With TCP, I saw no change in performance between a "stock" kernel and one with all five patches in this series applied, as, IIRC, you predicted. The following discussion is based on testing with NFS/RDMA. With RDMA, I saw an improvement of 5-10% in IOPS rate between the "stock" kernel and a kernel with the first four patches applied. When the fifth patch is applied, I saw IOPS throughput significantly worse than "stock" -- like 20% worse. I also studied average RPC execution time (the "execute" metric) with the "stock" kernel, the one with four patches applied, and with the one where all five are applied. The workload is 100% 4KB READs with an iodepth of 1024 in order to saturate the transmit queue. With four patches, the execute time is about 2.5 msec faster (average execution time is around 75 msec due to the large backlog this test generates). With five patches, it's slower than "stock" by 12 msec. I also saw a 30 usec improvement in the average latency of xprt_complete_rqst with the four patch series. As far as I can tell, the benefit of this series comes mostly from the third patch, which changes spin_lock_bh(&xprt->transport_lock) to spin_lock(&xprt->transport_lock). When the xprtiod work queue is lowered in priority in 5/5, that benefit vanishes. I am still confused about why 5/5 is needed. I did not see any soft lockups without this patch applied when using RDMA. Is the issue with xprtsock's use of xprtiod for handling incoming TCP receives? I still have some things I'd like to look at. One thing I haven't yet tried is looking at lock_stat, which would confirm or refute my theory that this is all about the transport_lock, for instance. >> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> >> --- >> net/sunrpc/sched.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c >> index c7e81336620c..6b37c9a4b48f 100644 >> --- a/net/sunrpc/sched.c >> +++ b/net/sunrpc/sched.c >> @@ -1253,7 +1253,7 @@ static int rpciod_start(void) >> goto out_failed; >> rpciod_workqueue = wq; >> /* Note: highpri because network receive is latency sensitive */ > > The above comment should be deleted as well. > > >> - wq = alloc_workqueue("xprtiod", WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_HIGHPRI, 0); >> + wq = alloc_workqueue("xprtiod", WQ_MEM_RECLAIM | WQ_UNBOUND, 0); >> if (!wq) >> goto free_rpciod; >> xprtiod_workqueue = wq; > > -- > Chuck Lever > > > -- Chuck Lever
On Tue, 2019-05-28 at 15:03 -0400, Chuck Lever wrote: > Following up on this. Now with even more data! > > > On May 6, 2019, at 4:41 PM, Chuck Lever <chuck.lever@oracle.com> > > wrote: > > > > > > > On May 3, 2019, at 7:18 AM, Trond Myklebust <trondmy@gmail.com> > > > wrote: > > > > > > Allow more time for softirqd > > > > Have you thought about performance tests for this one? > > I tested this series on my 12-core two-socket client using a variety > of tests including iozone, fio, and fstests. The network under test > is 56Gb InfiniBand (TCP uses IPoIB). I tested both TCP and RDMA. > > With lock debugging and memory leak testing enabled, I did not see > any functional regressions or new leaks or crashes. Thus IMO this > series is "safe to apply." > > With TCP, I saw no change in performance between a "stock" kernel > and one with all five patches in this series applied, as, IIRC, > you predicted. > > The following discussion is based on testing with NFS/RDMA. > > With RDMA, I saw an improvement of 5-10% in IOPS rate between the > "stock" kernel and a kernel with the first four patches applied. When > the fifth patch is applied, I saw IOPS throughput significantly worse > than "stock" -- like 20% worse. > > I also studied average RPC execution time (the "execute" metric) with > the "stock" kernel, the one with four patches applied, and with the > one where all five are applied. The workload is 100% 4KB READs with > an iodepth of 1024 in order to saturate the transmit queue. > > With four patches, the execute time is about 2.5 msec faster (average > execution time is around 75 msec due to the large backlog this test > generates). With five patches, it's slower than "stock" by 12 msec. > > I also saw a 30 usec improvement in the average latency of > xprt_complete_rqst with the four patch series. > > As far as I can tell, the benefit of this series comes mostly from > the third patch, which changes spin_lock_bh(&xprt->transport_lock) to > spin_lock(&xprt->transport_lock). When the xprtiod work queue is > lowered in priority in 5/5, that benefit vanishes. > > I am still confused about why 5/5 is needed. I did not see any soft > lockups without this patch applied when using RDMA. Is the issue > with xprtsock's use of xprtiod for handling incoming TCP receives? > > I still have some things I'd like to look at. One thing I haven't > yet tried is looking at lock_stat, which would confirm or refute > my theory that this is all about the transport_lock, for instance. > OK. I can drop 5/5. The issue there was not about soft lockups. However since we were previously running most soft irqs as part of spin_unlock_bh(), the question was whether or not we would see more of them needing to move to softirqd. As far as I can see, your answer to that question is 'no' (at least for your system). Cheers Trond
> On May 28, 2019, at 3:33 PM, Trond Myklebust <trondmy@hammerspace.com> wrote: > > On Tue, 2019-05-28 at 15:03 -0400, Chuck Lever wrote: >> Following up on this. Now with even more data! >> >>> On May 6, 2019, at 4:41 PM, Chuck Lever <chuck.lever@oracle.com> >>> wrote: >>> >>> >>>> On May 3, 2019, at 7:18 AM, Trond Myklebust <trondmy@gmail.com> >>>> wrote: >>>> >>>> Allow more time for softirqd >>> >>> Have you thought about performance tests for this one? >> >> I tested this series on my 12-core two-socket client using a variety >> of tests including iozone, fio, and fstests. The network under test >> is 56Gb InfiniBand (TCP uses IPoIB). I tested both TCP and RDMA. >> >> With lock debugging and memory leak testing enabled, I did not see >> any functional regressions or new leaks or crashes. Thus IMO this >> series is "safe to apply." >> >> With TCP, I saw no change in performance between a "stock" kernel >> and one with all five patches in this series applied, as, IIRC, >> you predicted. >> >> The following discussion is based on testing with NFS/RDMA. >> >> With RDMA, I saw an improvement of 5-10% in IOPS rate between the >> "stock" kernel and a kernel with the first four patches applied. When >> the fifth patch is applied, I saw IOPS throughput significantly worse >> than "stock" -- like 20% worse. >> >> I also studied average RPC execution time (the "execute" metric) with >> the "stock" kernel, the one with four patches applied, and with the >> one where all five are applied. The workload is 100% 4KB READs with >> an iodepth of 1024 in order to saturate the transmit queue. >> >> With four patches, the execute time is about 2.5 msec faster (average >> execution time is around 75 msec due to the large backlog this test >> generates). With five patches, it's slower than "stock" by 12 msec. >> >> I also saw a 30 usec improvement in the average latency of >> xprt_complete_rqst with the four patch series. >> >> As far as I can tell, the benefit of this series comes mostly from >> the third patch, which changes spin_lock_bh(&xprt->transport_lock) to >> spin_lock(&xprt->transport_lock). When the xprtiod work queue is >> lowered in priority in 5/5, that benefit vanishes. >> >> I am still confused about why 5/5 is needed. I did not see any soft >> lockups without this patch applied when using RDMA. Is the issue >> with xprtsock's use of xprtiod for handling incoming TCP receives? >> >> I still have some things I'd like to look at. One thing I haven't >> yet tried is looking at lock_stat, which would confirm or refute >> my theory that this is all about the transport_lock, for instance. >> > > OK. I can drop 5/5. > > The issue there was not about soft lockups. However since we were > previously running most soft irqs as part of spin_unlock_bh(), the > question was whether or not we would see more of them needing to move > to softirqd. As far as I can see, your answer to that question is 'no' > (at least for your system). The top contended lock now is the work queue lock. I believe that's a full irqsave lock. Someone should try testing on a single core system. I also plan to try this series on my mlx5_en system. The mlx5 Ethernet driver does a lot more work in soft IRQ than mlx4/IB does. -- Chuck Lever
On Fri, May 3, 2019 at 7:24 AM Trond Myklebust <trondmy@gmail.com> wrote: > > Allow more time for softirqd > > Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> > --- > net/sunrpc/sched.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c > index c7e81336620c..6b37c9a4b48f 100644 > --- a/net/sunrpc/sched.c > +++ b/net/sunrpc/sched.c > @@ -1253,7 +1253,7 @@ static int rpciod_start(void) > goto out_failed; > rpciod_workqueue = wq; > /* Note: highpri because network receive is latency sensitive */ > - wq = alloc_workqueue("xprtiod", WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_HIGHPRI, 0); I thought we needed UNBOUND otherwise there was performance degradation for read IO. > + wq = alloc_workqueue("xprtiod", WQ_MEM_RECLAIM | WQ_UNBOUND, 0); > if (!wq) > goto free_rpciod; > xprtiod_workqueue = wq; > -- > 2.21.0 >
> On May 28, 2019, at 3:52 PM, Chuck Lever <chuck.lever@oracle.com> wrote: > > > >> On May 28, 2019, at 3:33 PM, Trond Myklebust <trondmy@hammerspace.com> wrote: >> >> On Tue, 2019-05-28 at 15:03 -0400, Chuck Lever wrote: >>> Following up on this. Now with even more data! >>> >>>> On May 6, 2019, at 4:41 PM, Chuck Lever <chuck.lever@oracle.com> >>>> wrote: >>>> >>>> >>>>> On May 3, 2019, at 7:18 AM, Trond Myklebust <trondmy@gmail.com> >>>>> wrote: >>>>> >>>>> Allow more time for softirqd >>>> >>>> Have you thought about performance tests for this one? >>> >>> I tested this series on my 12-core two-socket client using a variety >>> of tests including iozone, fio, and fstests. The network under test >>> is 56Gb InfiniBand (TCP uses IPoIB). I tested both TCP and RDMA. >>> >>> With lock debugging and memory leak testing enabled, I did not see >>> any functional regressions or new leaks or crashes. Thus IMO this >>> series is "safe to apply." >>> >>> With TCP, I saw no change in performance between a "stock" kernel >>> and one with all five patches in this series applied, as, IIRC, >>> you predicted. >>> >>> The following discussion is based on testing with NFS/RDMA. >>> >>> With RDMA, I saw an improvement of 5-10% in IOPS rate between the >>> "stock" kernel and a kernel with the first four patches applied. When >>> the fifth patch is applied, I saw IOPS throughput significantly worse >>> than "stock" -- like 20% worse. >>> >>> I also studied average RPC execution time (the "execute" metric) with >>> the "stock" kernel, the one with four patches applied, and with the >>> one where all five are applied. The workload is 100% 4KB READs with >>> an iodepth of 1024 in order to saturate the transmit queue. >>> >>> With four patches, the execute time is about 2.5 msec faster (average >>> execution time is around 75 msec due to the large backlog this test >>> generates). With five patches, it's slower than "stock" by 12 msec. >>> >>> I also saw a 30 usec improvement in the average latency of >>> xprt_complete_rqst with the four patch series. >>> >>> As far as I can tell, the benefit of this series comes mostly from >>> the third patch, which changes spin_lock_bh(&xprt->transport_lock) to >>> spin_lock(&xprt->transport_lock). When the xprtiod work queue is >>> lowered in priority in 5/5, that benefit vanishes. >>> >>> I am still confused about why 5/5 is needed. I did not see any soft >>> lockups without this patch applied when using RDMA. Is the issue >>> with xprtsock's use of xprtiod for handling incoming TCP receives? >>> >>> I still have some things I'd like to look at. One thing I haven't >>> yet tried is looking at lock_stat, which would confirm or refute >>> my theory that this is all about the transport_lock, for instance. >>> >> >> OK. I can drop 5/5. >> >> The issue there was not about soft lockups. However since we were >> previously running most soft irqs as part of spin_unlock_bh(), the >> question was whether or not we would see more of them needing to move >> to softirqd. As far as I can see, your answer to that question is 'no' >> (at least for your system). > > The top contended lock now is the work queue lock. I believe that's a > full irqsave lock. Someone should try testing on a single core system. > > I also plan to try this series on my mlx5_en system. The mlx5 Ethernet > driver does a lot more work in soft IRQ than mlx4/IB does. I tested with CX-5 RoCE on 100GbE. I don't see any obvious signs of soft IRQ starvation. With 8 threads on a 4-core client, I was able to push the 4KB random read fio workload past 300KIOPS. -- Chuck Lever
On Tue, May 28, 2019 at 4:10 PM Olga Kornievskaia <aglo@umich.edu> wrote: > > On Fri, May 3, 2019 at 7:24 AM Trond Myklebust <trondmy@gmail.com> wrote: > > > > Allow more time for softirqd > > > > Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> > > --- > > net/sunrpc/sched.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c > > index c7e81336620c..6b37c9a4b48f 100644 > > --- a/net/sunrpc/sched.c > > +++ b/net/sunrpc/sched.c > > @@ -1253,7 +1253,7 @@ static int rpciod_start(void) > > goto out_failed; > > rpciod_workqueue = wq; > > /* Note: highpri because network receive is latency sensitive */ > > - wq = alloc_workqueue("xprtiod", WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_HIGHPRI, 0); > > I thought we needed UNBOUND otherwise there was performance > degradation for read IO. I remove my objection as this is for the xprtiod queue and not the rpciod queue. The latter is the one when removing WQ_UNBOUND would only use a single rpciod thread for doing all the crypto and thus impact performance. > > > + wq = alloc_workqueue("xprtiod", WQ_MEM_RECLAIM | WQ_UNBOUND, 0); > > if (!wq) > > goto free_rpciod; > > xprtiod_workqueue = wq; > > -- > > 2.21.0 > >
On Wed, 2019-05-29 at 14:38 -0400, Olga Kornievskaia wrote: > On Tue, May 28, 2019 at 4:10 PM Olga Kornievskaia <aglo@umich.edu> > wrote: > > On Fri, May 3, 2019 at 7:24 AM Trond Myklebust <trondmy@gmail.com> > > wrote: > > > Allow more time for softirqd > > > > > > Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> > > > --- > > > net/sunrpc/sched.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c > > > index c7e81336620c..6b37c9a4b48f 100644 > > > --- a/net/sunrpc/sched.c > > > +++ b/net/sunrpc/sched.c > > > @@ -1253,7 +1253,7 @@ static int rpciod_start(void) > > > goto out_failed; > > > rpciod_workqueue = wq; > > > /* Note: highpri because network receive is latency > > > sensitive */ > > > - wq = alloc_workqueue("xprtiod", > > > WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_HIGHPRI, 0); > > > > I thought we needed UNBOUND otherwise there was performance > > degradation for read IO. > > I remove my objection as this is for the xprtiod queue and not the > rpciod queue. The latter is the one when removing WQ_UNBOUND would > only use a single rpciod thread for doing all the crypto and thus > impact performance. > > > > + wq = alloc_workqueue("xprtiod", WQ_MEM_RECLAIM | > > > WQ_UNBOUND, 0); > > > if (!wq) > > > goto free_rpciod; > > > xprtiod_workqueue = wq; > > > -- > > > 2.21.0 > > > It was only removing the WQ_HIGHPRI flag, not the WQ_UNBOUND. However the point it moot, as I'm dropping the patch. Cheers Trond
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c index c7e81336620c..6b37c9a4b48f 100644 --- a/net/sunrpc/sched.c +++ b/net/sunrpc/sched.c @@ -1253,7 +1253,7 @@ static int rpciod_start(void) goto out_failed; rpciod_workqueue = wq; /* Note: highpri because network receive is latency sensitive */ - wq = alloc_workqueue("xprtiod", WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_HIGHPRI, 0); + wq = alloc_workqueue("xprtiod", WQ_MEM_RECLAIM | WQ_UNBOUND, 0); if (!wq) goto free_rpciod; xprtiod_workqueue = wq;
Allow more time for softirqd Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> --- net/sunrpc/sched.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)