[RFC,5/5] SUNRPC: Reduce the priority of the xprtiod queue

Message ID	20190503111841.4391-6-trond.myklebust@hammerspace.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nfs-owner@kernel.org> From: Trond Myklebust <trondmy@gmail.com> To: Chuck Lever <chuck.lever@oracle.com> Cc: linux-nfs@vger.kernel.org Subject: [RFC PATCH 5/5] SUNRPC: Reduce the priority of the xprtiod queue Date: Fri, 3 May 2019 06:18:41 -0500 Message-Id: <20190503111841.4391-6-trond.myklebust@hammerspace.com> In-Reply-To: <20190503111841.4391-5-trond.myklebust@hammerspace.com> References: <20190503111841.4391-1-trond.myklebust@hammerspace.com> <20190503111841.4391-2-trond.myklebust@hammerspace.com> <20190503111841.4391-3-trond.myklebust@hammerspace.com> <20190503111841.4391-4-trond.myklebust@hammerspace.com> <20190503111841.4391-5-trond.myklebust@hammerspace.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk
Series	bh-safe lock removal for SUNRPC \| expand [RFC,0/5] bh-safe lock removal for SUNRPC [RFC,1/5] SUNRPC: Replace the queue timer with a delayed work function [RFC,2/5] SUNRPC: Replace direct task wakeups from softirq context [RFC,3/5] SUNRPC: Remove the bh-safe lock requirement on xprt->transport_lock [RFC,4/5] SUNRPC: Remove the bh-safe lock requirement on the rpc_wait_queue->lock [RFC,5/5] SUNRPC: Reduce the priority of the xprtiod queue

Trond Myklebust May 3, 2019, 11:18 a.m. UTC

Allow more time for softirqd

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 net/sunrpc/sched.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Chuck Lever May 6, 2019, 8:41 p.m. UTC | #1

> On May 3, 2019, at 7:18 AM, Trond Myklebust <trondmy@gmail.com> wrote:
> 
> Allow more time for softirqd

Have you thought about performance tests for this one?


> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
> ---
> net/sunrpc/sched.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
> index c7e81336620c..6b37c9a4b48f 100644
> --- a/net/sunrpc/sched.c
> +++ b/net/sunrpc/sched.c
> @@ -1253,7 +1253,7 @@ static int rpciod_start(void)
> 		goto out_failed;
> 	rpciod_workqueue = wq;
> 	/* Note: highpri because network receive is latency sensitive */

The above comment should be deleted as well.


> -	wq = alloc_workqueue("xprtiod", WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_HIGHPRI, 0);
> +	wq = alloc_workqueue("xprtiod", WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
> 	if (!wq)
> 		goto free_rpciod;
> 	xprtiod_workqueue = wq;

--
Chuck Lever

Chuck Lever May 28, 2019, 7:03 p.m. UTC | #2

Following up on this. Now with even more data!

> On May 6, 2019, at 4:41 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> 
> 
>> On May 3, 2019, at 7:18 AM, Trond Myklebust <trondmy@gmail.com> wrote:
>> 
>> Allow more time for softirqd
> 
> Have you thought about performance tests for this one?

I tested this series on my 12-core two-socket client using a variety
of tests including iozone, fio, and fstests. The network under test
is 56Gb InfiniBand (TCP uses IPoIB). I tested both TCP and RDMA.

With lock debugging and memory leak testing enabled, I did not see
any functional regressions or new leaks or crashes. Thus IMO this
series is "safe to apply."

With TCP, I saw no change in performance between a "stock" kernel
and one with all five patches in this series applied, as, IIRC,
you predicted.

The following discussion is based on testing with NFS/RDMA.

With RDMA, I saw an improvement of 5-10% in IOPS rate between the
"stock" kernel and a kernel with the first four patches applied. When
the fifth patch is applied, I saw IOPS throughput significantly worse
than "stock" -- like 20% worse.

I also studied average RPC execution time (the "execute" metric) with
the "stock" kernel, the one with four patches applied, and with the
one where all five are applied. The workload is 100% 4KB READs with
an iodepth of 1024 in order to saturate the transmit queue.

With four patches, the execute time is about 2.5 msec faster (average
execution time is around 75 msec due to the large backlog this test
generates). With five patches, it's slower than "stock" by 12 msec.

I also saw a 30 usec improvement in the average latency of
xprt_complete_rqst with the four patch series.

As far as I can tell, the benefit of this series comes mostly from
the third patch, which changes spin_lock_bh(&xprt->transport_lock) to
spin_lock(&xprt->transport_lock). When the xprtiod work queue is
lowered in priority in 5/5, that benefit vanishes.

I am still confused about why 5/5 is needed. I did not see any soft
lockups without this patch applied when using RDMA. Is the issue
with xprtsock's use of xprtiod for handling incoming TCP receives?

I still have some things I'd like to look at. One thing I haven't
yet tried is looking at lock_stat, which would confirm or refute
my theory that this is all about the transport_lock, for instance.

>> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
>> ---
>> net/sunrpc/sched.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
>> index c7e81336620c..6b37c9a4b48f 100644
>> --- a/net/sunrpc/sched.c
>> +++ b/net/sunrpc/sched.c
>> @@ -1253,7 +1253,7 @@ static int rpciod_start(void)
>> 		goto out_failed;
>> 	rpciod_workqueue = wq;
>> 	/* Note: highpri because network receive is latency sensitive */
> 
> The above comment should be deleted as well.
> 
> 
>> -	wq = alloc_workqueue("xprtiod", WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_HIGHPRI, 0);
>> +	wq = alloc_workqueue("xprtiod", WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
>> 	if (!wq)
>> 		goto free_rpciod;
>> 	xprtiod_workqueue = wq;
> 
> --
> Chuck Lever
> 
> 
> 

--
Chuck Lever

Trond Myklebust May 28, 2019, 7:33 p.m. UTC | #3

On Tue, 2019-05-28 at 15:03 -0400, Chuck Lever wrote:
> Following up on this. Now with even more data!
> 
> > On May 6, 2019, at 4:41 PM, Chuck Lever <chuck.lever@oracle.com>
> > wrote:
> > 
> > 
> > > On May 3, 2019, at 7:18 AM, Trond Myklebust <trondmy@gmail.com>
> > > wrote:
> > > 
> > > Allow more time for softirqd
> > 
> > Have you thought about performance tests for this one?
> 
> I tested this series on my 12-core two-socket client using a variety
> of tests including iozone, fio, and fstests. The network under test
> is 56Gb InfiniBand (TCP uses IPoIB). I tested both TCP and RDMA.
> 
> With lock debugging and memory leak testing enabled, I did not see
> any functional regressions or new leaks or crashes. Thus IMO this
> series is "safe to apply."
> 
> With TCP, I saw no change in performance between a "stock" kernel
> and one with all five patches in this series applied, as, IIRC,
> you predicted.
> 
> The following discussion is based on testing with NFS/RDMA.
> 
> With RDMA, I saw an improvement of 5-10% in IOPS rate between the
> "stock" kernel and a kernel with the first four patches applied. When
> the fifth patch is applied, I saw IOPS throughput significantly worse
> than "stock" -- like 20% worse.
> 
> I also studied average RPC execution time (the "execute" metric) with
> the "stock" kernel, the one with four patches applied, and with the
> one where all five are applied. The workload is 100% 4KB READs with
> an iodepth of 1024 in order to saturate the transmit queue.
> 
> With four patches, the execute time is about 2.5 msec faster (average
> execution time is around 75 msec due to the large backlog this test
> generates). With five patches, it's slower than "stock" by 12 msec.
> 
> I also saw a 30 usec improvement in the average latency of
> xprt_complete_rqst with the four patch series.
> 
> As far as I can tell, the benefit of this series comes mostly from
> the third patch, which changes spin_lock_bh(&xprt->transport_lock) to
> spin_lock(&xprt->transport_lock). When the xprtiod work queue is
> lowered in priority in 5/5, that benefit vanishes.
> 
> I am still confused about why 5/5 is needed. I did not see any soft
> lockups without this patch applied when using RDMA. Is the issue
> with xprtsock's use of xprtiod for handling incoming TCP receives?
> 
> I still have some things I'd like to look at. One thing I haven't
> yet tried is looking at lock_stat, which would confirm or refute
> my theory that this is all about the transport_lock, for instance.
> 

OK. I can drop 5/5.

The issue there was not about soft lockups. However since we were
previously running most soft irqs as part of spin_unlock_bh(), the
question was whether or not we would see more of them needing to move
to softirqd. As far as I can see, your answer to that question is 'no'
(at least for your system).

Cheers
  Trond

Chuck Lever May 28, 2019, 7:52 p.m. UTC | #4

> On May 28, 2019, at 3:33 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> On Tue, 2019-05-28 at 15:03 -0400, Chuck Lever wrote:
>> Following up on this. Now with even more data!
>> 
>>> On May 6, 2019, at 4:41 PM, Chuck Lever <chuck.lever@oracle.com>
>>> wrote:
>>> 
>>> 
>>>> On May 3, 2019, at 7:18 AM, Trond Myklebust <trondmy@gmail.com>
>>>> wrote:
>>>> 
>>>> Allow more time for softirqd
>>> 
>>> Have you thought about performance tests for this one?
>> 
>> I tested this series on my 12-core two-socket client using a variety
>> of tests including iozone, fio, and fstests. The network under test
>> is 56Gb InfiniBand (TCP uses IPoIB). I tested both TCP and RDMA.
>> 
>> With lock debugging and memory leak testing enabled, I did not see
>> any functional regressions or new leaks or crashes. Thus IMO this
>> series is "safe to apply."
>> 
>> With TCP, I saw no change in performance between a "stock" kernel
>> and one with all five patches in this series applied, as, IIRC,
>> you predicted.
>> 
>> The following discussion is based on testing with NFS/RDMA.
>> 
>> With RDMA, I saw an improvement of 5-10% in IOPS rate between the
>> "stock" kernel and a kernel with the first four patches applied. When
>> the fifth patch is applied, I saw IOPS throughput significantly worse
>> than "stock" -- like 20% worse.
>> 
>> I also studied average RPC execution time (the "execute" metric) with
>> the "stock" kernel, the one with four patches applied, and with the
>> one where all five are applied. The workload is 100% 4KB READs with
>> an iodepth of 1024 in order to saturate the transmit queue.
>> 
>> With four patches, the execute time is about 2.5 msec faster (average
>> execution time is around 75 msec due to the large backlog this test
>> generates). With five patches, it's slower than "stock" by 12 msec.
>> 
>> I also saw a 30 usec improvement in the average latency of
>> xprt_complete_rqst with the four patch series.
>> 
>> As far as I can tell, the benefit of this series comes mostly from
>> the third patch, which changes spin_lock_bh(&xprt->transport_lock) to
>> spin_lock(&xprt->transport_lock). When the xprtiod work queue is
>> lowered in priority in 5/5, that benefit vanishes.
>> 
>> I am still confused about why 5/5 is needed. I did not see any soft
>> lockups without this patch applied when using RDMA. Is the issue
>> with xprtsock's use of xprtiod for handling incoming TCP receives?
>> 
>> I still have some things I'd like to look at. One thing I haven't
>> yet tried is looking at lock_stat, which would confirm or refute
>> my theory that this is all about the transport_lock, for instance.
>> 
> 
> OK. I can drop 5/5.
> 
> The issue there was not about soft lockups. However since we were
> previously running most soft irqs as part of spin_unlock_bh(), the
> question was whether or not we would see more of them needing to move
> to softirqd. As far as I can see, your answer to that question is 'no'
> (at least for your system).

The top contended lock now is the work queue lock. I believe that's a
full irqsave lock. Someone should try testing on a single core system.

I also plan to try this series on my mlx5_en system. The mlx5 Ethernet
driver does a lot more work in soft IRQ than mlx4/IB does.


--
Chuck Lever

Olga Kornievskaia May 28, 2019, 8:10 p.m. UTC | #5

On Fri, May 3, 2019 at 7:24 AM Trond Myklebust <trondmy@gmail.com> wrote:
>
> Allow more time for softirqd
>
> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
> ---
>  net/sunrpc/sched.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
> index c7e81336620c..6b37c9a4b48f 100644
> --- a/net/sunrpc/sched.c
> +++ b/net/sunrpc/sched.c
> @@ -1253,7 +1253,7 @@ static int rpciod_start(void)
>                 goto out_failed;
>         rpciod_workqueue = wq;
>         /* Note: highpri because network receive is latency sensitive */
> -       wq = alloc_workqueue("xprtiod", WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_HIGHPRI, 0);

I thought we needed UNBOUND otherwise there was performance
degradation for read IO.

> +       wq = alloc_workqueue("xprtiod", WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
>         if (!wq)
>                 goto free_rpciod;
>         xprtiod_workqueue = wq;
> --
> 2.21.0
>

Chuck Lever May 29, 2019, 5:13 p.m. UTC | #6

> On May 28, 2019, at 3:52 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> 
> 
> 
>> On May 28, 2019, at 3:33 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
>> 
>> On Tue, 2019-05-28 at 15:03 -0400, Chuck Lever wrote:
>>> Following up on this. Now with even more data!
>>> 
>>>> On May 6, 2019, at 4:41 PM, Chuck Lever <chuck.lever@oracle.com>
>>>> wrote:
>>>> 
>>>> 
>>>>> On May 3, 2019, at 7:18 AM, Trond Myklebust <trondmy@gmail.com>
>>>>> wrote:
>>>>> 
>>>>> Allow more time for softirqd
>>>> 
>>>> Have you thought about performance tests for this one?
>>> 
>>> I tested this series on my 12-core two-socket client using a variety
>>> of tests including iozone, fio, and fstests. The network under test
>>> is 56Gb InfiniBand (TCP uses IPoIB). I tested both TCP and RDMA.
>>> 
>>> With lock debugging and memory leak testing enabled, I did not see
>>> any functional regressions or new leaks or crashes. Thus IMO this
>>> series is "safe to apply."
>>> 
>>> With TCP, I saw no change in performance between a "stock" kernel
>>> and one with all five patches in this series applied, as, IIRC,
>>> you predicted.
>>> 
>>> The following discussion is based on testing with NFS/RDMA.
>>> 
>>> With RDMA, I saw an improvement of 5-10% in IOPS rate between the
>>> "stock" kernel and a kernel with the first four patches applied. When
>>> the fifth patch is applied, I saw IOPS throughput significantly worse
>>> than "stock" -- like 20% worse.
>>> 
>>> I also studied average RPC execution time (the "execute" metric) with
>>> the "stock" kernel, the one with four patches applied, and with the
>>> one where all five are applied. The workload is 100% 4KB READs with
>>> an iodepth of 1024 in order to saturate the transmit queue.
>>> 
>>> With four patches, the execute time is about 2.5 msec faster (average
>>> execution time is around 75 msec due to the large backlog this test
>>> generates). With five patches, it's slower than "stock" by 12 msec.
>>> 
>>> I also saw a 30 usec improvement in the average latency of
>>> xprt_complete_rqst with the four patch series.
>>> 
>>> As far as I can tell, the benefit of this series comes mostly from
>>> the third patch, which changes spin_lock_bh(&xprt->transport_lock) to
>>> spin_lock(&xprt->transport_lock). When the xprtiod work queue is
>>> lowered in priority in 5/5, that benefit vanishes.
>>> 
>>> I am still confused about why 5/5 is needed. I did not see any soft
>>> lockups without this patch applied when using RDMA. Is the issue
>>> with xprtsock's use of xprtiod for handling incoming TCP receives?
>>> 
>>> I still have some things I'd like to look at. One thing I haven't
>>> yet tried is looking at lock_stat, which would confirm or refute
>>> my theory that this is all about the transport_lock, for instance.
>>> 
>> 
>> OK. I can drop 5/5.
>> 
>> The issue there was not about soft lockups. However since we were
>> previously running most soft irqs as part of spin_unlock_bh(), the
>> question was whether or not we would see more of them needing to move
>> to softirqd. As far as I can see, your answer to that question is 'no'
>> (at least for your system).
> 
> The top contended lock now is the work queue lock. I believe that's a
> full irqsave lock. Someone should try testing on a single core system.
> 
> I also plan to try this series on my mlx5_en system. The mlx5 Ethernet
> driver does a lot more work in soft IRQ than mlx4/IB does.

I tested with CX-5 RoCE on 100GbE. I don't see any obvious signs of
soft IRQ starvation. With 8 threads on a 4-core client, I was able to
push the 4KB random read fio workload past 300KIOPS.

--
Chuck Lever

Olga Kornievskaia May 29, 2019, 6:38 p.m. UTC | #7

On Tue, May 28, 2019 at 4:10 PM Olga Kornievskaia <aglo@umich.edu> wrote:
>
> On Fri, May 3, 2019 at 7:24 AM Trond Myklebust <trondmy@gmail.com> wrote:
> >
> > Allow more time for softirqd
> >
> > Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
> > ---
> >  net/sunrpc/sched.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
> > index c7e81336620c..6b37c9a4b48f 100644
> > --- a/net/sunrpc/sched.c
> > +++ b/net/sunrpc/sched.c
> > @@ -1253,7 +1253,7 @@ static int rpciod_start(void)
> >                 goto out_failed;
> >         rpciod_workqueue = wq;
> >         /* Note: highpri because network receive is latency sensitive */
> > -       wq = alloc_workqueue("xprtiod", WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_HIGHPRI, 0);
>
> I thought we needed UNBOUND otherwise there was performance
> degradation for read IO.

I remove my objection as this is for the xprtiod queue and not the
rpciod queue. The latter is the one when removing WQ_UNBOUND would
only use a single rpciod thread for doing all the crypto and thus
impact performance.

>
> > +       wq = alloc_workqueue("xprtiod", WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
> >         if (!wq)
> >                 goto free_rpciod;
> >         xprtiod_workqueue = wq;
> > --
> > 2.21.0
> >

Trond Myklebust May 29, 2019, 6:45 p.m. UTC | #8

On Wed, 2019-05-29 at 14:38 -0400, Olga Kornievskaia wrote:
> On Tue, May 28, 2019 at 4:10 PM Olga Kornievskaia <aglo@umich.edu>
> wrote:
> > On Fri, May 3, 2019 at 7:24 AM Trond Myklebust <trondmy@gmail.com>
> > wrote:
> > > Allow more time for softirqd
> > > 
> > > Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
> > > ---
> > >  net/sunrpc/sched.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
> > > index c7e81336620c..6b37c9a4b48f 100644
> > > --- a/net/sunrpc/sched.c
> > > +++ b/net/sunrpc/sched.c
> > > @@ -1253,7 +1253,7 @@ static int rpciod_start(void)
> > >                 goto out_failed;
> > >         rpciod_workqueue = wq;
> > >         /* Note: highpri because network receive is latency
> > > sensitive */
> > > -       wq = alloc_workqueue("xprtiod",
> > > WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_HIGHPRI, 0);
> > 
> > I thought we needed UNBOUND otherwise there was performance
> > degradation for read IO.
> 
> I remove my objection as this is for the xprtiod queue and not the
> rpciod queue. The latter is the one when removing WQ_UNBOUND would
> only use a single rpciod thread for doing all the crypto and thus
> impact performance.
> 
> > > +       wq = alloc_workqueue("xprtiod", WQ_MEM_RECLAIM |
> > > WQ_UNBOUND, 0);
> > >         if (!wq)
> > >                 goto free_rpciod;
> > >         xprtiod_workqueue = wq;
> > > --
> > > 2.21.0
> > > 

It was only removing the WQ_HIGHPRI flag, not the WQ_UNBOUND. However
the point it moot, as I'm dropping the patch.

Cheers
 Trond

[RFC,5/5] SUNRPC: Reduce the priority of the xprtiod queue

Commit Message

Comments

Patch