[v3,0/2] uprobes: Improve scalability by reducing the contention on siglock

Message ID	20240815014629.2685155-1-liaochang1@huawei.com (mailing list archive)
Headers	show Received: from szxga08-in.huawei.com (szxga08-in.huawei.com [45.249.212.255]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EB90917C9; Thu, 15 Aug 2024 01:55:57 +0000 (UTC) From: Liao Chang <liaochang1@huawei.com> To: <mhiramat@kernel.org>, <oleg@redhat.com>, <peterz@infradead.org>, <mingo@redhat.com>, <acme@kernel.org>, <namhyung@kernel.org>, <mark.rutland@arm.com>, <alexander.shishkin@linux.intel.com>, <jolsa@kernel.org>, <irogers@google.com>, <adrian.hunter@intel.com>, <kan.liang@linux.intel.com> CC: <linux-kernel@vger.kernel.org>, <linux-trace-kernel@vger.kernel.org>, <linux-perf-users@vger.kernel.org>, <bpf@vger.kernel.org> Subject: [PATCH v3 0/2] uprobes: Improve scalability by reducing the contention on siglock Date: Thu, 15 Aug 2024 01:46:27 +0000 Message-ID: <20240815014629.2685155-1-liaochang1@huawei.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit
Series	uprobes: Improve scalability by reducing the contention on siglock \| expand [v3,0/2] uprobes: Improve scalability by reducing the contention on siglock [v3,1/2] uprobes: Remove redundant spinlock in uprobe_deny_signal() [v3,2/2] uprobes: Remove the spinlock within handle_singlestep()

Liao, Chang Aug. 15, 2024, 1:46 a.m. UTC

The profiling result of BPF selftest on ARM64 platform reveals the
significant contention on the current->sighand->siglock is the
scalability bottleneck. The reason is also very straightforward that all
producer threads of benchmark have to contend the spinlock mentioned to
resume the TIF_SIGPENDING bit in thread_info that might be removed in
uprobe_deny_signal().

The contention on current->sighand->siglock is unnecessary, this series
remove them thoroughly. I've use the script developed by Andrii in [1]
to run benchmark. The CPU used was Kunpeng916 (Hi1616), 4 NUMA nodes,
64 cores@2.4GHz running the kernel on next tree + the optimization in
[2] for get_xol_insn_slot().

before-opt
----------
uprobe-nop      ( 1 cpus):    0.907 ± 0.003M/s  (  0.907M/s/cpu)
uprobe-nop      ( 2 cpus):    1.676 ± 0.008M/s  (  0.838M/s/cpu)
uprobe-nop      ( 4 cpus):    3.210 ± 0.003M/s  (  0.802M/s/cpu)
uprobe-nop      ( 8 cpus):    4.457 ± 0.003M/s  (  0.557M/s/cpu)
uprobe-nop      (16 cpus):    3.724 ± 0.011M/s  (  0.233M/s/cpu)
uprobe-nop      (32 cpus):    2.761 ± 0.003M/s  (  0.086M/s/cpu)
uprobe-nop      (64 cpus):    1.293 ± 0.015M/s  (  0.020M/s/cpu)

uprobe-push     ( 1 cpus):    0.883 ± 0.001M/s  (  0.883M/s/cpu)
uprobe-push     ( 2 cpus):    1.642 ± 0.005M/s  (  0.821M/s/cpu)
uprobe-push     ( 4 cpus):    3.086 ± 0.002M/s  (  0.771M/s/cpu)
uprobe-push     ( 8 cpus):    3.390 ± 0.003M/s  (  0.424M/s/cpu)
uprobe-push     (16 cpus):    2.652 ± 0.005M/s  (  0.166M/s/cpu)
uprobe-push     (32 cpus):    2.713 ± 0.005M/s  (  0.085M/s/cpu)
uprobe-push     (64 cpus):    1.313 ± 0.009M/s  (  0.021M/s/cpu)

uprobe-ret      ( 1 cpus):    1.774 ± 0.000M/s  (  1.774M/s/cpu)
uprobe-ret      ( 2 cpus):    3.350 ± 0.001M/s  (  1.675M/s/cpu)
uprobe-ret      ( 4 cpus):    6.604 ± 0.000M/s  (  1.651M/s/cpu)
uprobe-ret      ( 8 cpus):    6.706 ± 0.005M/s  (  0.838M/s/cpu)
uprobe-ret      (16 cpus):    5.231 ± 0.001M/s  (  0.327M/s/cpu)
uprobe-ret      (32 cpus):    5.743 ± 0.003M/s  (  0.179M/s/cpu)
uprobe-ret      (64 cpus):    4.726 ± 0.016M/s  (  0.074M/s/cpu)

after-opt
---------
uprobe-nop      ( 1 cpus):    0.985 ± 0.002M/s  (  0.985M/s/cpu)
uprobe-nop      ( 2 cpus):    1.773 ± 0.005M/s  (  0.887M/s/cpu)
uprobe-nop      ( 4 cpus):    3.304 ± 0.001M/s  (  0.826M/s/cpu)
uprobe-nop      ( 8 cpus):    5.328 ± 0.002M/s  (  0.666M/s/cpu)
uprobe-nop      (16 cpus):    6.475 ± 0.002M/s  (  0.405M/s/cpu)
uprobe-nop      (32 cpus):    4.831 ± 0.082M/s  (  0.151M/s/cpu)
uprobe-nop      (64 cpus):    2.564 ± 0.053M/s  (  0.040M/s/cpu)

uprobe-push     ( 1 cpus):    0.964 ± 0.001M/s  (  0.964M/s/cpu)
uprobe-push     ( 2 cpus):    1.766 ± 0.002M/s  (  0.883M/s/cpu)
uprobe-push     ( 4 cpus):    3.290 ± 0.009M/s  (  0.823M/s/cpu)
uprobe-push     ( 8 cpus):    4.670 ± 0.002M/s  (  0.584M/s/cpu)
uprobe-push     (16 cpus):    5.197 ± 0.004M/s  (  0.325M/s/cpu)
uprobe-push     (32 cpus):    5.068 ± 0.161M/s  (  0.158M/s/cpu)
uprobe-push     (64 cpus):    2.605 ± 0.026M/s  (  0.041M/s/cpu)

uprobe-ret      ( 1 cpus):    1.833 ± 0.001M/s  (  1.833M/s/cpu)
uprobe-ret      ( 2 cpus):    3.384 ± 0.003M/s  (  1.692M/s/cpu)
uprobe-ret      ( 4 cpus):    6.677 ± 0.004M/s  (  1.669M/s/cpu)
uprobe-ret      ( 8 cpus):    6.854 ± 0.005M/s  (  0.857M/s/cpu)
uprobe-ret      (16 cpus):    6.508 ± 0.006M/s  (  0.407M/s/cpu)
uprobe-ret      (32 cpus):    5.793 ± 0.009M/s  (  0.181M/s/cpu)
uprobe-ret      (64 cpus):    4.743 ± 0.016M/s  (  0.074M/s/cpu)

Above benchmark results demonstrates a obivious improvement in the
scalability of trig-uprobe-nop and trig-uprobe-push, the peak throughput
of which are from 4.5M/s to 6.4M/s and 3.3M/s to 5.1M/s individually.

v3->v2:
Renaming the flag in [2/2], s/deny_signal/signal_denied/g.

v2->v1:
Oleg pointed out the _DENY_SIGNAL will be replaced by _ACK upon the
completion of singlestep which leads to handle_singlestep() has no
chance to restore the removed TIF_SIGPENDING [3] and some case in
question. So this revision proposes to use a flag in uprobe_task to
track the denied TIF_SIGPENDING instead of new UPROBE_SSTEP state.

[1] https://lore.kernel.org/all/20240731214256.3588718-1-andrii@kernel.org
[2] https://lore.kernel.org/all/20240727094405.1362496-1-liaochang1@huawei.com
[3] https://lore.kernel.org/all/20240801082407.1618451-1-liaochang1@huawei.com

Liao Chang (2):
  uprobes: Remove redundant spinlock in uprobe_deny_signal()
  uprobes: Remove the spinlock within handle_singlestep()

 include/linux/uprobes.h |  1 +
 kernel/events/uprobes.c | 10 +++++-----
 2 files changed, 6 insertions(+), 5 deletions(-)

Liao, Chang Sept. 14, 2024, 2:53 a.m. UTC | #1

Hi, Oleg

Kindly ping.

This series have been pending for a month. Is thre any issue I overlook?

Thanks.

在 2024/8/15 9:46, Liao Chang 写道:
> The profiling result of BPF selftest on ARM64 platform reveals the
> significant contention on the current->sighand->siglock is the
> scalability bottleneck. The reason is also very straightforward that all
> producer threads of benchmark have to contend the spinlock mentioned to
> resume the TIF_SIGPENDING bit in thread_info that might be removed in
> uprobe_deny_signal().
> 
> The contention on current->sighand->siglock is unnecessary, this series
> remove them thoroughly. I've use the script developed by Andrii in [1]
> to run benchmark. The CPU used was Kunpeng916 (Hi1616), 4 NUMA nodes,
> 64 cores@2.4GHz running the kernel on next tree + the optimization in
> [2] for get_xol_insn_slot().
> 
> before-opt
> ----------
> uprobe-nop      ( 1 cpus):    0.907 ± 0.003M/s  (  0.907M/s/cpu)
> uprobe-nop      ( 2 cpus):    1.676 ± 0.008M/s  (  0.838M/s/cpu)
> uprobe-nop      ( 4 cpus):    3.210 ± 0.003M/s  (  0.802M/s/cpu)
> uprobe-nop      ( 8 cpus):    4.457 ± 0.003M/s  (  0.557M/s/cpu)
> uprobe-nop      (16 cpus):    3.724 ± 0.011M/s  (  0.233M/s/cpu)
> uprobe-nop      (32 cpus):    2.761 ± 0.003M/s  (  0.086M/s/cpu)
> uprobe-nop      (64 cpus):    1.293 ± 0.015M/s  (  0.020M/s/cpu)
> 
> uprobe-push     ( 1 cpus):    0.883 ± 0.001M/s  (  0.883M/s/cpu)
> uprobe-push     ( 2 cpus):    1.642 ± 0.005M/s  (  0.821M/s/cpu)
> uprobe-push     ( 4 cpus):    3.086 ± 0.002M/s  (  0.771M/s/cpu)
> uprobe-push     ( 8 cpus):    3.390 ± 0.003M/s  (  0.424M/s/cpu)
> uprobe-push     (16 cpus):    2.652 ± 0.005M/s  (  0.166M/s/cpu)
> uprobe-push     (32 cpus):    2.713 ± 0.005M/s  (  0.085M/s/cpu)
> uprobe-push     (64 cpus):    1.313 ± 0.009M/s  (  0.021M/s/cpu)
> 
> uprobe-ret      ( 1 cpus):    1.774 ± 0.000M/s  (  1.774M/s/cpu)
> uprobe-ret      ( 2 cpus):    3.350 ± 0.001M/s  (  1.675M/s/cpu)
> uprobe-ret      ( 4 cpus):    6.604 ± 0.000M/s  (  1.651M/s/cpu)
> uprobe-ret      ( 8 cpus):    6.706 ± 0.005M/s  (  0.838M/s/cpu)
> uprobe-ret      (16 cpus):    5.231 ± 0.001M/s  (  0.327M/s/cpu)
> uprobe-ret      (32 cpus):    5.743 ± 0.003M/s  (  0.179M/s/cpu)
> uprobe-ret      (64 cpus):    4.726 ± 0.016M/s  (  0.074M/s/cpu)
> 
> after-opt
> ---------
> uprobe-nop      ( 1 cpus):    0.985 ± 0.002M/s  (  0.985M/s/cpu)
> uprobe-nop      ( 2 cpus):    1.773 ± 0.005M/s  (  0.887M/s/cpu)
> uprobe-nop      ( 4 cpus):    3.304 ± 0.001M/s  (  0.826M/s/cpu)
> uprobe-nop      ( 8 cpus):    5.328 ± 0.002M/s  (  0.666M/s/cpu)
> uprobe-nop      (16 cpus):    6.475 ± 0.002M/s  (  0.405M/s/cpu)
> uprobe-nop      (32 cpus):    4.831 ± 0.082M/s  (  0.151M/s/cpu)
> uprobe-nop      (64 cpus):    2.564 ± 0.053M/s  (  0.040M/s/cpu)
> 
> uprobe-push     ( 1 cpus):    0.964 ± 0.001M/s  (  0.964M/s/cpu)
> uprobe-push     ( 2 cpus):    1.766 ± 0.002M/s  (  0.883M/s/cpu)
> uprobe-push     ( 4 cpus):    3.290 ± 0.009M/s  (  0.823M/s/cpu)
> uprobe-push     ( 8 cpus):    4.670 ± 0.002M/s  (  0.584M/s/cpu)
> uprobe-push     (16 cpus):    5.197 ± 0.004M/s  (  0.325M/s/cpu)
> uprobe-push     (32 cpus):    5.068 ± 0.161M/s  (  0.158M/s/cpu)
> uprobe-push     (64 cpus):    2.605 ± 0.026M/s  (  0.041M/s/cpu)
> 
> uprobe-ret      ( 1 cpus):    1.833 ± 0.001M/s  (  1.833M/s/cpu)
> uprobe-ret      ( 2 cpus):    3.384 ± 0.003M/s  (  1.692M/s/cpu)
> uprobe-ret      ( 4 cpus):    6.677 ± 0.004M/s  (  1.669M/s/cpu)
> uprobe-ret      ( 8 cpus):    6.854 ± 0.005M/s  (  0.857M/s/cpu)
> uprobe-ret      (16 cpus):    6.508 ± 0.006M/s  (  0.407M/s/cpu)
> uprobe-ret      (32 cpus):    5.793 ± 0.009M/s  (  0.181M/s/cpu)
> uprobe-ret      (64 cpus):    4.743 ± 0.016M/s  (  0.074M/s/cpu)
> 
> Above benchmark results demonstrates a obivious improvement in the
> scalability of trig-uprobe-nop and trig-uprobe-push, the peak throughput
> of which are from 4.5M/s to 6.4M/s and 3.3M/s to 5.1M/s individually.
> 
> v3->v2:
> Renaming the flag in [2/2], s/deny_signal/signal_denied/g.
> 
> v2->v1:
> Oleg pointed out the _DENY_SIGNAL will be replaced by _ACK upon the
> completion of singlestep which leads to handle_singlestep() has no
> chance to restore the removed TIF_SIGPENDING [3] and some case in
> question. So this revision proposes to use a flag in uprobe_task to
> track the denied TIF_SIGPENDING instead of new UPROBE_SSTEP state.
> 
> [1] https://lore.kernel.org/all/20240731214256.3588718-1-andrii@kernel.org
> [2] https://lore.kernel.org/all/20240727094405.1362496-1-liaochang1@huawei.com
> [3] https://lore.kernel.org/all/20240801082407.1618451-1-liaochang1@huawei.com
> 
> Liao Chang (2):
>   uprobes: Remove redundant spinlock in uprobe_deny_signal()
>   uprobes: Remove the spinlock within handle_singlestep()
> 
>  include/linux/uprobes.h |  1 +
>  kernel/events/uprobes.c | 10 +++++-----
>  2 files changed, 6 insertions(+), 5 deletions(-)
>

Oleg Nesterov Sept. 15, 2024, 3:18 p.m. UTC | #2

Hi Liao,

On 09/14, Liao, Chang wrote:
>
> Hi, Oleg
>
> Kindly ping.
>
> This series have been pending for a month. Is thre any issue I overlook?

Well, I have already acked both patches.

Please resend them to Peter/Masami, with my acks included.

Oleg.

Liao, Chang Sept. 18, 2024, 2:05 a.m. UTC | #3

Hi, Peter and Masami

I look forward to your inputs on these series. Andrii has proven they are
hepful for uprobe scalability.

Thanks.

在 2024/9/15 23:18, Oleg Nesterov 写道:
> Hi Liao,
> 
> On 09/14, Liao, Chang wrote:
>>
>> Hi, Oleg
>>
>> Kindly ping.
>>
>> This series have been pending for a month. Is thre any issue I overlook?
> 
> Well, I have already acked both patches.
> 
> Please resend them to Peter/Masami, with my acks included.
> 
> Oleg.
> 
>

Andrii Nakryiko Oct. 11, 2024, 7:34 p.m. UTC | #4

On Tue, Sep 17, 2024 at 7:05 PM Liao, Chang <liaochang1@huawei.com> wrote:
>
> Hi, Peter and Masami
>
> I look forward to your inputs on these series. Andrii has proven they are
> hepful for uprobe scalability.
>
> Thanks.
>
> 在 2024/9/15 23:18, Oleg Nesterov 写道:
> > Hi Liao,
> >
> > On 09/14, Liao, Chang wrote:
> >>
> >> Hi, Oleg
> >>
> >> Kindly ping.
> >>
> >> This series have been pending for a month. Is thre any issue I overlook?
> >
> > Well, I have already acked both patches.
> >
> > Please resend them to Peter/Masami, with my acks included.
> >

Hey Liao,

I didn't see v4 from you for this patch set with Oleg's acks. Did you
get a chance to rebase, add acks, and send the latest version?

> > Oleg.
> >
> >
>
> --
> BR
> Liao, Chang

Liao, Chang Oct. 21, 2024, 10:43 a.m. UTC | #5

在 2024/10/12 3:34, Andrii Nakryiko 写道:
> On Tue, Sep 17, 2024 at 7:05 PM Liao, Chang <liaochang1@huawei.com> wrote:
>>
>> Hi, Peter and Masami
>>
>> I look forward to your inputs on these series. Andrii has proven they are
>> hepful for uprobe scalability.
>>
>> Thanks.
>>
>> 在 2024/9/15 23:18, Oleg Nesterov 写道:
>>> Hi Liao,
>>>
>>> On 09/14, Liao, Chang wrote:
>>>>
>>>> Hi, Oleg
>>>>
>>>> Kindly ping.
>>>>
>>>> This series have been pending for a month. Is thre any issue I overlook?
>>>
>>> Well, I have already acked both patches.
>>>
>>> Please resend them to Peter/Masami, with my acks included.
>>>
> 
> Hey Liao,
> 
> I didn't see v4 from you for this patch set with Oleg's acks. Did you
> get a chance to rebase, add acks, and send the latest version?

Andrii,

I am ready to send v4 based on the latest kernel from next tree. Otherwise,
I haven't heard back from any of maintainers except Oleg, so I'm a bit unsure
if I should make further changes to this series.

> 
>>> Oleg.
>>>
>>>
>>
>> --
>> BR
>> Liao, Chang

Andrii Nakryiko Oct. 21, 2024, 5:18 p.m. UTC | #6

On Mon, Oct 21, 2024 at 3:43 AM Liao, Chang <liaochang1@huawei.com> wrote:
>
>
>
> 在 2024/10/12 3:34, Andrii Nakryiko 写道:
> > On Tue, Sep 17, 2024 at 7:05 PM Liao, Chang <liaochang1@huawei.com> wrote:
> >>
> >> Hi, Peter and Masami
> >>
> >> I look forward to your inputs on these series. Andrii has proven they are
> >> hepful for uprobe scalability.
> >>
> >> Thanks.
> >>
> >> 在 2024/9/15 23:18, Oleg Nesterov 写道:
> >>> Hi Liao,
> >>>
> >>> On 09/14, Liao, Chang wrote:
> >>>>
> >>>> Hi, Oleg
> >>>>
> >>>> Kindly ping.
> >>>>
> >>>> This series have been pending for a month. Is thre any issue I overlook?
> >>>
> >>> Well, I have already acked both patches.
> >>>
> >>> Please resend them to Peter/Masami, with my acks included.
> >>>
> >
> > Hey Liao,
> >
> > I didn't see v4 from you for this patch set with Oleg's acks. Did you
> > get a chance to rebase, add acks, and send the latest version?
>
> Andrii,
>
> I am ready to send v4 based on the latest kernel from next tree. Otherwise,
> I haven't heard back from any of maintainers except Oleg, so I'm a bit unsure
> if I should make further changes to this series.
>

Let's just rebase to the latest tip/perf/core and resend with Oleg's
ack. Hopefully this should be enough.

> >
> >>> Oleg.
> >>>
> >>>
> >>
> >> --
> >> BR
> >> Liao, Chang
>
> --
> BR
> Liao, Chang
>

Liao, Chang Oct. 22, 2024, 6:18 a.m. UTC | #7

在 2024/10/22 1:18, Andrii Nakryiko 写道:
> On Mon, Oct 21, 2024 at 3:43 AM Liao, Chang <liaochang1@huawei.com> wrote:
>>
>>
>>
>> 在 2024/10/12 3:34, Andrii Nakryiko 写道:
>>> On Tue, Sep 17, 2024 at 7:05 PM Liao, Chang <liaochang1@huawei.com> wrote:
>>>>
>>>> Hi, Peter and Masami
>>>>
>>>> I look forward to your inputs on these series. Andrii has proven they are
>>>> hepful for uprobe scalability.
>>>>
>>>> Thanks.
>>>>
>>>> 在 2024/9/15 23:18, Oleg Nesterov 写道:
>>>>> Hi Liao,
>>>>>
>>>>> On 09/14, Liao, Chang wrote:
>>>>>>
>>>>>> Hi, Oleg
>>>>>>
>>>>>> Kindly ping.
>>>>>>
>>>>>> This series have been pending for a month. Is thre any issue I overlook?
>>>>>
>>>>> Well, I have already acked both patches.
>>>>>
>>>>> Please resend them to Peter/Masami, with my acks included.
>>>>>
>>>
>>> Hey Liao,
>>>
>>> I didn't see v4 from you for this patch set with Oleg's acks. Did you
>>> get a chance to rebase, add acks, and send the latest version?
>>
>> Andrii,
>>
>> I am ready to send v4 based on the latest kernel from next tree. Otherwise,
>> I haven't heard back from any of maintainers except Oleg, so I'm a bit unsure
>> if I should make further changes to this series.
>>
> 
> Let's just rebase to the latest tip/perf/core and resend with Oleg's
> ack. Hopefully this should be enough.

OK, the v4 is on the way with Masami's Acked-by.

> 
>>>
>>>>> Oleg.
>>>>>
>>>>>
>>>>
>>>> --
>>>> BR
>>>> Liao, Chang
>>
>> --
>> BR
>> Liao, Chang
>>

[v3,0/2] uprobes: Improve scalability by reducing the contention on siglock

Message

Comments