[0/9] perf: Avoid explicit cpumask var allocation from stack

Message ID	20240402105610.1695644-1-dawei.li@shingroup.cn (mailing list archive)
Headers	show Received: from smtpbgeu2.qq.com (smtpbgeu2.qq.com [18.194.254.142]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7DE1C69D29; Tue, 2 Apr 2024 10:57:36 +0000 (UTC) From: Dawei Li <dawei.li@shingroup.cn> To: will@kernel.org, mark.rutland@arm.com Cc: xueshuai@linux.alibaba.com, renyu.zj@linux.alibaba.com, yangyicong@hisilicon.com, jonathan.cameron@huawei.com, andersson@kernel.org, konrad.dybcio@linaro.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-arm-msm@vger.kernel.org, Dawei Li <dawei.li@shingroup.cn> Subject: [PATCH 0/9] perf: Avoid explicit cpumask var allocation from stack Date: Tue, 2 Apr 2024 18:56:01 +0800 Message-Id: <20240402105610.1695644-1-dawei.li@shingroup.cn> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Feedback-ID: bizesmtpsz:shingroup.cn:qybglogicsvrgz:qybglogicsvrgz5a-1
Series	perf: Avoid explicit cpumask var allocation from stack \| expand [0/9] perf: Avoid explicit cpumask var allocation from stack [1/9] perf/alibaba_uncore_drw: Avoid explicit cpumask var allocation from stack [2/9] perf/arm-cmn: Avoid explicit cpumask var allocation from stack [3/9] perf/arm_cspmu: Avoid explicit cpumask var allocation from stack [4/9] perf/arm_dsu: Avoid explicit cpumask var allocation from stack [5/9] perf/dwc_pcie: Avoid explicit cpumask var allocation from stack [6/9] perf/hisi_pcie: Avoid explicit cpumask var allocation from stack [7/9] perf/hisi_uncore: Avoid explicit cpumask var allocation from stack [8/9] perf/qcom_l2: Avoid explicit cpumask var allocation from stack [9/9] perf/thunder_x2: Avoid explicit cpumask var allocation from stack

Dawei Li April 2, 2024, 10:56 a.m. UTC

Hi,

This series try to eliminate direct cpumask var allocation from stack
for perf subsystem.

Direct/explicit allocation of cpumask on stack could be dangerous since
it can lead to stack overflow for systems with big NR_CPUS or
CONFIG_CPUMASK_OFFSTACK=y.

For arm64, it's more urgent since commit 3fbd56f0e7c1 ("ARM64: Dynamically
allocate cpumasks and increase supported CPUs to 512").

It's sort of a pattern that almost every cpumask var in perf subystem
occurs in teardown callback of cpuhp. In which case, if dynamic
allocation failed(which is unlikely), we choose return 0 rather than
-ENOMEM to caller cuz:
@teardown is not supposed to fail and if it does, system crashes:

static int cpuhp_issue_call(int cpu, enum cpuhp_state state, bool bringup,
                            struct hlist_node *node)
{
        struct cpuhp_step *sp = cpuhp_get_step(state);
        int ret;

        /*
         * If there's nothing to do, we done.
         * Relies on the union for multi_instance.
         */
        if (cpuhp_step_empty(bringup, sp))
                return 0;
        /*
         * The non AP bound callbacks can fail on bringup. On teardown
         * e.g. module removal we crash for now.
         */
	#ifdef CONFIG_SMP
        if (cpuhp_is_ap_state(state))
                ret = cpuhp_invoke_ap_callback(cpu, state, bringup, node);
        else
                ret = cpuhp_invoke_callback(cpu, state, bringup, node,
		NULL);
	#else
        ret = cpuhp_invoke_callback(cpu, state, bringup, node, NULL);
	#endif
        BUG_ON(ret && !bringup);
        return ret;
}

Dawei Li (9):
  perf/alibaba_uncore_drw: Avoid explicit cpumask var allocation from
    stack
  perf/arm-cmn: Avoid explicit cpumask var allocation from stack
  perf/arm_cspmu: Avoid explicit cpumask var allocation from stack
  perf/arm_dsu: Avoid explicit cpumask var allocation from stack
  perf/dwc_pcie: Avoid explicit cpumask var allocation from stack
  perf/hisi_pcie: Avoid explicit cpumask var allocation from stack
  perf/hisi_uncore: Avoid explicit cpumask var allocation from stack
  perf/qcom_l2: Avoid explicit cpumask var allocation from stack
  perf/thunder_x2: Avoid explicit cpumask var allocation from stack

 drivers/perf/alibaba_uncore_drw_pmu.c    | 13 +++++++++----
 drivers/perf/arm-cmn.c                   | 13 +++++++++----
 drivers/perf/arm_cspmu/arm_cspmu.c       | 13 +++++++++----
 drivers/perf/arm_dsu_pmu.c               | 18 +++++++++++++-----
 drivers/perf/dwc_pcie_pmu.c              | 17 +++++++++++------
 drivers/perf/hisilicon/hisi_pcie_pmu.c   | 15 ++++++++++-----
 drivers/perf/hisilicon/hisi_uncore_pmu.c | 13 +++++++++----
 drivers/perf/qcom_l2_pmu.c               | 15 ++++++++++-----
 drivers/perf/thunderx2_pmu.c             | 20 ++++++++++++--------
 9 files changed, 92 insertions(+), 45 deletions(-)


Thanks,

    Dawei

Mark Rutland April 2, 2024, 11:12 a.m. UTC | #1

On Tue, Apr 02, 2024 at 06:56:01PM +0800, Dawei Li wrote:
> Hi,
> 
> This series try to eliminate direct cpumask var allocation from stack
> for perf subsystem.
> 
> Direct/explicit allocation of cpumask on stack could be dangerous since
> it can lead to stack overflow for systems with big NR_CPUS or
> CONFIG_CPUMASK_OFFSTACK=y.
> 
> For arm64, it's more urgent since commit 3fbd56f0e7c1 ("ARM64: Dynamically
> allocate cpumasks and increase supported CPUs to 512").
> 
> It's sort of a pattern that almost every cpumask var in perf subystem
> occurs in teardown callback of cpuhp. In which case, if dynamic
> allocation failed(which is unlikely), we choose return 0 rather than
> -ENOMEM to caller cuz:
> @teardown is not supposed to fail and if it does, system crashes:

... but we've left the system in an incorrect state, so that makes no sense.

As I commented on the first patch, NAK to dynamically allocating cpumasks in
the CPUHP callbacks. Please allocate the necessry cpumasks up-front when we
probe the PMU. At that time we can handle an allocation failure by cleaning up
and failing to probe the PMU, and then the CPUHP callbacks don't need to
allocate memory to offline a CPU...

Also, for the titles it'd be better to say something like "avoid placing
cpumasks on the stack", because "explicit cpumask var allocation" sounds like
the use of alloc_cpumask_var().

Mark.

> 
> static int cpuhp_issue_call(int cpu, enum cpuhp_state state, bool bringup,
>                             struct hlist_node *node)
> {
>         struct cpuhp_step *sp = cpuhp_get_step(state);
>         int ret;
> 
>         /*
>          * If there's nothing to do, we done.
>          * Relies on the union for multi_instance.
>          */
>         if (cpuhp_step_empty(bringup, sp))
>                 return 0;
>         /*
>          * The non AP bound callbacks can fail on bringup. On teardown
>          * e.g. module removal we crash for now.
>          */
> 	#ifdef CONFIG_SMP
>         if (cpuhp_is_ap_state(state))
>                 ret = cpuhp_invoke_ap_callback(cpu, state, bringup, node);
>         else
>                 ret = cpuhp_invoke_callback(cpu, state, bringup, node,
> 		NULL);
> 	#else
>         ret = cpuhp_invoke_callback(cpu, state, bringup, node, NULL);
> 	#endif
>         BUG_ON(ret && !bringup);
>         return ret;
> }
> 
> Dawei Li (9):
>   perf/alibaba_uncore_drw: Avoid explicit cpumask var allocation from
>     stack
>   perf/arm-cmn: Avoid explicit cpumask var allocation from stack
>   perf/arm_cspmu: Avoid explicit cpumask var allocation from stack
>   perf/arm_dsu: Avoid explicit cpumask var allocation from stack
>   perf/dwc_pcie: Avoid explicit cpumask var allocation from stack
>   perf/hisi_pcie: Avoid explicit cpumask var allocation from stack
>   perf/hisi_uncore: Avoid explicit cpumask var allocation from stack
>   perf/qcom_l2: Avoid explicit cpumask var allocation from stack
>   perf/thunder_x2: Avoid explicit cpumask var allocation from stack
> 
>  drivers/perf/alibaba_uncore_drw_pmu.c    | 13 +++++++++----
>  drivers/perf/arm-cmn.c                   | 13 +++++++++----
>  drivers/perf/arm_cspmu/arm_cspmu.c       | 13 +++++++++----
>  drivers/perf/arm_dsu_pmu.c               | 18 +++++++++++++-----
>  drivers/perf/dwc_pcie_pmu.c              | 17 +++++++++++------
>  drivers/perf/hisilicon/hisi_pcie_pmu.c   | 15 ++++++++++-----
>  drivers/perf/hisilicon/hisi_uncore_pmu.c | 13 +++++++++----
>  drivers/perf/qcom_l2_pmu.c               | 15 ++++++++++-----
>  drivers/perf/thunderx2_pmu.c             | 20 ++++++++++++--------
>  9 files changed, 92 insertions(+), 45 deletions(-)
> 
> 
> Thanks,
> 
>     Dawei
> 
> -- 
> 2.27.0
>

Dawei Li April 2, 2024, 1:40 p.m. UTC | #2

Hi Mark,

Thanks for the quick review.

On Tue, Apr 02, 2024 at 12:12:50PM +0100, Mark Rutland wrote:
> On Tue, Apr 02, 2024 at 06:56:01PM +0800, Dawei Li wrote:
> > Hi,
> > 
> > This series try to eliminate direct cpumask var allocation from stack
> > for perf subsystem.
> > 
> > Direct/explicit allocation of cpumask on stack could be dangerous since
> > it can lead to stack overflow for systems with big NR_CPUS or
> > CONFIG_CPUMASK_OFFSTACK=y.
> > 
> > For arm64, it's more urgent since commit 3fbd56f0e7c1 ("ARM64: Dynamically
> > allocate cpumasks and increase supported CPUs to 512").
> > 
> > It's sort of a pattern that almost every cpumask var in perf subystem
> > occurs in teardown callback of cpuhp. In which case, if dynamic
> > allocation failed(which is unlikely), we choose return 0 rather than
> > -ENOMEM to caller cuz:
> > @teardown is not supposed to fail and if it does, system crashes:
> 
> .. but we've left the system in an incorrect state, so that makes no sense.
> 
> As I commented on the first patch, NAK to dynamically allocating cpumasks in
> the CPUHP callbacks. Please allocate the necessry cpumasks up-front when we
> probe the PMU. At that time we can handle an allocation failure by cleaning up
> and failing to probe the PMU, and then the CPUHP callbacks don't need to
> allocate memory to offline a CPU...

Agreed that dynamically allocation in callbacks lead to inconsistency
to system.

My (original)alternative plan is simple but ugly, just make cpumask var
_static_ and add extra static lock to protect it.

The only difference between solution above and your proposal is static/
dynamic alloction. CPUHP's teardown cb is supposed to run in targetted
cpuhp thread for most cases, and it's racy. Even the cpumask var is
wrapped in dynamically allocated struct xxx_pmu, it's still shareable
between different threads/contexts and needs proper protection.

Simple as this(_untested_):

diff --git a/drivers/perf/arm-cmn.c b/drivers/perf/arm-cmn.c
index 7ef9c7e4836b..fa89c3db4d7d 100644
--- a/drivers/perf/arm-cmn.c
+++ b/drivers/perf/arm-cmn.c
@@ -1950,18 +1950,24 @@ static int arm_cmn_pmu_offline_cpu(unsigned int cpu, struct hlist_node *cpuhp_no
        struct arm_cmn *cmn;
        unsigned int target;
        int node;
-       cpumask_t mask;
+       static cpumask_t mask;
+       static DEFINE_SPINLOCK(cpumask_lock);

        cmn = hlist_entry_safe(cpuhp_node, struct arm_cmn, cpuhp_node);
        if (cpu != cmn->cpu)
                return 0;

+       spin_lock(&cpumask_lock);
+
        node = dev_to_node(cmn->dev);
        if (cpumask_and(&mask, cpumask_of_node(node), cpu_online_mask) &&
            cpumask_andnot(&mask, &mask, cpumask_of(cpu)))
                target = cpumask_any(&mask);
        else
                target = cpumask_any_but(cpu_online_mask, cpu);
+
+       spin_unlock(&cpumask_lock);
+
        if (target < nr_cpu_ids)
                arm_cmn_migrate(cmn, target);
        return 0;

And yes, static allocation is evil :) 


Thanks,

    Dawei

> 
> Also, for the titles it'd be better to say something like "avoid placing
> cpumasks on the stack", because "explicit cpumask var allocation" sounds like
> the use of alloc_cpumask_var().

Sound great! I will update it.

> 
> Mark.
> 
> > 
> > static int cpuhp_issue_call(int cpu, enum cpuhp_state state, bool bringup,
> >                             struct hlist_node *node)
> > {
> >         struct cpuhp_step *sp = cpuhp_get_step(state);
> >         int ret;
> > 
> >         /*
> >          * If there's nothing to do, we done.
> >          * Relies on the union for multi_instance.
> >          */
> >         if (cpuhp_step_empty(bringup, sp))
> >                 return 0;
> >         /*
> >          * The non AP bound callbacks can fail on bringup. On teardown
> >          * e.g. module removal we crash for now.
> >          */
> > 	#ifdef CONFIG_SMP
> >         if (cpuhp_is_ap_state(state))
> >                 ret = cpuhp_invoke_ap_callback(cpu, state, bringup, node);
> >         else
> >                 ret = cpuhp_invoke_callback(cpu, state, bringup, node,
> > 		NULL);
> > 	#else
> >         ret = cpuhp_invoke_callback(cpu, state, bringup, node, NULL);
> > 	#endif
> >         BUG_ON(ret && !bringup);
> >         return ret;
> > }
> > 
> > Dawei Li (9):
> >   perf/alibaba_uncore_drw: Avoid explicit cpumask var allocation from
> >     stack
> >   perf/arm-cmn: Avoid explicit cpumask var allocation from stack
> >   perf/arm_cspmu: Avoid explicit cpumask var allocation from stack
> >   perf/arm_dsu: Avoid explicit cpumask var allocation from stack
> >   perf/dwc_pcie: Avoid explicit cpumask var allocation from stack
> >   perf/hisi_pcie: Avoid explicit cpumask var allocation from stack
> >   perf/hisi_uncore: Avoid explicit cpumask var allocation from stack
> >   perf/qcom_l2: Avoid explicit cpumask var allocation from stack
> >   perf/thunder_x2: Avoid explicit cpumask var allocation from stack
> > 
> >  drivers/perf/alibaba_uncore_drw_pmu.c    | 13 +++++++++----
> >  drivers/perf/arm-cmn.c                   | 13 +++++++++----
> >  drivers/perf/arm_cspmu/arm_cspmu.c       | 13 +++++++++----
> >  drivers/perf/arm_dsu_pmu.c               | 18 +++++++++++++-----
> >  drivers/perf/dwc_pcie_pmu.c              | 17 +++++++++++------
> >  drivers/perf/hisilicon/hisi_pcie_pmu.c   | 15 ++++++++++-----
> >  drivers/perf/hisilicon/hisi_uncore_pmu.c | 13 +++++++++----
> >  drivers/perf/qcom_l2_pmu.c               | 15 ++++++++++-----
> >  drivers/perf/thunderx2_pmu.c             | 20 ++++++++++++--------
> >  9 files changed, 92 insertions(+), 45 deletions(-)
> > 
> > 
> > Thanks,
> > 
> >     Dawei
> > 
> > -- 
> > 2.27.0
> > 
>

Mark Rutland April 2, 2024, 2:41 p.m. UTC | #3

On Tue, Apr 02, 2024 at 09:40:08PM +0800, Dawei Li wrote:
> Hi Mark,
> 
> Thanks for the quick review.
> 
> On Tue, Apr 02, 2024 at 12:12:50PM +0100, Mark Rutland wrote:
> > On Tue, Apr 02, 2024 at 06:56:01PM +0800, Dawei Li wrote:
> > > Hi,
> > > 
> > > This series try to eliminate direct cpumask var allocation from stack
> > > for perf subsystem.
> > > 
> > > Direct/explicit allocation of cpumask on stack could be dangerous since
> > > it can lead to stack overflow for systems with big NR_CPUS or
> > > CONFIG_CPUMASK_OFFSTACK=y.
> > > 
> > > For arm64, it's more urgent since commit 3fbd56f0e7c1 ("ARM64: Dynamically
> > > allocate cpumasks and increase supported CPUs to 512").
> > > 
> > > It's sort of a pattern that almost every cpumask var in perf subystem
> > > occurs in teardown callback of cpuhp. In which case, if dynamic
> > > allocation failed(which is unlikely), we choose return 0 rather than
> > > -ENOMEM to caller cuz:
> > > @teardown is not supposed to fail and if it does, system crashes:
> > 
> > .. but we've left the system in an incorrect state, so that makes no sense.
> > 
> > As I commented on the first patch, NAK to dynamically allocating cpumasks in
> > the CPUHP callbacks. Please allocate the necessry cpumasks up-front when we
> > probe the PMU. At that time we can handle an allocation failure by cleaning up
> > and failing to probe the PMU, and then the CPUHP callbacks don't need to
> > allocate memory to offline a CPU...
> 
> Agreed that dynamically allocation in callbacks lead to inconsistency
> to system.
> 
> My (original)alternative plan is simple but ugly, just make cpumask var
> _static_ and add extra static lock to protect it.
> 
> The only difference between solution above and your proposal is static/
> dynamic alloction. CPUHP's teardown cb is supposed to run in targetted
> cpuhp thread for most cases, and it's racy. Even the cpumask var is
> wrapped in dynamically allocated struct xxx_pmu, it's still shareable
> between different threads/contexts and needs proper protection.

I was under the impression that the cpuhp callbacks were *strictly* serialised.
If that's not the case, the existing offlining callbacks are horrendously
broken.

Are you *certain* these can race?

Regardless, adding additional locking here is not ok.

> Simple as this(_untested_):
> 
> diff --git a/drivers/perf/arm-cmn.c b/drivers/perf/arm-cmn.c
> index 7ef9c7e4836b..fa89c3db4d7d 100644
> --- a/drivers/perf/arm-cmn.c
> +++ b/drivers/perf/arm-cmn.c
> @@ -1950,18 +1950,24 @@ static int arm_cmn_pmu_offline_cpu(unsigned int cpu, struct hlist_node *cpuhp_no
>         struct arm_cmn *cmn;
>         unsigned int target;
>         int node;
> -       cpumask_t mask;
> +       static cpumask_t mask;
> +       static DEFINE_SPINLOCK(cpumask_lock);
> 
>         cmn = hlist_entry_safe(cpuhp_node, struct arm_cmn, cpuhp_node);
>         if (cpu != cmn->cpu)
>                 return 0;
> 
> +       spin_lock(&cpumask_lock);
> +
>         node = dev_to_node(cmn->dev);
>         if (cpumask_and(&mask, cpumask_of_node(node), cpu_online_mask) &&
>             cpumask_andnot(&mask, &mask, cpumask_of(cpu)))
>                 target = cpumask_any(&mask);
>         else
>                 target = cpumask_any_but(cpu_online_mask, cpu);
> +
> +       spin_unlock(&cpumask_lock);
> +
>         if (target < nr_cpu_ids)
>                 arm_cmn_migrate(cmn, target);
>         return 0;

Looking at this case, the only reason we need the mask is because it made the
logic a little easier to write. All we really want is to choose some CPU in the
intersection of two masks ignoring a specific CPU, and there was no helper
function to do that.

We can add a new helper to do that for us, which would avoid redundant work to
manipulate the entire mask, and it would make the existing code simpler.  I had
a series a few years back to add cpumask_any_and_but():

  https://lore.kernel.org/lkml/1486381132-5610-1-git-send-email-mark.rutland@arm.com/

... and that's easy to use here, e.g.

| diff --git a/drivers/perf/arm-cmn.c b/drivers/perf/arm-cmn.c
| index 7ef9c7e4836b7..c6bbd387ccf8b 100644
| --- a/drivers/perf/arm-cmn.c
| +++ b/drivers/perf/arm-cmn.c
| @@ -1950,17 +1950,15 @@ static int arm_cmn_pmu_offline_cpu(unsigned int cpu, struct hlist_node *cpuhp_no
|         struct arm_cmn *cmn;
|         unsigned int target;
|         int node;
| -       cpumask_t mask;
|  
|         cmn = hlist_entry_safe(cpuhp_node, struct arm_cmn, cpuhp_node);
|         if (cpu != cmn->cpu)
|                 return 0;
|  
|         node = dev_to_node(cmn->dev);
| -       if (cpumask_and(&mask, cpumask_of_node(node), cpu_online_mask) &&
| -           cpumask_andnot(&mask, &mask, cpumask_of(cpu)))
| -               target = cpumask_any(&mask);
| -       else
| +       target = cpumask_any_and_but(cpu_online_mask, cpumask_of_node(node),
| +                                    cpu);
| +       if (target >= nr_cpu_ids)
|                 target = cpumask_any_but(cpu_online_mask, cpu);
|         if (target < nr_cpu_ids)
|                 arm_cmn_migrate(cmn, target);

It doesn't trivially rebase since the cpumask code has changed a fair amount,
but I've managed to do that locally, and I can send that out as a
seven-years-late v2 if it's useful.

From a quick scan, it looks like that'd handle all cases in this series. Are
there any patterns in this series for which that would not be sufficient?

Mark.

> 
> And yes, static allocation is evil :) 
> 
> 
> Thanks,
> 
>     Dawei
> 
> > 
> > Also, for the titles it'd be better to say something like "avoid placing
> > cpumasks on the stack", because "explicit cpumask var allocation" sounds like
> > the use of alloc_cpumask_var().
> 
> Sound great! I will update it.
> 
> > 
> > Mark.
> > 
> > > 
> > > static int cpuhp_issue_call(int cpu, enum cpuhp_state state, bool bringup,
> > >                             struct hlist_node *node)
> > > {
> > >         struct cpuhp_step *sp = cpuhp_get_step(state);
> > >         int ret;
> > > 
> > >         /*
> > >          * If there's nothing to do, we done.
> > >          * Relies on the union for multi_instance.
> > >          */
> > >         if (cpuhp_step_empty(bringup, sp))
> > >                 return 0;
> > >         /*
> > >          * The non AP bound callbacks can fail on bringup. On teardown
> > >          * e.g. module removal we crash for now.
> > >          */
> > > 	#ifdef CONFIG_SMP
> > >         if (cpuhp_is_ap_state(state))
> > >                 ret = cpuhp_invoke_ap_callback(cpu, state, bringup, node);
> > >         else
> > >                 ret = cpuhp_invoke_callback(cpu, state, bringup, node,
> > > 		NULL);
> > > 	#else
> > >         ret = cpuhp_invoke_callback(cpu, state, bringup, node, NULL);
> > > 	#endif
> > >         BUG_ON(ret && !bringup);
> > >         return ret;
> > > }
> > > 
> > > Dawei Li (9):
> > >   perf/alibaba_uncore_drw: Avoid explicit cpumask var allocation from
> > >     stack
> > >   perf/arm-cmn: Avoid explicit cpumask var allocation from stack
> > >   perf/arm_cspmu: Avoid explicit cpumask var allocation from stack
> > >   perf/arm_dsu: Avoid explicit cpumask var allocation from stack
> > >   perf/dwc_pcie: Avoid explicit cpumask var allocation from stack
> > >   perf/hisi_pcie: Avoid explicit cpumask var allocation from stack
> > >   perf/hisi_uncore: Avoid explicit cpumask var allocation from stack
> > >   perf/qcom_l2: Avoid explicit cpumask var allocation from stack
> > >   perf/thunder_x2: Avoid explicit cpumask var allocation from stack
> > > 
> > >  drivers/perf/alibaba_uncore_drw_pmu.c    | 13 +++++++++----
> > >  drivers/perf/arm-cmn.c                   | 13 +++++++++----
> > >  drivers/perf/arm_cspmu/arm_cspmu.c       | 13 +++++++++----
> > >  drivers/perf/arm_dsu_pmu.c               | 18 +++++++++++++-----
> > >  drivers/perf/dwc_pcie_pmu.c              | 17 +++++++++++------
> > >  drivers/perf/hisilicon/hisi_pcie_pmu.c   | 15 ++++++++++-----
> > >  drivers/perf/hisilicon/hisi_uncore_pmu.c | 13 +++++++++----
> > >  drivers/perf/qcom_l2_pmu.c               | 15 ++++++++++-----
> > >  drivers/perf/thunderx2_pmu.c             | 20 ++++++++++++--------
> > >  9 files changed, 92 insertions(+), 45 deletions(-)
> > > 
> > > 
> > > Thanks,
> > > 
> > >     Dawei
> > > 
> > > -- 
> > > 2.27.0
> > > 
> >

Dawei Li April 3, 2024, 10:41 a.m. UTC | #4

On Tue, Apr 02, 2024 at 03:41:51PM +0100, Mark Rutland wrote:
> On Tue, Apr 02, 2024 at 09:40:08PM +0800, Dawei Li wrote:
> > Hi Mark,
> > 
> > Thanks for the quick review.
> > 
> > On Tue, Apr 02, 2024 at 12:12:50PM +0100, Mark Rutland wrote:
> > > On Tue, Apr 02, 2024 at 06:56:01PM +0800, Dawei Li wrote:
> > > > Hi,
> > > > 
> > > > This series try to eliminate direct cpumask var allocation from stack
> > > > for perf subsystem.
> > > > 
> > > > Direct/explicit allocation of cpumask on stack could be dangerous since
> > > > it can lead to stack overflow for systems with big NR_CPUS or
> > > > CONFIG_CPUMASK_OFFSTACK=y.
> > > > 
> > > > For arm64, it's more urgent since commit 3fbd56f0e7c1 ("ARM64: Dynamically
> > > > allocate cpumasks and increase supported CPUs to 512").
> > > > 
> > > > It's sort of a pattern that almost every cpumask var in perf subystem
> > > > occurs in teardown callback of cpuhp. In which case, if dynamic
> > > > allocation failed(which is unlikely), we choose return 0 rather than
> > > > -ENOMEM to caller cuz:
> > > > @teardown is not supposed to fail and if it does, system crashes:
> > > 
> > > .. but we've left the system in an incorrect state, so that makes no sense.
> > > 
> > > As I commented on the first patch, NAK to dynamically allocating cpumasks in
> > > the CPUHP callbacks. Please allocate the necessry cpumasks up-front when we
> > > probe the PMU. At that time we can handle an allocation failure by cleaning up
> > > and failing to probe the PMU, and then the CPUHP callbacks don't need to
> > > allocate memory to offline a CPU...
> > 
> > Agreed that dynamically allocation in callbacks lead to inconsistency
> > to system.
> > 
> > My (original)alternative plan is simple but ugly, just make cpumask var
> > _static_ and add extra static lock to protect it.
> > 
> > The only difference between solution above and your proposal is static/
> > dynamic alloction. CPUHP's teardown cb is supposed to run in targetted
> > cpuhp thread for most cases, and it's racy. Even the cpumask var is
> > wrapped in dynamically allocated struct xxx_pmu, it's still shareable
> > between different threads/contexts and needs proper protection.
> 
> I was under the impression that the cpuhp callbacks were *strictly* serialised.
> If that's not the case, the existing offlining callbacks are horrendously
> broken.
> 
> Are you *certain* these can race?
> 
> Regardless, adding additional locking here is not ok.
> 
> > Simple as this(_untested_):
> > 
> > diff --git a/drivers/perf/arm-cmn.c b/drivers/perf/arm-cmn.c
> > index 7ef9c7e4836b..fa89c3db4d7d 100644
> > --- a/drivers/perf/arm-cmn.c
> > +++ b/drivers/perf/arm-cmn.c
> > @@ -1950,18 +1950,24 @@ static int arm_cmn_pmu_offline_cpu(unsigned int cpu, struct hlist_node *cpuhp_no
> >         struct arm_cmn *cmn;
> >         unsigned int target;
> >         int node;
> > -       cpumask_t mask;
> > +       static cpumask_t mask;
> > +       static DEFINE_SPINLOCK(cpumask_lock);
> > 
> >         cmn = hlist_entry_safe(cpuhp_node, struct arm_cmn, cpuhp_node);
> >         if (cpu != cmn->cpu)
> >                 return 0;
> > 
> > +       spin_lock(&cpumask_lock);
> > +
> >         node = dev_to_node(cmn->dev);
> >         if (cpumask_and(&mask, cpumask_of_node(node), cpu_online_mask) &&
> >             cpumask_andnot(&mask, &mask, cpumask_of(cpu)))
> >                 target = cpumask_any(&mask);
> >         else
> >                 target = cpumask_any_but(cpu_online_mask, cpu);
> > +
> > +       spin_unlock(&cpumask_lock);
> > +
> >         if (target < nr_cpu_ids)
> >                 arm_cmn_migrate(cmn, target);
> >         return 0;
> 
> Looking at this case, the only reason we need the mask is because it made the
> logic a little easier to write. All we really want is to choose some CPU in the
> intersection of two masks ignoring a specific CPU, and there was no helper
> function to do that.
> 
> We can add a new helper to do that for us, which would avoid redundant work to
> manipulate the entire mask, and it would make the existing code simpler.  I had
> a series a few years back to add cpumask_any_and_but():
> 
>   https://lore.kernel.org/lkml/1486381132-5610-1-git-send-email-mark.rutland@arm.com/

Sounds a perfect idea!

Actually I am re-implementing new series on top of your seven-years-late-yet-still-helpful
patch, with minor update on it:

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 1c29947db848..121f3ac757ff 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -388,6 +388,29 @@ unsigned int cpumask_any_but(const struct cpumask *mask, unsigned int cpu)
        return i;
 }

+/**
+ * cpumask_any_and_but - pick a "random" cpu from *mask1 & *mask2, but not this one.
+ * @mask1: the first input cpumask
+ * @mask2: the second input cpumask
+ * @cpu: the cpu to ignore
+ *
+ * Returns >= nr_cpu_ids if no cpus set.
+ */
+static inline
+unsigned int cpumask_any_and_but(const struct cpumask *mask1,
+                                const struct cpumask *mask2,
+                                unsigned int cpu)
+{
+       unsigned int i;
+
+       cpumask_check(cpu);
+       i = cpumask_first_and(mask1, mask2);
+       if (i != cpu)
+               return i;
+
+       return cpumask_next_and(cpu, mask1, mask2);
+}
+
 /**
  * cpumask_nth - get the Nth cpu in a cpumask
  * @srcp: the cpumask pointer

Change from your original version:
1 Moved to cpumask.h, just like other helpers.
2 Return value converted to unsigned int.
3 Remove EXPORT_SYMBOL, for obvious reason.

I will respin V2 as a whole as soon as possible.

Thanks,

    Dawei

> 
> .. and that's easy to use here, e.g.
> 
> | diff --git a/drivers/perf/arm-cmn.c b/drivers/perf/arm-cmn.c
> | index 7ef9c7e4836b7..c6bbd387ccf8b 100644
> | --- a/drivers/perf/arm-cmn.c
> | +++ b/drivers/perf/arm-cmn.c
> | @@ -1950,17 +1950,15 @@ static int arm_cmn_pmu_offline_cpu(unsigned int cpu, struct hlist_node *cpuhp_no
> |         struct arm_cmn *cmn;
> |         unsigned int target;
> |         int node;
> | -       cpumask_t mask;
> |  
> |         cmn = hlist_entry_safe(cpuhp_node, struct arm_cmn, cpuhp_node);
> |         if (cpu != cmn->cpu)
> |                 return 0;
> |  
> |         node = dev_to_node(cmn->dev);
> | -       if (cpumask_and(&mask, cpumask_of_node(node), cpu_online_mask) &&
> | -           cpumask_andnot(&mask, &mask, cpumask_of(cpu)))
> | -               target = cpumask_any(&mask);
> | -       else
> | +       target = cpumask_any_and_but(cpu_online_mask, cpumask_of_node(node),
> | +                                    cpu);
> | +       if (target >= nr_cpu_ids)
> |                 target = cpumask_any_but(cpu_online_mask, cpu);
> |         if (target < nr_cpu_ids)
> |                 arm_cmn_migrate(cmn, target);
> 
> It doesn't trivially rebase since the cpumask code has changed a fair amount,
> but I've managed to do that locally, and I can send that out as a
> seven-years-late v2 if it's useful.
> 
> From a quick scan, it looks like that'd handle all cases in this series. Are
> there any patterns in this series for which that would not be sufficient?
> 
> Mark.
> 
> > 
> > And yes, static allocation is evil :) 
> > 
> > 
> > Thanks,
> > 
> >     Dawei
> > 
> > > 
> > > Also, for the titles it'd be better to say something like "avoid placing
> > > cpumasks on the stack", because "explicit cpumask var allocation" sounds like
> > > the use of alloc_cpumask_var().
> > 
> > Sound great! I will update it.
> > 
> > > 
> > > Mark.
> > > 
> > > > 
> > > > static int cpuhp_issue_call(int cpu, enum cpuhp_state state, bool bringup,
> > > >                             struct hlist_node *node)
> > > > {
> > > >         struct cpuhp_step *sp = cpuhp_get_step(state);
> > > >         int ret;
> > > > 
> > > >         /*
> > > >          * If there's nothing to do, we done.
> > > >          * Relies on the union for multi_instance.
> > > >          */
> > > >         if (cpuhp_step_empty(bringup, sp))
> > > >                 return 0;
> > > >         /*
> > > >          * The non AP bound callbacks can fail on bringup. On teardown
> > > >          * e.g. module removal we crash for now.
> > > >          */
> > > > 	#ifdef CONFIG_SMP
> > > >         if (cpuhp_is_ap_state(state))
> > > >                 ret = cpuhp_invoke_ap_callback(cpu, state, bringup, node);
> > > >         else
> > > >                 ret = cpuhp_invoke_callback(cpu, state, bringup, node,
> > > > 		NULL);
> > > > 	#else
> > > >         ret = cpuhp_invoke_callback(cpu, state, bringup, node, NULL);
> > > > 	#endif
> > > >         BUG_ON(ret && !bringup);
> > > >         return ret;
> > > > }
> > > > 
> > > > Dawei Li (9):
> > > >   perf/alibaba_uncore_drw: Avoid explicit cpumask var allocation from
> > > >     stack
> > > >   perf/arm-cmn: Avoid explicit cpumask var allocation from stack
> > > >   perf/arm_cspmu: Avoid explicit cpumask var allocation from stack
> > > >   perf/arm_dsu: Avoid explicit cpumask var allocation from stack
> > > >   perf/dwc_pcie: Avoid explicit cpumask var allocation from stack
> > > >   perf/hisi_pcie: Avoid explicit cpumask var allocation from stack
> > > >   perf/hisi_uncore: Avoid explicit cpumask var allocation from stack
> > > >   perf/qcom_l2: Avoid explicit cpumask var allocation from stack
> > > >   perf/thunder_x2: Avoid explicit cpumask var allocation from stack
> > > > 
> > > >  drivers/perf/alibaba_uncore_drw_pmu.c    | 13 +++++++++----
> > > >  drivers/perf/arm-cmn.c                   | 13 +++++++++----
> > > >  drivers/perf/arm_cspmu/arm_cspmu.c       | 13 +++++++++----
> > > >  drivers/perf/arm_dsu_pmu.c               | 18 +++++++++++++-----
> > > >  drivers/perf/dwc_pcie_pmu.c              | 17 +++++++++++------
> > > >  drivers/perf/hisilicon/hisi_pcie_pmu.c   | 15 ++++++++++-----
> > > >  drivers/perf/hisilicon/hisi_uncore_pmu.c | 13 +++++++++----
> > > >  drivers/perf/qcom_l2_pmu.c               | 15 ++++++++++-----
> > > >  drivers/perf/thunderx2_pmu.c             | 20 ++++++++++++--------
> > > >  9 files changed, 92 insertions(+), 45 deletions(-)
> > > > 
> > > > 
> > > > Thanks,
> > > > 
> > > >     Dawei
> > > > 
> > > > -- 
> > > > 2.27.0
> > > > 
> > > 
>

Mark Rutland April 3, 2024, 11:10 a.m. UTC | #5

On Wed, Apr 03, 2024 at 06:41:23PM +0800, Dawei Li wrote:
> On Tue, Apr 02, 2024 at 03:41:51PM +0100, Mark Rutland wrote:
> > Looking at this case, the only reason we need the mask is because it made the
> > logic a little easier to write. All we really want is to choose some CPU in the
> > intersection of two masks ignoring a specific CPU, and there was no helper
> > function to do that.
> > 
> > We can add a new helper to do that for us, which would avoid redundant work to
> > manipulate the entire mask, and it would make the existing code simpler.  I had
> > a series a few years back to add cpumask_any_and_but():
> > 
> >   https://lore.kernel.org/lkml/1486381132-5610-1-git-send-email-mark.rutland@arm.com/
> 
> Sounds a perfect idea!
> 
> Actually I am re-implementing new series on top of your seven-years-late-yet-still-helpful
> patch, with minor update on it:
> 
> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> index 1c29947db848..121f3ac757ff 100644
> --- a/include/linux/cpumask.h
> +++ b/include/linux/cpumask.h
> @@ -388,6 +388,29 @@ unsigned int cpumask_any_but(const struct cpumask *mask, unsigned int cpu)
>         return i;
>  }
> 
> +/**
> + * cpumask_any_and_but - pick a "random" cpu from *mask1 & *mask2, but not this one.
> + * @mask1: the first input cpumask
> + * @mask2: the second input cpumask
> + * @cpu: the cpu to ignore
> + *
> + * Returns >= nr_cpu_ids if no cpus set.
> + */
> +static inline
> +unsigned int cpumask_any_and_but(const struct cpumask *mask1,
> +                                const struct cpumask *mask2,
> +                                unsigned int cpu)
> +{
> +       unsigned int i;
> +
> +       cpumask_check(cpu);
> +       i = cpumask_first_and(mask1, mask2);
> +       if (i != cpu)
> +               return i;
> +
> +       return cpumask_next_and(cpu, mask1, mask2);
> +}
> +
>  /**
>   * cpumask_nth - get the Nth cpu in a cpumask
>   * @srcp: the cpumask pointer
> 
> Change from your original version:
> 1 Moved to cpumask.h, just like other helpers.
> 2 Return value converted to unsigned int.
> 3 Remove EXPORT_SYMBOL, for obvious reason.

That's exactly how I rebased it locally, so that looks good to me!

> I will respin V2 as a whole as soon as possible.

Great!

Thanks,
Mark.

[0/9] perf: Avoid explicit cpumask var allocation from stack

Message

Comments