[RFC,0/6] Sparse HART id support

Message ID	20211204002038.113653-1-atishp@atishpatra.org (mailing list archive)
Headers	show Return-Path: <linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org> From: Atish Patra <atishp@atishpatra.org> To: linux-kernel@vger.kernel.org Cc: Atish Patra <atishp@atishpatra.org>, Alexandre Ghiti <alex@ghiti.fr>, Anup Patel <anup.patel@wdc.com>, Greentime Hu <greentime.hu@sifive.com>, Guo Ren <guoren@linux.alibaba.com>, Heinrich Schuchardt <xypron.glpk@gmx.de>, Ingo Molnar <mingo@kernel.org>, Jisheng Zhang <jszhang@kernel.org>, kvm-riscv@lists.infradead.org, kvm@vger.kernel.org, linux-riscv@lists.infradead.org, Marc Zyngier <maz@kernel.org>, Nanyong Sun <sunnanyong@huawei.com>, Nick Kossifidis <mick@ics.forth.gr>, Palmer Dabbelt <palmer@dabbelt.com>, Paul Walmsley <paul.walmsley@sifive.com>, Pekka Enberg <penberg@kernel.org>, Vincent Chen <vincent.chen@sifive.com>, Vitaly Wool <vitaly.wool@konsulko.com> Subject: [RFC 0/6] Sparse HART id support Date: Fri, 3 Dec 2021 16:20:32 -0800 Message-Id: <20211204002038.113653-1-atishp@atishpatra.org> MIME-Version: 1.0 Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-riscv" <linux-riscv-bounces@lists.infradead.org> Errors-To: linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org
Series	Sparse HART id support \| expand [RFC,0/6] Sparse HART id support [RFC,1/6] RISC-V: Avoid using per cpu array for ordered booting [RFC,2/6] RISC-V: Do not print the SBI version during HSM extension boot print [RFC,3/6] RISC-V: Use __cpu_up_stack/task_pointer only for spinwait method [RFC,4/6] RISC-V: Move the entire hart selection via lottery to SMP [RFC,5/6] RISC-V: Move spinwait booting method to its own config [RFC,6/6] RISC-V: Do not use cpumask data structure for hartid bitmap

Message ID

20211204002038.113653-1-atishp@atishpatra.org (mailing list archive)

Headers

From: Atish Patra <atishp@atishpatra.org>
To: linux-kernel@vger.kernel.org
Cc: Atish Patra <atishp@atishpatra.org>, Alexandre Ghiti <alex@ghiti.fr>,
 Anup Patel <anup.patel@wdc.com>, Greentime Hu <greentime.hu@sifive.com>,
 Guo Ren <guoren@linux.alibaba.com>,
 Heinrich Schuchardt <xypron.glpk@gmx.de>, Ingo Molnar <mingo@kernel.org>,
 Jisheng Zhang <jszhang@kernel.org>, kvm-riscv@lists.infradead.org,
 kvm@vger.kernel.org, linux-riscv@lists.infradead.org,
 Marc Zyngier <maz@kernel.org>, Nanyong Sun <sunnanyong@huawei.com>,
 Nick Kossifidis <mick@ics.forth.gr>, Palmer Dabbelt <palmer@dabbelt.com>,
 Paul Walmsley <paul.walmsley@sifive.com>,
 Pekka Enberg <penberg@kernel.org>, Vincent Chen <vincent.chen@sifive.com>,
 Vitaly Wool <vitaly.wool@konsulko.com>
Subject: [RFC 0/6] Sparse HART id support 
Date: Fri,  3 Dec 2021 16:20:32 -0800
Message-Id: <20211204002038.113653-1-atishp@atishpatra.org>
MIME-Version: 1.0
Precedence: list
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: "linux-riscv" <linux-riscv-bounces@lists.infradead.org>
Errors-To: 
 linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org

Series

Sparse HART id support | expand

Message

Atish Patra Dec. 4, 2021, 12:20 a.m. UTC

Currently, sparse hartid is not supported for Linux RISC-V for the following
reasons.
1. Both spinwait and ordered booting method uses __cpu_up_stack/task_pointer
which is an array size of NR_CPUs.
2. During early booting, any hartid greater than NR_CPUs are not booted at all.
3. riscv_cpuid_to_hartid_mask uses struct cpumask for generating hartid bitmap.
4. SBI v0.2 implementation uses NR_CPUs as the maximum hartid number while
generating hartmask.

In order to support sparse hartid, the hartid & NR_CPUS needs to be disassociated
which was logically incorrect anyways. NR_CPUs represent the maximum logical|
CPU id configured in the kernel while the hartid represent the physical hartid
stored in mhartid CSR defined by the privilege specification. Thus, hartid
can have much greater value than logical cpuid.

Currently, we have two methods of booting. Ordered booting where the booting
hart brings up each non-booting hart one by one using SBI HSM extension.
The spinwait booting method relies on harts jumping to Linux kernel randomly
and boot hart is selected by a lottery. All other non-booting harts keep
spinning on __cpu_up_stack/task_pointer until boot hart initializes the data.
Both these methods rely on __cpu_up_stack/task_pointer to setup the stack/
task pointer. The spinwait method is mostly used to support older firmwares
without SBI HSM extension and M-mode Linux. The ordered booting method is the
preferred booting method for booting general Linux because it can support
cpu hotplug and kexec.

The first patch modified the ordered booting method to use an opaque parameter
already available in HSM start API to setup the stack/task pointer. The third
patch resolves the issue #1 by limiting the usage of
__cpu_up_stack/task_pointer to spinwait specific booting method. The fourth
and fifth patch moves the entire hart lottery selection and spinwait method
to a separate config that can be disabled if required. It solves the issue #2.
The 6th patch solves issue #3 and #4 by removing riscv_cpuid_to_hartid_mask
completely. All the SBI APIs directly pass a pointer to struct cpumask and
the SBI implementation takes care of generating the hart bitmap from the
cpumask.

It is not trivial to support sparse hartid for spinwait booting method and
there are no usecases to support sparse hartid for spinwait method as well.
Any platform with sparse hartid will probably require more advanced features
such as cpu hotplug and kexec. Thus, the series supports the sparse hartid via
ordered booting method only. To maintain backward compatibility, spinwait
booting method is currently enabled in defconfig so that M-mode linux will
continue to work. Any platform that requires to sparse hartid must disable the
spinwait method.

This series also fixes the out-of-bounds access error[1] reported by Geert.
The issue can be reproduced with SMP booting with NR_CPUS=4 on platforms with
discontiguous hart numbering (HiFive unleashed/unmatched & polarfire).
Spinwait method should also be disabled for such configuration where NR_CPUS
value is less than maximum hartid in the platform.

[1] https://lore.kernel.org/lkml/CAMuHMdUPWOjJfJohxLJefHOrJBtXZ0xfHQt4=hXpUXnasiN+AQ@mail.gmail.com/#t

The series is based on queue branch on kvm-riscv as it has kvm related changes
as well. I have tested it on HiFive Unmatched and Qemu.

Atish Patra (6):
RISC-V: Avoid using per cpu array for ordered booting
RISC-V: Do not print the SBI version during HSM extension boot print
RISC-V: Use __cpu_up_stack/task_pointer only for spinwait method
RISC-V: Move the entire hart selection via lottery to SMP
RISC-V: Move spinwait booting method to its own config
RISC-V: Do not use cpumask data structure for hartid bitmap

--
2.33.1

Comments

Rob Herring (Arm) Dec. 6, 2021, 3:28 p.m. UTC | #1

On Fri, Dec 03, 2021 at 04:20:32PM -0800, Atish Patra wrote:
> Currently, sparse hartid is not supported for Linux RISC-V for the following
> reasons.
> 1. Both spinwait and ordered booting method uses __cpu_up_stack/task_pointer
>    which is an array size of NR_CPUs.
> 2. During early booting, any hartid greater than NR_CPUs are not booted at all.
> 3. riscv_cpuid_to_hartid_mask uses struct cpumask for generating hartid bitmap.
> 4. SBI v0.2 implementation uses NR_CPUs as the maximum hartid number while
>    generating hartmask.
> 
> In order to support sparse hartid, the hartid & NR_CPUS needs to be disassociated
> which was logically incorrect anyways. NR_CPUs represent the maximum logical|
> CPU id configured in the kernel while the hartid represent the physical hartid
> stored in mhartid CSR defined by the privilege specification. Thus, hartid
> can have much greater value than logical cpuid.

We already have a couple of architectures with logical to physical CPU 
id maps. See cpu_logical_map. Can we make that common and use it here? 
That would also possibly allow for common populating the map from DT.

Rob

Atish Patra Dec. 13, 2021, 9:27 p.m. UTC | #2

On Mon, Dec 6, 2021 at 7:28 AM Rob Herring <robh@kernel.org> wrote:
>
> On Fri, Dec 03, 2021 at 04:20:32PM -0800, Atish Patra wrote:
> > Currently, sparse hartid is not supported for Linux RISC-V for the following
> > reasons.
> > 1. Both spinwait and ordered booting method uses __cpu_up_stack/task_pointer
> >    which is an array size of NR_CPUs.
> > 2. During early booting, any hartid greater than NR_CPUs are not booted at all.
> > 3. riscv_cpuid_to_hartid_mask uses struct cpumask for generating hartid bitmap.
> > 4. SBI v0.2 implementation uses NR_CPUs as the maximum hartid number while
> >    generating hartmask.
> >
> > In order to support sparse hartid, the hartid & NR_CPUS needs to be disassociated
> > which was logically incorrect anyways. NR_CPUs represent the maximum logical|
> > CPU id configured in the kernel while the hartid represent the physical hartid
> > stored in mhartid CSR defined by the privilege specification. Thus, hartid
> > can have much greater value than logical cpuid.
>
> We already have a couple of architectures with logical to physical CPU
> id maps. See cpu_logical_map. Can we make that common and use it here?

Yes. We can move the cpu_logical_map(which is a macro) &
__cpu_logical_map(actual array with NR_CPUS size)
to common code so that all the architecture can use it instead of
defining it separately.

> That would also possibly allow for common populating the map from DT.
>

I didn't understand this part. The mapping is populated at run time
[1] as the boot cpu can be any hart in RISC-V.
That booting hart will be mapped to cpu 0. All others will be mapped
based on how the cpu node is laid out in the DT.
Do you mean we can move the 2nd part to common code as well ?

[1] RISC-V: https://elixir.bootlin.com/linux/v5.16-rc5/source/arch/riscv/kernel/smpboot.c#L102

> Rob

Rob Herring (Arm) Dec. 13, 2021, 11:11 p.m. UTC | #3

On Mon, Dec 13, 2021 at 3:27 PM Atish Patra <atishp@atishpatra.org> wrote:
>
> On Mon, Dec 6, 2021 at 7:28 AM Rob Herring <robh@kernel.org> wrote:
> >
> > On Fri, Dec 03, 2021 at 04:20:32PM -0800, Atish Patra wrote:
> > > Currently, sparse hartid is not supported for Linux RISC-V for the following
> > > reasons.
> > > 1. Both spinwait and ordered booting method uses __cpu_up_stack/task_pointer
> > >    which is an array size of NR_CPUs.
> > > 2. During early booting, any hartid greater than NR_CPUs are not booted at all.
> > > 3. riscv_cpuid_to_hartid_mask uses struct cpumask for generating hartid bitmap.
> > > 4. SBI v0.2 implementation uses NR_CPUs as the maximum hartid number while
> > >    generating hartmask.
> > >
> > > In order to support sparse hartid, the hartid & NR_CPUS needs to be disassociated
> > > which was logically incorrect anyways. NR_CPUs represent the maximum logical|
> > > CPU id configured in the kernel while the hartid represent the physical hartid
> > > stored in mhartid CSR defined by the privilege specification. Thus, hartid
> > > can have much greater value than logical cpuid.
> >
> > We already have a couple of architectures with logical to physical CPU
> > id maps. See cpu_logical_map. Can we make that common and use it here?
>
> Yes. We can move the cpu_logical_map(which is a macro) &
> __cpu_logical_map(actual array with NR_CPUS size)
> to common code so that all the architecture can use it instead of
> defining it separately.

IIRC, the macro is what varies by arch and I would move to static
inlines rather than supporting:

cpu_logical_map(cpu) = 0xdeadbeef;

>
> > That would also possibly allow for common populating the map from DT.
> >
>
> I didn't understand this part. The mapping is populated at run time
> [1] as the boot cpu can be any hart in RISC-V.
> That booting hart will be mapped to cpu 0. All others will be mapped
> based on how the cpu node is laid out in the DT.
> Do you mean we can move the 2nd part to common code as well ?

Yes, as the DT platforms just loop thru the cpu nodes and fill the
logical map based on 'reg', I don't think that needs to be per arch
once we have a common map. But not asking for that now.

Rob

Atish Patra Dec. 14, 2021, 12:58 a.m. UTC | #4

On Mon, Dec 13, 2021 at 3:11 PM Rob Herring <robh@kernel.org> wrote:
>
> On Mon, Dec 13, 2021 at 3:27 PM Atish Patra <atishp@atishpatra.org> wrote:
> >
> > On Mon, Dec 6, 2021 at 7:28 AM Rob Herring <robh@kernel.org> wrote:
> > >
> > > On Fri, Dec 03, 2021 at 04:20:32PM -0800, Atish Patra wrote:
> > > > Currently, sparse hartid is not supported for Linux RISC-V for the following
> > > > reasons.
> > > > 1. Both spinwait and ordered booting method uses __cpu_up_stack/task_pointer
> > > >    which is an array size of NR_CPUs.
> > > > 2. During early booting, any hartid greater than NR_CPUs are not booted at all.
> > > > 3. riscv_cpuid_to_hartid_mask uses struct cpumask for generating hartid bitmap.
> > > > 4. SBI v0.2 implementation uses NR_CPUs as the maximum hartid number while
> > > >    generating hartmask.
> > > >
> > > > In order to support sparse hartid, the hartid & NR_CPUS needs to be disassociated
> > > > which was logically incorrect anyways. NR_CPUs represent the maximum logical|
> > > > CPU id configured in the kernel while the hartid represent the physical hartid
> > > > stored in mhartid CSR defined by the privilege specification. Thus, hartid
> > > > can have much greater value than logical cpuid.
> > >
> > > We already have a couple of architectures with logical to physical CPU
> > > id maps. See cpu_logical_map. Can we make that common and use it here?
> >
> > Yes. We can move the cpu_logical_map(which is a macro) &
> > __cpu_logical_map(actual array with NR_CPUS size)
> > to common code so that all the architecture can use it instead of
> > defining it separately.
>
> IIRC, the macro is what varies by arch and I would move to static
> inlines rather than supporting:
>
> cpu_logical_map(cpu) = 0xdeadbeef;
>

Sounds good.


> >
> > > That would also possibly allow for common populating the map from DT.
> > >
> >
> > I didn't understand this part. The mapping is populated at run time
> > [1] as the boot cpu can be any hart in RISC-V.
> > That booting hart will be mapped to cpu 0. All others will be mapped
> > based on how the cpu node is laid out in the DT.
> > Do you mean we can move the 2nd part to common code as well ?
>
> Yes, as the DT platforms just loop thru the cpu nodes and fill the
> logical map based on 'reg', I don't think that needs to be per arch
> once we have a common map. But not asking for that now.
>

It would make sense to keep them together in a series. I can take a stab
at it once this series is merged so that we don't end up in
conflicting changes between
two series.

> Rob