[RFC,0/3] gpu: nova-core: add basic timer subdevice implementation

Message ID	20250217-nova_timer-v1-0-78c5ace2d987@nvidia.com (mailing list archive)
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Alexandre Courbot <acourbot@nvidia.com> Subject: [RFC PATCH 0/3] gpu: nova-core: add basic timer subdevice implementation Date: Mon, 17 Feb 2025 23:04:45 +0900 Message-Id: <20250217-nova_timer-v1-0-78c5ace2d987@nvidia.com> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit To: Danilo Krummrich <dakr@kernel.org>, David Airlie <airlied@gmail.com>, John Hubbard <jhubbard@nvidia.com>, Ben Skeggs <bskeggs@nvidia.com> Cc: linux-kernel@vger.kernel.org, rust-for-linux@vger.kernel.org, nouveau@lists.freedesktop.org, dri-devel@lists.freedesktop.org, Alexandre Courbot <acourbot@nvidia.com> MIME-Version: 1.0 Precedence: list Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	gpu: nova-core: add basic timer subdevice implementation \| expand [RFC,0/3] gpu: nova-core: add basic timer subdevice implementation [RFC,1/3] rust: add useful ops for u64 [RFC,2/3] rust: make ETIMEDOUT error available [RFC,3/3] gpu: nova-core: add basic timer device

Alexandre Courbot Feb. 17, 2025, 2:04 p.m. UTC

Hi everyone,

This short RFC is based on top of Danilo's initial driver stub series
[1] and has for goal to initiate discussions and hopefully some design
decisions using the simplest subdevice of the GPU (the timer) as an
example, before implementing more devices allowing the GPU
initialization sequence to progress (Falcon being the logical next step
so we can get the GSP rolling).

It is kept simple and short for that purpose, and to avoid bumping into
a wall with much more device code because my assumptions were incorrect.

This is my first time trying to write Rust kernel code, and some of my
questions below are probably due to me not understanding yet how to use
the core kernel interfaces. So before going further I thought it would
make sense to raise the most obvious questions that came to my mind
while writing this draft:

- Where and how to store subdevices. The timer device is currently a
  direct member of the GPU structure. It might work for GSP devices
  which are IIUC supposed to have at least a few fixed devices required
  to bring the GSP up ; but as a general rule this probably won't scale
  as not all subdevices are present on all GPU variants, or in the same
  numbers. So we will probably need to find an equivalent to the
  `subdev` linked list in Nouveau.

- BAR sharing between subdevices. Right now each subdevice gets access
  to the full BAR range. I am wondering whether we could not split it
  into the relevant slices for each-subdevice, and transfer ownership of
  each slice to the device that is supposed to use it. That way each
  register would have a single owner, which is arguably safer - but
  maybe not as flexible as we will need down the road?

- On a related note, since the BAR is behind a Devres its availability
  must first be secured before any hardware access using try_access().
  Doing this on a per-register or per-operation basis looks overkill, so
  all methods that access the BAR take a reference to it, allowing to
  call try_access() from the highest-level caller and thus reducing the
  number of times this needs to be performed. Doing so comes at the cost
  of an extra argument to most subdevice methods ; but also with the
  benefit that we don't need to put the BAR behind another Arc and share
  it across all subdevices. I don't know which design is better here,
  and input would be very welcome.

- We will probably need sometime like a `Subdevice` trait or something
  down the road, but I'll wait until we have more than one subdevice to
  think about it.

The first 2 patches are small additions to the core Rust modules, that
the following patches make use of and which might be useful for other
drivers as well. The last patch is the naive implementation of the timer
device. I don't expect it to stay this way at all, so please point out
all the deficiencies in this very early code! :)

[1] https://lore.kernel.org/nouveau/20250209173048.17398-1-dakr@kernel.org/

Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
---
Alexandre Courbot (3):
      rust: add useful ops for u64
      rust: make ETIMEDOUT error available
      gpu: nova-core: add basic timer device

 drivers/gpu/nova-core/driver.rs    |  4 +-
 drivers/gpu/nova-core/gpu.rs       | 35 ++++++++++++++-
 drivers/gpu/nova-core/nova_core.rs |  1 +
 drivers/gpu/nova-core/regs.rs      | 43 ++++++++++++++++++
 drivers/gpu/nova-core/timer.rs     | 91 ++++++++++++++++++++++++++++++++++++++
 rust/kernel/error.rs               |  1 +
 rust/kernel/lib.rs                 |  1 +
 rust/kernel/num.rs                 | 32 ++++++++++++++
 8 files changed, 206 insertions(+), 2 deletions(-)
---
base-commit: 6484e46f33eac8dd42aa36fa56b51d8daa5ae1c1
change-id: 20250216-nova_timer-c69430184f54

Best regards,

Simona Vetter Feb. 17, 2025, 3:48 p.m. UTC | #1

On Mon, Feb 17, 2025 at 11:04:45PM +0900, Alexandre Courbot wrote:
> Hi everyone,
> 
> This short RFC is based on top of Danilo's initial driver stub series
> [1] and has for goal to initiate discussions and hopefully some design
> decisions using the simplest subdevice of the GPU (the timer) as an
> example, before implementing more devices allowing the GPU
> initialization sequence to progress (Falcon being the logical next step
> so we can get the GSP rolling).
> 
> It is kept simple and short for that purpose, and to avoid bumping into
> a wall with much more device code because my assumptions were incorrect.
> 
> This is my first time trying to write Rust kernel code, and some of my
> questions below are probably due to me not understanding yet how to use
> the core kernel interfaces. So before going further I thought it would
> make sense to raise the most obvious questions that came to my mind
> while writing this draft:
> 
> - Where and how to store subdevices. The timer device is currently a
>   direct member of the GPU structure. It might work for GSP devices
>   which are IIUC supposed to have at least a few fixed devices required
>   to bring the GSP up ; but as a general rule this probably won't scale
>   as not all subdevices are present on all GPU variants, or in the same
>   numbers. So we will probably need to find an equivalent to the
>   `subdev` linked list in Nouveau.
> 
> - BAR sharing between subdevices. Right now each subdevice gets access
>   to the full BAR range. I am wondering whether we could not split it
>   into the relevant slices for each-subdevice, and transfer ownership of
>   each slice to the device that is supposed to use it. That way each
>   register would have a single owner, which is arguably safer - but
>   maybe not as flexible as we will need down the road?
> 
> - On a related note, since the BAR is behind a Devres its availability
>   must first be secured before any hardware access using try_access().
>   Doing this on a per-register or per-operation basis looks overkill, so
>   all methods that access the BAR take a reference to it, allowing to
>   call try_access() from the highest-level caller and thus reducing the
>   number of times this needs to be performed. Doing so comes at the cost
>   of an extra argument to most subdevice methods ; but also with the
>   benefit that we don't need to put the BAR behind another Arc and share
>   it across all subdevices. I don't know which design is better here,
>   and input would be very welcome.
> 
> - We will probably need sometime like a `Subdevice` trait or something
>   down the road, but I'll wait until we have more than one subdevice to
>   think about it.

It might make sense to go with a full-blown aux bus. Gives you a lot of
structures and answers to these questions, but also might be way too much.

So just throwing this a consideration in here.
-Sima

> 
> The first 2 patches are small additions to the core Rust modules, that
> the following patches make use of and which might be useful for other
> drivers as well. The last patch is the naive implementation of the timer
> device. I don't expect it to stay this way at all, so please point out
> all the deficiencies in this very early code! :)
> 
> [1] https://lore.kernel.org/nouveau/20250209173048.17398-1-dakr@kernel.org/
> 
> Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
> ---
> Alexandre Courbot (3):
>       rust: add useful ops for u64
>       rust: make ETIMEDOUT error available
>       gpu: nova-core: add basic timer device
> 
>  drivers/gpu/nova-core/driver.rs    |  4 +-
>  drivers/gpu/nova-core/gpu.rs       | 35 ++++++++++++++-
>  drivers/gpu/nova-core/nova_core.rs |  1 +
>  drivers/gpu/nova-core/regs.rs      | 43 ++++++++++++++++++
>  drivers/gpu/nova-core/timer.rs     | 91 ++++++++++++++++++++++++++++++++++++++
>  rust/kernel/error.rs               |  1 +
>  rust/kernel/lib.rs                 |  1 +
>  rust/kernel/num.rs                 | 32 ++++++++++++++
>  8 files changed, 206 insertions(+), 2 deletions(-)
> ---
> base-commit: 6484e46f33eac8dd42aa36fa56b51d8daa5ae1c1
> change-id: 20250216-nova_timer-c69430184f54
> 
> Best regards,
> -- 
> Alexandre Courbot <acourbot@nvidia.com>
>

Danilo Krummrich Feb. 17, 2025, 9:33 p.m. UTC | #2

Hi Alex,

On Mon, Feb 17, 2025 at 11:04:45PM +0900, Alexandre Courbot wrote:
> Hi everyone,
> 
> This short RFC is based on top of Danilo's initial driver stub series
> [1] and has for goal to initiate discussions and hopefully some design
> decisions using the simplest subdevice of the GPU (the timer) as an
> example, before implementing more devices allowing the GPU
> initialization sequence to progress (Falcon being the logical next step
> so we can get the GSP rolling).
> 
> It is kept simple and short for that purpose, and to avoid bumping into
> a wall with much more device code because my assumptions were incorrect.
> 
> This is my first time trying to write Rust kernel code, and some of my
> questions below are probably due to me not understanding yet how to use
> the core kernel interfaces. So before going further I thought it would
> make sense to raise the most obvious questions that came to my mind
> while writing this draft:

Thanks for sending this RFC, that makes a lot of sense.

It's great to see you picking up work on Nova and Rust in the kernel in general!

One nit: For the future, please make sure to copy in the folks listed under the
RUST entry in the maintainers file explicitly.

> 
> - Where and how to store subdevices. The timer device is currently a
>   direct member of the GPU structure. It might work for GSP devices
>   which are IIUC supposed to have at least a few fixed devices required
>   to bring the GSP up ; but as a general rule this probably won't scale
>   as not all subdevices are present on all GPU variants, or in the same
>   numbers. So we will probably need to find an equivalent to the
>   `subdev` linked list in Nouveau.

Hm...I think a Vec should probably do the job for this. Once we know the
chipset, we know the exact topology of subdevices too.

> 
> - BAR sharing between subdevices. Right now each subdevice gets access
>   to the full BAR range. I am wondering whether we could not split it
>   into the relevant slices for each-subdevice, and transfer ownership of
>   each slice to the device that is supposed to use it. That way each
>   register would have a single owner, which is arguably safer - but
>   maybe not as flexible as we will need down the road?

I think for self-contained subdevices we can easily add an abstraction for
pci_iomap_range() to pci::Device. I considered doing that from the get-go, but
then decided to wait until we have some actual use for that.

For where we have to share a mapping of the same set of registers between
multiple structures, I think we have to embedd in into an Arc (unfortunately,
we can't re-use the inner Arc of Devres for that).

An alternative would be to request a whole new mapping, i.e. Devres<pci::Bar>
instance, but that includes an inner Arc anyways and, hence, is more costly.

> 
> - On a related note, since the BAR is behind a Devres its availability
>   must first be secured before any hardware access using try_access().
>   Doing this on a per-register or per-operation basis looks overkill, so
>   all methods that access the BAR take a reference to it, allowing to
>   call try_access() from the highest-level caller and thus reducing the
>   number of times this needs to be performed. Doing so comes at the cost
>   of an extra argument to most subdevice methods ; but also with the
>   benefit that we don't need to put the BAR behind another Arc and share
>   it across all subdevices. I don't know which design is better here,
>   and input would be very welcome.

I'm not sure I understand you correctly, because what you describe here seem to
be two different things to me.

1. How to avoid unnecessary calls to try_access().

This is why I made Boot0.read() take a &RevocableGuard<'_, Bar0> as argument. I
think we can just call try_access() once and then propage the guard through the
callchain, where necessary.

2. Share the MMIO mapping between subdevices.

This is where I can follow. How does 1. help with that? How are 1. and 2.
related?

> 
> - We will probably need sometime like a `Subdevice` trait or something
>   down the road, but I'll wait until we have more than one subdevice to
>   think about it.

Yeah, that sounds reasonable.

> 
> The first 2 patches are small additions to the core Rust modules, that
> the following patches make use of and which might be useful for other
> drivers as well. The last patch is the naive implementation of the timer
> device. I don't expect it to stay this way at all, so please point out
> all the deficiencies in this very early code! :)
> 
> [1] https://lore.kernel.org/nouveau/20250209173048.17398-1-dakr@kernel.org/
> 
> Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
> ---
> Alexandre Courbot (3):
>       rust: add useful ops for u64
>       rust: make ETIMEDOUT error available
>       gpu: nova-core: add basic timer device
> 
>  drivers/gpu/nova-core/driver.rs    |  4 +-
>  drivers/gpu/nova-core/gpu.rs       | 35 ++++++++++++++-
>  drivers/gpu/nova-core/nova_core.rs |  1 +
>  drivers/gpu/nova-core/regs.rs      | 43 ++++++++++++++++++
>  drivers/gpu/nova-core/timer.rs     | 91 ++++++++++++++++++++++++++++++++++++++
>  rust/kernel/error.rs               |  1 +
>  rust/kernel/lib.rs                 |  1 +
>  rust/kernel/num.rs                 | 32 ++++++++++++++
>  8 files changed, 206 insertions(+), 2 deletions(-)
> ---
> base-commit: 6484e46f33eac8dd42aa36fa56b51d8daa5ae1c1
> change-id: 20250216-nova_timer-c69430184f54
> 
> Best regards,
> -- 
> Alexandre Courbot <acourbot@nvidia.com>
>

Dave Airlie Feb. 18, 2025, 1:42 a.m. UTC | #3

On Tue, 18 Feb 2025 at 00:04, Alexandre Courbot <acourbot@nvidia.com> wrote:
>
> Hi everyone,
>
> This short RFC is based on top of Danilo's initial driver stub series
> [1] and has for goal to initiate discussions and hopefully some design
> decisions using the simplest subdevice of the GPU (the timer) as an
> example, before implementing more devices allowing the GPU
> initialization sequence to progress (Falcon being the logical next step
> so we can get the GSP rolling).
>
> It is kept simple and short for that purpose, and to avoid bumping into
> a wall with much more device code because my assumptions were incorrect.
>
> This is my first time trying to write Rust kernel code, and some of my
> questions below are probably due to me not understanding yet how to use
> the core kernel interfaces. So before going further I thought it would
> make sense to raise the most obvious questions that came to my mind
> while writing this draft:
>
> - Where and how to store subdevices. The timer device is currently a
>   direct member of the GPU structure. It might work for GSP devices
>   which are IIUC supposed to have at least a few fixed devices required
>   to bring the GSP up ; but as a general rule this probably won't scale
>   as not all subdevices are present on all GPU variants, or in the same
>   numbers. So we will probably need to find an equivalent to the
>   `subdev` linked list in Nouveau.

I deliberately avoided doing this.

My reasoning is this isn't like nouveau, where we control a bunch of
devices, we have one mission, bring up GSP, if that entails a bunch of
fixed function blocks being setup in a particular order then let's
just deal with that.

If things become optional later we can move to Option<> or just have a
completely new path. But in those cases I'd make the Option
<TuringGSPBootDevices> rather than Option<Sec2>, Option<NVDEC>, etc,
but I think we need to look at the boot requirements on other GSP
devices to know.

I just don't see any case where we need to iterate over the subdevices
in any form of list that makes sense and doesn't lead to the nouveau
design which is a pain in the ass to tease out any sense of ordering
or hierarchy.

Just be explicit, boot the devices you need in the order you need to
get GSP up and running.

>
> - BAR sharing between subdevices. Right now each subdevice gets access
>   to the full BAR range. I am wondering whether we could not split it
>   into the relevant slices for each-subdevice, and transfer ownership of
>   each slice to the device that is supposed to use it. That way each
>   register would have a single owner, which is arguably safer - but
>   maybe not as flexible as we will need down the road?

This could be useful, again the mission is mostly not to be hitting
registers since GSP will deal with it, the only case I know that it
won't work is, the GSP CPU sequencer code gets a script from the
device, the script tells you to directly hit registers, with no real
bounds checking, so I don't know if this is practical.

>
> - On a related note, since the BAR is behind a Devres its availability
>   must first be secured before any hardware access using try_access().
>   Doing this on a per-register or per-operation basis looks overkill, so
>   all methods that access the BAR take a reference to it, allowing to
>   call try_access() from the highest-level caller and thus reducing the
>   number of times this needs to be performed. Doing so comes at the cost
>   of an extra argument to most subdevice methods ; but also with the
>   benefit that we don't need to put the BAR behind another Arc and share
>   it across all subdevices. I don't know which design is better here,
>   and input would be very welcome.

We can't pass down the bar, because it takes a devres lock and it
interferes with lockdep ordering, even hanging onto devres too long
caused me lockdep issues.

We should only be doing try access on registers that are runtime
sized, is this a lot of them? Do we expect to be hitting a lot of
registers in an actual fast path?

> - We will probably need sometime like a `Subdevice` trait or something
>   down the road, but I'll wait until we have more than one subdevice to
>   think about it.

Again I'm kinda opposed to this idea, I don't think it buys anything,
with GSP we just want to boot, after that we never touch most of the
subdevices ever again.

Dave.

>
> The first 2 patches are small additions to the core Rust modules, that
> the following patches make use of and which might be useful for other
> drivers as well. The last patch is the naive implementation of the timer
> device. I don't expect it to stay this way at all, so please point out
> all the deficiencies in this very early code! :)
>
> [1] https://lore.kernel.org/nouveau/20250209173048.17398-1-dakr@kernel.org/
>
> Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
> ---
> Alexandre Courbot (3):
>       rust: add useful ops for u64
>       rust: make ETIMEDOUT error available
>       gpu: nova-core: add basic timer device
>
>  drivers/gpu/nova-core/driver.rs    |  4 +-
>  drivers/gpu/nova-core/gpu.rs       | 35 ++++++++++++++-
>  drivers/gpu/nova-core/nova_core.rs |  1 +
>  drivers/gpu/nova-core/regs.rs      | 43 ++++++++++++++++++
>  drivers/gpu/nova-core/timer.rs     | 91 ++++++++++++++++++++++++++++++++++++++
>  rust/kernel/error.rs               |  1 +
>  rust/kernel/lib.rs                 |  1 +
>  rust/kernel/num.rs                 | 32 ++++++++++++++
>  8 files changed, 206 insertions(+), 2 deletions(-)
> ---
> base-commit: 6484e46f33eac8dd42aa36fa56b51d8daa5ae1c1
> change-id: 20250216-nova_timer-c69430184f54
>
> Best regards,
> --
> Alexandre Courbot <acourbot@nvidia.com>
>

Dave Airlie Feb. 18, 2025, 1:46 a.m. UTC | #4

> 1. How to avoid unnecessary calls to try_access().
>
> This is why I made Boot0.read() take a &RevocableGuard<'_, Bar0> as argument. I
> think we can just call try_access() once and then propage the guard through the
> callchain, where necessary.

Nope, you can't do that, RevocableGuard holds a lock and things
explode badly in lockdep if you do.

[ 39.960247] =============================
[ 39.960265] [ BUG: Invalid wait context ]
[ 39.960282] 6.12.0-rc2+ #151 Not tainted
[ 39.960298] -----------------------------
[ 39.960316] modprobe/2006 is trying to lock:
[ 39.960335] ffffa08dd7783a68
(drivers/gpu/nova-core/gsp/sharedq.rs:259){....}-{3:3}, at:
_RNvMs0_NtNtCs6v51TV2h8sK_6nova_c3gsp7sharedqNtB5_26GSPSharedQueuesr535_113_018rpc_push+0x34/0x4c0
[nova_core]
[ 39.960413] other info that might help us debug this:
[ 39.960434] context-{4:4}
[ 39.960447] 2 locks held by modprobe/2006:
[ 39.960465] #0: ffffa08dc27581b0 (&dev->mutex){....}-{3:3}, at:
__driver_attach+0x111/0x260
[ 39.960505] #1: ffffffffad55ac10 (rcu_read_lock){....}-{1:2}, at:
rust_helper_rcu_read_lock+0x11/0x80
[ 39.960545] stack backtrace:
[ 39.960559] CPU: 8 UID: 0 PID: 2006 Comm: modprobe Not tainted 6.12.0-rc2+ #151
[ 39.960586] Hardware name: System manufacturer System Product
Name/PRIME X370-PRO, BIOS 6231 08/31/2024
[ 39.960618] Call Trace:
[ 39.960632] <TASK>

was one time I didn't drop a revocable before proceeding to do other things,

Dave.

Greg Kroah-Hartman Feb. 18, 2025, 8:07 a.m. UTC | #5

On Mon, Feb 17, 2025 at 04:48:13PM +0100, Simona Vetter wrote:
> On Mon, Feb 17, 2025 at 11:04:45PM +0900, Alexandre Courbot wrote:
> > Hi everyone,
> > 
> > This short RFC is based on top of Danilo's initial driver stub series
> > [1] and has for goal to initiate discussions and hopefully some design
> > decisions using the simplest subdevice of the GPU (the timer) as an
> > example, before implementing more devices allowing the GPU
> > initialization sequence to progress (Falcon being the logical next step
> > so we can get the GSP rolling).
> > 
> > It is kept simple and short for that purpose, and to avoid bumping into
> > a wall with much more device code because my assumptions were incorrect.
> > 
> > This is my first time trying to write Rust kernel code, and some of my
> > questions below are probably due to me not understanding yet how to use
> > the core kernel interfaces. So before going further I thought it would
> > make sense to raise the most obvious questions that came to my mind
> > while writing this draft:
> > 
> > - Where and how to store subdevices. The timer device is currently a
> >   direct member of the GPU structure. It might work for GSP devices
> >   which are IIUC supposed to have at least a few fixed devices required
> >   to bring the GSP up ; but as a general rule this probably won't scale
> >   as not all subdevices are present on all GPU variants, or in the same
> >   numbers. So we will probably need to find an equivalent to the
> >   `subdev` linked list in Nouveau.
> > 
> > - BAR sharing between subdevices. Right now each subdevice gets access
> >   to the full BAR range. I am wondering whether we could not split it
> >   into the relevant slices for each-subdevice, and transfer ownership of
> >   each slice to the device that is supposed to use it. That way each
> >   register would have a single owner, which is arguably safer - but
> >   maybe not as flexible as we will need down the road?
> > 
> > - On a related note, since the BAR is behind a Devres its availability
> >   must first be secured before any hardware access using try_access().
> >   Doing this on a per-register or per-operation basis looks overkill, so
> >   all methods that access the BAR take a reference to it, allowing to
> >   call try_access() from the highest-level caller and thus reducing the
> >   number of times this needs to be performed. Doing so comes at the cost
> >   of an extra argument to most subdevice methods ; but also with the
> >   benefit that we don't need to put the BAR behind another Arc and share
> >   it across all subdevices. I don't know which design is better here,
> >   and input would be very welcome.
> > 
> > - We will probably need sometime like a `Subdevice` trait or something
> >   down the road, but I'll wait until we have more than one subdevice to
> >   think about it.
> 
> It might make sense to go with a full-blown aux bus. Gives you a lot of
> structures and answers to these questions, but also might be way too much.

No, it's not too much, that's exactly what the auxbus code is for
(splitting a real device into child ones where they all share the same
physical resources.)  So good suggestion.

thanks,

greg k-h

Danilo Krummrich Feb. 18, 2025, 10:26 a.m. UTC | #6

On Tue, Feb 18, 2025 at 11:46:26AM +1000, Dave Airlie wrote:
> > 1. How to avoid unnecessary calls to try_access().
> >
> > This is why I made Boot0.read() take a &RevocableGuard<'_, Bar0> as argument. I
> > think we can just call try_access() once and then propage the guard through the
> > callchain, where necessary.
> 
> Nope, you can't do that, RevocableGuard holds a lock and things
> explode badly in lockdep if you do.

Yes, try_access() marks the begin of an RCU read side critical section. Hence,
sections holding the guard should be kept as short as possible.

What I meant is that for a series of I/O operations we can still pass the guard
to subsequent functions doing the actual I/O ops.

More generally, I also thought about whether we should also provide an SRCU
variant of Revocable and hence Devres. Maybe we even want to replace it with
SRCU entirely to ensure that drivers can't stall the RCU grace period for too
long by accident.

Alexandre Courbot Feb. 18, 2025, 1:23 p.m. UTC | #7

On Tue Feb 18, 2025 at 5:07 PM JST, Greg KH wrote:
> On Mon, Feb 17, 2025 at 04:48:13PM +0100, Simona Vetter wrote:
>> On Mon, Feb 17, 2025 at 11:04:45PM +0900, Alexandre Courbot wrote:
>> > Hi everyone,
>> > 
>> > This short RFC is based on top of Danilo's initial driver stub series
>> > [1] and has for goal to initiate discussions and hopefully some design
>> > decisions using the simplest subdevice of the GPU (the timer) as an
>> > example, before implementing more devices allowing the GPU
>> > initialization sequence to progress (Falcon being the logical next step
>> > so we can get the GSP rolling).
>> > 
>> > It is kept simple and short for that purpose, and to avoid bumping into
>> > a wall with much more device code because my assumptions were incorrect.
>> > 
>> > This is my first time trying to write Rust kernel code, and some of my
>> > questions below are probably due to me not understanding yet how to use
>> > the core kernel interfaces. So before going further I thought it would
>> > make sense to raise the most obvious questions that came to my mind
>> > while writing this draft:
>> > 
>> > - Where and how to store subdevices. The timer device is currently a
>> >   direct member of the GPU structure. It might work for GSP devices
>> >   which are IIUC supposed to have at least a few fixed devices required
>> >   to bring the GSP up ; but as a general rule this probably won't scale
>> >   as not all subdevices are present on all GPU variants, or in the same
>> >   numbers. So we will probably need to find an equivalent to the
>> >   `subdev` linked list in Nouveau.
>> > 
>> > - BAR sharing between subdevices. Right now each subdevice gets access
>> >   to the full BAR range. I am wondering whether we could not split it
>> >   into the relevant slices for each-subdevice, and transfer ownership of
>> >   each slice to the device that is supposed to use it. That way each
>> >   register would have a single owner, which is arguably safer - but
>> >   maybe not as flexible as we will need down the road?
>> > 
>> > - On a related note, since the BAR is behind a Devres its availability
>> >   must first be secured before any hardware access using try_access().
>> >   Doing this on a per-register or per-operation basis looks overkill, so
>> >   all methods that access the BAR take a reference to it, allowing to
>> >   call try_access() from the highest-level caller and thus reducing the
>> >   number of times this needs to be performed. Doing so comes at the cost
>> >   of an extra argument to most subdevice methods ; but also with the
>> >   benefit that we don't need to put the BAR behind another Arc and share
>> >   it across all subdevices. I don't know which design is better here,
>> >   and input would be very welcome.
>> > 
>> > - We will probably need sometime like a `Subdevice` trait or something
>> >   down the road, but I'll wait until we have more than one subdevice to
>> >   think about it.
>> 
>> It might make sense to go with a full-blown aux bus. Gives you a lot of
>> structures and answers to these questions, but also might be way too much.
>
> No, it's not too much, that's exactly what the auxbus code is for
> (splitting a real device into child ones where they all share the same
> physical resources.)  So good suggestion.

Dave's comments have somehow convinced me that we probably won't need to
do something as complex as I initially planned, so hopefully it won't
come to that. :) But thanks for the suggestion, I'll keep it in mind
just in case.

Alexandre Courbot Feb. 18, 2025, 1:35 p.m. UTC | #8

Hi Danilo,

On Tue Feb 18, 2025 at 6:33 AM JST, Danilo Krummrich wrote:
> Hi Alex,
>
> On Mon, Feb 17, 2025 at 11:04:45PM +0900, Alexandre Courbot wrote:
>> Hi everyone,
>> 
>> This short RFC is based on top of Danilo's initial driver stub series
>> [1] and has for goal to initiate discussions and hopefully some design
>> decisions using the simplest subdevice of the GPU (the timer) as an
>> example, before implementing more devices allowing the GPU
>> initialization sequence to progress (Falcon being the logical next step
>> so we can get the GSP rolling).
>> 
>> It is kept simple and short for that purpose, and to avoid bumping into
>> a wall with much more device code because my assumptions were incorrect.
>> 
>> This is my first time trying to write Rust kernel code, and some of my
>> questions below are probably due to me not understanding yet how to use
>> the core kernel interfaces. So before going further I thought it would
>> make sense to raise the most obvious questions that came to my mind
>> while writing this draft:
>
> Thanks for sending this RFC, that makes a lot of sense.
>
> It's great to see you picking up work on Nova and Rust in the kernel in general!
>
> One nit: For the future, please make sure to copy in the folks listed under the
> RUST entry in the maintainers file explicitly.

Ack. I tend to get nervous as the list of recipients increases and
reduce it to the people I believe will be strictly interested, but will
refrain from doing that in the future.

>
>> 
>> - Where and how to store subdevices. The timer device is currently a
>>   direct member of the GPU structure. It might work for GSP devices
>>   which are IIUC supposed to have at least a few fixed devices required
>>   to bring the GSP up ; but as a general rule this probably won't scale
>>   as not all subdevices are present on all GPU variants, or in the same
>>   numbers. So we will probably need to find an equivalent to the
>>   `subdev` linked list in Nouveau.
>
> Hm...I think a Vec should probably do the job for this. Once we know the
> chipset, we know the exact topology of subdevices too.
>
>> 
>> - BAR sharing between subdevices. Right now each subdevice gets access
>>   to the full BAR range. I am wondering whether we could not split it
>>   into the relevant slices for each-subdevice, and transfer ownership of
>>   each slice to the device that is supposed to use it. That way each
>>   register would have a single owner, which is arguably safer - but
>>   maybe not as flexible as we will need down the road?
>
> I think for self-contained subdevices we can easily add an abstraction for
> pci_iomap_range() to pci::Device. I considered doing that from the get-go, but
> then decided to wait until we have some actual use for that.

Yeah, my comment was just to check whether this makes sense at all, we
can definitely live without it for now. Would be a nice safety addition
though IMHO.

>
> For where we have to share a mapping of the same set of registers between
> multiple structures, I think we have to embedd in into an Arc (unfortunately,
> we can't re-use the inner Arc of Devres for that).
>
> An alternative would be to request a whole new mapping, i.e. Devres<pci::Bar>
> instance, but that includes an inner Arc anyways and, hence, is more costly.

Another way could be to make the owning subdevice itself implement the
required functionality through a method that other devices could call as
needed.

>
>> 
>> - On a related note, since the BAR is behind a Devres its availability
>>   must first be secured before any hardware access using try_access().
>>   Doing this on a per-register or per-operation basis looks overkill, so
>>   all methods that access the BAR take a reference to it, allowing to
>>   call try_access() from the highest-level caller and thus reducing the
>>   number of times this needs to be performed. Doing so comes at the cost
>>   of an extra argument to most subdevice methods ; but also with the
>>   benefit that we don't need to put the BAR behind another Arc and share
>>   it across all subdevices. I don't know which design is better here,
>>   and input would be very welcome.
>
> I'm not sure I understand you correctly, because what you describe here seem to
> be two different things to me.
>
> 1. How to avoid unnecessary calls to try_access().
>
> This is why I made Boot0.read() take a &RevocableGuard<'_, Bar0> as argument. I
> think we can just call try_access() once and then propage the guard through the
> callchain, where necessary.
>
> 2. Share the MMIO mapping between subdevices.
>
> This is where I can follow. How does 1. help with that? How are 1. and 2.
> related?

The idea was that by having the Gpu device secure access to the Bar and
pass a reference to the guard (or to the Bar itself since the guard
implements Deref, as I mentioned in [1]), we can avoid the repeated
calls to try_access() AND "share" the Bar between all the subdevices
through an argument instead of e.g. wrapping it into another Arc that
each subdevice would store.

But as Dave pointed out, it looks like this won't work in practice
anyway, so we can probably drop that idea...

[1] https://lore.kernel.org/nouveau/D7Q79WJZSFEK.R9BX1V85SV1Z@nvidia.com/

Alexandre Courbot Feb. 18, 2025, 1:47 p.m. UTC | #9

Hi Dave,

On Tue Feb 18, 2025 at 10:42 AM JST, Dave Airlie wrote:
> On Tue, 18 Feb 2025 at 00:04, Alexandre Courbot <acourbot@nvidia.com> wrote:
>>
>> Hi everyone,
>>
>> This short RFC is based on top of Danilo's initial driver stub series
>> [1] and has for goal to initiate discussions and hopefully some design
>> decisions using the simplest subdevice of the GPU (the timer) as an
>> example, before implementing more devices allowing the GPU
>> initialization sequence to progress (Falcon being the logical next step
>> so we can get the GSP rolling).
>>
>> It is kept simple and short for that purpose, and to avoid bumping into
>> a wall with much more device code because my assumptions were incorrect.
>>
>> This is my first time trying to write Rust kernel code, and some of my
>> questions below are probably due to me not understanding yet how to use
>> the core kernel interfaces. So before going further I thought it would
>> make sense to raise the most obvious questions that came to my mind
>> while writing this draft:
>>
>> - Where and how to store subdevices. The timer device is currently a
>>   direct member of the GPU structure. It might work for GSP devices
>>   which are IIUC supposed to have at least a few fixed devices required
>>   to bring the GSP up ; but as a general rule this probably won't scale
>>   as not all subdevices are present on all GPU variants, or in the same
>>   numbers. So we will probably need to find an equivalent to the
>>   `subdev` linked list in Nouveau.
>
> I deliberately avoided doing this.
>
> My reasoning is this isn't like nouveau, where we control a bunch of
> devices, we have one mission, bring up GSP, if that entails a bunch of
> fixed function blocks being setup in a particular order then let's
> just deal with that.
>
> If things become optional later we can move to Option<> or just have a
> completely new path. But in those cases I'd make the Option
> <TuringGSPBootDevices> rather than Option<Sec2>, Option<NVDEC>, etc,
> but I think we need to look at the boot requirements on other GSP
> devices to know.
>
> I just don't see any case where we need to iterate over the subdevices
> in any form of list that makes sense and doesn't lead to the nouveau
> design which is a pain in the ass to tease out any sense of ordering
> or hierarchy.
>
> Just be explicit, boot the devices you need in the order you need to
> get GSP up and running.

Right, I was looking too far ahead and lost track of the fact that our
main purpose is to get the GSP to run. For that a fixed set of devices
should do the job just fine.

I still suspect that at some point we will need to keep some kernel-side
state for the functions supported by the GSP thus will have to introduce
more flexibility, but let's think about it when we arrive there. Core
GSP boot + communication is fixed.

>
>>
>> - BAR sharing between subdevices. Right now each subdevice gets access
>>   to the full BAR range. I am wondering whether we could not split it
>>   into the relevant slices for each-subdevice, and transfer ownership of
>>   each slice to the device that is supposed to use it. That way each
>>   register would have a single owner, which is arguably safer - but
>>   maybe not as flexible as we will need down the road?
>
> This could be useful, again the mission is mostly not to be hitting
> registers since GSP will deal with it, the only case I know that it
> won't work is, the GSP CPU sequencer code gets a script from the
> device, the script tells you to directly hit registers, with no real
> bounds checking, so I don't know if this is practical.
>
>>
>> - On a related note, since the BAR is behind a Devres its availability
>>   must first be secured before any hardware access using try_access().
>>   Doing this on a per-register or per-operation basis looks overkill, so
>>   all methods that access the BAR take a reference to it, allowing to
>>   call try_access() from the highest-level caller and thus reducing the
>>   number of times this needs to be performed. Doing so comes at the cost
>>   of an extra argument to most subdevice methods ; but also with the
>>   benefit that we don't need to put the BAR behind another Arc and share
>>   it across all subdevices. I don't know which design is better here,
>>   and input would be very welcome.
>
> We can't pass down the bar, because it takes a devres lock and it
> interferes with lockdep ordering, even hanging onto devres too long
> caused me lockdep issues.
>
> We should only be doing try access on registers that are runtime
> sized, is this a lot of them? Do we expect to be hitting a lot of
> registers in an actual fast path?

For this particular driver, I don't think we do. But other drivers will
probably also have this issue, and the challenge for them will to be
find the right granularity at which to invoke try_access(). My worry
being that this can explode down the road without warning if invoked at
the wrong place, which is the kind of behavior we are trying to avoid by
introducing Rust.

>
>> - We will probably need sometime like a `Subdevice` trait or something
>>   down the road, but I'll wait until we have more than one subdevice to
>>   think about it.
>
> Again I'm kinda opposed to this idea, I don't think it buys anything,
> with GSP we just want to boot, after that we never touch most of the
> subdevices ever again.

Yep, it's definitely not the thing we need to worry at the moment.
Minimal static set of devices, and let's get the GSP running. :)

Thanks,
Alex.

Simona Vetter Feb. 19, 2025, 12:58 p.m. UTC | #10

On Tue, Feb 18, 2025 at 11:26:01AM +0100, Danilo Krummrich wrote:
> On Tue, Feb 18, 2025 at 11:46:26AM +1000, Dave Airlie wrote:
> > > 1. How to avoid unnecessary calls to try_access().
> > >
> > > This is why I made Boot0.read() take a &RevocableGuard<'_, Bar0> as argument. I
> > > think we can just call try_access() once and then propage the guard through the
> > > callchain, where necessary.
> > 
> > Nope, you can't do that, RevocableGuard holds a lock and things
> > explode badly in lockdep if you do.
> 
> Yes, try_access() marks the begin of an RCU read side critical section. Hence,
> sections holding the guard should be kept as short as possible.
> 
> What I meant is that for a series of I/O operations we can still pass the guard
> to subsequent functions doing the actual I/O ops.
> 
> More generally, I also thought about whether we should also provide an SRCU
> variant of Revocable and hence Devres. Maybe we even want to replace it with
> SRCU entirely to ensure that drivers can't stall the RCU grace period for too
> long by accident.

The issue with srcu is that the revocation on driver unload or hotunplug
becomes unbound. Which is very, very uncool, and the fundamental issue
that also drm_dev_enter/exit() struggles with. So unless the kernel has a
concept of "bound-time mutex only, but not unbounded sleeps of any sort" I
think we should try really, really hard to introduce srcu revocables on
the rust side. And at least for mmio I don't think any driver needs more
than maybe some retry loops while holding a spinlock, which is fine under
standard rcu. It does mean that drivers need to have fairly fine-grained
fallible paths for dealing with revocable resources, but I think that's
also a good thing - mmio to an unplugged device times out and is really
slow, so doing too many of those isn't a great idea either.

Ofc on the C side of things the balance shits a lot, and we just want to
at least make "no uaf on driver hotunplug" something achievable and hence
are much more ok with the risk that it's just stuck forever or driver
calls take an undue amount of time until they've finished timing out
everything.

Cheers, Sima

[RFC,0/3] gpu: nova-core: add basic timer subdevice implementation

Message

Comments