mbox series

[v5,00/21] Host1x/TegraDRM UAPI

Message ID 20210111130019.3515669-1-mperttunen@nvidia.com (mailing list archive)
Headers show
Series Host1x/TegraDRM UAPI | expand

Message

Mikko Perttunen Jan. 11, 2021, 12:59 p.m. UTC
Hi all,

here's the fifth revision of the Host1x/TegraDRM UAPI proposal,
containing primarily small bug fixes. It has also been
rebased on top of recent linux-next.

vaapi-tegra-driver has been updated to support the new UAPI
as well as Tegra186:

  https://github.com/cyndis/vaapi-tegra-driver

The `putsurface` program has been tested to work.

The test suite for the new UAPI is available at
https://github.com/cyndis/uapi-test

The series can be also found in
https://github.com/cyndis/linux/commits/work/host1x-uapi-v5.

Older versions:
v1: https://www.spinics.net/lists/linux-tegra/msg51000.html
v2: https://www.spinics.net/lists/linux-tegra/msg53061.html
v3: https://www.spinics.net/lists/linux-tegra/msg54370.html
v4: https://www.spinics.net/lists/dri-devel/msg279897.html

Thank you,
Mikko

Mikko Perttunen (21):
  gpu: host1x: Use different lock classes for each client
  gpu: host1x: Allow syncpoints without associated client
  gpu: host1x: Show number of pending waiters in debugfs
  gpu: host1x: Remove cancelled waiters immediately
  gpu: host1x: Use HW-equivalent syncpoint expiration check
  gpu: host1x: Cleanup and refcounting for syncpoints
  gpu: host1x: Introduce UAPI header
  gpu: host1x: Implement /dev/host1x device node
  gpu: host1x: DMA fences and userspace fence creation
  gpu: host1x: Add no-recovery mode
  gpu: host1x: Add job release callback
  gpu: host1x: Add support for syncpoint waits in CDMA pushbuffer
  gpu: host1x: Reset max value when freeing a syncpoint
  gpu: host1x: Reserve VBLANK syncpoints at initialization
  drm/tegra: Add new UAPI to header
  drm/tegra: Boot VIC during runtime PM resume
  drm/tegra: Set resv fields when importing/exporting GEMs
  drm/tegra: Allocate per-engine channel in core code
  drm/tegra: Implement new UAPI
  drm/tegra: Implement job submission part of new UAPI
  drm/tegra: Add job firewall

 drivers/gpu/drm/tegra/Makefile         |   4 +
 drivers/gpu/drm/tegra/dc.c             |  10 +-
 drivers/gpu/drm/tegra/drm.c            |  69 ++--
 drivers/gpu/drm/tegra/drm.h            |   9 +
 drivers/gpu/drm/tegra/gem.c            |   2 +
 drivers/gpu/drm/tegra/gr2d.c           |   4 +-
 drivers/gpu/drm/tegra/gr3d.c           |   4 +-
 drivers/gpu/drm/tegra/uapi.h           |  63 ++++
 drivers/gpu/drm/tegra/uapi/firewall.c  | 221 +++++++++++++
 drivers/gpu/drm/tegra/uapi/gather_bo.c |  86 +++++
 drivers/gpu/drm/tegra/uapi/gather_bo.h |  22 ++
 drivers/gpu/drm/tegra/uapi/submit.c    | 436 +++++++++++++++++++++++++
 drivers/gpu/drm/tegra/uapi/submit.h    |  21 ++
 drivers/gpu/drm/tegra/uapi/uapi.c      | 307 +++++++++++++++++
 drivers/gpu/drm/tegra/vic.c            | 118 ++++---
 drivers/gpu/host1x/Makefile            |   2 +
 drivers/gpu/host1x/bus.c               |   7 +-
 drivers/gpu/host1x/cdma.c              |  69 +++-
 drivers/gpu/host1x/debug.c             |  14 +-
 drivers/gpu/host1x/dev.c               |  15 +
 drivers/gpu/host1x/dev.h               |  16 +-
 drivers/gpu/host1x/fence.c             | 208 ++++++++++++
 drivers/gpu/host1x/fence.h             |  13 +
 drivers/gpu/host1x/hw/cdma_hw.c        |   2 +-
 drivers/gpu/host1x/hw/channel_hw.c     |  63 ++--
 drivers/gpu/host1x/hw/debug_hw.c       |  11 +-
 drivers/gpu/host1x/intr.c              |  32 +-
 drivers/gpu/host1x/intr.h              |   6 +-
 drivers/gpu/host1x/job.c               |  79 +++--
 drivers/gpu/host1x/job.h               |  14 +
 drivers/gpu/host1x/syncpt.c            | 187 ++++++-----
 drivers/gpu/host1x/syncpt.h            |  16 +-
 drivers/gpu/host1x/uapi.c              | 385 ++++++++++++++++++++++
 drivers/gpu/host1x/uapi.h              |  22 ++
 drivers/staging/media/tegra-video/vi.c |   4 +-
 include/linux/host1x.h                 |  47 ++-
 include/uapi/drm/tegra_drm.h           | 338 +++++++++++++++++--
 include/uapi/linux/host1x.h            | 134 ++++++++
 38 files changed, 2771 insertions(+), 289 deletions(-)
 create mode 100644 drivers/gpu/drm/tegra/uapi.h
 create mode 100644 drivers/gpu/drm/tegra/uapi/firewall.c
 create mode 100644 drivers/gpu/drm/tegra/uapi/gather_bo.c
 create mode 100644 drivers/gpu/drm/tegra/uapi/gather_bo.h
 create mode 100644 drivers/gpu/drm/tegra/uapi/submit.c
 create mode 100644 drivers/gpu/drm/tegra/uapi/submit.h
 create mode 100644 drivers/gpu/drm/tegra/uapi/uapi.c
 create mode 100644 drivers/gpu/host1x/fence.c
 create mode 100644 drivers/gpu/host1x/fence.h
 create mode 100644 drivers/gpu/host1x/uapi.c
 create mode 100644 drivers/gpu/host1x/uapi.h
 create mode 100644 include/uapi/linux/host1x.h

Comments

Dmitry Osipenko Jan. 19, 2021, 10:29 p.m. UTC | #1
11.01.2021 15:59, Mikko Perttunen пишет:
> Hi all,
> 
> here's the fifth revision of the Host1x/TegraDRM UAPI proposal,
> containing primarily small bug fixes. It has also been
> rebased on top of recent linux-next.
> 
> vaapi-tegra-driver has been updated to support the new UAPI
> as well as Tegra186:
> 
>   https://github.com/cyndis/vaapi-tegra-driver
> 
> The `putsurface` program has been tested to work.
> 
> The test suite for the new UAPI is available at
> https://github.com/cyndis/uapi-test
> 
> The series can be also found in
> https://github.com/cyndis/linux/commits/work/host1x-uapi-v5.
> 
> Older versions:
> v1: https://www.spinics.net/lists/linux-tegra/msg51000.html
> v2: https://www.spinics.net/lists/linux-tegra/msg53061.html
> v3: https://www.spinics.net/lists/linux-tegra/msg54370.html
> v4: https://www.spinics.net/lists/dri-devel/msg279897.html
> 
> Thank you,
> Mikko

The basic support for the v5 UAPI is added now to the Opentegra driver.
 In overall UAPI works, but there are couple things that we need to
improve, I'll focus on them here.

Problems
========

1. The channel map/unmap API needs some more thought.

The main problem is a difficulty to track liveness of BOs and mappings.
 The kernel driver refs BO for each mapping and userspace needs to track
both BO and its mappings together, it's too easy to make mistake and
leak BOs without noticing.

2. Host1x sync point UAPI should not be used for tracking DRM jobs.  I
remember we discussed this previously, but this pops up again and I
don't remember where we ended previously.

This creates unnecessary complexity for userspace.  Userspace needs to
go through a lot of chores just to get a sync point and then to manage
it for the jobs.

Nothing stops two different channels to reuse a single sync point and
use it for a job, fixing this will only add more complexity to the
kernel driver instead of removing it.

3. The signalling of DMA fences doesn't work properly in v5 apparently
because of the host1x waiter bug.  I see that sync point interrupt
happens, but waiter callback isn't invoked.

4. The sync_file API is not very suitable for DRM purposes because of
-EMFILE "Too many open files", which I saw while was running x11perf.
It also adds complexity to userspace, instead of removing it.  This
approach not suitable for DRM scheduler as well.

5. Sync points have a dirty hardware state when allocated / requested.
The special sync point reservation is meaningless in this case.

6. I found that the need to chop cmdstream into gathers is a bit
cumbersome for userspace of older SoCs which don't have h/w firewall.
Can we support option where all commands are collected into a single
dedicated cmdstream for a job?

Possible solutions for the above problems
=========================================

1. Stop to use concept of "channels". Switch to DRM context-only.

Each DRM context should get access to all engines once DRM context is
created.  Think of it like "when DRM context is opened, it opens a
channel for each engine".

Then each DRM context will get one instance of mapping per-engine for
each BO.

enum tegra_engine {
	TEGRA_GR2D,
	TEGRA_GR3D,
	TEGRA_VIC,
	...
	NUM_ENGINES
};

struct tegra_bo_mapping {
	dma_addr_t ioaddr;
	...
};

struct tegra_bo {
	...
	struct tegra_bo_mapping *hw_maps[NUM_ENGINES];
};

Instead of DRM_IOCTL_TEGRA_CHANNEL_MAP we should have
DRM_IOCTL_TEGRA_GEM_MAP_TO_ENGINE, which will create a BO mapping for a
specified h/w engine.

Once BO is closed, all its mappings should be closed too.  This way
userspace doesn't need to track both BOs and mappings.

Everything that userspace needs to do is:

	1) Open DRM context
	2) Create GEM
	3) Map GEM to required hardware engines
	4) Submit job that uses GEM handle
	5) Close GEM

If GEM wasn't mapped prior to the job's submission, then job will fail
because kernel driver won't resolve the IO mapping of the GEM.

2. We will probably need a dedicated drm_tegra_submit_cmd for sync point
increments.  The job's sync point will be allocated dynamically when job
is submitted.  We will need a fag for the sync_incr and wait_syncpt
commands, saying "it's a job's sync point increment/wait"

3. We should use dma-fence API directly and waiter-shim should be
removed.  It's great that you're already working on this.

4. Sync file shouldn't be needed for the part of DRM API which doesn't
interact with external non-DRM devices.  We should use DRM syncobj for
everything related to DRM, it's a superior API over sync file, it's
suitable for DRM scheduler.

5. The hardware state of sync points should be reset when sync point is
requested, not when host1x driver is initialized.

6. We will need to allocate a host1x BO for a job's cmdstream and add a
restart command to the end of the job's stream.  CDMA will jump into the
job's stream from push buffer.

We could add a flag for that to drm_tegra_submit_cmd_gather, saying that
gather should be inlined into job's main cmdstream.

This will remove a need to have a large push buffer that will easily
overflow, it's a real problem and upstream driver even has a bug where
it locks up on overflow.

How it will look from CDMA perspective:

PUSHBUF |
---------
...     |      | JOB   |
        |      ---------       | JOB GATHER |
RESTART	------> CMD    |       --------------
        |      |GATHER -------> DATA        |
... <---------- RESTART|       |            |
        |      |       |


What's missing
==============

1. Explicit and implicit fencing isn't there yet, we need to support DRM
scheduler for that.

2. The "wait" command probably should be taught to take a syncobj handle
in order to populate it with a dma-fence by kernel driver once job is
submitted.  This will give us intermediate fences of a job and allow
utilize the syncobj features like "wait until job is submitted to h/w
before starting to wait for timeout", which will be needed by userspace
when DRM scheduler will be supported.

Miscellaneous
=============

1. Please don't forget to bump driver version.  This is important for
userspace.

2. Please use a proper kernel coding style, use checkpatch.

   # git format-patch -v5 ...
   # ./scripts/checkpatch.pl --strict v5*

3. Kernel driver needs a rich error messages for each error condition
and it should dump submitted job when firewall fails.  It's very tedious
to re-add it all each time when something doesn't work.

4. Previously firewall was using the client's class if is_valid_class
wasn't specified in tegra_drm_client_ops, you changed it and now
firewall fails for GR3D because it doesn't have the is_valid_class() set
in the driver.  See [1].

5. The CDMA class should be restored to the class of a previous gather
after the wait command in submit_gathers() and not to the class of the
client.  GR2D supports multiple classes.  See [1].

[1]
https://github.com/grate-driver/linux/commit/024cba369c9c0e2762e9890068ff9944cb10c44f
Mikko Perttunen Jan. 26, 2021, 2:45 a.m. UTC | #2
On 1/20/21 12:29 AM, Dmitry Osipenko wrote:
> 11.01.2021 15:59, Mikko Perttunen пишет:
>> Hi all,
>>
>> here's the fifth revision of the Host1x/TegraDRM UAPI proposal,
>> containing primarily small bug fixes. It has also been
>> rebased on top of recent linux-next.
>>
>> vaapi-tegra-driver has been updated to support the new UAPI
>> as well as Tegra186:
>>
>>    https://github.com/cyndis/vaapi-tegra-driver
>>
>> The `putsurface` program has been tested to work.
>>
>> The test suite for the new UAPI is available at
>> https://github.com/cyndis/uapi-test
>>
>> The series can be also found in
>> https://github.com/cyndis/linux/commits/work/host1x-uapi-v5.
>>
>> Older versions:
>> v1: https://www.spinics.net/lists/linux-tegra/msg51000.html
>> v2: https://www.spinics.net/lists/linux-tegra/msg53061.html
>> v3: https://www.spinics.net/lists/linux-tegra/msg54370.html
>> v4: https://www.spinics.net/lists/dri-devel/msg279897.html
>>
>> Thank you,
>> Mikko
> 
> The basic support for the v5 UAPI is added now to the Opentegra driver.
>   In overall UAPI works, but there are couple things that we need to
> improve, I'll focus on them here.
> 
> Problems
> ========
> 
> 1. The channel map/unmap API needs some more thought.
> 
> The main problem is a difficulty to track liveness of BOs and mappings.
>   The kernel driver refs BO for each mapping and userspace needs to track
> both BO and its mappings together, it's too easy to make mistake and
> leak BOs without noticing.
> 
> 2. Host1x sync point UAPI should not be used for tracking DRM jobs.  I
> remember we discussed this previously, but this pops up again and I
> don't remember where we ended previously.
> 
> This creates unnecessary complexity for userspace.  Userspace needs to
> go through a lot of chores just to get a sync point and then to manage
> it for the jobs.
> 
> Nothing stops two different channels to reuse a single sync point and
> use it for a job, fixing this will only add more complexity to the
> kernel driver instead of removing it.
> 
> 3. The signalling of DMA fences doesn't work properly in v5 apparently
> because of the host1x waiter bug.  I see that sync point interrupt
> happens, but waiter callback isn't invoked.
> 
> 4. The sync_file API is not very suitable for DRM purposes because of
> -EMFILE "Too many open files", which I saw while was running x11perf.
> It also adds complexity to userspace, instead of removing it.  This
> approach not suitable for DRM scheduler as well.
> 
> 5. Sync points have a dirty hardware state when allocated / requested.
> The special sync point reservation is meaningless in this case.
> 
> 6. I found that the need to chop cmdstream into gathers is a bit
> cumbersome for userspace of older SoCs which don't have h/w firewall.
> Can we support option where all commands are collected into a single
> dedicated cmdstream for a job?
> 
> Possible solutions for the above problems
> =========================================
> 
> 1. Stop to use concept of "channels". Switch to DRM context-only.
> 
> Each DRM context should get access to all engines once DRM context is
> created.  Think of it like "when DRM context is opened, it opens a
> channel for each engine".
> 
> Then each DRM context will get one instance of mapping per-engine for
> each BO.
> 
> enum tegra_engine {
> 	TEGRA_GR2D,
> 	TEGRA_GR3D,
> 	TEGRA_VIC,
> 	...
> 	NUM_ENGINES
> };
> 
> struct tegra_bo_mapping {
> 	dma_addr_t ioaddr;
> 	...
> };
> 
> struct tegra_bo {
> 	...
> 	struct tegra_bo_mapping *hw_maps[NUM_ENGINES];
> };
> 
> Instead of DRM_IOCTL_TEGRA_CHANNEL_MAP we should have
> DRM_IOCTL_TEGRA_GEM_MAP_TO_ENGINE, which will create a BO mapping for a
> specified h/w engine.
> 
> Once BO is closed, all its mappings should be closed too.  This way
> userspace doesn't need to track both BOs and mappings.
> 
> Everything that userspace needs to do is:
> 
> 	1) Open DRM context
> 	2) Create GEM
> 	3) Map GEM to required hardware engines
> 	4) Submit job that uses GEM handle
> 	5) Close GEM
> 
> If GEM wasn't mapped prior to the job's submission, then job will fail
> because kernel driver won't resolve the IO mapping of the GEM.
Perhaps we can instead change the reference counting such that if you 
close the GEM object, the mappings will be dropped as well automatically.

> 
> 2. We will probably need a dedicated drm_tegra_submit_cmd for sync point
> increments.  The job's sync point will be allocated dynamically when job
> is submitted.  We will need a fag for the sync_incr and wait_syncpt
> commands, saying "it's a job's sync point increment/wait"

Negative. Like I have explained in previous discussions, with the 
current way the usage of hardware resources is much more deterministic 
and obvious. I disagree on the point that this is much more complicated 
for the userspace. Separating syncpoint and channel allocation is one of 
the primary motivations of this series for me.

> 
> 3. We should use dma-fence API directly and waiter-shim should be
> removed.  It's great that you're already working on this.
> 
> 4. Sync file shouldn't be needed for the part of DRM API which doesn't
> interact with external non-DRM devices.  We should use DRM syncobj for
> everything related to DRM, it's a superior API over sync file, it's
> suitable for DRM scheduler.

Considering the issues with fileno limits, I suppose there is no other 
choice. Considering the recent NTSYNC proposal by Wine developers, maybe 
we should also have NTHANDLEs to get rid of restrictions of file 
descriptors. DRM syncobjs may have some advantages over sync files, but 
also disadvantages. They cannot be poll()ed, so they cannot be combined 
with waits for other resources.

I'll look into this for v6.

> 
> 5. The hardware state of sync points should be reset when sync point is
> requested, not when host1x driver is initialized.

This may be doable, but I don't think it is critical for this UAPI, so 
let's consider it after this series.

The userspace should anyway not be able to assume the initial value of 
the syncpoint upon allocation. The kernel should set it to some high 
value to catch any issues related to wraparound.

Also, this makes code more complicated since it now needs to ensure all 
waits on the syncpoint have completed before freeing the syncpoint, 
which can be nontrivial e.g. if the waiter is in a different virtual 
machine or some other device connected via PCIe (a real usecase).

> 
> 6. We will need to allocate a host1x BO for a job's cmdstream and add a
> restart command to the end of the job's stream.  CDMA will jump into the
> job's stream from push buffer.
> 
> We could add a flag for that to drm_tegra_submit_cmd_gather, saying that
> gather should be inlined into job's main cmdstream.
> 
> This will remove a need to have a large push buffer that will easily
> overflow, it's a real problem and upstream driver even has a bug where
> it locks up on overflow.
> 
> How it will look from CDMA perspective:
> 
> PUSHBUF |
> ---------
> ...     |      | JOB   |
>          |      ---------       | JOB GATHER |
> RESTART	------> CMD    |       --------------
>          |      |GATHER -------> DATA        |
> ... <---------- RESTART|       |            |
>          |      |       |
> 

Let me check if I understood you correctly:
- You would like to have the job's cmdbuf have further GATHER opcodes 
that jump into smaller gathers?

I assume this is needed because currently WAITs are placed into the 
pushbuffer, so the job will take a lot of space in the pushbuffer if 
there are a lot of waits (and GATHERs in between these waits)?

If so, perhaps as a simpler alternative we could change the firewall to 
allow SETCLASS into HOST1X for waiting specifically, then userspace 
could just submit one big cmdbuf taking only little space in the 
pushbuffer? Although that would only allow direct ID/threshold waits.

In any case, it seems that this can be added in a later patch, so we 
should omit it from this series for simplicity. If it is impossible for 
the userspace to deal with it, we could disable the firewall 
temporarily, or implement the above change in the firewall.

> 
> What's missing
> ==============
> 
> 1. Explicit and implicit fencing isn't there yet, we need to support DRM
> scheduler for that.
> 
> 2. The "wait" command probably should be taught to take a syncobj handle
> in order to populate it with a dma-fence by kernel driver once job is
> submitted.  This will give us intermediate fences of a job and allow
> utilize the syncobj features like "wait until job is submitted to h/w
> before starting to wait for timeout", which will be needed by userspace
> when DRM scheduler will be supported.
> 
> Miscellaneous
> =============
> 
> 1. Please don't forget to bump driver version.  This is important for
> userspace.

Sure. I didn't do it this time since it's backwards compatible and it's 
easy to detect if the new UAPI is available by checking for /dev/host1x. 
I can bump it in v6 if necessary.

> 
> 2. Please use a proper kernel coding style, use checkpatch.
> 
>     # git format-patch -v5 ...
>     # ./scripts/checkpatch.pl --strict v5*

Looks like I accidentally placed some spaces into firewall.c. Otherwise 
the warnings it prints do not look critical.

> 
> 3. Kernel driver needs a rich error messages for each error condition
> and it should dump submitted job when firewall fails.  It's very tedious
> to re-add it all each time when something doesn't work.

Yes, that's true. Will take a look for v6.

> 
> 4. Previously firewall was using the client's class if is_valid_class
> wasn't specified in tegra_drm_client_ops, you changed it and now
> firewall fails for GR3D because it doesn't have the is_valid_class() set
> in the driver.  See [1].
> 
> 5. The CDMA class should be restored to the class of a previous gather
> after the wait command in submit_gathers() and not to the class of the
> client.  GR2D supports multiple classes.  See [1].

Will take a look at these two for v6.

> 
> [1]
> https://github.com/grate-driver/linux/commit/024cba369c9c0e2762e9890068ff9944cb10c44f
> 

Mikko
Dmitry Osipenko Jan. 27, 2021, 9:20 p.m. UTC | #3
26.01.2021 05:45, Mikko Perttunen пишет:
>> 2. We will probably need a dedicated drm_tegra_submit_cmd for sync point
>> increments.  The job's sync point will be allocated dynamically when job
>> is submitted.  We will need a fag for the sync_incr and wait_syncpt
>> commands, saying "it's a job's sync point increment/wait"
> 
> Negative. Like I have explained in previous discussions, with the
> current way the usage of hardware resources is much more deterministic
> and obvious. I disagree on the point that this is much more complicated
> for the userspace. Separating syncpoint and channel allocation is one of
> the primary motivations of this series for me.

Sync points are a limited resource. The most sensible way to work around
it is to keep sync points within kernel as much as possible. This is not
only much simpler for user space, but also allows to utilize DRM API
properly without re-inventing what already exists and it's easier to
maintain hardware in a good state.

If you need to use a dedicated sync point for VMs, then just allocate
that special sync point and use it. But this sync point won't be used
for jobs tracking by kernel driver. Is there any problem with this?

The primary motivation for me is to get a practically usable kernel
driver for userspace.
Dmitry Osipenko Jan. 27, 2021, 9:26 p.m. UTC | #4
26.01.2021 05:45, Mikko Perttunen пишет:
>> 5. The hardware state of sync points should be reset when sync point is
>> requested, not when host1x driver is initialized.
> 
> This may be doable, but I don't think it is critical for this UAPI, so
> let's consider it after this series.
> 
> The userspace should anyway not be able to assume the initial value of
> the syncpoint upon allocation. The kernel should set it to some high
> value to catch any issues related to wraparound.

This is critical because min != max when sync point is requested.

> Also, this makes code more complicated since it now needs to ensure all
> waits on the syncpoint have completed before freeing the syncpoint,
> which can be nontrivial e.g. if the waiter is in a different virtual
> machine or some other device connected via PCIe (a real usecase).

It sounds to me that these VM sync points should be treated very
separately from a generic sync points, don't you think so? Let's not mix
them and get the generic sync points usable first.
Dmitry Osipenko Jan. 27, 2021, 9:35 p.m. UTC | #5
26.01.2021 05:45, Mikko Perttunen пишет:
>> 4. Sync file shouldn't be needed for the part of DRM API which doesn't
>> interact with external non-DRM devices.  We should use DRM syncobj for
>> everything related to DRM, it's a superior API over sync file, it's
>> suitable for DRM scheduler.
> 
> Considering the issues with fileno limits, I suppose there is no other
> choice. Considering the recent NTSYNC proposal by Wine developers, maybe
> we should also have NTHANDLEs to get rid of restrictions of file
> descriptors.

It's odd to me that you trying to avoid the existing DRM API. This all
was solved in DRM long time ago and grate drivers have no problems with
using the DRM APIs. Even if something is really missing, then you should
add the missing features instead of re-inventing everything from scratch.

> DRM syncobjs may have some advantages over sync files, but
> also disadvantages. They cannot be poll()ed, so they cannot be combined
> with waits for other resources.

I'm not sure do you mean by "poll". Sync object supports polling very well.
Dmitry Osipenko Jan. 27, 2021, 9:52 p.m. UTC | #6
26.01.2021 05:45, Mikko Perttunen пишет:
>> 6. We will need to allocate a host1x BO for a job's cmdstream and add a
>> restart command to the end of the job's stream.  CDMA will jump into the
>> job's stream from push buffer.
>>
>> We could add a flag for that to drm_tegra_submit_cmd_gather, saying that
>> gather should be inlined into job's main cmdstream.
>>
>> This will remove a need to have a large push buffer that will easily
>> overflow, it's a real problem and upstream driver even has a bug where
>> it locks up on overflow.
>>
>> How it will look from CDMA perspective:
>>
>> PUSHBUF |
>> ---------
>> ...     |      | JOB   |
>>          |      ---------       | JOB GATHER |
>> RESTART    ------> CMD    |       --------------
>>          |      |GATHER -------> DATA        |
>> ... <---------- RESTART|       |            |
>>          |      |       |
>>
> 
> Let me check if I understood you correctly:
> - You would like to have the job's cmdbuf have further GATHER opcodes
> that jump into smaller gathers?

I want jobs to be a self-contained. Instead of pushing commands to the
PB of a kernel driver, we'll push them to job's cmdstream. This means
that for each new job we'll need to allocate a host1x buffer.

> I assume this is needed because currently WAITs are placed into the
> pushbuffer, so the job will take a lot of space in the pushbuffer if
> there are a lot of waits (and GATHERs in between these waits)?

Yes, and with drm-sched we will just need to limit the max number of
jobs in the h/w queue (i.e. push buffer) and then push buffer won't ever
overflow. Problem solved.

> If so, perhaps as a simpler alternative we could change the firewall to
> allow SETCLASS into HOST1X for waiting specifically, then userspace
> could just submit one big cmdbuf taking only little space in the
> pushbuffer? Although that would only allow direct ID/threshold waits.

My solution doesn't require changes to firewall, not sure whether it's
easier.

> In any case, it seems that this can be added in a later patch, so we
> should omit it from this series for simplicity. If it is impossible for
> the userspace to deal with it, we could disable the firewall
> temporarily, or implement the above change in the firewall.

I won't be able to test UAPI fully until all features are at least on
par with the experimental driver of grate kernel.
Mikko Perttunen Jan. 27, 2021, 9:53 p.m. UTC | #7
On 1/27/21 11:35 PM, Dmitry Osipenko wrote:
> 26.01.2021 05:45, Mikko Perttunen пишет:
>>> 4. Sync file shouldn't be needed for the part of DRM API which doesn't
>>> interact with external non-DRM devices.  We should use DRM syncobj for
>>> everything related to DRM, it's a superior API over sync file, it's
>>> suitable for DRM scheduler.
>>
>> Considering the issues with fileno limits, I suppose there is no other
>> choice. Considering the recent NTSYNC proposal by Wine developers, maybe
>> we should also have NTHANDLEs to get rid of restrictions of file
>> descriptors.
> 
> It's odd to me that you trying to avoid the existing DRM API. This all
> was solved in DRM long time ago and grate drivers have no problems with
> using the DRM APIs. Even if something is really missing, then you should
> add the missing features instead of re-inventing everything from scratch.
> 

DRM is only one of many subsystems that will have to deal with 
syncpoints, so I have wanted to have a central solution instead of 
reimplementing the same stuff everywhere. sync_files seem like the 
"missing feature", but they are difficult to use it with the fileno 
limits. But as has been said many times, they are intended only to 
transfer fences between the implementations in individual drivers, so I 
guess I will have to abandon this dream.

>> DRM syncobjs may have some advantages over sync files, but
>> also disadvantages. They cannot be poll()ed, so they cannot be combined
>> with waits for other resources.
> 
> I'm not sure do you mean by "poll". Sync object supports polling very well.
> 

I mean the poll/select etc. series of functions, which wait for file 
descriptors to become ready. If there's some trick that allows syncobjs 
to be used for that, then please tell.

Mikko
Mikko Perttunen Jan. 27, 2021, 9:57 p.m. UTC | #8
On 1/27/21 11:26 PM, Dmitry Osipenko wrote:
> 26.01.2021 05:45, Mikko Perttunen пишет:
>>> 5. The hardware state of sync points should be reset when sync point is
>>> requested, not when host1x driver is initialized.
>>
>> This may be doable, but I don't think it is critical for this UAPI, so
>> let's consider it after this series.
>>
>> The userspace should anyway not be able to assume the initial value of
>> the syncpoint upon allocation. The kernel should set it to some high
>> value to catch any issues related to wraparound.
> 
> This is critical because min != max when sync point is requested.

That I would just consider a bug, and it can be fixed. But it's 
orthogonal to whether the value gets reset every time the syncpoint is 
allocated.

> 
>> Also, this makes code more complicated since it now needs to ensure all
>> waits on the syncpoint have completed before freeing the syncpoint,
>> which can be nontrivial e.g. if the waiter is in a different virtual
>> machine or some other device connected via PCIe (a real usecase).
> 
> It sounds to me that these VM sync points should be treated very
> separately from a generic sync points, don't you think so? Let's not mix
> them and get the generic sync points usable first.
> 

They are not special in any way, I'm just referring to cases where the 
waiter (consumer) is remote. The allocator of the syncpoint (producer) 
doesn't necessarily even need to know about it. The same concern is 
applicable within a single VM, or single application as well. Just 
putting out the point that this is something that needs to be taken care 
of if we were to reset the value.

Mikko
Dmitry Osipenko Jan. 27, 2021, 10:06 p.m. UTC | #9
28.01.2021 00:57, Mikko Perttunen пишет:
> 
> 
> On 1/27/21 11:26 PM, Dmitry Osipenko wrote:
>> 26.01.2021 05:45, Mikko Perttunen пишет:
>>>> 5. The hardware state of sync points should be reset when sync point is
>>>> requested, not when host1x driver is initialized.
>>>
>>> This may be doable, but I don't think it is critical for this UAPI, so
>>> let's consider it after this series.
>>>
>>> The userspace should anyway not be able to assume the initial value of
>>> the syncpoint upon allocation. The kernel should set it to some high
>>> value to catch any issues related to wraparound.
>>
>> This is critical because min != max when sync point is requested.
> 
> That I would just consider a bug, and it can be fixed. But it's
> orthogonal to whether the value gets reset every time the syncpoint is
> allocated.
> 
>>
>>> Also, this makes code more complicated since it now needs to ensure all
>>> waits on the syncpoint have completed before freeing the syncpoint,
>>> which can be nontrivial e.g. if the waiter is in a different virtual
>>> machine or some other device connected via PCIe (a real usecase).
>>
>> It sounds to me that these VM sync points should be treated very
>> separately from a generic sync points, don't you think so? Let's not mix
>> them and get the generic sync points usable first.
>>
> 
> They are not special in any way, I'm just referring to cases where the
> waiter (consumer) is remote. The allocator of the syncpoint (producer)
> doesn't necessarily even need to know about it. The same concern is
> applicable within a single VM, or single application as well. Just
> putting out the point that this is something that needs to be taken care
> of if we were to reset the value.

Will kernel driver know that it deals with a VM sync point?

Will it be possible to get a non-VM sync point explicitly?

If driver knows that it deals with a VM sync point, then we can treat it
specially, avoiding the reset and etc.
Dmitry Osipenko Jan. 27, 2021, 10:26 p.m. UTC | #10
28.01.2021 00:53, Mikko Perttunen пишет:
> On 1/27/21 11:35 PM, Dmitry Osipenko wrote:
>> 26.01.2021 05:45, Mikko Perttunen пишет:
>>>> 4. Sync file shouldn't be needed for the part of DRM API which doesn't
>>>> interact with external non-DRM devices.  We should use DRM syncobj for
>>>> everything related to DRM, it's a superior API over sync file, it's
>>>> suitable for DRM scheduler.
>>>
>>> Considering the issues with fileno limits, I suppose there is no other
>>> choice. Considering the recent NTSYNC proposal by Wine developers, maybe
>>> we should also have NTHANDLEs to get rid of restrictions of file
>>> descriptors.
>>
>> It's odd to me that you trying to avoid the existing DRM API. This all
>> was solved in DRM long time ago and grate drivers have no problems with
>> using the DRM APIs. Even if something is really missing, then you should
>> add the missing features instead of re-inventing everything from scratch.
>>
> 
> DRM is only one of many subsystems that will have to deal with
> syncpoints, so I have wanted to have a central solution instead of
> reimplementing the same stuff everywhere. sync_files seem like the
> "missing feature", but they are difficult to use it with the fileno
> limits. But as has been said many times, they are intended only to
> transfer fences between the implementations in individual drivers, so I
> guess I will have to abandon this dream.

Let's focus on finishing the basics first, using what we already have.
Sync file + syncobj should be good enough for what we need right now.

>>> DRM syncobjs may have some advantages over sync files, but
>>> also disadvantages. They cannot be poll()ed, so they cannot be combined
>>> with waits for other resources.
>>
>> I'm not sure do you mean by "poll". Sync object supports polling very
>> well.
>>
> 
> I mean the poll/select etc. series of functions, which wait for file
> descriptors to become ready. If there's some trick that allows syncobjs
> to be used for that, then please tell.

Please explain in details what problem you need to solve, give an example.
Mikko Perttunen Jan. 28, 2021, 11:08 a.m. UTC | #11
On 1/27/21 11:20 PM, Dmitry Osipenko wrote:
> 26.01.2021 05:45, Mikko Perttunen пишет:
>>> 2. We will probably need a dedicated drm_tegra_submit_cmd for sync point
>>> increments.  The job's sync point will be allocated dynamically when job
>>> is submitted.  We will need a fag for the sync_incr and wait_syncpt
>>> commands, saying "it's a job's sync point increment/wait"
>>
>> Negative. Like I have explained in previous discussions, with the
>> current way the usage of hardware resources is much more deterministic
>> and obvious. I disagree on the point that this is much more complicated
>> for the userspace. Separating syncpoint and channel allocation is one of
>> the primary motivations of this series for me.
> 
> Sync points are a limited resource. The most sensible way to work around
> it is to keep sync points within kernel as much as possible. This is not
> only much simpler for user space, but also allows to utilize DRM API
> properly without re-inventing what already exists and it's easier to
> maintain hardware in a good state.

I've spent the last few years designing for automotive and industrial 
products, where we don't want to at runtime notice that the system is 
out of free syncpoints and because of that we can only process the next 
camera frame in a second or two instead of 16 milliseconds. We need to 
know that once we have allocated the resource, it is there. The newer 
chips are also designed to support this.

Considering Linux is increasingly being used for such applications, and 
they are important target markets for NVIDIA, these need to be supported.

Because of the above design constraint the userspace software that runs 
in these environments also expects resources to be allocated up front. 
This isn't a matter of having to design that software according to what 
kind of allocation API we decide do at Linux level -- it's no use 
designing for dynamic allocation if it leads to you not meeting the 
safety requirement of needing to ensure you have all resources allocated 
up front.

This isn't a good design feature just in a car, but in anything that 
needs to be reliable. However, it does pose some tradeoffs, and if you 
think that running out of syncpoints on T20-T114 because of upfront 
allocation is an actual problem, I'm not opposed to having both options 
available.

> 
> If you need to use a dedicated sync point for VMs, then just allocate
> that special sync point and use it. But this sync point won't be used
> for jobs tracking by kernel driver. Is there any problem with this?

In addition to above, it would increase the number of syncpoints 
required. The number of syncpoints supported by hardware has been 
calculated for specific use cases, and increasing the number of required 
syncpoints risks not being able to support those use cases.

> 
> The primary motivation for me is to get a practically usable kernel
> driver for userspace.
> 

Me too. For the traditional "tablet chips" the task is quite well 
defined and supported. But my goal is to also get rid of the jank in 
downstream and allow fully-featured use of Tegra devices on upstream 
kernels and for that, the driver needs to be usable for the whole range 
of use cases.

Mikko
Mikko Perttunen Jan. 28, 2021, 11:46 a.m. UTC | #12
On 1/28/21 12:06 AM, Dmitry Osipenko wrote:
> 28.01.2021 00:57, Mikko Perttunen пишет:
>>
>>
>> On 1/27/21 11:26 PM, Dmitry Osipenko wrote:
>>> 26.01.2021 05:45, Mikko Perttunen пишет:
>>>>> 5. The hardware state of sync points should be reset when sync point is
>>>>> requested, not when host1x driver is initialized.
>>>>
>>>> This may be doable, but I don't think it is critical for this UAPI, so
>>>> let's consider it after this series.
>>>>
>>>> The userspace should anyway not be able to assume the initial value of
>>>> the syncpoint upon allocation. The kernel should set it to some high
>>>> value to catch any issues related to wraparound.
>>>
>>> This is critical because min != max when sync point is requested.
>>
>> That I would just consider a bug, and it can be fixed. But it's
>> orthogonal to whether the value gets reset every time the syncpoint is
>> allocated.
>>
>>>
>>>> Also, this makes code more complicated since it now needs to ensure all
>>>> waits on the syncpoint have completed before freeing the syncpoint,
>>>> which can be nontrivial e.g. if the waiter is in a different virtual
>>>> machine or some other device connected via PCIe (a real usecase).
>>>
>>> It sounds to me that these VM sync points should be treated very
>>> separately from a generic sync points, don't you think so? Let's not mix
>>> them and get the generic sync points usable first.
>>>
>>
>> They are not special in any way, I'm just referring to cases where the
>> waiter (consumer) is remote. The allocator of the syncpoint (producer)
>> doesn't necessarily even need to know about it. The same concern is
>> applicable within a single VM, or single application as well. Just
>> putting out the point that this is something that needs to be taken care
>> of if we were to reset the value.
> 
> Will kernel driver know that it deals with a VM sync point? >
> Will it be possible to get a non-VM sync point explicitly?
> 
> If driver knows that it deals with a VM sync point, then we can treat it
> specially, avoiding the reset and etc.
> 

There is no distinction between a "VM syncpoint" and a "non-VM 
syncpoint". To provide an example on the issue, consider we have VM1 and 
VM2. VM1 is running some camera software that produces frames. VM2 runs 
some analysis software that consumes those frames through shared memory. 
In between there is some software that takes the postfences of the 
camera software and transmits them to the analysis software to be used 
as prefences. Only this transmitting software needs to know anything 
about multiple VMs being in use.

At any time, if we want to reset the value of the syncpoint in question, 
we must ensure that all fences waiting on that syncpoint have observed 
the fence's threshold first.

Consider an interleaving like this:

VM1 (Camera)				VM2 (Analysis)
-------------------------------------------------------
Send postfence (threshold=X)
					Recv postfence (threshold=X)
Increment syncpoint value to X
Free syncpoint
Reset syncpoint value to Y
					Wait on postfence
-------------------------------------------------------

Now depending on the relative values of X and Y, either VM2 progresses 
(correctly), or gets stuck. If we didn't reset the syncpoint, the race 
could not occur (unless VM1 managed to increment the syncpoint 2^31 
times before VM2's wait starts, which is very unrealistic).

We can remove "VM1" and "VM2" everywhere here, and just consider two 
applications in one VM, too, or two parts of one application. Within one 
VM the issue is of course easier because the driver can have knowledge 
about fences and solve the race internally, but I'd always prefer not 
having such special cases.

Now, admittedly this is probably not a huge problem unless we are 
freeing syncpoints all the time, but nevertheless something to consider.

Mikko
Thierry Reding Jan. 28, 2021, 4:58 p.m. UTC | #13
On Thu, Jan 28, 2021 at 01:08:54PM +0200, Mikko Perttunen wrote:
> On 1/27/21 11:20 PM, Dmitry Osipenko wrote:
> > 26.01.2021 05:45, Mikko Perttunen пишет:
> > > > 2. We will probably need a dedicated drm_tegra_submit_cmd for sync point
> > > > increments.  The job's sync point will be allocated dynamically when job
> > > > is submitted.  We will need a fag for the sync_incr and wait_syncpt
> > > > commands, saying "it's a job's sync point increment/wait"
> > > 
> > > Negative. Like I have explained in previous discussions, with the
> > > current way the usage of hardware resources is much more deterministic
> > > and obvious. I disagree on the point that this is much more complicated
> > > for the userspace. Separating syncpoint and channel allocation is one of
> > > the primary motivations of this series for me.
> > 
> > Sync points are a limited resource. The most sensible way to work around
> > it is to keep sync points within kernel as much as possible. This is not
> > only much simpler for user space, but also allows to utilize DRM API
> > properly without re-inventing what already exists and it's easier to
> > maintain hardware in a good state.
> 
> I've spent the last few years designing for automotive and industrial
> products, where we don't want to at runtime notice that the system is out of
> free syncpoints and because of that we can only process the next camera
> frame in a second or two instead of 16 milliseconds. We need to know that
> once we have allocated the resource, it is there. The newer chips are also
> designed to support this.
> 
> Considering Linux is increasingly being used for such applications, and they
> are important target markets for NVIDIA, these need to be supported.
> 
> Because of the above design constraint the userspace software that runs in
> these environments also expects resources to be allocated up front. This
> isn't a matter of having to design that software according to what kind of
> allocation API we decide do at Linux level -- it's no use designing for
> dynamic allocation if it leads to you not meeting the safety requirement of
> needing to ensure you have all resources allocated up front.
> 
> This isn't a good design feature just in a car, but in anything that needs
> to be reliable. However, it does pose some tradeoffs, and if you think that
> running out of syncpoints on T20-T114 because of upfront allocation is an
> actual problem, I'm not opposed to having both options available.

I think that's a fair point. I don't see why we can't support both
implicit and explicit syncpoint requests. If most of the use-cases can
work with implicit syncpoints and let the kernel handle all aspects of
it, that's great. But there's no reason we can't provide more explicit
controls for cases where it's really important that all the resources
are allocated upfront.

Ultimately this is very specific to each use-case, so I think having
both options will provide us with the most flexibility.

Thierry
Dmitry Osipenko Jan. 29, 2021, 5:30 p.m. UTC | #14
28.01.2021 19:58, Thierry Reding пишет:
> On Thu, Jan 28, 2021 at 01:08:54PM +0200, Mikko Perttunen wrote:
>> On 1/27/21 11:20 PM, Dmitry Osipenko wrote:
>>> 26.01.2021 05:45, Mikko Perttunen пишет:
>>>>> 2. We will probably need a dedicated drm_tegra_submit_cmd for sync point
>>>>> increments.  The job's sync point will be allocated dynamically when job
>>>>> is submitted.  We will need a fag for the sync_incr and wait_syncpt
>>>>> commands, saying "it's a job's sync point increment/wait"
>>>>
>>>> Negative. Like I have explained in previous discussions, with the
>>>> current way the usage of hardware resources is much more deterministic
>>>> and obvious. I disagree on the point that this is much more complicated
>>>> for the userspace. Separating syncpoint and channel allocation is one of
>>>> the primary motivations of this series for me.
>>>
>>> Sync points are a limited resource. The most sensible way to work around
>>> it is to keep sync points within kernel as much as possible. This is not
>>> only much simpler for user space, but also allows to utilize DRM API
>>> properly without re-inventing what already exists and it's easier to
>>> maintain hardware in a good state.
>>
>> I've spent the last few years designing for automotive and industrial
>> products, where we don't want to at runtime notice that the system is out of
>> free syncpoints and because of that we can only process the next camera
>> frame in a second or two instead of 16 milliseconds. We need to know that
>> once we have allocated the resource, it is there. The newer chips are also
>> designed to support this.
>>
>> Considering Linux is increasingly being used for such applications, and they
>> are important target markets for NVIDIA, these need to be supported.
>>
>> Because of the above design constraint the userspace software that runs in
>> these environments also expects resources to be allocated up front. This
>> isn't a matter of having to design that software according to what kind of
>> allocation API we decide do at Linux level -- it's no use designing for
>> dynamic allocation if it leads to you not meeting the safety requirement of
>> needing to ensure you have all resources allocated up front.
>>
>> This isn't a good design feature just in a car, but in anything that needs
>> to be reliable. However, it does pose some tradeoffs, and if you think that
>> running out of syncpoints on T20-T114 because of upfront allocation is an
>> actual problem, I'm not opposed to having both options available.

The word "reliable" contradicts to the error-prone approach. On the
other hand, it should be very difficult to change the stubborn
downstream firmware and we want driver to be usable as much as possible,
so in reality not much can be done about it.

> I think that's a fair point. I don't see why we can't support both
> implicit and explicit syncpoint requests. If most of the use-cases can
> work with implicit syncpoints and let the kernel handle all aspects of
> it, that's great. But there's no reason we can't provide more explicit
> controls for cases where it's really important that all the resources
> are allocated upfront.
> 
> Ultimately this is very specific to each use-case, so I think having
> both options will provide us with the most flexibility.
It should be fine to support both. This will add complexity to the
driver, thus it needs to be done wisely.

I'll need more time to think about it.
Mikko Perttunen Feb. 3, 2021, 11:18 a.m. UTC | #15
On 1/29/21 7:30 PM, Dmitry Osipenko wrote:
> 28.01.2021 19:58, Thierry Reding пишет:
>> On Thu, Jan 28, 2021 at 01:08:54PM +0200, Mikko Perttunen wrote:
>>> On 1/27/21 11:20 PM, Dmitry Osipenko wrote:
>>>> 26.01.2021 05:45, Mikko Perttunen пишет:
>>>>>> 2. We will probably need a dedicated drm_tegra_submit_cmd for sync point
>>>>>> increments.  The job's sync point will be allocated dynamically when job
>>>>>> is submitted.  We will need a fag for the sync_incr and wait_syncpt
>>>>>> commands, saying "it's a job's sync point increment/wait"
>>>>>
>>>>> Negative. Like I have explained in previous discussions, with the
>>>>> current way the usage of hardware resources is much more deterministic
>>>>> and obvious. I disagree on the point that this is much more complicated
>>>>> for the userspace. Separating syncpoint and channel allocation is one of
>>>>> the primary motivations of this series for me.
>>>>
>>>> Sync points are a limited resource. The most sensible way to work around
>>>> it is to keep sync points within kernel as much as possible. This is not
>>>> only much simpler for user space, but also allows to utilize DRM API
>>>> properly without re-inventing what already exists and it's easier to
>>>> maintain hardware in a good state.
>>>
>>> I've spent the last few years designing for automotive and industrial
>>> products, where we don't want to at runtime notice that the system is out of
>>> free syncpoints and because of that we can only process the next camera
>>> frame in a second or two instead of 16 milliseconds. We need to know that
>>> once we have allocated the resource, it is there. The newer chips are also
>>> designed to support this.
>>>
>>> Considering Linux is increasingly being used for such applications, and they
>>> are important target markets for NVIDIA, these need to be supported.
>>>
>>> Because of the above design constraint the userspace software that runs in
>>> these environments also expects resources to be allocated up front. This
>>> isn't a matter of having to design that software according to what kind of
>>> allocation API we decide do at Linux level -- it's no use designing for
>>> dynamic allocation if it leads to you not meeting the safety requirement of
>>> needing to ensure you have all resources allocated up front.
>>>
>>> This isn't a good design feature just in a car, but in anything that needs
>>> to be reliable. However, it does pose some tradeoffs, and if you think that
>>> running out of syncpoints on T20-T114 because of upfront allocation is an
>>> actual problem, I'm not opposed to having both options available.
> 
> The word "reliable" contradicts to the error-prone approach. On the
> other hand, it should be very difficult to change the stubborn
> downstream firmware and we want driver to be usable as much as possible,
> so in reality not much can be done about it.

Depends on the perspective.

> 
>> I think that's a fair point. I don't see why we can't support both
>> implicit and explicit syncpoint requests. If most of the use-cases can
>> work with implicit syncpoints and let the kernel handle all aspects of
>> it, that's great. But there's no reason we can't provide more explicit
>> controls for cases where it's really important that all the resources
>> are allocated upfront.
>>
>> Ultimately this is very specific to each use-case, so I think having
>> both options will provide us with the most flexibility.
> It should be fine to support both. This will add complexity to the
> driver, thus it needs to be done wisely.
> 
> I'll need more time to think about it.
> 

How about something like this:

Turn the syncpt_incr field back into an array of structs like

#define DRM_TEGRA_SUBMIT_SYNCPT_INCR_REPLACE_SYNCOBJ		(1<<0)
#define DRM_TEGRA_SUBMIT_SYNCPT_INCR_PATCH_DYNAMIC_SYNCPT	(1<<1)

struct drm_tegra_submit_syncpt_incr {
	/* can be left as zero if using dynamic syncpt */
	__u32 syncpt_id;
	__u32 flags;

	struct {
		__u32 syncobj;
		__u32 value;
	} fence;

	/* patch word as such:
          * *word = *word | (syncpt_id << shift)
          */
	struct {
		__u32 gather_offset_words;
		__u32 shift;
	} patch;
};

So this will work similarly to the buffer reloc system; the kernel 
driver will allocate a job syncpoint and patch in the syncpoint ID if 
requested, and allows outputting syncobjs for each increment.

Mikko
Dmitry Osipenko Feb. 27, 2021, 11:19 a.m. UTC | #16
03.02.2021 14:18, Mikko Perttunen пишет:
...
>> I'll need more time to think about it.
>>
> 
> How about something like this:
> 
> Turn the syncpt_incr field back into an array of structs like
> 
> #define DRM_TEGRA_SUBMIT_SYNCPT_INCR_REPLACE_SYNCOBJ        (1<<0)
> #define DRM_TEGRA_SUBMIT_SYNCPT_INCR_PATCH_DYNAMIC_SYNCPT    (1<<1)
> 
> struct drm_tegra_submit_syncpt_incr {
>     /* can be left as zero if using dynamic syncpt */
>     __u32 syncpt_id;
>     __u32 flags;
> 
>     struct {
>         __u32 syncobj;
>         __u32 value;
>     } fence;
> 
>     /* patch word as such:
>          * *word = *word | (syncpt_id << shift)
>          */
>     struct {
>         __u32 gather_offset_words;
>         __u32 shift;
>     } patch;
> };
> 
> So this will work similarly to the buffer reloc system; the kernel
> driver will allocate a job syncpoint and patch in the syncpoint ID if
> requested, and allows outputting syncobjs for each increment.

I haven't got any great ideas so far, but it feels that will be easier
and cleaner if we could have separate job paths (and job IOCTLS) based
on hardware generation since the workloads a too different. The needs of
a newer h/w are too obscure for me and absence of userspace code,
firmware sources and full h/w documentation do not help.

There still should be quite a lot to share, but things like
mapping-to-channel and VM sync points are too far away from older h/w,
IMO. This means that code path before drm-sched and path for job-timeout
handling should be separate.

Maybe later on it will become cleaner that we actually could unify it
all nicely, but for now it doesn't look like a good idea to me.

Mikko, do you have any objections to trying out variant with the
separate paths?
Mikko Perttunen March 1, 2021, 8:19 a.m. UTC | #17
On 2/27/21 1:19 PM, Dmitry Osipenko wrote:
> 03.02.2021 14:18, Mikko Perttunen пишет:
> ...
>>> I'll need more time to think about it.
>>>
>>
>> How about something like this:
>>
>> Turn the syncpt_incr field back into an array of structs like
>>
>> #define DRM_TEGRA_SUBMIT_SYNCPT_INCR_REPLACE_SYNCOBJ        (1<<0)
>> #define DRM_TEGRA_SUBMIT_SYNCPT_INCR_PATCH_DYNAMIC_SYNCPT    (1<<1)
>>
>> struct drm_tegra_submit_syncpt_incr {
>>      /* can be left as zero if using dynamic syncpt */
>>      __u32 syncpt_id;
>>      __u32 flags;
>>
>>      struct {
>>          __u32 syncobj;
>>          __u32 value;
>>      } fence;
>>
>>      /* patch word as such:
>>           * *word = *word | (syncpt_id << shift)
>>           */
>>      struct {
>>          __u32 gather_offset_words;
>>          __u32 shift;
>>      } patch;
>> };
>>
>> So this will work similarly to the buffer reloc system; the kernel
>> driver will allocate a job syncpoint and patch in the syncpoint ID if
>> requested, and allows outputting syncobjs for each increment.
> 
> I haven't got any great ideas so far, but it feels that will be easier
> and cleaner if we could have separate job paths (and job IOCTLS) based
> on hardware generation since the workloads a too different. The needs of
> a newer h/w are too obscure for me and absence of userspace code,
> firmware sources and full h/w documentation do not help.
> 
> There still should be quite a lot to share, but things like
> mapping-to-channel and VM sync points are too far away from older h/w,
> IMO. This means that code path before drm-sched and path for job-timeout
> handling should be separate.
> 
> Maybe later on it will become cleaner that we actually could unify it
> all nicely, but for now it doesn't look like a good idea to me.
> 
> Mikko, do you have any objections to trying out variant with the
> separate paths?
> 

I'm on vacation for the next two weeks. I'll think about it and post a 
proposal once I'm back to work.

Mikko
Thierry Reding March 23, 2021, 6:21 p.m. UTC | #18
On Sat, Feb 27, 2021 at 02:19:39PM +0300, Dmitry Osipenko wrote:
> 03.02.2021 14:18, Mikko Perttunen пишет:
> ...
> >> I'll need more time to think about it.
> >>
> > 
> > How about something like this:
> > 
> > Turn the syncpt_incr field back into an array of structs like
> > 
> > #define DRM_TEGRA_SUBMIT_SYNCPT_INCR_REPLACE_SYNCOBJ        (1<<0)
> > #define DRM_TEGRA_SUBMIT_SYNCPT_INCR_PATCH_DYNAMIC_SYNCPT    (1<<1)
> > 
> > struct drm_tegra_submit_syncpt_incr {
> >     /* can be left as zero if using dynamic syncpt */
> >     __u32 syncpt_id;
> >     __u32 flags;
> > 
> >     struct {
> >         __u32 syncobj;
> >         __u32 value;
> >     } fence;
> > 
> >     /* patch word as such:
> >          * *word = *word | (syncpt_id << shift)
> >          */
> >     struct {
> >         __u32 gather_offset_words;
> >         __u32 shift;
> >     } patch;
> > };
> > 
> > So this will work similarly to the buffer reloc system; the kernel
> > driver will allocate a job syncpoint and patch in the syncpoint ID if
> > requested, and allows outputting syncobjs for each increment.
> 
> I haven't got any great ideas so far, but it feels that will be easier
> and cleaner if we could have separate job paths (and job IOCTLS) based
> on hardware generation since the workloads a too different. The needs of
> a newer h/w are too obscure for me and absence of userspace code,
> firmware sources and full h/w documentation do not help.
> 
> There still should be quite a lot to share, but things like
> mapping-to-channel and VM sync points are too far away from older h/w,
> IMO. This means that code path before drm-sched and path for job-timeout
> handling should be separate.
> 
> Maybe later on it will become cleaner that we actually could unify it
> all nicely, but for now it doesn't look like a good idea to me.

Sorry for jumping in rather randomly here and elsewhere, but it's been a
long time since the discussion and I just want to share my thoughts on a
couple of topics in order to hopefully help move this forward somehow.

For UAPI, "unifying it later" doesn't really work. So I think the only
realistic option is to make a best attempt at getting the UABI right so
that it works for all existing use-cases and ideally perhaps even as of
yet unknown use-cases in the future. As with all APIs this means that
there's going to be a need to abstract away some of the hardware details
so that we don't have to deal with too many low-level details in
userspace, because otherwise the UAPI is just going to be a useless
mess.

I think a proposal such as the above to allow both implicit and explicit
syncpoints makes sense. For what it's worth, it's fairly similar to what
we had come up with last time we tried destaging the ABI, although back
at the time I'm not sure we had even considered explicit syncpoint usage
yet. I think where reasonably possible this kind of optional behaviour
is acceptable, but I don't think having two completely separate paths is
going to help in any way. If anything it's just going to make it more
difficult to maintain the code and get it to a usable state in the first
place.

Like I said elsewhere, the programming model for host1x hasn't changed
since Tegra20. It's rather evolved and gained a couple more features,
but that doesn't change anything about how userspace uses it.

Thierry
Dmitry Osipenko March 23, 2021, 7:57 p.m. UTC | #19
23.03.2021 21:21, Thierry Reding пишет:
> On Sat, Feb 27, 2021 at 02:19:39PM +0300, Dmitry Osipenko wrote:
>> 03.02.2021 14:18, Mikko Perttunen пишет:
>> ...
>>>> I'll need more time to think about it.
>>>>
>>>
>>> How about something like this:
>>>
>>> Turn the syncpt_incr field back into an array of structs like
>>>
>>> #define DRM_TEGRA_SUBMIT_SYNCPT_INCR_REPLACE_SYNCOBJ        (1<<0)
>>> #define DRM_TEGRA_SUBMIT_SYNCPT_INCR_PATCH_DYNAMIC_SYNCPT    (1<<1)
>>>
>>> struct drm_tegra_submit_syncpt_incr {
>>>     /* can be left as zero if using dynamic syncpt */
>>>     __u32 syncpt_id;
>>>     __u32 flags;
>>>
>>>     struct {
>>>         __u32 syncobj;
>>>         __u32 value;
>>>     } fence;
>>>
>>>     /* patch word as such:
>>>          * *word = *word | (syncpt_id << shift)
>>>          */
>>>     struct {
>>>         __u32 gather_offset_words;
>>>         __u32 shift;
>>>     } patch;
>>> };
>>>
>>> So this will work similarly to the buffer reloc system; the kernel
>>> driver will allocate a job syncpoint and patch in the syncpoint ID if
>>> requested, and allows outputting syncobjs for each increment.
>>
>> I haven't got any great ideas so far, but it feels that will be easier
>> and cleaner if we could have separate job paths (and job IOCTLS) based
>> on hardware generation since the workloads a too different. The needs of
>> a newer h/w are too obscure for me and absence of userspace code,
>> firmware sources and full h/w documentation do not help.
>>
>> There still should be quite a lot to share, but things like
>> mapping-to-channel and VM sync points are too far away from older h/w,
>> IMO. This means that code path before drm-sched and path for job-timeout
>> handling should be separate.
>>
>> Maybe later on it will become cleaner that we actually could unify it
>> all nicely, but for now it doesn't look like a good idea to me.
> 
> Sorry for jumping in rather randomly here and elsewhere, but it's been a
> long time since the discussion and I just want to share my thoughts on a
> couple of topics in order to hopefully help move this forward somehow.
> 
> For UAPI, "unifying it later" doesn't really work.

Of course I meant a "later version of this series" :) Sorry for not
making it clear.

> So I think the only
> realistic option is to make a best attempt at getting the UABI right so
> that it works for all existing use-cases and ideally perhaps even as of
> yet unknown use-cases in the future. As with all APIs this means that
> there's going to be a need to abstract away some of the hardware details
> so that we don't have to deal with too many low-level details in
> userspace, because otherwise the UAPI is just going to be a useless
> mess.
> 
> I think a proposal such as the above to allow both implicit and explicit
> syncpoints makes sense. For what it's worth, it's fairly similar to what
> we had come up with last time we tried destaging the ABI, although back
> at the time I'm not sure we had even considered explicit syncpoint usage
> yet. I think where reasonably possible this kind of optional behaviour
> is acceptable, but I don't think having two completely separate paths is
> going to help in any way. If anything it's just going to make it more
> difficult to maintain the code and get it to a usable state in the first
> place.
> 
> Like I said elsewhere, the programming model for host1x hasn't changed
> since Tegra20. It's rather evolved and gained a couple more features,
> but that doesn't change anything about how userspace uses it.

Not having a complete control over sync points state is a radical
change, IMO. I prefer not to use this legacy and error-prone way of sync
points handling at least for older h/w where it's possible to do. This
is what downstream driver did 10 years ago and what it still continues
to do. I was very happy to move away from this unnecessary complication
in the experimental grate-kernel driver and I think this will be great
to do in the mainline as well.

The need to map buffers explicitly is also a big difference. The need to
map BO for each channel is a quite big over-complication as we already
found out in the current version of UAPI.

Alright, perhaps the mapping could improved for older userspace if we
will move away from a per-channel contexts to a single DRM context like
I already was suggesting to Mikko before. I.e. instead of mapping BO "to
a channel", we will need to map BO "to h/w clients" within the DRM
context. This should allow older userspace to create a single mapping
for all channels/clients using a single IOCTL and then to have a single
"mapping handle" to care about. Objections?
Dmitry Osipenko March 23, 2021, 8:13 p.m. UTC | #20
23.03.2021 22:57, Dmitry Osipenko пишет:
> 23.03.2021 21:21, Thierry Reding пишет:
>> On Sat, Feb 27, 2021 at 02:19:39PM +0300, Dmitry Osipenko wrote:
>>> 03.02.2021 14:18, Mikko Perttunen пишет:
>>> ...
>>>>> I'll need more time to think about it.
>>>>>
>>>>
>>>> How about something like this:
>>>>
>>>> Turn the syncpt_incr field back into an array of structs like
>>>>
>>>> #define DRM_TEGRA_SUBMIT_SYNCPT_INCR_REPLACE_SYNCOBJ        (1<<0)
>>>> #define DRM_TEGRA_SUBMIT_SYNCPT_INCR_PATCH_DYNAMIC_SYNCPT    (1<<1)
>>>>
>>>> struct drm_tegra_submit_syncpt_incr {
>>>>     /* can be left as zero if using dynamic syncpt */
>>>>     __u32 syncpt_id;
>>>>     __u32 flags;
>>>>
>>>>     struct {
>>>>         __u32 syncobj;
>>>>         __u32 value;
>>>>     } fence;
>>>>
>>>>     /* patch word as such:
>>>>          * *word = *word | (syncpt_id << shift)
>>>>          */
>>>>     struct {
>>>>         __u32 gather_offset_words;
>>>>         __u32 shift;
>>>>     } patch;
>>>> };
>>>>
>>>> So this will work similarly to the buffer reloc system; the kernel
>>>> driver will allocate a job syncpoint and patch in the syncpoint ID if
>>>> requested, and allows outputting syncobjs for each increment.
>>>
>>> I haven't got any great ideas so far, but it feels that will be easier
>>> and cleaner if we could have separate job paths (and job IOCTLS) based
>>> on hardware generation since the workloads a too different. The needs of
>>> a newer h/w are too obscure for me and absence of userspace code,
>>> firmware sources and full h/w documentation do not help.
>>>
>>> There still should be quite a lot to share, but things like
>>> mapping-to-channel and VM sync points are too far away from older h/w,
>>> IMO. This means that code path before drm-sched and path for job-timeout
>>> handling should be separate.
>>>
>>> Maybe later on it will become cleaner that we actually could unify it
>>> all nicely, but for now it doesn't look like a good idea to me.
>>
>> Sorry for jumping in rather randomly here and elsewhere, but it's been a
>> long time since the discussion and I just want to share my thoughts on a
>> couple of topics in order to hopefully help move this forward somehow.
>>
>> For UAPI, "unifying it later" doesn't really work.
> 
> Of course I meant a "later version of this series" :) Sorry for not
> making it clear.
> 
>> So I think the only
>> realistic option is to make a best attempt at getting the UABI right so
>> that it works for all existing use-cases and ideally perhaps even as of
>> yet unknown use-cases in the future. As with all APIs this means that
>> there's going to be a need to abstract away some of the hardware details
>> so that we don't have to deal with too many low-level details in
>> userspace, because otherwise the UAPI is just going to be a useless
>> mess.
>>
>> I think a proposal such as the above to allow both implicit and explicit
>> syncpoints makes sense. For what it's worth, it's fairly similar to what
>> we had come up with last time we tried destaging the ABI, although back
>> at the time I'm not sure we had even considered explicit syncpoint usage
>> yet. I think where reasonably possible this kind of optional behaviour
>> is acceptable, but I don't think having two completely separate paths is
>> going to help in any way. If anything it's just going to make it more
>> difficult to maintain the code and get it to a usable state in the first
>> place.
>>
>> Like I said elsewhere, the programming model for host1x hasn't changed
>> since Tegra20. It's rather evolved and gained a couple more features,
>> but that doesn't change anything about how userspace uses it.
> 
> Not having a complete control over sync points state is a radical
> change, IMO. I prefer not to use this legacy and error-prone way of sync
> points handling at least for older h/w where it's possible to do. This
> is what downstream driver did 10 years ago and what it still continues
> to do. I was very happy to move away from this unnecessary complication
> in the experimental grate-kernel driver and I think this will be great
> to do in the mainline as well.
> 
> The need to map buffers explicitly is also a big difference. The need to
> map BO for each channel is a quite big over-complication as we already
> found out in the current version of UAPI.
> 
> Alright, perhaps the mapping could improved for older userspace if we
> will move away from a per-channel contexts to a single DRM context like
> I already was suggesting to Mikko before. I.e. instead of mapping BO "to
> a channel", we will need to map BO "to h/w clients" within the DRM
> context. This should allow older userspace to create a single mapping
> for all channels/clients using a single IOCTL and then to have a single
> "mapping handle" to care about. Objections?
> 

I just recalled that Mikko didn't like *not* to have the per-channel
contexts, saying that it should be good to have for later SoCs.

We could have a DRM context and channel contexts within the DRM context.
Then the mapping will need to be done to the channel contexts of the DRM
context, this should allow us to retain the per-channel contexts and
have a single mapping for older userspace.