[v2,02/15] drm/panthor: Add uAPI

Message ID	20230809165330.2451699-3-boris.brezillon@collabora.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> sender: bbrezillon) by madras.collabora.co.uk (Postfix) with ESMTPSA id 8F5206607208; Wed, 9 Aug 2023 17:53:34 +0100 (BST) From: Boris Brezillon <boris.brezillon@collabora.com> To: dri-devel@lists.freedesktop.org Subject: [PATCH v2 02/15] drm/panthor: Add uAPI Date: Wed, 9 Aug 2023 18:53:15 +0200 Message-ID: <20230809165330.2451699-3-boris.brezillon@collabora.com> In-Reply-To: <20230809165330.2451699-1-boris.brezillon@collabora.com> References: <20230809165330.2451699-1-boris.brezillon@collabora.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Cc: Nicolas Boichat <drinkcat@chromium.org>, Daniel Stone <daniels@collabora.com>, Neil Armstrong <neil.armstrong@linaro.org>, Liviu Dudau <Liviu.Dudau@arm.com>, Steven Price <steven.price@arm.com>, Boris Brezillon <boris.brezillon@collabora.com>, =?utf-8?b?Q2zDqW1lbnQgUMOp?= =?utf-8?b?cm9u?= <peron.clem@gmail.com>, "Marty E . Plummer" <hanetzer@startmail.com>, Robin Murphy <robin.murphy@arm.com>, Faith Ekstrand <faith.ekstrand@collabora.com> Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	drm: Add a driver for FW-based Mali GPUs \| expand [v2,00/15] drm: Add a driver for FW-based Mali GPUs [v2,01/15] drm/shmem-helper: Make pages_use_count an atomic_t [v2,02/15] drm/panthor: Add uAPI [v2,03/15] drm/panthor: Add GPU register definitions [v2,04/15] drm/panthor: Add the device logical block [v2,05/15] drm/panthor: Add the GPU logical block [v2,06/15] drm/panthor: Add GEM logical block [v2,07/15] drm/panthor: Add the devfreq logical block [v2,08/15] drm/panthor: Add the MMU/VM logical block [v2,09/15] drm/panthor: Add the FW logical block [v2,10/15] drm/panthor: Add the heap logical block [v2,11/15] drm/panthor: Add the scheduler logical block [v2,12/15] drm/panthor: Add the driver frontend block [v2,13/15] drm/panthor: Allow driver compilation [v2,14/15] dt-bindings: gpu: mali-valhall-csf: Add initial bindings for panthor driver [v2,15/15] drm/panthor: Add an entry to MAINTAINERS

Boris Brezillon Aug. 9, 2023, 4:53 p.m. UTC

Panthor follows the lead of other recently submitted drivers with
ioctls allowing us to support modern Vulkan features, like sparse memory
binding:

- Pretty standard GEM management ioctls (BO_CREATE and BO_MMAP_OFFSET),
  with the 'exclusive-VM' bit to speed-up BO reservation on job submission
- VM management ioctls (VM_CREATE, VM_DESTROY and VM_BIND). The VM_BIND
  ioctl is loosely based on the Xe model, and can handle both
  asynchronous and synchronous requests
- GPU execution context creation/destruction, tiler heap context creation
  and job submission. Those ioctls reflect how the hardware/scheduler
  works and are thus driver specific.

We also have a way to expose IO regions, such that the usermode driver
can directly access specific/well-isolate registers, like the
LATEST_FLUSH register used to implement cache-flush reduction.

This uAPI intentionally keeps usermode queues out of the scope, which
explains why doorbell registers and command stream ring-buffers are not
directly exposed to userspace.

v2:
- Rename the driver (pancsf -> panthor)
- Change the license (GPL2 -> MIT + GPL2)
- Split the driver addition commit
- Turn the VM_{MAP,UNMAP} ioctls into a VM_BIND ioctl
- Add the concept of exclusive_vm at BO creation time
- Add missing padding fields
- Add documentation

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 Documentation/gpu/driver-uapi.rst |   5 +
 include/uapi/drm/panthor_drm.h    | 862 ++++++++++++++++++++++++++++++
 2 files changed, 867 insertions(+)
 create mode 100644 include/uapi/drm/panthor_drm.h

Steven Price Aug. 11, 2023, 2:13 p.m. UTC | #1

On 09/08/2023 17:53, Boris Brezillon wrote:
> Panthor follows the lead of other recently submitted drivers with
> ioctls allowing us to support modern Vulkan features, like sparse memory
> binding:
> 
> - Pretty standard GEM management ioctls (BO_CREATE and BO_MMAP_OFFSET),
>   with the 'exclusive-VM' bit to speed-up BO reservation on job submission
> - VM management ioctls (VM_CREATE, VM_DESTROY and VM_BIND). The VM_BIND
>   ioctl is loosely based on the Xe model, and can handle both
>   asynchronous and synchronous requests
> - GPU execution context creation/destruction, tiler heap context creation
>   and job submission. Those ioctls reflect how the hardware/scheduler
>   works and are thus driver specific.
> 
> We also have a way to expose IO regions, such that the usermode driver
> can directly access specific/well-isolate registers, like the
> LATEST_FLUSH register used to implement cache-flush reduction.
> 
> This uAPI intentionally keeps usermode queues out of the scope, which
> explains why doorbell registers and command stream ring-buffers are not
> directly exposed to userspace.
> 
> v2:
> - Rename the driver (pancsf -> panthor)
> - Change the license (GPL2 -> MIT + GPL2)
> - Split the driver addition commit
> - Turn the VM_{MAP,UNMAP} ioctls into a VM_BIND ioctl
> - Add the concept of exclusive_vm at BO creation time
> - Add missing padding fields
> - Add documentation
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>

Looks good, just documentation typos/corrections below. With those fixed

Reviewed-by: Steven Price <steven.price@arm.com>

> ---
>  Documentation/gpu/driver-uapi.rst |   5 +
>  include/uapi/drm/panthor_drm.h    | 862 ++++++++++++++++++++++++++++++
>  2 files changed, 867 insertions(+)
>  create mode 100644 include/uapi/drm/panthor_drm.h
> 
> diff --git a/Documentation/gpu/driver-uapi.rst b/Documentation/gpu/driver-uapi.rst
> index c08bcbb95fb3..7a667901830f 100644
> --- a/Documentation/gpu/driver-uapi.rst
> +++ b/Documentation/gpu/driver-uapi.rst
> @@ -17,3 +17,8 @@ VM_BIND / EXEC uAPI
>      :doc: Overview
>  
>  .. kernel-doc:: include/uapi/drm/nouveau_drm.h
> +
> +drm/panthor uAPI
> +================
> +
> +.. kernel-doc:: include/uapi/drm/panthor_drm.h
> diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
> new file mode 100644
> index 000000000000..e217eb5ad198
> --- /dev/null
> +++ b/include/uapi/drm/panthor_drm.h
> @@ -0,0 +1,862 @@
> +/* SPDX-License-Identifier: MIT */
> +/* Copyright (C) 2023 Collabora ltd. */
> +#ifndef _PANTHOR_DRM_H_
> +#define _PANTHOR_DRM_H_
> +
> +#include "drm.h"
> +
> +#if defined(__cplusplus)
> +extern "C" {
> +#endif
> +
> +/**
> + * DOC: Introduction
> + *
> + * This documentation decribes the Panthor IOCTLs.
                         ^^^^^^^^ describes
> + *
> + * Just a few generic rules about the data passed to the Panthor IOCTLs:
> + *
> + * - Structures must be aligned on 64-bit/8-byte. If the object is not
> + *   naturally aligned, a padding field must be added.
> + * - Fields must be explicity aligned to their natural type alignment with
                       ^^^^^^^^^ explicitly

> + *   pad[0..N] fields.
> + * - All padding fields will be checked by the driver to make sure they are
> + *   zeroed.
> + * - Flags can be added, but not removed/replaced.
> + * - New fields can be added to the main structures (the structures
> + *   directly passed to the ioctl). Those fiels can be added at the end of
					     ^^^^^ fields

> + *   the structure, or replace existing padding fields. Any new field being
> + *   added must preserve the behavior that existed before those fields were
> + *   added when a value of zero is passed.
> + * - New fields can be added to indirect objects (objects pointed by the
> + *   main structure), iff those objects are passed a size to reflect the
> + *   size known by the userspace driver (see drm_panthor_obj_array::stride
> + *   or drm_panthor_dev_query::size).
> + * - If the kernel driver is too old to know some fields, those will
> + *   be ignored (input) and set back to zero (output).

I presume this should be "will be ignored if zero (input)" and rejected
if non-zero?

> + * - If userspace is too old to know some fields, those will be zeroed
> + *   (input) before the structure is parsed by the kernel driver.
> + * - Each new flag/field addition must come with a driver version update so
> + *   the userspace driver doesn't have to trial and error to know which
> + *   flags are supported.
> + * - Structures should not contain unions, as this would defeat the
> + *   extensibility of such structures.
> + * - IOCTLs can't be removed or replaced. New IOCTL IDs should be placed
> + *   at the end of the drm_panthor_ioctl_id enum.
> + */
> +
> +/**
> + * DOC: MMIO regions exposed to userspace.
> + *
> + * .. c:macro:: DRM_PANTHOR_USER_MMIO_OFFSET
> + *
> + * File offset for all MMIO regions being exposed to userspace. Don't use
> + * this value directly, use DRM_PANTHOR_USER_<name>_OFFSET values instead.
> + *
> + * .. c:macro:: DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET
> + *
> + * File offset for the LATEST_FLUSH_ID register. The Userspace driver controls
> + * GPU cache flushling through CS instructions, but the flush reduction
                ^^^^^^^^^ flushing

> + * mechanism requires a flush_id. This flush_id could be queried with an
> + * ioctl, but Arm provides a well-isolated register page containing only this
> + * read-only register, so let's expose this page through a static mmap offset
> + * and allow direct mapping of this MMIO region so we can avoid the
> + * user <-> kernel round-trip.
> + */
> +#define DRM_PANTHOR_USER_MMIO_OFFSET		(0x1ull << 56)
> +#define DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET	(DRM_PANTHOR_USER_MMIO_OFFSET | 0)
> +
> +/**
> + * DOC: IOCTL IDs
> + *
> + * enum drm_panthor_ioctl_id - IOCTL IDs
> + *
> + * Place new ioctls at the end, don't re-oder, don't replace or remove entries.
                                         ^^^^^^^ re-order

> + *
> + * These IDs are not meant to be used directly. Use the DRM_IOCTL_PANTHOR_xxx
> + * definitions instead.
> + */
> +enum drm_panthor_ioctl_id {
> +	/** @DRM_PANTHOR_DEV_QUERY: Query device information. */
> +	DRM_PANTHOR_DEV_QUERY = 0,
> +
> +	/** @DRM_PANTHOR_VM_CREATE: Create a VM. */
> +	DRM_PANTHOR_VM_CREATE,
> +
> +	/** @DRM_PANTHOR_VM_DESTROY: Destroy a VM. */
> +	DRM_PANTHOR_VM_DESTROY,
> +
> +	/** @DRM_PANTHOR_VM_BIND: Bind/unbind memory to a VM. */
> +	DRM_PANTHOR_VM_BIND,
> +
> +	/** @DRM_PANTHOR_BO_CREATE: Create a buffer object. */
> +	DRM_PANTHOR_BO_CREATE,
> +
> +	/**
> +	 * @DRM_PANTHOR_BO_MMAP_OFFSET: Get the file offset to pass to
> +	 * mmap to map a GEM object.
> +	 */
> +	DRM_PANTHOR_BO_MMAP_OFFSET,
> +
> +	/** @DRM_PANTHOR_GROUP_CREATE: Create a scheduling group. */
> +	DRM_PANTHOR_GROUP_CREATE,
> +
> +	/** @DRM_PANTHOR_GROUP_DESTROY: Destroy a scheduling group. */
> +	DRM_PANTHOR_GROUP_DESTROY,
> +
> +	/**
> +	 * @DRM_PANTHOR_GROUP_SUBMIT: Submit jobs to queues belonging
> +	 * to a specific scheduling group.
> +	 */
> +	DRM_PANTHOR_GROUP_SUBMIT,
> +
> +	/** @DRM_PANTHOR_GROUP_GET_STATE: Get the state of a scheduling group. */
> +	DRM_PANTHOR_GROUP_GET_STATE,
> +
> +	/** @DRM_PANTHOR_TILER_HEAP_CREATE: Create a tiler heap. */
> +	DRM_PANTHOR_TILER_HEAP_CREATE,
> +
> +	/** @DRM_PANTHOR_TILER_HEAP_DESTROY: Destroy a tiler heap. */
> +	DRM_PANTHOR_TILER_HEAP_DESTROY,
> +};
> +
> +/**
> + * DRM_IOCTL_PANTHOR() - Build a Panthor IOCTL number
> + * @__access: Access type. Must be R, W or RW.
> + * @__id: One of the DRM_PANTHOR_xxx id.
> + * @__type: Suffix of the type being passed to the IOCTL.
> + *
> + * Don't use this macro directly, use the DRM_IOCTL_PANTHOR_xxx
> + * values instead.
> + *
> + * Return: An IOCTL number to be passed to ioctl() from userspace.
> + */
> +#define DRM_IOCTL_PANTHOR(__access, __id, __type) \
> +	DRM_IO ## __access(DRM_COMMAND_BASE + DRM_PANTHOR_ ## __id, \
> +			   struct drm_panthor_ ## __type)
> +
> +#define DRM_IOCTL_PANTHOR_DEV_QUERY \
> +	DRM_IOCTL_PANTHOR(WR, DEV_QUERY, dev_query)
> +#define DRM_IOCTL_PANTHOR_VM_CREATE \
> +	DRM_IOCTL_PANTHOR(WR, VM_CREATE, vm_create)
> +#define DRM_IOCTL_PANTHOR_VM_DESTROY \
> +	DRM_IOCTL_PANTHOR(WR, VM_DESTROY, vm_destroy)
> +#define DRM_IOCTL_PANTHOR_VM_BIND \
> +	DRM_IOCTL_PANTHOR(WR, VM_BIND, vm_bind)
> +#define DRM_IOCTL_PANTHOR_BO_CREATE \
> +	DRM_IOCTL_PANTHOR(WR, BO_CREATE, bo_create)
> +#define DRM_IOCTL_PANTHOR_BO_MMAP_OFFSET \
> +	DRM_IOCTL_PANTHOR(WR, BO_MMAP_OFFSET, bo_mmap_offset)
> +#define DRM_IOCTL_PANTHOR_GROUP_CREATE \
> +	DRM_IOCTL_PANTHOR(WR, GROUP_CREATE, group_create)
> +#define DRM_IOCTL_PANTHOR_GROUP_DESTROY \
> +	DRM_IOCTL_PANTHOR(WR, GROUP_DESTROY, group_destroy)
> +#define DRM_IOCTL_PANTHOR_GROUP_SUBMIT \
> +	DRM_IOCTL_PANTHOR(WR, GROUP_SUBMIT, group_submit)
> +#define DRM_IOCTL_PANTHOR_GROUP_GET_STATE \
> +	DRM_IOCTL_PANTHOR(WR, GROUP_GET_STATE, group_get_state)
> +#define DRM_IOCTL_PANTHOR_TILER_HEAP_CREATE \
> +	DRM_IOCTL_PANTHOR(WR, TILER_HEAP_CREATE, tiler_heap_create)
> +#define DRM_IOCTL_PANTHOR_TILER_HEAP_DESTROY \
> +	DRM_IOCTL_PANTHOR(WR, TILER_HEAP_DESTROY, tiler_heap_destroy)
> +
> +/**
> + * DOC: IOCTL arguments
> + */
> +
> +/**
> + * struct drm_panthor_obj_array - Object array.
> + *
> + * This object is used to pass an array of objects whose size it subject to changes in
> + * future versions of the driver. In order to support this mutability, we pass a stride
> + * describing the size of the object as known by userspace.
> + *
> + * You shouldn't fill drm_panthor_obj_array fields directly. You should instead use
> + * the DRM_PANTHOR_OBJ_ARRAY() macro that takes care of initializing the stride to
> + * the object size.
> + */
> +struct drm_panthor_obj_array {
> +	/** @stride: Stride of object struct. Used for versioning. */
> +	__u32 stride;
> +
> +	/** @count: Number of objects in the array. */
> +	__u32 count;
> +
> +	/** @array: User pointer to an array of objects. */
> +	__u64 array;
> +};
> +
> +/**
> + * DRM_PANTHOR_OBJ_ARRAY() - Initialize a drm_panthor_obj_array field.
> + * @cnt: Number of elements in the array.
> + * @ptr: Pointer to the array to pass to the kernel.
> + *
> + * Macro initializing a drm_panthor_obj_array based on the object size as known
> + * by userspace.
> + */
> +#define DRM_PANTHOR_OBJ_ARRAY(cnt, ptr) \
> +	{ .stride = sizeof((ptr)[0]), .count = (cnt), .array = (__u64)(uintptr_t)(ptr) }
> +
> +/**
> + * enum drm_panthor_sync_op_flags - Synchronization operation flags.
> + */
> +enum drm_panthor_sync_op_flags {
> +	/** @DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_MASK: Synchronization handle type mask. */
> +	DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_MASK = 0xff,
> +
> +	/** @DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_SYNCOBJ: Synchronization object type. */
> +	DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_SYNCOBJ = 0,
> +
> +	/**
> +	 * @DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_TIMELINE_SYNCOBJ: Timeline synchronization
> +	 * object type.
> +	 */
> +	DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_TIMELINE_SYNCOBJ = 1,
> +
> +	/** @DRM_PANTHOR_SYNC_OP_WAIT: Wait operation. */
> +	DRM_PANTHOR_SYNC_OP_WAIT = 0 << 31,
> +
> +	/** @DRM_PANTHOR_SYNC_OP_SIGNAL: Signal operation. */
> +	DRM_PANTHOR_SYNC_OP_SIGNAL = 1 << 31,
> +};
> +
> +/**
> + * struct drm_panthor_sync_op - Synchronization operation.
> + */
> +struct drm_panthor_sync_op {
> +	/** @flags: Synchronization operation flags. Combination of DRM_PANTHOR_SYNC_OP values. */
> +	__u32 flags;
> +
> +	/** @handle: Sync handle. */
> +	__u32 handle;
> +
> +	/**
> +	 * @timeline_value: MBZ if
> +	 * (flags & DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_MASK) !=
> +	 * DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_TIMELINE_SYNCOBJ.
> +	 */
> +	__u64 timeline_value;
> +};
> +
> +/**
> + * enum drm_panthor_dev_query_type - Query type
> + *
> + * Place new types at the end, don't re-oder, don't remove or replace.
s/re-oder/re-order/

> + */
> +enum drm_panthor_dev_query_type {
> +	/** @DRM_PANTHOR_DEV_QUERY_GPU_INFO: Query GPU information. */
> +	DRM_PANTHOR_DEV_QUERY_GPU_INFO = 0,
> +
> +	/** @DRM_PANTHOR_DEV_QUERY_CSIF_INFO: Query command-stream interface information. */
> +	DRM_PANTHOR_DEV_QUERY_CSIF_INFO,
> +};
> +
> +/**
> + * struct drm_panthor_gpu_info - GPU information
> + *
> + * Structure grouping all queryable information relating to the GPU.
> + */
> +struct drm_panthor_gpu_info {
> +	/** @gpu_id : GPU ID. */
> +	__u32 gpu_id;
> +#define DRM_PANTHOR_ARCH_MAJOR(x)		((x) >> 28)
> +#define DRM_PANTHOR_ARCH_MINOR(x)		(((x) >> 24) & 0xf)
> +#define DRM_PANTHOR_ARCH_REV(x)			(((x) >> 20) & 0xf)
> +#define DRM_PANTHOR_PRODUCT_MAJOR(x)		(((x) >> 16) & 0xf)
> +#define DRM_PANTHOR_VERSION_MAJOR(x)		(((x) >> 12) & 0xf)
> +#define DRM_PANTHOR_VERSION_MINOR(x)		(((x) >> 4) & 0xff)
> +#define DRM_PANTHOR_VERSION_STATUS(x)		((x) & 0xf)
> +
> +	/** @gpu_rev: GPU revision. */
> +	__u32 gpu_rev;
> +
> +	/** @csf_id: Command stream frontend ID. */
> +	__u32 csf_id;
> +#define DRM_PANTHOR_CSHW_MAJOR(x)		(((x) >> 26) & 0x3f)
> +#define DRM_PANTHOR_CSHW_MINOR(x)		(((x) >> 20) & 0x3f)
> +#define DRM_PANTHOR_CSHW_REV(x)			(((x) >> 16) & 0xf)
> +#define DRM_PANTHOR_MCU_MAJOR(x)		(((x) >> 10) & 0x3f)
> +#define DRM_PANTHOR_MCU_MINOR(x)		(((x) >> 4) & 0x3f)
> +#define DRM_PANTHOR_MCU_REV(x)			((x) & 0xf)
> +
> +	/** @l2_features: L2-cache features. */
> +	__u32 l2_features;
> +
> +	/** @tiler_features: Tiler features. */
> +	__u32 tiler_features;
> +
> +	/** @mem_features: Memory features. */
> +	__u32 mem_features;
> +
> +	/** @mmu_features: MMU features. */
> +	__u32 mmu_features;
> +#define DRM_PANTHOR_MMU_VA_BITS(x)		((x) & 0xff)
> +
> +	/** @thread_features: Thread features. */
> +	__u32 thread_features;
> +
> +	/** @max_threads: Maximum number of threads. */
> +	__u32 max_threads;
> +
> +	/** @thread_max_workgroup_size: Maximum workgroup size. */
> +	__u32 thread_max_workgroup_size;
> +
> +	/**
> +	 * @thread_max_barrier_size: Maximum number of threads that can wait
> +	 * simultaneously on a barrier.
> +	 */
> +	__u32 thread_max_barrier_size;
> +
> +	/** @coherency_features: Coherency features. */
> +	__u32 coherency_features;
> +
> +	/** @texture_features: Texture features. */
> +	__u32 texture_features[4];
> +
> +	/** @as_present: Bitmask encoding the number of address-space exposed by the MMU. */
> +	__u32 as_present;
> +
> +	/** @core_group_count: Number of core groups. */
> +	__u32 core_group_count;
> +
> +	/** @pad: Zero on return. */
> +	__u32 pad;
> +
> +	/** @shader_present: Bitmask encoding the shader cores exposed by the GPU. */
> +	__u64 shader_present;
> +
> +	/** @l2_present: Bitmask encoding the L2 caches exposed by the GPU. */
> +	__u64 l2_present;
> +
> +	/** @tiler_present: Bitmask encoding the tiler unit exposed by the GPU. */
s/unit/units/

> +	__u64 tiler_present;
> +};
> +
> +/**
> + * struct drm_panthor_csif_info - Command stream interface information
> + *
> + * Structure grouping all queryable information relating to the command stream interface.
> + */
> +struct drm_panthor_csif_info {
> +	/** @csg_slot_count: Number of command stream group slots exposed by the firmware. */
> +	__u32 csg_slot_count;
> +
> +	/** @cs_slot_count: Number of command stream slot per group. */
s/slot/slots/

> +	__u32 cs_slot_count;
> +
> +	/** @cs_reg_count: Number of command stream register. */
s/register/registers/

> +	__u32 cs_reg_count;
> +
> +	/** @scoreboard_slot_count: Number of scoreboard slot. */
s/slot/slots/

> +	__u32 scoreboard_slot_count;
> +
> +	/**
> +	 * @unpreserved_cs_reg_count: Number of command stream registers reserved by
> +	 * the kernel driver to call a userspace command stream.
> +	 *
> +	 * All registers can be used by a userspace command stream, but the
> +	 * [cs_slot_count - unpreserved_cs_reg_count .. cs_slot_count] registers are
> +	 * used by the kernel when DRM_PANTHOR_IOCTL_GROUP_SUBMIT is called.
> +	 */
> +	__u32 unpreserved_cs_reg_count;
> +
> +	/**
> +	 * @pad: Padding field, set to zero.
> +	 */
> +	__u32 pad;
> +};
> +
> +/**
> + * struct drm_panthor_dev_query - Arguments passed to DRM_PANTHOR_IOCTL_DEV_QUERY
> + */
> +struct drm_panthor_dev_query {
> +	/** @type: the query type (see drm_panthor_dev_query_type). */
> +	__u32 type;
> +
> +	/**
> +	 * @size: size of the type being queried.
> +	 *
> +	 * If pointer is NULL, size is updated by the driver to provide the
> +	 * output structure size. If pointer is not NULL, the driver will
> +	 * only copy min(size, actual_structure_size) bytes to the pointer,
> +	 * and update the size accordingly. This allows us to extend query
> +	 * types without breaking userspace.
> +	 */
> +	__u32 size;
> +
> +	/**
> +	 * @pointer: user pointer to a query type struct.
> +	 *
> +	 * Pointer can be NULL, in which case, nothing is copied, but the
> +	 * actual structure size is returned. If not NULL, it must point to
> +	 * a location that's large enough to hold size bytes.
> +	 */
> +	__u64 pointer;
> +};
> +
> +/**
> + * struct drm_panthor_vm_create - Arguments passed to DRM_PANTHOR_IOCTL_VM_CREATE
> + */
> +struct drm_panthor_vm_create {
> +	/** @flags: VM flags, MBZ. */
> +	__u32 flags;
> +
> +	/** @id: Returned VM ID. */
> +	__u32 id;
> +
> +	/**
> +	 * @kernel_va_range: Size of the VA space reserved for kernel objects.
> +	 *
> +	 * If kernel_va_range is zero, we pick half of the VA space for kernel objects.
> +	 *
> +	 * Kernel VA space is always placed at the top of the supported VA range.
> +	 */
> +	__u64 kernel_va_range;
> +};
> +
> +/**
> + * struct drm_panthor_vm_destroy - Arguments passed to DRM_PANTHOR_IOCTL_VM_DESTROY
> + */
> +struct drm_panthor_vm_destroy {
> +	/** @id: ID of the VM to destroy. */
> +	__u32 id;
> +
> +	/** @pad: MBZ. */
> +	__u32 pad;
> +};
> +
> +/**
> + * enum drm_panthor_vm_bind_op_flags - VM bind operation flags
> + */
> +enum drm_panthor_vm_bind_op_flags {
> +	/**
> +	 * @DRM_PANTHOR_VM_BIND_OP_MAP_READONLY: Map the memory read-only.
> +	 *
> +	 * Only valid with DRM_PANTHOR_VM_BIND_OP_TYPE_MAP.
> +	 */
> +	DRM_PANTHOR_VM_BIND_OP_MAP_READONLY = 1 << 0,
> +
> +	/**
> +	 * @DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC: Map the memory not-executable.
> +	 *
> +	 * Only valid with DRM_PANTHOR_VM_BIND_OP_TYPE_MAP.
> +	 */
> +	DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC = 1 << 1,
> +
> +	/**
> +	 * @DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED: Map the memory uncached.
> +	 *
> +	 * Only valid with DRM_PANTHOR_VM_BIND_OP_TYPE_MAP.
> +	 */
> +	DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED = 1 << 2,
> +
> +	/**
> +	 * @DRM_PANTHOR_VM_BIND_OP_TYPE_MASK: Mask used to determine the type of operation.
> +	 */
> +	DRM_PANTHOR_VM_BIND_OP_TYPE_MASK = 0xf << 28,
> +
> +	/** @DRM_PANTHOR_VM_BIND_OP_TYPE_MAP: Map operation. */
> +	DRM_PANTHOR_VM_BIND_OP_TYPE_MAP = 0 << 28,
> +
> +	/** @DRM_PANTHOR_VM_BIND_OP_TYPE_UNMAP: Unmap operation. */
> +	DRM_PANTHOR_VM_BIND_OP_TYPE_UNMAP = 1 << 28,
> +};
> +
> +/**
> + * struct drm_panthor_vm_bind_op - VM bind operation
> + */
> +struct drm_panthor_vm_bind_op {
> +	/** @flags: Combination of drm_panthor_vm_bind_op_flags flags. */
> +	__u32 flags;
> +
> +	/**
> +	 * @bo_handle: Handle of the buffer object to map.
> +	 * MBZ for unmap operations.
> +	 */
> +	__u32 bo_handle;
> +
> +	/**
> +	 * @bo_offset: Buffer object offset.
> +	 * MBZ for unmap operations.
> +	 */
> +	__u64 bo_offset;
> +
> +	/**
> +	 * @va: Virtual address to map/unmap.
> +	 */
> +	__u64 va;
> +
> +	/** @size: Size to map/unmap. */
> +	__u64 size;
> +
> +	/**
> +	 * @syncs: Array of synchronization operations.
> +	 *
> +	 * This array must be empty if %DRM_PANTHOR_VM_BIND_ASYNC is not set on
> +	 * the drm_panthor_vm_bind object containing this VM bind operation.

You should state this is an array of struct drm_panthor_sync_op.

> +	 */
> +	struct drm_panthor_obj_array syncs;
> +
> +};
> +
> +/**
> + * enum drm_panthor_vm_bind_flags - VM bind flags
> + */
> +enum drm_panthor_vm_bind_flags {
> +	/**
> +	 * @DRM_PANTHOR_VM_BIND_ASYNC: VM bind operations are queued to the VM
> +	 * queue instead of being executed synchronously.
> +	 */
> +	DRM_PANTHOR_VM_BIND_ASYNC = 1 << 0,
> +};
> +
> +/**
> + * struct drm_panthor_vm_bind - Arguments passed to DRM_IOCTL_PANTHOR_VM_BIND
> + */
> +struct drm_panthor_vm_bind {
> +	/** @vm_id: VM targeted by the bind request. */
> +	__u32 vm_id;
> +
> +	/** @flags: Combination of drm_panthor_vm_bind_flags flags. */
> +	__u32 flags;
> +
> +	/** @ops: Array of bind operations. */

Array of struct drm_panthor_vm_bind_op

> +	struct drm_panthor_obj_array ops;
> +};
> +
> +/**
> + * enum drm_panthor_bo_flags - Buffer object flags, passed at creation time.
> + */
> +enum drm_panthor_bo_flags {
> +	/** @DRM_PANTHOR_BO_NO_MMAP: The buffer object will never be CPU-mapped in userspace. */
> +	DRM_PANTHOR_BO_NO_MMAP = (1 << 0),
> +};
> +
> +/**
> + * struct drm_panthor_bo_create - Arguments passed to DRM_IOCTL_PANTHOR_BO_CREATE.
> + */
> +struct drm_panthor_bo_create {
> +	/**
> +	 * @size: Requested size for the object
> +	 *
> +	 * The (page-aligned) allocated size for the object will be returned.
> +	 */
> +	__u64 size;
> +
> +	/**
> +	 * @flags: Flags. Must be a combination of drm_panthor_bo_flags flags.
> +	 */
> +	__u32 flags;
> +
> +	/**
> +	 * @exclusive_vm_id: Exclusive VM this buffer object will be mapped to.
> +	 *
> +	 * If not zero, the field must refer to a valid VM ID, and implies that:
> +	 *  - the buffer object will only ever be bound to that VM
> +	 *  - cannot be exported as a PRIME fd
> +	 */
> +	__u32 exclusive_vm_id;
> +
> +	/**
> +	 * @handle: Returned handle for the object.
> +	 *
> +	 * Object handles are nonzero.
> +	 */
> +	__u32 handle;
> +
> +	/** @pad: MBZ. */
> +	__u32 pad;
> +};
> +
> +/**
> + * struct drm_panthor_bo_mmap_offset - Arguments passed to DRM_IOCTL_PANTHOR_BO_MMAP_OFFSET.
> + */
> +struct drm_panthor_bo_mmap_offset {
> +	/** @handle: Handle of the object we want an mmap offset for. */
> +	__u32 handle;
> +
> +	/** @pad: MBZ. */
> +	__u32 pad;
> +
> +	/** @offset: The fake offset to use for subsequent mmap calls. */
> +	__u64 offset;
> +};
> +
> +/**
> + * struct drm_panthor_queue_create - Queue creation arguments.
> + */
> +struct drm_panthor_queue_create {
> +	/**
> +	 * @priority: Defines the priority of queues inside a group. Goes from 0 to 15,
> +	 * 15 being the highest priority.
> +	 */
> +	__u8 priority;
> +
> +	/** @pad: Padding fields, MBZ. */
> +	__u8 pad[3];
> +
> +	/** @ringbuf_size: Size of the ring buffer to allocate to this queue. */
> +	__u32 ringbuf_size;
> +};
> +
> +/**
> + * enum drm_panthor_group_priority - Scheduling group priority
> + */
> +enum drm_panthor_group_priority {
> +	/** @PANTHOR_GROUP_PRIORITY_LOW: Low priority group. */
> +	PANTHOR_GROUP_PRIORITY_LOW = 0,
> +
> +	/** @PANTHOR_GROUP_PRIORITY_MEDIUM: Medium priority group. */
> +	PANTHOR_GROUP_PRIORITY_MEDIUM,
> +
> +	/** @PANTHOR_GROUP_PRIORITY_HIGH: High priority group. */
> +	PANTHOR_GROUP_PRIORITY_HIGH,
> +};
> +
> +/**
> + * struct drm_panthor_group_create - Arguments passed to DRM_IOCTL_PANTHOR_GROUP_CREATE
> + */
> +struct drm_panthor_group_create {
> +	/** @queues: Array of drm_panthor_create_cs_queue elements. */

s/drm_panthor_create_cs_queue/drm_panthor_queue_create/

> +	struct drm_panthor_obj_array queues;
> +
> +	/**
> +	 * @max_compute_cores: Maximum number of cores that can be used by compute
> +	 * jobs across CS queues bound to this group.
> +	 *
> +	 * Must be less or equal to the number of bits set in @compute_core_mask.
> +	 */
> +	__u8 max_compute_cores;
> +
> +	/**
> +	 * @max_fragment_cores: Maximum number of cores that can be used by fragment
> +	 * jobs across CS queues bound to this group.
> +	 *
> +	 * Must be less or equal to the number of bits set in @fragment_core_mask.
> +	 */
> +	__u8 max_fragment_cores;
> +
> +	/**
> +	 * @max_tiler_cores: Maximum number of tilers that can be used by tiler jobs
> +	 * across CS queues bound to this group.
> +	 *
> +	 * Must be less or equal to the number of bits set in @tiler_core_mask.
> +	 */
> +	__u8 max_tiler_cores;
> +
> +	/** @priority: Group priority (see drm_drm_panthor_cs_group_priority). */

s/drm_drm_panthor_cs_group_priority/enum drm_panthor_group_priority/

> +	__u8 priority;
> +
> +	/** @pad: Padding field, MBZ. */
> +	__u32 pad;
> +
> +	/**
> +	 * @compute_core_mask: Mask encoding cores that can be used for compute jobs.
> +	 *
> +	 * This field must have at least @max_compute_cores bits set.
> +	 *
> +	 * The bits set here should also be set in drm_panthor_gpu_info::shader_present.
> +	 */
> +	__u64 compute_core_mask;
> +
> +	/**
> +	 * @fragment_core_mask: Mask encoding cores that can be used for fragment jobs.
> +	 *
> +	 * This field must have at least @max_fragment_cores bits set.
> +	 *
> +	 * The bits set here should also be set in drm_panthor_gpu_info::shader_present.
> +	 */
> +	__u64 fragment_core_mask;
> +
> +	/**
> +	 * @tiler_core_mask: Mask encoding cores that can be used for tiler jobs.
> +	 *
> +	 * This field must have at least @max_tiler_cores bits set.
> +	 *
> +	 * The bits set here should also be set in drm_panthor_gpu_info::tiler_present.
> +	 */
> +	__u64 tiler_core_mask;
> +
> +	/**
> +	 * @vm_id: VM ID to bind this group to.
> +	 *
> +	 * All submission to queues bound to this group will use this VM.
> +	 */
> +	__u32 vm_id;
> +
> +	/**
> +	 * @group_handle: Returned group handle. Passed back when submitting jobs or
> +	 * destroying a group.
> +	 */
> +	__u32 group_handle;
> +};
> +
> +/**
> + * struct drm_panthor_group_destroy - Arguments passed to DRM_IOCTL_PANTHOR_GROUP_DESTROY
> + */
> +struct drm_panthor_group_destroy {
> +	/** @group_handle: Group to destroy */
> +	__u32 group_handle;
> +
> +	/** @pad: Padding field, MBZ. */
> +	__u32 pad;
> +};
> +
> +/**
> + * struct drm_panthor_queue_submit - Job submission arguments.
> + *
> + * This is describing the userspace command stream to call from the kernel
> + * command stream ring-buffer. Queue submission is always part of a group
> + * submission, taking one or more jobs to submit to the underlying queues.
> + */
> +struct drm_panthor_queue_submit {
> +	/** @queue_index: Index of the queue inside a group. */
> +	__u32 queue_index;
> +
> +	/**
> +	 * @stream_size: Size of the command stream to execute.
> +	 *
> +	 * Must be 64-bit/8-byte aligned (the size of a CS instruction)
> +	 *
> +	 * Can be zero if stream_addr is zero too.
> +	 */
> +	__u32 stream_size;
> +
> +	/**
> +	 * @stream_addr: GPU address of the command stream to execute.
> +	 *
> +	 * Must be aligned on 64-byte.
> +	 *
> +	 * Can be zero is stream_size is zero too.
> +	 */
> +	__u64 stream_addr;
> +
> +	/**
> +	 * @latest_flush: FLUSH_ID read at the time the stream was built.
> +	 *
> +	 * This allows cache flush elimination for the automatic
> +	 * flush+invalidate(all) done at submission time, which is needed to
> +	 * ensure the GPU doesn't get garbage when reading the indirect command
> +	 * stream buffers. If you want the cache flush to happen
> +	 * unconditionally, pass a zero here.
> +	 */
> +	__u32 latest_flush;
> +
> +	/** @pad: MBZ. */
> +	__u32 pad;
> +
> +	/** @syncs: Array of sync operations. */

Array of struct drm_panthor_sync_op.

Steve

> +	struct drm_panthor_obj_array syncs;
> +};
> +
> +/**
> + * struct drm_panthor_group_submit - Arguments passed to DRM_IOCTL_PANTHOR_VM_BIND
> + */
> +struct drm_panthor_group_submit {
> +	/** @group_handle: Handle of the group to queue jobs to. */
> +	__u32 group_handle;
> +
> +	/** @pad: MBZ. */
> +	__u32 pad;
> +
> +	/** @queue_submits: Array of drm_panthor_queue_submit objects. */
> +	struct drm_panthor_obj_array queue_submits;
> +};
> +
> +/**
> + * enum drm_panthor_group_state_flags - Group state flags
> + */
> +enum drm_panthor_group_state_flags {
> +	/**
> +	 * @DRM_PANTHOR_GROUP_STATE_TIMEDOUT: Group had unfinished jobs.
> +	 *
> +	 * When a group ends up with this flag set, no jobs can be submitted to its queues.
> +	 */
> +	DRM_PANTHOR_GROUP_STATE_TIMEDOUT = 1 << 0,
> +
> +	/**
> +	 * @DRM_PANTHOR_GROUP_STATE_FATAL_FAULT: Group had fatal faults.
> +	 *
> +	 * When a group ends up with this flag set, no jobs can be submitted to its queues.
> +	 */
> +	DRM_PANTHOR_GROUP_STATE_FATAL_FAULT = 1 << 1,
> +};
> +
> +/**
> + * struct drm_panthor_group_get_state - Arguments passed to DRM_IOCTL_PANTHOR_GROUP_GET_STATE
> + *
> + * Used to query the state of a group and decide whether a new group should be created to
> + * replace it.
> + */
> +struct drm_panthor_group_get_state {
> +	/** @group_handle: Handle of the group to query state on */
> +	__u32 group_handle;
> +
> +	/**
> +	 * @state: Combination of DRM_PANTHOR_GROUP_STATE_* flags encoding the
> +	 * group state.
> +	 */
> +	__u32 state;
> +
> +	/** @fatal_queues: Bitmask of queues that faced fatal faults. */
> +	__u32 fatal_queues;
> +
> +	/** @pad: MBZ */
> +	__u32 pad;
> +};
> +
> +/**
> + * struct drm_panthor_tiler_heap_create - Arguments passed to DRM_IOCTL_PANTHOR_TILER_HEAP_CREATE
> + */
> +struct drm_panthor_tiler_heap_create {
> +	/** @vm_id: VM ID the tiler heap should be mapped to */
> +	__u32 vm_id;
> +
> +	/** @initial_chunk_count: Initial number of chunks to allocate. */
> +	__u32 initial_chunk_count;
> +
> +	/** @chunk_size: Chunk size. Must be a power of two at least 256KB large. */
> +	__u32 chunk_size;
> +
> +	/** @max_chunks: Maximum number of chunks that can be allocated. */
> +	__u32 max_chunks;
> +
> +	/**
> +	 * @target_in_flight: Maximum number of in-flight render passes.
> +	 *
> +	 * If the heap has more than tiler jobs in-flight, the FW will wait for render
> +	 * passes to finish before queuing new tiler jobs.
> +	 */
> +	__u32 target_in_flight;
> +
> +	/** @handle: Returned heap handle. Passed back to DESTROY_TILER_HEAP. */
> +	__u32 handle;
> +
> +	/** @tiler_heap_ctx_gpu_va: Returned heap GPU virtual address returned */
> +	__u64 tiler_heap_ctx_gpu_va;
> +
> +	/**
> +	 * @first_heap_chunk_gpu_va: First heap chunk.
> +	 *
> +	 * The tiler heap is formed of heap chunks forming a single-link list. This
> +	 * is the first element in the list.
> +	 */
> +	__u64 first_heap_chunk_gpu_va;
> +};
> +
> +/**
> + * struct drm_panthor_tiler_heap_destroy - Arguments passed to DRM_IOCTL_PANTHOR_TILER_HEAP_DESTROY
> + */
> +struct drm_panthor_tiler_heap_destroy {
> +	/** @handle: Handle of the tiler heap to destroy */
> +	__u32 handle;
> +
> +	/** @pad: Padding field, MBZ. */
> +	__u32 pad;
> +};
> +
> +#if defined(__cplusplus)
> +}
> +#endif
> +
> +#endif /* _PANTHOR_DRM_H_ */

Liviu Dudau Sept. 1, 2023, 1:59 p.m. UTC | #2

Hi Boris,

On Wed, Aug 09, 2023 at 06:53:15PM +0200, Boris Brezillon wrote:
> Panthor follows the lead of other recently submitted drivers with
> ioctls allowing us to support modern Vulkan features, like sparse memory
> binding:
> 
> - Pretty standard GEM management ioctls (BO_CREATE and BO_MMAP_OFFSET),
>   with the 'exclusive-VM' bit to speed-up BO reservation on job submission
> - VM management ioctls (VM_CREATE, VM_DESTROY and VM_BIND). The VM_BIND
>   ioctl is loosely based on the Xe model, and can handle both
>   asynchronous and synchronous requests
> - GPU execution context creation/destruction, tiler heap context creation
>   and job submission. Those ioctls reflect how the hardware/scheduler
>   works and are thus driver specific.
> 
> We also have a way to expose IO regions, such that the usermode driver
> can directly access specific/well-isolate registers, like the
> LATEST_FLUSH register used to implement cache-flush reduction.
> 
> This uAPI intentionally keeps usermode queues out of the scope, which
> explains why doorbell registers and command stream ring-buffers are not
> directly exposed to userspace.
> 
> v2:
> - Rename the driver (pancsf -> panthor)
> - Change the license (GPL2 -> MIT + GPL2)
> - Split the driver addition commit
> - Turn the VM_{MAP,UNMAP} ioctls into a VM_BIND ioctl
> - Add the concept of exclusive_vm at BO creation time
> - Add missing padding fields
> - Add documentation
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>

Minor fixes in addition to what Steve has alread flagged.

> ---
>  Documentation/gpu/driver-uapi.rst |   5 +
>  include/uapi/drm/panthor_drm.h    | 862 ++++++++++++++++++++++++++++++
>  2 files changed, 867 insertions(+)
>  create mode 100644 include/uapi/drm/panthor_drm.h
> 
> diff --git a/Documentation/gpu/driver-uapi.rst b/Documentation/gpu/driver-uapi.rst
> index c08bcbb95fb3..7a667901830f 100644
> --- a/Documentation/gpu/driver-uapi.rst
> +++ b/Documentation/gpu/driver-uapi.rst
> @@ -17,3 +17,8 @@ VM_BIND / EXEC uAPI
>      :doc: Overview
>  
>  .. kernel-doc:: include/uapi/drm/nouveau_drm.h
> +
> +drm/panthor uAPI
> +================
> +
> +.. kernel-doc:: include/uapi/drm/panthor_drm.h
> diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
> new file mode 100644
> index 000000000000..e217eb5ad198
> --- /dev/null
> +++ b/include/uapi/drm/panthor_drm.h
> @@ -0,0 +1,862 @@
> +/* SPDX-License-Identifier: MIT */
> +/* Copyright (C) 2023 Collabora ltd. */
> +#ifndef _PANTHOR_DRM_H_
> +#define _PANTHOR_DRM_H_
> +
> +#include "drm.h"
> +
> +#if defined(__cplusplus)
> +extern "C" {
> +#endif
> +
> +/**
> + * DOC: Introduction
> + *
> + * This documentation decribes the Panthor IOCTLs.
> + *
> + * Just a few generic rules about the data passed to the Panthor IOCTLs:
> + *
> + * - Structures must be aligned on 64-bit/8-byte. If the object is not
> + *   naturally aligned, a padding field must be added.
> + * - Fields must be explicity aligned to their natural type alignment with
> + *   pad[0..N] fields.
> + * - All padding fields will be checked by the driver to make sure they are
> + *   zeroed.
> + * - Flags can be added, but not removed/replaced.
> + * - New fields can be added to the main structures (the structures
> + *   directly passed to the ioctl). Those fiels can be added at the end of
> + *   the structure, or replace existing padding fields. Any new field being
> + *   added must preserve the behavior that existed before those fields were
> + *   added when a value of zero is passed.
> + * - New fields can be added to indirect objects (objects pointed by the
> + *   main structure), iff those objects are passed a size to reflect the
> + *   size known by the userspace driver (see drm_panthor_obj_array::stride
> + *   or drm_panthor_dev_query::size).
> + * - If the kernel driver is too old to know some fields, those will
> + *   be ignored (input) and set back to zero (output).
> + * - If userspace is too old to know some fields, those will be zeroed
> + *   (input) before the structure is parsed by the kernel driver.
> + * - Each new flag/field addition must come with a driver version update so
> + *   the userspace driver doesn't have to trial and error to know which
> + *   flags are supported.
> + * - Structures should not contain unions, as this would defeat the
> + *   extensibility of such structures.
> + * - IOCTLs can't be removed or replaced. New IOCTL IDs should be placed
> + *   at the end of the drm_panthor_ioctl_id enum.
> + */
> +
> +/**
> + * DOC: MMIO regions exposed to userspace.
> + *
> + * .. c:macro:: DRM_PANTHOR_USER_MMIO_OFFSET
> + *
> + * File offset for all MMIO regions being exposed to userspace. Don't use
> + * this value directly, use DRM_PANTHOR_USER_<name>_OFFSET values instead.
> + *
> + * .. c:macro:: DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET
> + *
> + * File offset for the LATEST_FLUSH_ID register. The Userspace driver controls
> + * GPU cache flushling through CS instructions, but the flush reduction
> + * mechanism requires a flush_id. This flush_id could be queried with an
> + * ioctl, but Arm provides a well-isolated register page containing only this
> + * read-only register, so let's expose this page through a static mmap offset
> + * and allow direct mapping of this MMIO region so we can avoid the
> + * user <-> kernel round-trip.
> + */
> +#define DRM_PANTHOR_USER_MMIO_OFFSET		(0x1ull << 56)
> +#define DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET	(DRM_PANTHOR_USER_MMIO_OFFSET | 0)
> +
> +/**
> + * DOC: IOCTL IDs
> + *
> + * enum drm_panthor_ioctl_id - IOCTL IDs
> + *
> + * Place new ioctls at the end, don't re-oder, don't replace or remove entries.
> + *
> + * These IDs are not meant to be used directly. Use the DRM_IOCTL_PANTHOR_xxx
> + * definitions instead.
> + */
> +enum drm_panthor_ioctl_id {
> +	/** @DRM_PANTHOR_DEV_QUERY: Query device information. */
> +	DRM_PANTHOR_DEV_QUERY = 0,
> +
> +	/** @DRM_PANTHOR_VM_CREATE: Create a VM. */
> +	DRM_PANTHOR_VM_CREATE,
> +
> +	/** @DRM_PANTHOR_VM_DESTROY: Destroy a VM. */
> +	DRM_PANTHOR_VM_DESTROY,
> +
> +	/** @DRM_PANTHOR_VM_BIND: Bind/unbind memory to a VM. */
> +	DRM_PANTHOR_VM_BIND,
> +
> +	/** @DRM_PANTHOR_BO_CREATE: Create a buffer object. */
> +	DRM_PANTHOR_BO_CREATE,
> +
> +	/**
> +	 * @DRM_PANTHOR_BO_MMAP_OFFSET: Get the file offset to pass to
> +	 * mmap to map a GEM object.
> +	 */
> +	DRM_PANTHOR_BO_MMAP_OFFSET,
> +
> +	/** @DRM_PANTHOR_GROUP_CREATE: Create a scheduling group. */
> +	DRM_PANTHOR_GROUP_CREATE,
> +
> +	/** @DRM_PANTHOR_GROUP_DESTROY: Destroy a scheduling group. */
> +	DRM_PANTHOR_GROUP_DESTROY,
> +
> +	/**
> +	 * @DRM_PANTHOR_GROUP_SUBMIT: Submit jobs to queues belonging
> +	 * to a specific scheduling group.
> +	 */
> +	DRM_PANTHOR_GROUP_SUBMIT,
> +
> +	/** @DRM_PANTHOR_GROUP_GET_STATE: Get the state of a scheduling group. */
> +	DRM_PANTHOR_GROUP_GET_STATE,
> +
> +	/** @DRM_PANTHOR_TILER_HEAP_CREATE: Create a tiler heap. */
> +	DRM_PANTHOR_TILER_HEAP_CREATE,
> +
> +	/** @DRM_PANTHOR_TILER_HEAP_DESTROY: Destroy a tiler heap. */
> +	DRM_PANTHOR_TILER_HEAP_DESTROY,
> +};
> +
> +/**
> + * DRM_IOCTL_PANTHOR() - Build a Panthor IOCTL number
> + * @__access: Access type. Must be R, W or RW.
> + * @__id: One of the DRM_PANTHOR_xxx id.
> + * @__type: Suffix of the type being passed to the IOCTL.
> + *
> + * Don't use this macro directly, use the DRM_IOCTL_PANTHOR_xxx
> + * values instead.
> + *
> + * Return: An IOCTL number to be passed to ioctl() from userspace.
> + */
> +#define DRM_IOCTL_PANTHOR(__access, __id, __type) \
> +	DRM_IO ## __access(DRM_COMMAND_BASE + DRM_PANTHOR_ ## __id, \
> +			   struct drm_panthor_ ## __type)
> +
> +#define DRM_IOCTL_PANTHOR_DEV_QUERY \
> +	DRM_IOCTL_PANTHOR(WR, DEV_QUERY, dev_query)
> +#define DRM_IOCTL_PANTHOR_VM_CREATE \
> +	DRM_IOCTL_PANTHOR(WR, VM_CREATE, vm_create)
> +#define DRM_IOCTL_PANTHOR_VM_DESTROY \
> +	DRM_IOCTL_PANTHOR(WR, VM_DESTROY, vm_destroy)
> +#define DRM_IOCTL_PANTHOR_VM_BIND \
> +	DRM_IOCTL_PANTHOR(WR, VM_BIND, vm_bind)
> +#define DRM_IOCTL_PANTHOR_BO_CREATE \
> +	DRM_IOCTL_PANTHOR(WR, BO_CREATE, bo_create)
> +#define DRM_IOCTL_PANTHOR_BO_MMAP_OFFSET \
> +	DRM_IOCTL_PANTHOR(WR, BO_MMAP_OFFSET, bo_mmap_offset)
> +#define DRM_IOCTL_PANTHOR_GROUP_CREATE \
> +	DRM_IOCTL_PANTHOR(WR, GROUP_CREATE, group_create)
> +#define DRM_IOCTL_PANTHOR_GROUP_DESTROY \
> +	DRM_IOCTL_PANTHOR(WR, GROUP_DESTROY, group_destroy)
> +#define DRM_IOCTL_PANTHOR_GROUP_SUBMIT \
> +	DRM_IOCTL_PANTHOR(WR, GROUP_SUBMIT, group_submit)
> +#define DRM_IOCTL_PANTHOR_GROUP_GET_STATE \
> +	DRM_IOCTL_PANTHOR(WR, GROUP_GET_STATE, group_get_state)
> +#define DRM_IOCTL_PANTHOR_TILER_HEAP_CREATE \
> +	DRM_IOCTL_PANTHOR(WR, TILER_HEAP_CREATE, tiler_heap_create)
> +#define DRM_IOCTL_PANTHOR_TILER_HEAP_DESTROY \
> +	DRM_IOCTL_PANTHOR(WR, TILER_HEAP_DESTROY, tiler_heap_destroy)
> +
> +/**
> + * DOC: IOCTL arguments
> + */
> +
> +/**
> + * struct drm_panthor_obj_array - Object array.
> + *
> + * This object is used to pass an array of objects whose size it subject to changes in

s/it subject/is subject/

> + * future versions of the driver. In order to support this mutability, we pass a stride
> + * describing the size of the object as known by userspace.
> + *
> + * You shouldn't fill drm_panthor_obj_array fields directly. You should instead use
> + * the DRM_PANTHOR_OBJ_ARRAY() macro that takes care of initializing the stride to
> + * the object size.
> + */
> +struct drm_panthor_obj_array {
> +	/** @stride: Stride of object struct. Used for versioning. */
> +	__u32 stride;
> +
> +	/** @count: Number of objects in the array. */
> +	__u32 count;
> +
> +	/** @array: User pointer to an array of objects. */
> +	__u64 array;
> +};
> +
> +/**
> + * DRM_PANTHOR_OBJ_ARRAY() - Initialize a drm_panthor_obj_array field.
> + * @cnt: Number of elements in the array.
> + * @ptr: Pointer to the array to pass to the kernel.
> + *
> + * Macro initializing a drm_panthor_obj_array based on the object size as known
> + * by userspace.
> + */
> +#define DRM_PANTHOR_OBJ_ARRAY(cnt, ptr) \
> +	{ .stride = sizeof((ptr)[0]), .count = (cnt), .array = (__u64)(uintptr_t)(ptr) }
> +
> +/**
> + * enum drm_panthor_sync_op_flags - Synchronization operation flags.
> + */
> +enum drm_panthor_sync_op_flags {
> +	/** @DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_MASK: Synchronization handle type mask. */
> +	DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_MASK = 0xff,
> +
> +	/** @DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_SYNCOBJ: Synchronization object type. */
> +	DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_SYNCOBJ = 0,
> +
> +	/**
> +	 * @DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_TIMELINE_SYNCOBJ: Timeline synchronization
> +	 * object type.
> +	 */
> +	DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_TIMELINE_SYNCOBJ = 1,
> +
> +	/** @DRM_PANTHOR_SYNC_OP_WAIT: Wait operation. */
> +	DRM_PANTHOR_SYNC_OP_WAIT = 0 << 31,
> +
> +	/** @DRM_PANTHOR_SYNC_OP_SIGNAL: Signal operation. */
> +	DRM_PANTHOR_SYNC_OP_SIGNAL = 1 << 31,

This gets flagged by GCC in pedantic mode as not an integer constant, see [1]. Fix is to use

	DRM_PANTHOR_SYNC_OP_SIGNAL = (int)(1u << 31),

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71803

> +};
> +
> +/**
> + * struct drm_panthor_sync_op - Synchronization operation.
> + */
> +struct drm_panthor_sync_op {
> +	/** @flags: Synchronization operation flags. Combination of DRM_PANTHOR_SYNC_OP values. */
> +	__u32 flags;
> +
> +	/** @handle: Sync handle. */
> +	__u32 handle;
> +
> +	/**
> +	 * @timeline_value: MBZ if
> +	 * (flags & DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_MASK) !=
> +	 * DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_TIMELINE_SYNCOBJ.
> +	 */
> +	__u64 timeline_value;
> +};
> +
> +/**
> + * enum drm_panthor_dev_query_type - Query type
> + *
> + * Place new types at the end, don't re-oder, don't remove or replace.
> + */
> +enum drm_panthor_dev_query_type {
> +	/** @DRM_PANTHOR_DEV_QUERY_GPU_INFO: Query GPU information. */
> +	DRM_PANTHOR_DEV_QUERY_GPU_INFO = 0,
> +
> +	/** @DRM_PANTHOR_DEV_QUERY_CSIF_INFO: Query command-stream interface information. */
> +	DRM_PANTHOR_DEV_QUERY_CSIF_INFO,
> +};
> +
> +/**
> + * struct drm_panthor_gpu_info - GPU information
> + *
> + * Structure grouping all queryable information relating to the GPU.
> + */
> +struct drm_panthor_gpu_info {
> +	/** @gpu_id : GPU ID. */
> +	__u32 gpu_id;
> +#define DRM_PANTHOR_ARCH_MAJOR(x)		((x) >> 28)
> +#define DRM_PANTHOR_ARCH_MINOR(x)		(((x) >> 24) & 0xf)
> +#define DRM_PANTHOR_ARCH_REV(x)			(((x) >> 20) & 0xf)
> +#define DRM_PANTHOR_PRODUCT_MAJOR(x)		(((x) >> 16) & 0xf)
> +#define DRM_PANTHOR_VERSION_MAJOR(x)		(((x) >> 12) & 0xf)
> +#define DRM_PANTHOR_VERSION_MINOR(x)		(((x) >> 4) & 0xff)
> +#define DRM_PANTHOR_VERSION_STATUS(x)		((x) & 0xf)
> +
> +	/** @gpu_rev: GPU revision. */
> +	__u32 gpu_rev;
> +
> +	/** @csf_id: Command stream frontend ID. */
> +	__u32 csf_id;
> +#define DRM_PANTHOR_CSHW_MAJOR(x)		(((x) >> 26) & 0x3f)
> +#define DRM_PANTHOR_CSHW_MINOR(x)		(((x) >> 20) & 0x3f)
> +#define DRM_PANTHOR_CSHW_REV(x)			(((x) >> 16) & 0xf)
> +#define DRM_PANTHOR_MCU_MAJOR(x)		(((x) >> 10) & 0x3f)
> +#define DRM_PANTHOR_MCU_MINOR(x)		(((x) >> 4) & 0x3f)
> +#define DRM_PANTHOR_MCU_REV(x)			((x) & 0xf)
> +
> +	/** @l2_features: L2-cache features. */
> +	__u32 l2_features;
> +
> +	/** @tiler_features: Tiler features. */
> +	__u32 tiler_features;
> +
> +	/** @mem_features: Memory features. */
> +	__u32 mem_features;
> +
> +	/** @mmu_features: MMU features. */
> +	__u32 mmu_features;
> +#define DRM_PANTHOR_MMU_VA_BITS(x)		((x) & 0xff)
> +
> +	/** @thread_features: Thread features. */
> +	__u32 thread_features;
> +
> +	/** @max_threads: Maximum number of threads. */
> +	__u32 max_threads;
> +
> +	/** @thread_max_workgroup_size: Maximum workgroup size. */
> +	__u32 thread_max_workgroup_size;
> +
> +	/**
> +	 * @thread_max_barrier_size: Maximum number of threads that can wait
> +	 * simultaneously on a barrier.
> +	 */
> +	__u32 thread_max_barrier_size;
> +
> +	/** @coherency_features: Coherency features. */
> +	__u32 coherency_features;
> +
> +	/** @texture_features: Texture features. */
> +	__u32 texture_features[4];
> +
> +	/** @as_present: Bitmask encoding the number of address-space exposed by the MMU. */
> +	__u32 as_present;
> +
> +	/** @core_group_count: Number of core groups. */
> +	__u32 core_group_count;
> +
> +	/** @pad: Zero on return. */
> +	__u32 pad;
> +
> +	/** @shader_present: Bitmask encoding the shader cores exposed by the GPU. */
> +	__u64 shader_present;
> +
> +	/** @l2_present: Bitmask encoding the L2 caches exposed by the GPU. */
> +	__u64 l2_present;
> +
> +	/** @tiler_present: Bitmask encoding the tiler unit exposed by the GPU. */
> +	__u64 tiler_present;
> +};
> +
> +/**
> + * struct drm_panthor_csif_info - Command stream interface information
> + *
> + * Structure grouping all queryable information relating to the command stream interface.
> + */
> +struct drm_panthor_csif_info {
> +	/** @csg_slot_count: Number of command stream group slots exposed by the firmware. */
> +	__u32 csg_slot_count;
> +
> +	/** @cs_slot_count: Number of command stream slot per group. */
> +	__u32 cs_slot_count;
> +
> +	/** @cs_reg_count: Number of command stream register. */
> +	__u32 cs_reg_count;
> +
> +	/** @scoreboard_slot_count: Number of scoreboard slot. */
> +	__u32 scoreboard_slot_count;
> +
> +	/**
> +	 * @unpreserved_cs_reg_count: Number of command stream registers reserved by
> +	 * the kernel driver to call a userspace command stream.
> +	 *
> +	 * All registers can be used by a userspace command stream, but the
> +	 * [cs_slot_count - unpreserved_cs_reg_count .. cs_slot_count] registers are
> +	 * used by the kernel when DRM_PANTHOR_IOCTL_GROUP_SUBMIT is called.
> +	 */
> +	__u32 unpreserved_cs_reg_count;
> +
> +	/**
> +	 * @pad: Padding field, set to zero.
> +	 */
> +	__u32 pad;
> +};
> +
> +/**
> + * struct drm_panthor_dev_query - Arguments passed to DRM_PANTHOR_IOCTL_DEV_QUERY
> + */
> +struct drm_panthor_dev_query {
> +	/** @type: the query type (see drm_panthor_dev_query_type). */
> +	__u32 type;
> +
> +	/**
> +	 * @size: size of the type being queried.
> +	 *
> +	 * If pointer is NULL, size is updated by the driver to provide the
> +	 * output structure size. If pointer is not NULL, the driver will
> +	 * only copy min(size, actual_structure_size) bytes to the pointer,
> +	 * and update the size accordingly. This allows us to extend query
> +	 * types without breaking userspace.
> +	 */
> +	__u32 size;
> +
> +	/**
> +	 * @pointer: user pointer to a query type struct.
> +	 *
> +	 * Pointer can be NULL, in which case, nothing is copied, but the
> +	 * actual structure size is returned. If not NULL, it must point to
> +	 * a location that's large enough to hold size bytes.
> +	 */
> +	__u64 pointer;
> +};
> +
> +/**
> + * struct drm_panthor_vm_create - Arguments passed to DRM_PANTHOR_IOCTL_VM_CREATE
> + */
> +struct drm_panthor_vm_create {
> +	/** @flags: VM flags, MBZ. */
> +	__u32 flags;
> +
> +	/** @id: Returned VM ID. */
> +	__u32 id;
> +
> +	/**
> +	 * @kernel_va_range: Size of the VA space reserved for kernel objects.
> +	 *
> +	 * If kernel_va_range is zero, we pick half of the VA space for kernel objects.
> +	 *
> +	 * Kernel VA space is always placed at the top of the supported VA range.
> +	 */
> +	__u64 kernel_va_range;
> +};
> +
> +/**
> + * struct drm_panthor_vm_destroy - Arguments passed to DRM_PANTHOR_IOCTL_VM_DESTROY
> + */
> +struct drm_panthor_vm_destroy {
> +	/** @id: ID of the VM to destroy. */
> +	__u32 id;
> +
> +	/** @pad: MBZ. */
> +	__u32 pad;
> +};
> +
> +/**
> + * enum drm_panthor_vm_bind_op_flags - VM bind operation flags
> + */
> +enum drm_panthor_vm_bind_op_flags {
> +	/**
> +	 * @DRM_PANTHOR_VM_BIND_OP_MAP_READONLY: Map the memory read-only.
> +	 *
> +	 * Only valid with DRM_PANTHOR_VM_BIND_OP_TYPE_MAP.
> +	 */
> +	DRM_PANTHOR_VM_BIND_OP_MAP_READONLY = 1 << 0,
> +
> +	/**
> +	 * @DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC: Map the memory not-executable.
> +	 *
> +	 * Only valid with DRM_PANTHOR_VM_BIND_OP_TYPE_MAP.
> +	 */
> +	DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC = 1 << 1,
> +
> +	/**
> +	 * @DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED: Map the memory uncached.
> +	 *
> +	 * Only valid with DRM_PANTHOR_VM_BIND_OP_TYPE_MAP.
> +	 */
> +	DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED = 1 << 2,
> +
> +	/**
> +	 * @DRM_PANTHOR_VM_BIND_OP_TYPE_MASK: Mask used to determine the type of operation.
> +	 */
> +	DRM_PANTHOR_VM_BIND_OP_TYPE_MASK = 0xf << 28,

Same here for GCC being pedantic. Also, on 32 bits this is going to exceed UINT_MAX.

Rest of the file looks good to me.

Reviewed-by: Liviu Dudau <liviu.dudau@arm.com>

Best regard,
Liviu

> +
> +	/** @DRM_PANTHOR_VM_BIND_OP_TYPE_MAP: Map operation. */
> +	DRM_PANTHOR_VM_BIND_OP_TYPE_MAP = 0 << 28,
> +
> +	/** @DRM_PANTHOR_VM_BIND_OP_TYPE_UNMAP: Unmap operation. */
> +	DRM_PANTHOR_VM_BIND_OP_TYPE_UNMAP = 1 << 28,
> +};
> +
> +/**
> + * struct drm_panthor_vm_bind_op - VM bind operation
> + */
> +struct drm_panthor_vm_bind_op {
> +	/** @flags: Combination of drm_panthor_vm_bind_op_flags flags. */
> +	__u32 flags;
> +
> +	/**
> +	 * @bo_handle: Handle of the buffer object to map.
> +	 * MBZ for unmap operations.
> +	 */
> +	__u32 bo_handle;
> +
> +	/**
> +	 * @bo_offset: Buffer object offset.
> +	 * MBZ for unmap operations.
> +	 */
> +	__u64 bo_offset;
> +
> +	/**
> +	 * @va: Virtual address to map/unmap.
> +	 */
> +	__u64 va;
> +
> +	/** @size: Size to map/unmap. */
> +	__u64 size;
> +
> +	/**
> +	 * @syncs: Array of synchronization operations.
> +	 *
> +	 * This array must be empty if %DRM_PANTHOR_VM_BIND_ASYNC is not set on
> +	 * the drm_panthor_vm_bind object containing this VM bind operation.
> +	 */
> +	struct drm_panthor_obj_array syncs;
> +
> +};
> +
> +/**
> + * enum drm_panthor_vm_bind_flags - VM bind flags
> + */
> +enum drm_panthor_vm_bind_flags {
> +	/**
> +	 * @DRM_PANTHOR_VM_BIND_ASYNC: VM bind operations are queued to the VM
> +	 * queue instead of being executed synchronously.
> +	 */
> +	DRM_PANTHOR_VM_BIND_ASYNC = 1 << 0,
> +};
> +
> +/**
> + * struct drm_panthor_vm_bind - Arguments passed to DRM_IOCTL_PANTHOR_VM_BIND
> + */
> +struct drm_panthor_vm_bind {
> +	/** @vm_id: VM targeted by the bind request. */
> +	__u32 vm_id;
> +
> +	/** @flags: Combination of drm_panthor_vm_bind_flags flags. */
> +	__u32 flags;
> +
> +	/** @ops: Array of bind operations. */
> +	struct drm_panthor_obj_array ops;
> +};
> +
> +/**
> + * enum drm_panthor_bo_flags - Buffer object flags, passed at creation time.
> + */
> +enum drm_panthor_bo_flags {
> +	/** @DRM_PANTHOR_BO_NO_MMAP: The buffer object will never be CPU-mapped in userspace. */
> +	DRM_PANTHOR_BO_NO_MMAP = (1 << 0),
> +};
> +
> +/**
> + * struct drm_panthor_bo_create - Arguments passed to DRM_IOCTL_PANTHOR_BO_CREATE.
> + */
> +struct drm_panthor_bo_create {
> +	/**
> +	 * @size: Requested size for the object
> +	 *
> +	 * The (page-aligned) allocated size for the object will be returned.
> +	 */
> +	__u64 size;
> +
> +	/**
> +	 * @flags: Flags. Must be a combination of drm_panthor_bo_flags flags.
> +	 */
> +	__u32 flags;
> +
> +	/**
> +	 * @exclusive_vm_id: Exclusive VM this buffer object will be mapped to.
> +	 *
> +	 * If not zero, the field must refer to a valid VM ID, and implies that:
> +	 *  - the buffer object will only ever be bound to that VM
> +	 *  - cannot be exported as a PRIME fd
> +	 */
> +	__u32 exclusive_vm_id;
> +
> +	/**
> +	 * @handle: Returned handle for the object.
> +	 *
> +	 * Object handles are nonzero.
> +	 */
> +	__u32 handle;
> +
> +	/** @pad: MBZ. */
> +	__u32 pad;
> +};
> +
> +/**
> + * struct drm_panthor_bo_mmap_offset - Arguments passed to DRM_IOCTL_PANTHOR_BO_MMAP_OFFSET.
> + */
> +struct drm_panthor_bo_mmap_offset {
> +	/** @handle: Handle of the object we want an mmap offset for. */
> +	__u32 handle;
> +
> +	/** @pad: MBZ. */
> +	__u32 pad;
> +
> +	/** @offset: The fake offset to use for subsequent mmap calls. */
> +	__u64 offset;
> +};
> +
> +/**
> + * struct drm_panthor_queue_create - Queue creation arguments.
> + */
> +struct drm_panthor_queue_create {
> +	/**
> +	 * @priority: Defines the priority of queues inside a group. Goes from 0 to 15,
> +	 * 15 being the highest priority.
> +	 */
> +	__u8 priority;
> +
> +	/** @pad: Padding fields, MBZ. */
> +	__u8 pad[3];
> +
> +	/** @ringbuf_size: Size of the ring buffer to allocate to this queue. */
> +	__u32 ringbuf_size;
> +};
> +
> +/**
> + * enum drm_panthor_group_priority - Scheduling group priority
> + */
> +enum drm_panthor_group_priority {
> +	/** @PANTHOR_GROUP_PRIORITY_LOW: Low priority group. */
> +	PANTHOR_GROUP_PRIORITY_LOW = 0,
> +
> +	/** @PANTHOR_GROUP_PRIORITY_MEDIUM: Medium priority group. */
> +	PANTHOR_GROUP_PRIORITY_MEDIUM,
> +
> +	/** @PANTHOR_GROUP_PRIORITY_HIGH: High priority group. */
> +	PANTHOR_GROUP_PRIORITY_HIGH,
> +};
> +
> +/**
> + * struct drm_panthor_group_create - Arguments passed to DRM_IOCTL_PANTHOR_GROUP_CREATE
> + */
> +struct drm_panthor_group_create {
> +	/** @queues: Array of drm_panthor_create_cs_queue elements. */
> +	struct drm_panthor_obj_array queues;
> +
> +	/**
> +	 * @max_compute_cores: Maximum number of cores that can be used by compute
> +	 * jobs across CS queues bound to this group.
> +	 *
> +	 * Must be less or equal to the number of bits set in @compute_core_mask.
> +	 */
> +	__u8 max_compute_cores;
> +
> +	/**
> +	 * @max_fragment_cores: Maximum number of cores that can be used by fragment
> +	 * jobs across CS queues bound to this group.
> +	 *
> +	 * Must be less or equal to the number of bits set in @fragment_core_mask.
> +	 */
> +	__u8 max_fragment_cores;
> +
> +	/**
> +	 * @max_tiler_cores: Maximum number of tilers that can be used by tiler jobs
> +	 * across CS queues bound to this group.
> +	 *
> +	 * Must be less or equal to the number of bits set in @tiler_core_mask.
> +	 */
> +	__u8 max_tiler_cores;
> +
> +	/** @priority: Group priority (see drm_drm_panthor_cs_group_priority). */
> +	__u8 priority;
> +
> +	/** @pad: Padding field, MBZ. */
> +	__u32 pad;
> +
> +	/**
> +	 * @compute_core_mask: Mask encoding cores that can be used for compute jobs.
> +	 *
> +	 * This field must have at least @max_compute_cores bits set.
> +	 *
> +	 * The bits set here should also be set in drm_panthor_gpu_info::shader_present.
> +	 */
> +	__u64 compute_core_mask;
> +
> +	/**
> +	 * @fragment_core_mask: Mask encoding cores that can be used for fragment jobs.
> +	 *
> +	 * This field must have at least @max_fragment_cores bits set.
> +	 *
> +	 * The bits set here should also be set in drm_panthor_gpu_info::shader_present.
> +	 */
> +	__u64 fragment_core_mask;
> +
> +	/**
> +	 * @tiler_core_mask: Mask encoding cores that can be used for tiler jobs.
> +	 *
> +	 * This field must have at least @max_tiler_cores bits set.
> +	 *
> +	 * The bits set here should also be set in drm_panthor_gpu_info::tiler_present.
> +	 */
> +	__u64 tiler_core_mask;
> +
> +	/**
> +	 * @vm_id: VM ID to bind this group to.
> +	 *
> +	 * All submission to queues bound to this group will use this VM.
> +	 */
> +	__u32 vm_id;
> +
> +	/**
> +	 * @group_handle: Returned group handle. Passed back when submitting jobs or
> +	 * destroying a group.
> +	 */
> +	__u32 group_handle;
> +};
> +
> +/**
> + * struct drm_panthor_group_destroy - Arguments passed to DRM_IOCTL_PANTHOR_GROUP_DESTROY
> + */
> +struct drm_panthor_group_destroy {
> +	/** @group_handle: Group to destroy */
> +	__u32 group_handle;
> +
> +	/** @pad: Padding field, MBZ. */
> +	__u32 pad;
> +};
> +
> +/**
> + * struct drm_panthor_queue_submit - Job submission arguments.
> + *
> + * This is describing the userspace command stream to call from the kernel
> + * command stream ring-buffer. Queue submission is always part of a group
> + * submission, taking one or more jobs to submit to the underlying queues.
> + */
> +struct drm_panthor_queue_submit {
> +	/** @queue_index: Index of the queue inside a group. */
> +	__u32 queue_index;
> +
> +	/**
> +	 * @stream_size: Size of the command stream to execute.
> +	 *
> +	 * Must be 64-bit/8-byte aligned (the size of a CS instruction)
> +	 *
> +	 * Can be zero if stream_addr is zero too.
> +	 */
> +	__u32 stream_size;
> +
> +	/**
> +	 * @stream_addr: GPU address of the command stream to execute.
> +	 *
> +	 * Must be aligned on 64-byte.
> +	 *
> +	 * Can be zero is stream_size is zero too.
> +	 */
> +	__u64 stream_addr;
> +
> +	/**
> +	 * @latest_flush: FLUSH_ID read at the time the stream was built.
> +	 *
> +	 * This allows cache flush elimination for the automatic
> +	 * flush+invalidate(all) done at submission time, which is needed to
> +	 * ensure the GPU doesn't get garbage when reading the indirect command
> +	 * stream buffers. If you want the cache flush to happen
> +	 * unconditionally, pass a zero here.
> +	 */
> +	__u32 latest_flush;
> +
> +	/** @pad: MBZ. */
> +	__u32 pad;
> +
> +	/** @syncs: Array of sync operations. */
> +	struct drm_panthor_obj_array syncs;
> +};
> +
> +/**
> + * struct drm_panthor_group_submit - Arguments passed to DRM_IOCTL_PANTHOR_VM_BIND
> + */
> +struct drm_panthor_group_submit {
> +	/** @group_handle: Handle of the group to queue jobs to. */
> +	__u32 group_handle;
> +
> +	/** @pad: MBZ. */
> +	__u32 pad;
> +
> +	/** @queue_submits: Array of drm_panthor_queue_submit objects. */
> +	struct drm_panthor_obj_array queue_submits;
> +};
> +
> +/**
> + * enum drm_panthor_group_state_flags - Group state flags
> + */
> +enum drm_panthor_group_state_flags {
> +	/**
> +	 * @DRM_PANTHOR_GROUP_STATE_TIMEDOUT: Group had unfinished jobs.
> +	 *
> +	 * When a group ends up with this flag set, no jobs can be submitted to its queues.
> +	 */
> +	DRM_PANTHOR_GROUP_STATE_TIMEDOUT = 1 << 0,
> +
> +	/**
> +	 * @DRM_PANTHOR_GROUP_STATE_FATAL_FAULT: Group had fatal faults.
> +	 *
> +	 * When a group ends up with this flag set, no jobs can be submitted to its queues.
> +	 */
> +	DRM_PANTHOR_GROUP_STATE_FATAL_FAULT = 1 << 1,
> +};
> +
> +/**
> + * struct drm_panthor_group_get_state - Arguments passed to DRM_IOCTL_PANTHOR_GROUP_GET_STATE
> + *
> + * Used to query the state of a group and decide whether a new group should be created to
> + * replace it.
> + */
> +struct drm_panthor_group_get_state {
> +	/** @group_handle: Handle of the group to query state on */
> +	__u32 group_handle;
> +
> +	/**
> +	 * @state: Combination of DRM_PANTHOR_GROUP_STATE_* flags encoding the
> +	 * group state.
> +	 */
> +	__u32 state;
> +
> +	/** @fatal_queues: Bitmask of queues that faced fatal faults. */
> +	__u32 fatal_queues;
> +
> +	/** @pad: MBZ */
> +	__u32 pad;
> +};
> +
> +/**
> + * struct drm_panthor_tiler_heap_create - Arguments passed to DRM_IOCTL_PANTHOR_TILER_HEAP_CREATE
> + */
> +struct drm_panthor_tiler_heap_create {
> +	/** @vm_id: VM ID the tiler heap should be mapped to */
> +	__u32 vm_id;
> +
> +	/** @initial_chunk_count: Initial number of chunks to allocate. */
> +	__u32 initial_chunk_count;
> +
> +	/** @chunk_size: Chunk size. Must be a power of two at least 256KB large. */
> +	__u32 chunk_size;
> +
> +	/** @max_chunks: Maximum number of chunks that can be allocated. */
> +	__u32 max_chunks;
> +
> +	/**
> +	 * @target_in_flight: Maximum number of in-flight render passes.
> +	 *
> +	 * If the heap has more than tiler jobs in-flight, the FW will wait for render
> +	 * passes to finish before queuing new tiler jobs.
> +	 */
> +	__u32 target_in_flight;
> +
> +	/** @handle: Returned heap handle. Passed back to DESTROY_TILER_HEAP. */
> +	__u32 handle;
> +
> +	/** @tiler_heap_ctx_gpu_va: Returned heap GPU virtual address returned */
> +	__u64 tiler_heap_ctx_gpu_va;
> +
> +	/**
> +	 * @first_heap_chunk_gpu_va: First heap chunk.
> +	 *
> +	 * The tiler heap is formed of heap chunks forming a single-link list. This
> +	 * is the first element in the list.
> +	 */
> +	__u64 first_heap_chunk_gpu_va;
> +};
> +
> +/**
> + * struct drm_panthor_tiler_heap_destroy - Arguments passed to DRM_IOCTL_PANTHOR_TILER_HEAP_DESTROY
> + */
> +struct drm_panthor_tiler_heap_destroy {
> +	/** @handle: Handle of the tiler heap to destroy */
> +	__u32 handle;
> +
> +	/** @pad: Padding field, MBZ. */
> +	__u32 pad;
> +};
> +
> +#if defined(__cplusplus)
> +}
> +#endif
> +
> +#endif /* _PANTHOR_DRM_H_ */
> -- 
> 2.41.0
>

Boris Brezillon Sept. 1, 2023, 4:10 p.m. UTC | #3

On Wed,  9 Aug 2023 18:53:15 +0200
Boris Brezillon <boris.brezillon@collabora.com> wrote:

> +/**
> + * DOC: MMIO regions exposed to userspace.
> + *
> + * .. c:macro:: DRM_PANTHOR_USER_MMIO_OFFSET
> + *
> + * File offset for all MMIO regions being exposed to userspace. Don't use
> + * this value directly, use DRM_PANTHOR_USER_<name>_OFFSET values instead.
> + *
> + * .. c:macro:: DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET
> + *
> + * File offset for the LATEST_FLUSH_ID register. The Userspace driver controls
> + * GPU cache flushling through CS instructions, but the flush reduction
> + * mechanism requires a flush_id. This flush_id could be queried with an
> + * ioctl, but Arm provides a well-isolated register page containing only this
> + * read-only register, so let's expose this page through a static mmap offset
> + * and allow direct mapping of this MMIO region so we can avoid the
> + * user <-> kernel round-trip.
> + */
> +#define DRM_PANTHOR_USER_MMIO_OFFSET		(0x1ull << 56)

I'm playing with a 32-bit kernel/userspace, and this is problematic,
because vm_pgoff is limited to 32-bit there, meaning we can only map up
to (1ull << (PAGE_SHIFT + 32)) - 1. Should we add a DEV_QUERY to let
userspace set the mmio range?

> +#define DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET	(DRM_PANTHOR_USER_MMIO_OFFSET | 0)

Steven Price Sept. 4, 2023, 7:42 a.m. UTC | #4

On 01/09/2023 17:10, Boris Brezillon wrote:
> On Wed,  9 Aug 2023 18:53:15 +0200
> Boris Brezillon <boris.brezillon@collabora.com> wrote:
> 
>> +/**
>> + * DOC: MMIO regions exposed to userspace.
>> + *
>> + * .. c:macro:: DRM_PANTHOR_USER_MMIO_OFFSET
>> + *
>> + * File offset for all MMIO regions being exposed to userspace. Don't use
>> + * this value directly, use DRM_PANTHOR_USER_<name>_OFFSET values instead.
>> + *
>> + * .. c:macro:: DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET
>> + *
>> + * File offset for the LATEST_FLUSH_ID register. The Userspace driver controls
>> + * GPU cache flushling through CS instructions, but the flush reduction
>> + * mechanism requires a flush_id. This flush_id could be queried with an
>> + * ioctl, but Arm provides a well-isolated register page containing only this
>> + * read-only register, so let's expose this page through a static mmap offset
>> + * and allow direct mapping of this MMIO region so we can avoid the
>> + * user <-> kernel round-trip.
>> + */
>> +#define DRM_PANTHOR_USER_MMIO_OFFSET		(0x1ull << 56)
> 
> I'm playing with a 32-bit kernel/userspace, and this is problematic,
> because vm_pgoff is limited to 32-bit there, meaning we can only map up
> to (1ull << (PAGE_SHIFT + 32)) - 1. Should we add a DEV_QUERY to let
> userspace set the mmio range?

Hmm, I was rather hoping we could ignore 32 bit these days ;) But while
I can't see why anyone would be running a 32 bit kernel, I guess 32 bit
user space is likely to still be needed.

I can't really think of anything better than letting user space set the
MMIO range. Having an ioctl which returned a special fd just for MMIO
would be one option (which would preserve the full 44 bit GPU VA) but
seems somewhat overkill. Hiding the mmap within an ioctl would of course
be bad as it breaks tools like Valgrind.

Oh and please do make it a range - user space submission will be adding
to the MMIO range ;)

Steve

>> +#define DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET	(DRM_PANTHOR_USER_MMIO_OFFSET | 0)

Boris Brezillon Sept. 4, 2023, 8:26 a.m. UTC | #5

On Mon, 4 Sep 2023 08:42:08 +0100
Steven Price <steven.price@arm.com> wrote:

> On 01/09/2023 17:10, Boris Brezillon wrote:
> > On Wed,  9 Aug 2023 18:53:15 +0200
> > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> >   
> >> +/**
> >> + * DOC: MMIO regions exposed to userspace.
> >> + *
> >> + * .. c:macro:: DRM_PANTHOR_USER_MMIO_OFFSET
> >> + *
> >> + * File offset for all MMIO regions being exposed to userspace. Don't use
> >> + * this value directly, use DRM_PANTHOR_USER_<name>_OFFSET values instead.
> >> + *
> >> + * .. c:macro:: DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET
> >> + *
> >> + * File offset for the LATEST_FLUSH_ID register. The Userspace driver controls
> >> + * GPU cache flushling through CS instructions, but the flush reduction
> >> + * mechanism requires a flush_id. This flush_id could be queried with an
> >> + * ioctl, but Arm provides a well-isolated register page containing only this
> >> + * read-only register, so let's expose this page through a static mmap offset
> >> + * and allow direct mapping of this MMIO region so we can avoid the
> >> + * user <-> kernel round-trip.
> >> + */
> >> +#define DRM_PANTHOR_USER_MMIO_OFFSET		(0x1ull << 56)  
> > 
> > I'm playing with a 32-bit kernel/userspace, and this is problematic,
> > because vm_pgoff is limited to 32-bit there, meaning we can only map up
> > to (1ull << (PAGE_SHIFT + 32)) - 1. Should we add a DEV_QUERY to let
> > userspace set the mmio range?  
> 
> Hmm, I was rather hoping we could ignore 32 bit these days ;) But while
> I can't see why anyone would be running a 32 bit kernel, I guess 32 bit
> user space is likely to still be needed.

Well, I can tell you some people are using 32-bit kernels ;-).

> 
> I can't really think of anything better than letting user space set the
> MMIO range. Having an ioctl which returned a special fd just for MMIO
> would be one option (which would preserve the full 44 bit GPU VA) but
> seems somewhat overkill.

Yeah, I don't think I like the separate-fd approach. Just feels like it
goes against the DRM-way of doing things. And, with 32-bit userspace,
we'd be limited by the CPU VA range anyway. Of course it's orthogonal
to the max file offset, and just because we can't map all buffers at
once, doesn't mean we don't want to be able to address more than 4G of
memory. But with 43-bit left (I think I'd prefer if we enforce a log2
value for the mmio offset/size, meaning that the max MMIO range would be
1ull << 43), that means we're still able to address 8TB of memory. I
guess that's more than enough for 32-bit users...

> Hiding the mmap within an ioctl would of course
> be bad as it breaks tools like Valgrind.

Don't like this idea either.

> 
> Oh and please do make it a range - user space submission will be adding
> to the MMIO range ;)

Yeah, that was the plan (I keep usermode submission in the back of my
mind ;-)).

Boris Brezillon Sept. 4, 2023, 9:26 a.m. UTC | #6

On Mon, 4 Sep 2023 08:42:08 +0100
Steven Price <steven.price@arm.com> wrote:

> On 01/09/2023 17:10, Boris Brezillon wrote:
> > On Wed,  9 Aug 2023 18:53:15 +0200
> > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> >   
> >> +/**
> >> + * DOC: MMIO regions exposed to userspace.
> >> + *
> >> + * .. c:macro:: DRM_PANTHOR_USER_MMIO_OFFSET
> >> + *
> >> + * File offset for all MMIO regions being exposed to userspace. Don't use
> >> + * this value directly, use DRM_PANTHOR_USER_<name>_OFFSET values instead.
> >> + *
> >> + * .. c:macro:: DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET
> >> + *
> >> + * File offset for the LATEST_FLUSH_ID register. The Userspace driver controls
> >> + * GPU cache flushling through CS instructions, but the flush reduction
> >> + * mechanism requires a flush_id. This flush_id could be queried with an
> >> + * ioctl, but Arm provides a well-isolated register page containing only this
> >> + * read-only register, so let's expose this page through a static mmap offset
> >> + * and allow direct mapping of this MMIO region so we can avoid the
> >> + * user <-> kernel round-trip.
> >> + */
> >> +#define DRM_PANTHOR_USER_MMIO_OFFSET		(0x1ull << 56)  
> > 
> > I'm playing with a 32-bit kernel/userspace, and this is problematic,
> > because vm_pgoff is limited to 32-bit there, meaning we can only map up
> > to (1ull << (PAGE_SHIFT + 32)) - 1. Should we add a DEV_QUERY to let
> > userspace set the mmio range?  
> 
> Hmm, I was rather hoping we could ignore 32 bit these days ;) But while
> I can't see why anyone would be running a 32 bit kernel, I guess 32 bit
> user space is likely to still be needed.

Uh, I just hit a new problem with 32-bit kernels: the io-pgtable
interface (io_pgtable_ops) passes device VAs as unsigned longs, meaning
the GPU VA space is limited to 4G on a 32-bit build :-(. Robin, any
chance you could advise me on what to do here?

1. assume this limitation is here for a good reason, and limit the GPU
VA space to 32-bits on 32-bit kernels

or

2. update the interface to make iova an u64

Steven Price Sept. 4, 2023, 3:22 p.m. UTC | #7

On 04/09/2023 10:26, Boris Brezillon wrote:
> On Mon, 4 Sep 2023 08:42:08 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
>> On 01/09/2023 17:10, Boris Brezillon wrote:
>>> On Wed,  9 Aug 2023 18:53:15 +0200
>>> Boris Brezillon <boris.brezillon@collabora.com> wrote:
>>>   
>>>> +/**
>>>> + * DOC: MMIO regions exposed to userspace.
>>>> + *
>>>> + * .. c:macro:: DRM_PANTHOR_USER_MMIO_OFFSET
>>>> + *
>>>> + * File offset for all MMIO regions being exposed to userspace. Don't use
>>>> + * this value directly, use DRM_PANTHOR_USER_<name>_OFFSET values instead.
>>>> + *
>>>> + * .. c:macro:: DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET
>>>> + *
>>>> + * File offset for the LATEST_FLUSH_ID register. The Userspace driver controls
>>>> + * GPU cache flushling through CS instructions, but the flush reduction
>>>> + * mechanism requires a flush_id. This flush_id could be queried with an
>>>> + * ioctl, but Arm provides a well-isolated register page containing only this
>>>> + * read-only register, so let's expose this page through a static mmap offset
>>>> + * and allow direct mapping of this MMIO region so we can avoid the
>>>> + * user <-> kernel round-trip.
>>>> + */
>>>> +#define DRM_PANTHOR_USER_MMIO_OFFSET		(0x1ull << 56)  
>>>
>>> I'm playing with a 32-bit kernel/userspace, and this is problematic,
>>> because vm_pgoff is limited to 32-bit there, meaning we can only map up
>>> to (1ull << (PAGE_SHIFT + 32)) - 1. Should we add a DEV_QUERY to let
>>> userspace set the mmio range?  
>>
>> Hmm, I was rather hoping we could ignore 32 bit these days ;) But while
>> I can't see why anyone would be running a 32 bit kernel, I guess 32 bit
>> user space is likely to still be needed.
> 
> Uh, I just hit a new problem with 32-bit kernels: the io-pgtable
> interface (io_pgtable_ops) passes device VAs as unsigned longs, meaning
> the GPU VA space is limited to 4G on a 32-bit build :-(. Robin, any
> chance you could advise me on what to do here?
> 
> 1. assume this limitation is here for a good reason, and limit the GPU
> VA space to 32-bits on 32-bit kernels
> 
> or
> 
> 2. update the interface to make iova an u64

I'm not sure I can answer the question from a technical perspective,
hopefully Robin will be able to.

But why do we care about 32-bit kernels on a platform which is new
enough to have a CSF-GPU (and by extension a recent 64-bit CPU)?

Given the other limitations present in a 32-bit kernel I'd be tempted to
say '1' just for simplicity. Especially since apparently we've lived
with this for panfrost which presumably has the same limitation (even
though all Bifrost/Midgard GPUs have at least 33 bits of VA space).

Steve

Robin Murphy Sept. 4, 2023, 4:06 p.m. UTC | #8

On 2023-08-09 17:53, Boris Brezillon wrote:
[...]
> +/**
> + * struct drm_panthor_vm_create - Arguments passed to DRM_PANTHOR_IOCTL_VM_CREATE
> + */
> +struct drm_panthor_vm_create {
> +	/** @flags: VM flags, MBZ. */
> +	__u32 flags;
> +
> +	/** @id: Returned VM ID. */
> +	__u32 id;
> +
> +	/**
> +	 * @kernel_va_range: Size of the VA space reserved for kernel objects.
> +	 *
> +	 * If kernel_va_range is zero, we pick half of the VA space for kernel objects.
> +	 *
> +	 * Kernel VA space is always placed at the top of the supported VA range.
> +	 */
> +	__u64 kernel_va_range;

Off the back of the "IOVA as unsigned long" concern, Boris and I 
reasoned through the 64-bit vs. 32-bit vs. compat cases on IRC, and it 
seems like this kernel_va_range argument is a source of much of the pain.

Rather than have userspace specify a quantity which it shouldn't care 
about and depend on assumptions of kernel behaviour to infer the 
quantity which *is* relevant (i.e. how large the usable range of the VM 
will actually be), I think it would be considerably more logical for 
userspace to simply request the size of usable VM it actually wants. 
Then it would be straightforward and consistent to define the default 
value in terms of the minimum of half the GPU VA size or TASK_SIZE (the 
latter being the largest *meaningful* value in all 3 cases), and it's 
still easy enough for the kernel to deduce for itself whether there's a 
reasonable amount of space left between the requested limit and 
ULONG_MAX for it to use. 32-bit kernels should then get at least 1GB to 
play with, for compat the kernel BOs can get well out of the way into 
the >32-bit range, and it's only really 64-bit where userspace is liable 
to see "kernel" VA space impinging on usable process VAs. Even then 
we're not sure that's a significant concern beyond OpenCL SVM.

Thanks,
Robin.

Boris Brezillon Sept. 4, 2023, 4:16 p.m. UTC | #9

On Mon, 4 Sep 2023 16:22:19 +0100
Steven Price <steven.price@arm.com> wrote:

> On 04/09/2023 10:26, Boris Brezillon wrote:
> > On Mon, 4 Sep 2023 08:42:08 +0100
> > Steven Price <steven.price@arm.com> wrote:
> >   
> >> On 01/09/2023 17:10, Boris Brezillon wrote:  
> >>> On Wed,  9 Aug 2023 18:53:15 +0200
> >>> Boris Brezillon <boris.brezillon@collabora.com> wrote:
> >>>     
> >>>> +/**
> >>>> + * DOC: MMIO regions exposed to userspace.
> >>>> + *
> >>>> + * .. c:macro:: DRM_PANTHOR_USER_MMIO_OFFSET
> >>>> + *
> >>>> + * File offset for all MMIO regions being exposed to userspace. Don't use
> >>>> + * this value directly, use DRM_PANTHOR_USER_<name>_OFFSET values instead.
> >>>> + *
> >>>> + * .. c:macro:: DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET
> >>>> + *
> >>>> + * File offset for the LATEST_FLUSH_ID register. The Userspace driver controls
> >>>> + * GPU cache flushling through CS instructions, but the flush reduction
> >>>> + * mechanism requires a flush_id. This flush_id could be queried with an
> >>>> + * ioctl, but Arm provides a well-isolated register page containing only this
> >>>> + * read-only register, so let's expose this page through a static mmap offset
> >>>> + * and allow direct mapping of this MMIO region so we can avoid the
> >>>> + * user <-> kernel round-trip.
> >>>> + */
> >>>> +#define DRM_PANTHOR_USER_MMIO_OFFSET		(0x1ull << 56)    
> >>>
> >>> I'm playing with a 32-bit kernel/userspace, and this is problematic,
> >>> because vm_pgoff is limited to 32-bit there, meaning we can only map up
> >>> to (1ull << (PAGE_SHIFT + 32)) - 1. Should we add a DEV_QUERY to let
> >>> userspace set the mmio range?    
> >>
> >> Hmm, I was rather hoping we could ignore 32 bit these days ;) But while
> >> I can't see why anyone would be running a 32 bit kernel, I guess 32 bit
> >> user space is likely to still be needed.  
> > 
> > Uh, I just hit a new problem with 32-bit kernels: the io-pgtable
> > interface (io_pgtable_ops) passes device VAs as unsigned longs, meaning
> > the GPU VA space is limited to 4G on a 32-bit build :-(. Robin, any
> > chance you could advise me on what to do here?
> > 
> > 1. assume this limitation is here for a good reason, and limit the GPU
> > VA space to 32-bits on 32-bit kernels
> > 
> > or
> > 
> > 2. update the interface to make iova an u64  
> 
> I'm not sure I can answer the question from a technical perspective,
> hopefully Robin will be able to.

Had a quick chat with Robin, and he's recommending going for #1 too.

> 
> But why do we care about 32-bit kernels on a platform which is new
> enough to have a CSF-GPU (and by extension a recent 64-bit CPU)?

Apparently the memory you save by switching to a 32-bit kernel matters
to some people. To clarify, the CPU is aarch64, but they want to use it
in 32-bit mode.

> 
> Given the other limitations present in a 32-bit kernel I'd be tempted to
> say '1' just for simplicity. Especially since apparently we've lived
> with this for panfrost which presumably has the same limitation (even
> though all Bifrost/Midgard GPUs have at least 33 bits of VA space).

Well, Panfrost is simpler in that you don't have this kernel VA range,
and, IIRC, we are using the old format that naturally limits the GPU VA
space to 4G.

Robin Murphy Sept. 4, 2023, 4:25 p.m. UTC | #10

On 2023-09-04 17:16, Boris Brezillon wrote:
> On Mon, 4 Sep 2023 16:22:19 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
>> On 04/09/2023 10:26, Boris Brezillon wrote:
>>> On Mon, 4 Sep 2023 08:42:08 +0100
>>> Steven Price <steven.price@arm.com> wrote:
>>>    
>>>> On 01/09/2023 17:10, Boris Brezillon wrote:
>>>>> On Wed,  9 Aug 2023 18:53:15 +0200
>>>>> Boris Brezillon <boris.brezillon@collabora.com> wrote:
>>>>>      
>>>>>> +/**
>>>>>> + * DOC: MMIO regions exposed to userspace.
>>>>>> + *
>>>>>> + * .. c:macro:: DRM_PANTHOR_USER_MMIO_OFFSET
>>>>>> + *
>>>>>> + * File offset for all MMIO regions being exposed to userspace. Don't use
>>>>>> + * this value directly, use DRM_PANTHOR_USER_<name>_OFFSET values instead.
>>>>>> + *
>>>>>> + * .. c:macro:: DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET
>>>>>> + *
>>>>>> + * File offset for the LATEST_FLUSH_ID register. The Userspace driver controls
>>>>>> + * GPU cache flushling through CS instructions, but the flush reduction
>>>>>> + * mechanism requires a flush_id. This flush_id could be queried with an
>>>>>> + * ioctl, but Arm provides a well-isolated register page containing only this
>>>>>> + * read-only register, so let's expose this page through a static mmap offset
>>>>>> + * and allow direct mapping of this MMIO region so we can avoid the
>>>>>> + * user <-> kernel round-trip.
>>>>>> + */
>>>>>> +#define DRM_PANTHOR_USER_MMIO_OFFSET		(0x1ull << 56)
>>>>>
>>>>> I'm playing with a 32-bit kernel/userspace, and this is problematic,
>>>>> because vm_pgoff is limited to 32-bit there, meaning we can only map up
>>>>> to (1ull << (PAGE_SHIFT + 32)) - 1. Should we add a DEV_QUERY to let
>>>>> userspace set the mmio range?
>>>>
>>>> Hmm, I was rather hoping we could ignore 32 bit these days ;) But while
>>>> I can't see why anyone would be running a 32 bit kernel, I guess 32 bit
>>>> user space is likely to still be needed.
>>>
>>> Uh, I just hit a new problem with 32-bit kernels: the io-pgtable
>>> interface (io_pgtable_ops) passes device VAs as unsigned longs, meaning
>>> the GPU VA space is limited to 4G on a 32-bit build :-(. Robin, any
>>> chance you could advise me on what to do here?
>>>
>>> 1. assume this limitation is here for a good reason, and limit the GPU
>>> VA space to 32-bits on 32-bit kernels
>>>
>>> or
>>>
>>> 2. update the interface to make iova an u64
>>
>> I'm not sure I can answer the question from a technical perspective,
>> hopefully Robin will be able to.
> 
> Had a quick chat with Robin, and he's recommending going for #1 too.
> 
>>
>> But why do we care about 32-bit kernels on a platform which is new
>> enough to have a CSF-GPU (and by extension a recent 64-bit CPU)?
> 
> Apparently the memory you save by switching to a 32-bit kernel matters
> to some people. To clarify, the CPU is aarch64, but they want to use it
> in 32-bit mode.
> 
>>
>> Given the other limitations present in a 32-bit kernel I'd be tempted to
>> say '1' just for simplicity. Especially since apparently we've lived
>> with this for panfrost which presumably has the same limitation (even
>> though all Bifrost/Midgard GPUs have at least 33 bits of VA space).
> 
> Well, Panfrost is simpler in that you don't have this kernel VA range,
> and, IIRC, we are using the old format that naturally limits the GPU VA
> space to 4G.

FWIW the legacy pagetable format itself should be fine going up to 
however many bits the GPU supports, however there were various ISA 
limitations around crossing 4GB boundaries, and the easiest way to avoid 
having to think about those was to just not use more than 4GB of VA at 
all (minus chunks at the ends for similar weird ISA reasons).

Cheers,
Robin.

Steven Price Sept. 6, 2023, 10:55 a.m. UTC | #11

On 04/09/2023 17:25, Robin Murphy wrote:
> On 2023-09-04 17:16, Boris Brezillon wrote:
>> On Mon, 4 Sep 2023 16:22:19 +0100
>> Steven Price <steven.price@arm.com> wrote:
>>
>>> On 04/09/2023 10:26, Boris Brezillon wrote:
>>>> On Mon, 4 Sep 2023 08:42:08 +0100
>>>> Steven Price <steven.price@arm.com> wrote:
>>>>   
>>>>> On 01/09/2023 17:10, Boris Brezillon wrote:
>>>>>> On Wed,  9 Aug 2023 18:53:15 +0200
>>>>>> Boris Brezillon <boris.brezillon@collabora.com> wrote:
>>>>>>     
>>>>>>> +/**
>>>>>>> + * DOC: MMIO regions exposed to userspace.
>>>>>>> + *
>>>>>>> + * .. c:macro:: DRM_PANTHOR_USER_MMIO_OFFSET
>>>>>>> + *
>>>>>>> + * File offset for all MMIO regions being exposed to userspace.
>>>>>>> Don't use
>>>>>>> + * this value directly, use DRM_PANTHOR_USER_<name>_OFFSET
>>>>>>> values instead.
>>>>>>> + *
>>>>>>> + * .. c:macro:: DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET
>>>>>>> + *
>>>>>>> + * File offset for the LATEST_FLUSH_ID register. The Userspace
>>>>>>> driver controls
>>>>>>> + * GPU cache flushling through CS instructions, but the flush
>>>>>>> reduction
>>>>>>> + * mechanism requires a flush_id. This flush_id could be queried
>>>>>>> with an
>>>>>>> + * ioctl, but Arm provides a well-isolated register page
>>>>>>> containing only this
>>>>>>> + * read-only register, so let's expose this page through a
>>>>>>> static mmap offset
>>>>>>> + * and allow direct mapping of this MMIO region so we can avoid the
>>>>>>> + * user <-> kernel round-trip.
>>>>>>> + */
>>>>>>> +#define DRM_PANTHOR_USER_MMIO_OFFSET        (0x1ull << 56)
>>>>>>
>>>>>> I'm playing with a 32-bit kernel/userspace, and this is problematic,
>>>>>> because vm_pgoff is limited to 32-bit there, meaning we can only
>>>>>> map up
>>>>>> to (1ull << (PAGE_SHIFT + 32)) - 1. Should we add a DEV_QUERY to let
>>>>>> userspace set the mmio range?
>>>>>
>>>>> Hmm, I was rather hoping we could ignore 32 bit these days ;) But
>>>>> while
>>>>> I can't see why anyone would be running a 32 bit kernel, I guess 32
>>>>> bit
>>>>> user space is likely to still be needed.
>>>>
>>>> Uh, I just hit a new problem with 32-bit kernels: the io-pgtable
>>>> interface (io_pgtable_ops) passes device VAs as unsigned longs, meaning
>>>> the GPU VA space is limited to 4G on a 32-bit build :-(. Robin, any
>>>> chance you could advise me on what to do here?
>>>>
>>>> 1. assume this limitation is here for a good reason, and limit the GPU
>>>> VA space to 32-bits on 32-bit kernels
>>>>
>>>> or
>>>>
>>>> 2. update the interface to make iova an u64
>>>
>>> I'm not sure I can answer the question from a technical perspective,
>>> hopefully Robin will be able to.
>>
>> Had a quick chat with Robin, and he's recommending going for #1 too.
>>
>>>
>>> But why do we care about 32-bit kernels on a platform which is new
>>> enough to have a CSF-GPU (and by extension a recent 64-bit CPU)?
>>
>> Apparently the memory you save by switching to a 32-bit kernel matters
>> to some people. To clarify, the CPU is aarch64, but they want to use it
>> in 32-bit mode.
>>
>>>
>>> Given the other limitations present in a 32-bit kernel I'd be tempted to
>>> say '1' just for simplicity. Especially since apparently we've lived
>>> with this for panfrost which presumably has the same limitation (even
>>> though all Bifrost/Midgard GPUs have at least 33 bits of VA space).
>>
>> Well, Panfrost is simpler in that you don't have this kernel VA range,
>> and, IIRC, we are using the old format that naturally limits the GPU VA
>> space to 4G.
> 
> FWIW the legacy pagetable format itself should be fine going up to
> however many bits the GPU supports, however there were various ISA
> limitations around crossing 4GB boundaries, and the easiest way to avoid
> having to think about those was to just not use more than 4GB of VA at
> all (minus chunks at the ends for similar weird ISA reasons).

Exactly right. The legacy pagetable format supports the full range of
VA. Indeed kbase used the legacy format for a long time.

However the ISA has special handling for addresses where bits 31:12 are
all 0 or all 1, so we have to avoid executable code landing on these
regions. kbase has a modified version of 'unmapped_area_topdown'[1] to
handle these additional restrictions.

Steve

[1] kbase_unmapped_area_topdown()
https://android.googlesource.com/kernel/google-modules/gpu/+/refs/tags/android-12.0.0_r0.42/mali_kbase/thirdparty/mali_kbase_mmap.c#97

Ketil Johnsen Sept. 6, 2023, 12:18 p.m. UTC | #12

On 8/9/23 18:53, Boris Brezillon wrote:

> +enum drm_panthor_sync_op_flags {
> +	/** @DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_MASK: Synchronization handle type mask. */
> +	DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_MASK = 0xff,
> +
> +	/** @DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_SYNCOBJ: Synchronization object type. */
> +	DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_SYNCOBJ = 0,
> +
> +	/**
> +	 * @DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_TIMELINE_SYNCOBJ: Timeline synchronization
> +	 * object type.
> +	 */
> +	DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_TIMELINE_SYNCOBJ = 1,
> +
> +	/** @DRM_PANTHOR_SYNC_OP_WAIT: Wait operation. */
> +	DRM_PANTHOR_SYNC_OP_WAIT = 0 << 31,
> +
> +	/** @DRM_PANTHOR_SYNC_OP_SIGNAL: Signal operation. */
> +	DRM_PANTHOR_SYNC_OP_SIGNAL = 1 << 31,
> +};

We get an issue with --pedantic here:

warning: enumerator value for 'DRM_PANTHOR_SYNC_OP_SIGNAL' is not an 
integer constant expression [-Wpedantic]

Would be god to get rid of this, so user space can include this header 
without disabling pedantic. Either we can stop using the top most bit or 
a cast value like "(int)(1U << 31)"

> +	/**
> +	 * @DRM_PANTHOR_VM_BIND_OP_TYPE_MASK: Mask used to determine the type of operation.
> +	 */
> +	DRM_PANTHOR_VM_BIND_OP_TYPE_MASK = 0xf << 28,

Same issue for this member. Either not use the top most bit or cast 
value like "(int)(0xfU << 28)" avoids the pedantic warning.

--
Regards,
Ketil Johnsen

[v2,02/15] drm/panthor: Add uAPI

Commit Message

Comments

Patch