diff mbox series

[RFC] drm/pancsf: Add a new driver for Mali CSF-based GPUs

Message ID 20230201084832.1708866-1-boris.brezillon@collabora.com (mailing list archive)
State New, archived
Headers show
Series [RFC] drm/pancsf: Add a new driver for Mali CSF-based GPUs | expand

Commit Message

Boris Brezillon Feb. 1, 2023, 8:48 a.m. UTC
Mali v10 (second Valhal iteration) and later GPUs replaced the Job
Manager block by a command stream based interface called CSF (for
Command Stream Frontend). This interface is not only turning the job
chain based submission model into a command stream based one, but also
introducing FW-assisted scheduling of command stream queues. This is a
fundamental shift in both how userspace is supposed to submit jobs, but
also how the driver is architectured. We initially tried to retrofit the
CSF model into panfrost, but this ended up introducing unneeded
complexity to the existing driver, which we all know is a potential
source of regression.

So here comes a brand new driver for CSF-based hardware. This is a
preliminary version and some important features are missing (like devfreq
, PM support and a memory shrinker implem, to name a few). The goal of
this RFC is to gather some preliminary feedback on both the uAPI and some
basic building blocks, like the MMU/VM code, the tiler heap allocation
logic...

It's also here to give concrete code to refer to for the discussion
around scheduling and VM_BIND support that started on the Xe/nouveau
threads[1][2]. Right now, I'm still using a custom timesharing-based
scheduler, but I plan to give Daniel's suggestion a try (having one
drm_gpu_scheduler per drm_sched_entity, and replacing the tick-based
scheduler by some group slot manager with an LRU-based group eviction
mechanism). I also have a bunch of things I need to figure out regarding
the VM-based memory management code. The current design assumes explicit
syncs everywhere, but we don't use resv objects yet. I see other modern
drivers are adding BOOKKEEP fences to the VM resv object and using this
VM resv to synchronize with kernel operations on the VM, but we
currently don't do any of that. As Daniel pointed out it's likely to
become an issue when we throw the memory shrinker into the mix. And of
course, the plan is to transition to the drm_gpuva_manager infrastructure
being discussed here [2] before merging the driver. Kind of related to
this shrinker topic, I'm wondering if it wouldn't make sense to use
the TTM infra for our buffer management (AFAIU, we'd get LRU-based BO
eviction for free, without needing to expose an MADVISE(DONT_NEED) kind
of API), but I'm a bit worried about the extra complexity this would pull
in.

Note that DT bindings are currently undocumented. For those who really
care, they're based on the panfrost bindings, so I don't expect any
pain points on that front. I'll provide a proper doc once all other
aspects have been sorted out.

Regards,

Boris

[1]https://lore.kernel.org/dri-devel/20221222222127.34560-1-matthew.brost@intel.com/
[2]https://lore.kernel.org/lkml/Y8jOCE%2FPyNZ2Z6aX@DUT025-TGLU.fm.intel.com/

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Cc: Steven Price <steven.price@arm.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Daniel Stone <daniels@collabora.com>
Cc: Jason Ekstrand <jason@jlekstrand.net>
---
 drivers/gpu/drm/Kconfig                |    2 +
 drivers/gpu/drm/Makefile               |    1 +
 drivers/gpu/drm/pancsf/Kconfig         |   15 +
 drivers/gpu/drm/pancsf/Makefile        |   14 +
 drivers/gpu/drm/pancsf/pancsf_device.c |  391 ++++
 drivers/gpu/drm/pancsf/pancsf_device.h |  168 ++
 drivers/gpu/drm/pancsf/pancsf_drv.c    |  812 +++++++
 drivers/gpu/drm/pancsf/pancsf_gem.c    |  161 ++
 drivers/gpu/drm/pancsf/pancsf_gem.h    |   45 +
 drivers/gpu/drm/pancsf/pancsf_gpu.c    |  381 ++++
 drivers/gpu/drm/pancsf/pancsf_gpu.h    |   40 +
 drivers/gpu/drm/pancsf/pancsf_heap.c   |  337 +++
 drivers/gpu/drm/pancsf/pancsf_heap.h   |   30 +
 drivers/gpu/drm/pancsf/pancsf_mcu.c    |  891 ++++++++
 drivers/gpu/drm/pancsf/pancsf_mcu.h    |  313 +++
 drivers/gpu/drm/pancsf/pancsf_mmu.c    | 1345 +++++++++++
 drivers/gpu/drm/pancsf/pancsf_mmu.h    |   51 +
 drivers/gpu/drm/pancsf/pancsf_regs.h   |  225 ++
 drivers/gpu/drm/pancsf/pancsf_sched.c  | 2837 ++++++++++++++++++++++++
 drivers/gpu/drm/pancsf/pancsf_sched.h  |   68 +
 include/uapi/drm/pancsf_drm.h          |  414 ++++
 21 files changed, 8541 insertions(+)
 create mode 100644 drivers/gpu/drm/pancsf/Kconfig
 create mode 100644 drivers/gpu/drm/pancsf/Makefile
 create mode 100644 drivers/gpu/drm/pancsf/pancsf_device.c
 create mode 100644 drivers/gpu/drm/pancsf/pancsf_device.h
 create mode 100644 drivers/gpu/drm/pancsf/pancsf_drv.c
 create mode 100644 drivers/gpu/drm/pancsf/pancsf_gem.c
 create mode 100644 drivers/gpu/drm/pancsf/pancsf_gem.h
 create mode 100644 drivers/gpu/drm/pancsf/pancsf_gpu.c
 create mode 100644 drivers/gpu/drm/pancsf/pancsf_gpu.h
 create mode 100644 drivers/gpu/drm/pancsf/pancsf_heap.c
 create mode 100644 drivers/gpu/drm/pancsf/pancsf_heap.h
 create mode 100644 drivers/gpu/drm/pancsf/pancsf_mcu.c
 create mode 100644 drivers/gpu/drm/pancsf/pancsf_mcu.h
 create mode 100644 drivers/gpu/drm/pancsf/pancsf_mmu.c
 create mode 100644 drivers/gpu/drm/pancsf/pancsf_mmu.h
 create mode 100644 drivers/gpu/drm/pancsf/pancsf_regs.h
 create mode 100644 drivers/gpu/drm/pancsf/pancsf_sched.c
 create mode 100644 drivers/gpu/drm/pancsf/pancsf_sched.h
 create mode 100644 include/uapi/drm/pancsf_drm.h

Comments

Steven Price Feb. 3, 2023, 3:41 p.m. UTC | #1
Hi Boris,

Thanks for the post - it's great to see the progress!

On 01/02/2023 08:48, Boris Brezillon wrote:
> Mali v10 (second Valhal iteration) and later GPUs replaced the Job
> Manager block by a command stream based interface called CSF (for
> Command Stream Frontend). This interface is not only turning the job
> chain based submission model into a command stream based one, but also
> introducing FW-assisted scheduling of command stream queues. This is a
> fundamental shift in both how userspace is supposed to submit jobs, but
> also how the driver is architectured. We initially tried to retrofit the
> CSF model into panfrost, but this ended up introducing unneeded
> complexity to the existing driver, which we all know is a potential
> source of regression.

While I agree there's some big differences which effectively mandate
splitting the driver I do think there are some parts which make a lot of
sense to share.

For example pancsf_regs.h and panfrost_regs.h are really quite similar
and I think could easily be combined. The clock/regulator code is pretty
much a direct copy/paste (just adding support for more clocks), etc.

What would be ideal is factoring out 'generic' parts from panfrost and
then being able to use them from pancsf.

I had a go at starting that:

https://gitlab.arm.com/linux-arm/linux-sp/-/tree/pancsf-refactor

(lightly tested for Panfrost, only build tested for pancsf).

That saves around 200 lines overall and avoids needing to maintain two
lots of clock/regulator code. There's definite scope for sharing (most)
register definitions between the drivers and quite possibly some of the
MMU/memory code (although there's diminishing returns there).

> So here comes a brand new driver for CSF-based hardware. This is a
> preliminary version and some important features are missing (like devfreq
> , PM support and a memory shrinker implem, to name a few). The goal of
> this RFC is to gather some preliminary feedback on both the uAPI and some
> basic building blocks, like the MMU/VM code, the tiler heap allocation
> logic...

At the moment I don't have any CSF hardware available, so this review is
a pure code review. I'll try to organise some hardware and do some
testing, but it's probably going to take a while to arrive and get setup.

> It's also here to give concrete code to refer to for the discussion
> around scheduling and VM_BIND support that started on the Xe/nouveau
> threads[1][2]. Right now, I'm still using a custom timesharing-based
> scheduler, but I plan to give Daniel's suggestion a try (having one
> drm_gpu_scheduler per drm_sched_entity, and replacing the tick-based
> scheduler by some group slot manager with an LRU-based group eviction
> mechanism). I also have a bunch of things I need to figure out regarding
> the VM-based memory management code. The current design assumes explicit
> syncs everywhere, but we don't use resv objects yet. I see other modern
> drivers are adding BOOKKEEP fences to the VM resv object and using this
> VM resv to synchronize with kernel operations on the VM, but we
> currently don't do any of that. As Daniel pointed out it's likely to
> become an issue when we throw the memory shrinker into the mix. And of
> course, the plan is to transition to the drm_gpuva_manager infrastructure
> being discussed here [2] before merging the driver. Kind of related to
> this shrinker topic, I'm wondering if it wouldn't make sense to use
> the TTM infra for our buffer management (AFAIU, we'd get LRU-based BO
> eviction for free, without needing to expose an MADVISE(DONT_NEED) kind
> of API), but I'm a bit worried about the extra complexity this would pull
> in.

I'm afraid I haven't done much review of this yet - my knowledge of how
this is done in kbase is lacking as I was already leaving GPU around the
time this was being implemented... but since it's all about to change
perhaps that's for the best ;)

> Note that DT bindings are currently undocumented. For those who really
> care, they're based on the panfrost bindings, so I don't expect any
> pain points on that front. I'll provide a proper doc once all other
> aspects have been sorted out.
> 
> Regards,
> 
> Boris
> 
> [1]https://lore.kernel.org/dri-devel/20221222222127.34560-1-matthew.brost@intel.com/
> [2]https://lore.kernel.org/lkml/Y8jOCE%2FPyNZ2Z6aX@DUT025-TGLU.fm.intel.com/
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
> Cc: Steven Price <steven.price@arm.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Daniel Vetter <daniel@ffwll.ch>
> Cc: Daniel Stone <daniels@collabora.com>
> Cc: Jason Ekstrand <jason@jlekstrand.net>
> ---
>  drivers/gpu/drm/Kconfig                |    2 +
>  drivers/gpu/drm/Makefile               |    1 +
>  drivers/gpu/drm/pancsf/Kconfig         |   15 +
>  drivers/gpu/drm/pancsf/Makefile        |   14 +
>  drivers/gpu/drm/pancsf/pancsf_device.c |  391 ++++
>  drivers/gpu/drm/pancsf/pancsf_device.h |  168 ++
>  drivers/gpu/drm/pancsf/pancsf_drv.c    |  812 +++++++
>  drivers/gpu/drm/pancsf/pancsf_gem.c    |  161 ++
>  drivers/gpu/drm/pancsf/pancsf_gem.h    |   45 +
>  drivers/gpu/drm/pancsf/pancsf_gpu.c    |  381 ++++
>  drivers/gpu/drm/pancsf/pancsf_gpu.h    |   40 +
>  drivers/gpu/drm/pancsf/pancsf_heap.c   |  337 +++
>  drivers/gpu/drm/pancsf/pancsf_heap.h   |   30 +
>  drivers/gpu/drm/pancsf/pancsf_mcu.c    |  891 ++++++++
>  drivers/gpu/drm/pancsf/pancsf_mcu.h    |  313 +++
>  drivers/gpu/drm/pancsf/pancsf_mmu.c    | 1345 +++++++++++
>  drivers/gpu/drm/pancsf/pancsf_mmu.h    |   51 +
>  drivers/gpu/drm/pancsf/pancsf_regs.h   |  225 ++
>  drivers/gpu/drm/pancsf/pancsf_sched.c  | 2837 ++++++++++++++++++++++++
>  drivers/gpu/drm/pancsf/pancsf_sched.h  |   68 +
>  include/uapi/drm/pancsf_drm.h          |  414 ++++
>  21 files changed, 8541 insertions(+)
>  create mode 100644 drivers/gpu/drm/pancsf/Kconfig
>  create mode 100644 drivers/gpu/drm/pancsf/Makefile
>  create mode 100644 drivers/gpu/drm/pancsf/pancsf_device.c
>  create mode 100644 drivers/gpu/drm/pancsf/pancsf_device.h
>  create mode 100644 drivers/gpu/drm/pancsf/pancsf_drv.c
>  create mode 100644 drivers/gpu/drm/pancsf/pancsf_gem.c
>  create mode 100644 drivers/gpu/drm/pancsf/pancsf_gem.h
>  create mode 100644 drivers/gpu/drm/pancsf/pancsf_gpu.c
>  create mode 100644 drivers/gpu/drm/pancsf/pancsf_gpu.h
>  create mode 100644 drivers/gpu/drm/pancsf/pancsf_heap.c
>  create mode 100644 drivers/gpu/drm/pancsf/pancsf_heap.h
>  create mode 100644 drivers/gpu/drm/pancsf/pancsf_mcu.c
>  create mode 100644 drivers/gpu/drm/pancsf/pancsf_mcu.h
>  create mode 100644 drivers/gpu/drm/pancsf/pancsf_mmu.c
>  create mode 100644 drivers/gpu/drm/pancsf/pancsf_mmu.h
>  create mode 100644 drivers/gpu/drm/pancsf/pancsf_regs.h
>  create mode 100644 drivers/gpu/drm/pancsf/pancsf_sched.c
>  create mode 100644 drivers/gpu/drm/pancsf/pancsf_sched.h
>  create mode 100644 include/uapi/drm/pancsf_drm.h
> 

<snip>

> diff --git a/drivers/gpu/drm/pancsf/pancsf_drv.c b/drivers/gpu/drm/pancsf/pancsf_drv.c
> new file mode 100644
> index 000000000000..05c4df13ba90
> --- /dev/null
> +++ b/drivers/gpu/drm/pancsf/pancsf_drv.c
> @@ -0,0 +1,812 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright 2018 Marty E. Plummer <hanetzer@startmail.com> */
> +/* Copyright 2019 Linaro, Ltd., Rob Herring <robh@kernel.org> */
> +/* Copyright 2019 Collabora ltd. */
> +
> +#include <linux/module.h>
> +#include <linux/of_platform.h>
> +#include <linux/pagemap.h>
> +#include <linux/pm_runtime.h>
> +#include <drm/pancsf_drm.h>
> +#include <drm/drm_drv.h>
> +#include <drm/drm_ioctl.h>
> +#include <drm/drm_syncobj.h>
> +#include <drm/drm_utils.h>
> +#include <drm/drm_debugfs.h>
> +
> +#include "pancsf_sched.h"
> +#include "pancsf_device.h"
> +#include "pancsf_gem.h"
> +#include "pancsf_heap.h"
> +#include "pancsf_mcu.h"
> +#include "pancsf_mmu.h"
> +#include "pancsf_gpu.h"
> +
> +#define DRM_PANCSF_SYNC_OP_MIN_SIZE		24
> +#define DRM_PANCSF_QUEUE_SUBMIT_MIN_SIZE	32
> +#define DRM_PANCSF_QUEUE_CREATE_MIN_SIZE	8
> +#define DRM_PANCSF_VM_BIND_OP_MIN_SIZE		48

I'm not sure why these are #defines rather than using sizeof()?

> +static int pancsf_ioctl_dev_query(struct drm_device *ddev, void *data, struct drm_file *file)
> +{
> +	struct drm_pancsf_dev_query *args = data;
> +	struct pancsf_device *pfdev = ddev->dev_private;
> +	const void *src;
> +	size_t src_size;
> +
> +	switch (args->type) {
> +	case DRM_PANCSF_DEV_QUERY_GPU_INFO:
> +		src_size = sizeof(pfdev->gpu_info);
> +		src = &pfdev->gpu_info;
> +		break;
> +	case DRM_PANCSF_DEV_QUERY_CSIF_INFO:
> +		src_size = sizeof(pfdev->csif_info);
> +		src = &pfdev->csif_info;
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	if (!args->pointer) {
> +		args->size = src_size;
> +		return 0;
> +	}
> +
> +	args->size = min_t(unsigned long, src_size, args->size);
> +	if (copy_to_user((void __user *)(uintptr_t)args->pointer, src, args->size))

NIT: use u64_to_user_ptr()

> +		return -EFAULT;
> +
> +	return 0;
> +}
> +
> +#define PANCSF_MAX_VMS_PER_FILE		32
> +#define PANCSF_VM_CREATE_FLAGS		0
> +
> +int pancsf_ioctl_vm_create(struct drm_device *ddev, void *data,
> +			   struct drm_file *file)
> +{
> +	struct pancsf_device *pfdev = ddev->dev_private;
> +	struct pancsf_file *pfile = file->driver_priv;
> +	struct drm_pancsf_vm_create *args = data;
> +	int ret;
> +
> +	if (args->flags & ~PANCSF_VM_CREATE_FLAGS)
> +		return -EINVAL;
> +
> +	ret = pancsf_vm_pool_create_vm(pfdev, pfile->vms);
> +	if (ret < 0)
> +		return ret;
> +
> +	args->id = ret;
> +	return 0;
> +}
> +
> +int pancsf_ioctl_vm_destroy(struct drm_device *ddev, void *data,
> +			    struct drm_file *file)
> +{
> +	struct pancsf_file *pfile = file->driver_priv;
> +	struct drm_pancsf_vm_destroy *args = data;
> +
> +	pancsf_vm_pool_destroy_vm(pfile->vms, args->id);
> +	return 0;
> +}
> +
> +#define PANCSF_BO_FLAGS		0
> +
> +static int pancsf_ioctl_bo_create(struct drm_device *ddev, void *data,
> +				  struct drm_file *file)
> +{
> +	struct pancsf_gem_object *bo;
> +	struct drm_pancsf_bo_create *args = data;
> +
> +	if (!args->size || args->pad ||
> +	    (args->flags & ~PANCSF_BO_FLAGS))
> +		return -EINVAL;
> +
> +	bo = pancsf_gem_create_with_handle(file, ddev, args->size, args->flags,
> +					   &args->handle);
> +	if (IS_ERR(bo))
> +		return PTR_ERR(bo);
> +
> +	return 0;
> +}
> +
> +#define PANCSF_VMA_MAP_FLAGS (PANCSF_VMA_MAP_READONLY | \
> +			      PANCSF_VMA_MAP_NOEXEC | \
> +			      PANCSF_VMA_MAP_UNCACHED | \
> +			      PANCSF_VMA_MAP_FRAG_SHADER | \
> +			      PANCSF_VMA_MAP_ON_FAULT | \
> +			      PANCSF_VMA_MAP_AUTO_VA)
> +
> +static int pancsf_ioctl_vm_map(struct drm_device *ddev, void *data,
> +			       struct drm_file *file)
> +{
> +	struct pancsf_file *pfile = file->driver_priv;
> +	struct drm_pancsf_vm_map *args = data;
> +	struct drm_gem_object *gem;
> +	struct pancsf_vm *vm;
> +	int ret;
> +
> +	if (args->flags & ~PANCSF_VMA_MAP_FLAGS)
> +		return -EINVAL;
> +
> +	gem = drm_gem_object_lookup(file, args->bo_handle);
> +	if (!gem)
> +		return -EINVAL;
> +
> +	vm = pancsf_vm_pool_get_vm(pfile->vms, args->vm_id);
> +	if (vm) {
> +		ret = pancsf_vm_map_bo_range(vm, to_pancsf_bo(gem), args->bo_offset,
> +					     args->size, &args->va, args->flags);
> +	} else {
> +		ret = -EINVAL;
> +	}
> +
> +	pancsf_vm_put(vm);
> +	drm_gem_object_put(gem);
> +	return ret;
> +}
> +
> +#define PANCSF_VMA_UNMAP_FLAGS 0
> +
> +static int pancsf_ioctl_vm_unmap(struct drm_device *ddev, void *data,
> +				 struct drm_file *file)
> +{
> +	struct pancsf_file *pfile = file->driver_priv;
> +	struct drm_pancsf_vm_unmap *args = data;
> +	struct pancsf_vm *vm;
> +	int ret;
> +
> +	if (args->flags & ~PANCSF_VMA_UNMAP_FLAGS)
> +		return -EINVAL;
> +
> +	vm = pancsf_vm_pool_get_vm(pfile->vms, args->vm_id);
> +	if (vm)
> +		ret = pancsf_vm_unmap_range(vm, args->va, args->size);
> +	else
> +		ret = -EINVAL;
> +
> +	pancsf_vm_put(vm);
> +	return ret;
> +}
> +
> +static void *pancsf_get_obj_array(struct drm_pancsf_obj_array *in, u32 min_stride)
> +{
> +	u32 elem_size = min_t(u32, in->stride, min_stride);
> +	int ret = 0;
> +	void *out;
> +
> +	if (in->stride < min_stride)
> +		return ERR_PTR(-EINVAL);
> +
> +	out = kvmalloc_array(in->count, elem_size, GFP_KERNEL);
> +	if (!out)
> +		return ERR_PTR(-ENOMEM);
> +
> +	if (elem_size == in->stride) {
> +		if (copy_from_user(out, u64_to_user_ptr(in->array), elem_size * in->count))
> +			ret = -EFAULT;
> +	} else {
> +		void __user *in_ptr = u64_to_user_ptr(in->array);
> +		void *out_ptr = out;
> +		u32 i;
> +
> +		for (i = 0; i < in->count; i++) {
> +			if (copy_from_user(out_ptr, in_ptr, elem_size)) {
> +				ret = -EFAULT;
> +				break;
> +			}

Missing the increment of out_ptr and in_ptr.

> +		}
> +	}
> +
> +	if (ret) {
> +		kvfree(out);
> +		return ERR_PTR(ret);
> +	}
> +
> +	return out;
> +}
> +
> +static int pancsf_add_job_deps(struct drm_file *file, struct pancsf_job *job,
> +			       struct drm_pancsf_sync_op *sync_ops, u32 sync_op_count)
> +{
> +	u32 i;
> +
> +	for (i = 0; i < sync_op_count; i++) {
> +		struct dma_fence *fence;
> +		int ret;
> +
> +		if (sync_ops[i].op_type != DRM_PANCSF_SYNC_OP_WAIT)
> +			continue;
> +
> +		switch (sync_ops[i].handle_type) {
> +		case DRM_PANCSF_SYNC_HANDLE_TYPE_SYNCOBJ:
> +		case DRM_PANCSF_SYNC_HANDLE_TYPE_TIMELINE_SYNCOBJ:
> +			ret = drm_syncobj_find_fence(file, sync_ops[i].handle,
> +						     sync_ops[i].timeline_value,
> +						     0, &fence);
> +			if (ret)
> +				return ret;
> +
> +			ret = pancsf_add_job_dep(job, fence);
> +			if (ret) {
> +				dma_fence_put(fence);
> +				return ret;
> +			}
> +			break;
> +
> +		default:
> +			return -EINVAL;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +struct pancsf_sync_signal {
> +	struct drm_syncobj *syncobj;
> +	struct dma_fence_chain *chain;
> +	u64 point;
> +};
> +
> +struct pancsf_sync_signal_array {
> +	struct pancsf_sync_signal *signals;
> +	u32 count;
> +};
> +
> +void
> +pancsf_free_sync_signal_array(struct pancsf_sync_signal_array *array)
> +{
> +	u32 i;
> +
> +	for (i = 0; i < array->count; i++) {
> +		drm_syncobj_put(array->signals[i].syncobj);
> +		dma_fence_chain_free(array->signals[i].chain);
> +	}
> +
> +	kvfree(array->signals);
> +	array->signals = NULL;
> +	array->count = 0;
> +}
> +
> +int
> +pancsf_collect_sync_signal_array(struct drm_file *file,
> +				 struct drm_pancsf_sync_op *sync_ops, u32 sync_op_count,
> +				 struct pancsf_sync_signal_array *array)
> +{
> +	u32 count = 0, i;
> +	int ret;
> +
> +	for (i = 0; i < sync_op_count; i++) {
> +		if (sync_ops[i].op_type == DRM_PANCSF_SYNC_OP_SIGNAL)
> +			count++;
> +	}
> +
> +	array->signals = kvmalloc_array(count, sizeof(*array->signals), GFP_KERNEL | __GFP_ZERO);
> +	if (!array->signals)
> +		return -ENOMEM;
> +
> +	for (i = 0; i < sync_op_count; i++) {
> +		int ret;
> +
> +		if (sync_ops[i].op_type != DRM_PANCSF_SYNC_OP_SIGNAL)
> +			continue;
> +
> +		switch (sync_ops[i].handle_type) {
> +		case DRM_PANCSF_SYNC_HANDLE_TYPE_TIMELINE_SYNCOBJ:
> +			array->signals[array->count].chain = dma_fence_chain_alloc();
> +			if (!array->signals[array->count].chain) {
> +				ret = -ENOMEM;
> +				goto err;
> +			}
> +
> +			array->signals[array->count].point = sync_ops[i].timeline_value;
> +			fallthrough;
> +
> +		case DRM_PANCSF_SYNC_HANDLE_TYPE_SYNCOBJ:
> +			array->signals[array->count].syncobj = drm_syncobj_find(file, sync_ops[i].handle);
> +			if (!array->signals[array->count].syncobj) {
> +				ret = -EINVAL;
> +				goto err;
> +			}
> +
> +			array->count++;

NIT: it's not obvious in this function that array->count is set to 0
initially. It looks like it's correct because
pancsf_ioctl_group_submit() does an alloc with __GFP_ZERO, but it would
be much clearer with a local variable and assigning array->count at the end.

> +			break;
> +
> +		default:
> +			ret = -EINVAL;
> +			goto err;
> +		}
> +	}
> +
> +	return 0;
> +
> +err:
> +	pancsf_free_sync_signal_array(array);
> +	return ret;
> +}

<gigantic snip - sorry, I phased out somewhat reading through all the
code - hopefully when I get a platform and start playing with it
properly I can give some meaningful feedback. Nothing jumped out as
looking seriously wrong>

> diff --git a/include/uapi/drm/pancsf_drm.h b/include/uapi/drm/pancsf_drm.h
> new file mode 100644
> index 000000000000..d770d1d1037c
> --- /dev/null
> +++ b/include/uapi/drm/pancsf_drm.h
> @@ -0,0 +1,414 @@
> +/* SPDX-License-Identifier: MIT */
> +/* Copyright (C) 2023 Collabora ltd. */
> +#ifndef _PANCSF_DRM_H_
> +#define _PANCSF_DRM_H_
> +
> +#include "drm.h"
> +
> +#if defined(__cplusplus)
> +extern "C" {
> +#endif
> +
> +/*
> + * Userspace driver controls GPU cache flushling through CS instructions, but
> + * the flush reduction mechanism requires a flush_id. This flush_id could be
> + * queried with an ioctl, but Arm provides a well-isolated register page
> + * containing only this read-only register, so let's expose this page through
> + * a static mmap offset and allow direct mapping of this MMIO region so we
> + * can avoid the user <-> kernel round-trip.
> + */
> +#define DRM_PANCSF_USER_MMIO_OFFSET		(0xffffull << 48)
> +#define DRM_PANCSF_USER_FLUSH_ID_MMIO_OFFSET	(DRM_PANCSF_USER_MMIO_OFFSET | 0)
> +
> +/* Place new ioctls at the end, don't re-oder. */
> +enum drm_pancsf_ioctl_id {
> +	DRM_PANCSF_DEV_QUERY = 0,
> +	DRM_PANCSF_VM_CREATE,
> +	DRM_PANCSF_VM_DESTROY,
> +	DRM_PANCSF_BO_CREATE,
> +	DRM_PANCSF_BO_MMAP_OFFSET,
> +	DRM_PANCSF_VM_MAP,
> +	DRM_PANCSF_VM_UNMAP,
> +	DRM_PANCSF_GROUP_CREATE,
> +	DRM_PANCSF_GROUP_DESTROY,
> +	DRM_PANCSF_TILER_HEAP_CREATE,
> +	DRM_PANCSF_TILER_HEAP_DESTROY,
> +	DRM_PANCSF_GROUP_SUBMIT,
> +};
> +
> +#define DRM_IOCTL_PANCSF_DEV_QUERY		DRM_IOWR(DRM_COMMAND_BASE + DRM_PANCSF_DEV_QUERY, struct drm_pancsf_dev_query)
> +#define DRM_IOCTL_PANCSF_VM_CREATE		DRM_IOWR(DRM_COMMAND_BASE + DRM_PANCSF_VM_CREATE, struct drm_pancsf_vm_create)
> +#define DRM_IOCTL_PANCSF_VM_DESTROY		DRM_IOWR(DRM_COMMAND_BASE + DRM_PANCSF_VM_DESTROY, struct drm_pancsf_vm_destroy)
> +#define DRM_IOCTL_PANCSF_BO_CREATE		DRM_IOWR(DRM_COMMAND_BASE + DRM_PANCSF_BO_CREATE, struct drm_pancsf_bo_create)
> +#define DRM_IOCTL_PANCSF_BO_MMAP_OFFSET		DRM_IOWR(DRM_COMMAND_BASE + DRM_PANCSF_BO_MMAP_OFFSET, struct drm_pancsf_bo_mmap_offset)
> +#define DRM_IOCTL_PANCSF_VM_MAP			DRM_IOWR(DRM_COMMAND_BASE + DRM_PANCSF_VM_MAP, struct drm_pancsf_vm_map)
> +#define DRM_IOCTL_PANCSF_VM_UNMAP		DRM_IOWR(DRM_COMMAND_BASE + DRM_PANCSF_VM_UNMAP, struct drm_pancsf_vm_unmap)
> +#define DRM_IOCTL_PANCSF_GROUP_CREATE		DRM_IOWR(DRM_COMMAND_BASE + DRM_PANCSF_GROUP_CREATE, struct drm_pancsf_group_create)
> +#define DRM_IOCTL_PANCSF_GROUP_DESTROY		DRM_IOWR(DRM_COMMAND_BASE + DRM_PANCSF_GROUP_DESTROY, struct drm_pancsf_group_destroy)
> +#define DRM_IOCTL_PANCSF_TILER_HEAP_CREATE	DRM_IOWR(DRM_COMMAND_BASE + DRM_PANCSF_TILER_HEAP_CREATE, struct drm_pancsf_tiler_heap_create)
> +#define DRM_IOCTL_PANCSF_TILER_HEAP_DESTROY	DRM_IOWR(DRM_COMMAND_BASE + DRM_PANCSF_TILER_HEAP_DESTROY, struct drm_pancsf_tiler_heap_destroy)
> +#define DRM_IOCTL_PANCSF_GROUP_SUBMIT		DRM_IOWR(DRM_COMMAND_BASE + DRM_PANCSF_GROUP_SUBMIT, struct drm_pancsf_group_submit)
> +
> +/* Place new types at the end, don't re-oder. */
> +enum drm_pancsf_dev_query_type {
> +	DRM_PANCSF_DEV_QUERY_GPU_INFO = 0,
> +	DRM_PANCSF_DEV_QUERY_CSIF_INFO,
> +};
> +
> +struct drm_pancsf_gpu_info {
> +#define DRM_PANCSF_ARCH_MAJOR(x)		((x) >> 28)
> +#define DRM_PANCSF_ARCH_MINOR(x)		(((x) >> 24) & 0xf)
> +#define DRM_PANCSF_ARCH_REV(x)			(((x) >> 20) & 0xf)
> +#define DRM_PANCSF_PRODUCT_MAJOR(x)		(((x) >> 16) & 0xf)
> +#define DRM_PANCSF_VERSION_MAJOR(x)		(((x) >> 12) & 0xf)
> +#define DRM_PANCSF_VERSION_MINOR(x)		(((x) >> 4) & 0xff)
> +#define DRM_PANCSF_VERSION_STATUS(x)		((x) & 0xf)
> +	__u32 gpu_id;
> +	__u32 gpu_rev;
> +#define DRM_PANCSF_CSHW_MAJOR(x)		(((x) >> 26) & 0x3f)
> +#define DRM_PANCSF_CSHW_MINOR(x)		(((x) >> 20) & 0x3f)
> +#define DRM_PANCSF_CSHW_REV(x)			(((x) >> 16) & 0xf)
> +#define DRM_PANCSF_MCU_MAJOR(x)			(((x) >> 10) & 0x3f)
> +#define DRM_PANCSF_MCU_MINOR(x)			(((x) >> 4) & 0x3f)
> +#define DRM_PANCSF_MCU_REV(x)			((x) & 0xf)
> +	__u32 csf_id;
> +	__u32 l2_features;
> +	__u32 tiler_features;
> +	__u32 mem_features;
> +	__u32 mmu_features;
> +	__u32 thread_features;
> +	__u32 max_threads;
> +	__u32 thread_max_workgroup_size;
> +	__u32 thread_max_barrier_size;
> +	__u32 coherency_features;
> +	__u32 texture_features[4];
> +	__u32 as_present;
> +	__u32 core_group_count;
> +	__u64 shader_present;
> +	__u64 l2_present;
> +	__u64 tiler_present;
> +};
> +
> +struct drm_pancsf_csif_info {
> +	__u32 csg_slot_count;
> +	__u32 cs_slot_count;
> +	__u32 cs_reg_count;
> +	__u32 scoreboard_slot_count;
> +	__u32 unpreserved_cs_reg_count;
> +};
> +
> +struct drm_pancsf_dev_query {
> +	/** @type: the query type (see enum drm_pancsf_dev_query_type). */
> +	__u32 type;
> +
> +	/**
> +	 * @size: size of the type being queried.
> +	 *
> +	 * If pointer is NULL, size is updated by the driver to provide the
> +	 * output structure size. If pointer is not NULL, the the driver will
> +	 * only copy min(size, actual_structure_size) bytes to the pointer,
> +	 * and update the size accordingly. This allows us to extend query
> +	 * types without breaking userspace.
> +	 */
> +	__u32 size;
> +
> +	/**
> +	 * @pointer: user pointer to a query type struct.
> +	 *
> +	 * Pointer can be NULL, in which case, nothing is copied, but the
> +	 * actual structure size is returned. If not NULL, it must point to
> +	 * a location that's large enough to hold size bytes.
> +	 */
> +	__u64 pointer;
> +};

Genuine question: is there something wrong with the panfrost 'get_param'
ioctl where individual features are queried one-by-one, rather than
passing a big structure back to user space.

I ask because we've had issues in the past with trying to 'deprecate'
registers - if a new version of the hardware stops providing a
(meaningful) value for a register then it's hard to fix up the
structures. The get_param method means it's possible to return a failure
for just the register(s) that have disappeared.

There is obviously overhead iterating over all the register that user
space cares about. Another option (used by kbase) is to return some form
of structured data so a missing property can be encoded.

> +
> +struct drm_pancsf_vm_create {
> +	/** @flags: VM flags, MBZ. */
> +	__u32 flags;
> +
> +	/** @id: Returned VM ID */
> +	__u32 id;
> +};
> +
> +struct drm_pancsf_vm_destroy {
> +	/** @id: ID of the VM to destroy */
> +	__u32 id;
> +
> +	/** @pad: MBZ. */
> +	__u32 pad;
> +};
> +
> +struct drm_pancsf_bo_create {
> +	/**
> +	 * @size: Requested size for the object
> +	 *
> +	 * The (page-aligned) allocated size for the object will be returned.
> +	 */
> +	__u64 size;
> +
> +	/**
> +	 * @flags: Flags, currently unused, MBZ.
> +	 */
> +	__u32 flags;
> +
> +	/**
> +	 * @vm_id: Attached VM, if any
> +	 *
> +	 * If a VM is specified, this BO must:
> +	 *
> +	 *  1. Only ever be bound to that VM.
> +	 *
> +	 *  2. Cannot be exported as a PRIME fd.
> +	 */
> +	__u32 vm_id;

Unless I'm mistaken this (vm_id) isn't used (or even checked) by the
current code.

> +
> +	/**
> +	 * @handle: Returned handle for the object.
> +	 *
> +	 * Object handles are nonzero.
> +	 */
> +	__u32 handle;
> +
> +	/* @pad: MBZ. */
> +	__u32 pad;
> +};
> +
> +struct drm_pancsf_bo_mmap_offset {
> +	/** @handle: Handle for the object being mapped. */
> +	__u32 handle;
> +
> +	/** @pad: MBZ. */
> +	__u32 pad;

pancsf_ioctl_bo_mmap_offset is missing validation for this pad field.

> +
> +	/** @offset: The fake offset to use for subsequent mmap call */
> +	__u64 offset;
> +};
> +
> +#define PANCSF_VMA_MAP_READONLY		0x1
> +#define PANCSF_VMA_MAP_NOEXEC		0x2
> +#define PANCSF_VMA_MAP_UNCACHED		0x4
> +#define PANCSF_VMA_MAP_FRAG_SHADER	0x8
> +#define PANCSF_VMA_MAP_ON_FAULT		0x10
> +#define PANCSF_VMA_MAP_AUTO_VA		0x20
> +
> +struct drm_pancsf_vm_map {
> +	/** @vm_id: VM to map BO range to */
> +	__u32 vm_id;
> +
> +	/** @flags: Combination of PANCSF_VMA_MAP_ flags */
> +	__u32 flags;
> +
> +	/** @pad: padding field, MBZ. */
> +	__u32 pad;

Missing validation in pancsf_ioctl_vm_map.

> +
> +	/** @bo_handle: Buffer object to map. */
> +	__u32 bo_handle;
> +
> +	/** @bo_offset: Buffer object offset. */
> +	__u64 bo_offset;
> +
> +	/**
> +	 * @va: Virtual address to map the BO to. Mapping address returned here if
> +	 *	PANCSF_VMA_MAP_ON_FAULT is set.
> +	 */
> +	__u64 va;
> +
> +	/** @size: Size to map. */
> +	__u64 size;
> +};
> +
> +struct drm_pancsf_vm_unmap {
> +	/** @vm_id: VM to map BO range to */
> +	__u32 vm_id;
> +
> +	/** @flags: MBZ. */
> +	__u32 flags;

Validation missing

> +
> +	/** @va: Virtual address to unmap. */
> +	__u64 va;
> +
> +	/** @size: Size to unmap. */
> +	__u64 size;
> +};
> +
> +enum drm_pancsf_sync_op_type {
> +	DRM_PANCSF_SYNC_OP_WAIT = 0,
> +	DRM_PANCSF_SYNC_OP_SIGNAL,
> +};
> +
> +enum drm_pancsf_sync_handle_type {
> +	DRM_PANCSF_SYNC_HANDLE_TYPE_SYNCOBJ = 0,
> +	DRM_PANCSF_SYNC_HANDLE_TYPE_TIMELINE_SYNCOBJ,
> +};
> +
> +struct drm_pancsf_sync_op {
> +	/** @op_type: Sync operation type. */
> +	__u32 op_type;
> +
> +	/** @handle_type: Sync handle type. */
> +	__u32 handle_type;
> +
> +	/** @handle: Sync handle. */
> +	__u32 handle;
> +
> +	/** @flags: MBZ. */
> +	__u32 flags;

I don't see any proper validation that op_type/handle_type/flags are not
invalid.

> +
> +	/** @timeline_value: MBZ if handle_type != DRM_PANCSF_SYNC_HANDLE_TYPE_TIMELINE_SYNCOBJ. */
> +	__u64 timeline_value;
> +};
> +
> +struct drm_pancsf_obj_array {
> +	/** @stride: Stride of object struct. Used for versioning. */
> +	__u32 stride;
> +
> +	/** @count: Number of objects in the array. */
> +	__u32 count;
> +
> +	/** @array: User pointer to an array of objects. */
> +	__u64 array;
> +};
> +
> +#define DRM_PANCSF_OBJ_ARRAY(cnt, ptr) \
> +	{ .stride = sizeof(ptr[0]), .count = cnt, .array = (__u64)(uintptr_t)ptr }
> +
> +struct drm_pancsf_queue_submit {
> +	/** @queue_index: Index of the queue inside a group. */
> +	__u32 queue_index;
> +
> +	/** @stream_size: Size of the command stream to execute. */
> +	__u32 stream_size;
> +
> +	/** @stream_addr: GPU address of the command stream to execute. */
> +	__u64 stream_addr;
> +
> +	/** @syncs: Array of sync operations. */
> +	struct drm_pancsf_obj_array syncs;
> +};
> +
> +struct drm_pancsf_group_submit {
> +	/** @group_handle: Handle of the group to queue jobs to. */
> +	__u32 group_handle;
> +
> +	/** @syncs: Array of queue submit operations. */
> +	struct drm_pancsf_obj_array queue_submits;
> +};
> +
> +struct drm_pancsf_queue_create {
> +	/**
> +	 * @priority: Defines the priority of queues inside a group. Goes from 0 to 15,
> +	 *	      15 being the highest priority.
> +	 */
> +	__u8 priority;
> +
> +	/** @pad: Padding fields, MBZ. */
> +	__u8 pad[3];
> +
> +	/** @ringbuf_size: Size of the ring buffer to allocate to this queue. */

(Must be PAGE_SIZE aligned and at most 64k)

> +	__u32 ringbuf_size;
> +};
> +
> +enum drm_pancsf_group_priority {
> +	PANCSF_GROUP_PRIORITY_LOW = 0,
> +	PANCSF_GROUP_PRIORITY_MEDIUM,
> +	PANCSF_GROUP_PRIORITY_HIGH,
> +};
> +
> +struct drm_pancsf_group_create {
> +	/** @queues: Array of drm_pancsf_create_cs_queue elements. */

NIT: s/drm_pancsf_create_cs_queue/drm_pancsf_queue_create/

> +	struct drm_pancsf_obj_array queues;
> +
> +	/**
> +	 * @max_compute_cores: Maximum number of cores that can be
> +	 *		       used by compute jobs across CS queues
> +	 *		       bound to this group.
> +	 */
> +	__u8 max_compute_cores;
> +
> +	/**
> +	 * @max_fragment_cores: Maximum number of cores that can be
> +	 *			used by fragment jobs across CS queues
> +	 *			bound to this group.
> +	 */
> +	__u8 max_fragment_cores;
> +
> +	/**
> +	 * @max_tiler_cores: Maximum number of tilers that can be
> +	 *		     used by tiler jobs across CS queues
> +	 *		     bound to this group.
> +	 */
> +	__u8 max_tiler_cores;
> +
> +	/** @priority: Group priority (see drm_drm_pancsf_cs_group_priority). */
> +	__u8 priority;
> +
> +	/** @compute_core_mask: Mask encoding cores that can be used for compute jobs. */
> +	__u64 compute_core_mask;
> +
> +	/** @fragment_core_mask: Mask encoding cores that can be used for fragment jobs. */
> +	__u64 fragment_core_mask;
> +
> +	/** @tiler_core_mask: Mask encoding cores that can be used for tiler jobs. */
> +	__u64 tiler_core_mask;
> +
> +	/**
> +	 * @vm_id: VM ID to bind this group to. All submission to queues bound to this
> +	 *	   group will use this VM.
> +	 */
> +	__u32 vm_id;
> +
> +	/*
> +	 * @group_handle: Returned group handle. Passed back when submitting jobs or
> +	 *		  destroying a group.
> +	 */
> +	__u32 group_handle;
> +};
> +
> +struct drm_pancsf_group_destroy {
> +	/** @group_handle: Group to destroy */
> +	__u32 group_handle;
> +
> +	/** @pad: Padding field, MBZ. */
> +	__u32 pad;

No validation for padding

> +};
> +
> +struct drm_pancsf_tiler_heap_create {
> +	/** @vm_id: VM ID the tiler heap should be mapped to */
> +	__u32 vm_id;
> +
> +	/** @initial_chunk_count: Initial number of chunks to allocate. */
> +	__u32 initial_chunk_count;
> +
> +	/** @chunk_size: Chunk size. Must be a power of two at least 256KB large. */
> +	__u32 chunk_size;
> +
> +	/* @max_chunks: Maximum number of chunks that can be allocated. */
> +	__u32 max_chunks;
> +
> +	/** @target_in_flight: Maximum number of in-flight render passes.
> +	 * If exceeded the FW will wait for render passes to finish before
> +	 * queuing new tiler jobs.
> +	 */
> +	__u32 target_in_flight;
> +
> +	/** @handle: Returned heap handle. Passed back to DESTROY_TILER_HEAP. */
> +	__u32 handle;
> +
> +	/** @tiler_heap_ctx_gpu_va: Returned heap GPU virtual address returned */
> +	__u64 tiler_heap_ctx_gpu_va;
> +	__u64 first_heap_chunk_gpu_va;
> +};
> +
> +struct drm_pancsf_tiler_heap_destroy {
> +	/** @handle: Handle of the tiler heap to destroy */
> +	__u32 handle;
> +
> +	/** @pad: Padding field, MBZ. */
> +	__u32 pad;

Validation

> +};
> +
> +#if defined(__cplusplus)
> +}
> +#endif
> +
> +#endif /* _PANCSF_DRM_H_ */


Thanks once again for this. I hope to be able to give it a spin soon!

Steve
Alyssa Rosenzweig Feb. 3, 2023, 4:29 p.m. UTC | #2
> > Mali v10 (second Valhal iteration) and later GPUs replaced the Job
> > Manager block by a command stream based interface called CSF (for
> > Command Stream Frontend). This interface is not only turning the job
> > chain based submission model into a command stream based one, but also
> > introducing FW-assisted scheduling of command stream queues. This is a
> > fundamental shift in both how userspace is supposed to submit jobs, but
> > also how the driver is architectured. We initially tried to retrofit the
> > CSF model into panfrost, but this ended up introducing unneeded
> > complexity to the existing driver, which we all know is a potential
> > source of regression.
> 
> While I agree there's some big differences which effectively mandate
> splitting the driver I do think there are some parts which make a lot of
> sense to share.
> 
> For example pancsf_regs.h and panfrost_regs.h are really quite similar
> and I think could easily be combined. The clock/regulator code is pretty
> much a direct copy/paste (just adding support for more clocks), etc.
> 
> What would be ideal is factoring out 'generic' parts from panfrost and
> then being able to use them from pancsf.
> 
> I had a go at starting that:
> 
> https://gitlab.arm.com/linux-arm/linux-sp/-/tree/pancsf-refactor
> 
> (lightly tested for Panfrost, only build tested for pancsf).
> 
> That saves around 200 lines overall and avoids needing to maintain two
> lots of clock/regulator code. There's definite scope for sharing (most)
> register definitions between the drivers and quite possibly some of the
> MMU/memory code (although there's diminishing returns there).

200 lines saved in a 5kloc+ driver doesn't seem worth much, especially
against the added testing combinatorics, TBH. The main reason I can see
to unify is if we want VM_BIND (and related goodies) on JM hardware too.
That's only really for Vulkan and I really don't see the case for Vulkan
on anything older than Valhall at this point. So it comes down to
whether we want to start Vulkan at v9 or skip to v10. The separate
panfrost/pancsf drivers approach strongly favours the latter.
Steven Price Feb. 3, 2023, 4:43 p.m. UTC | #3
On 03/02/2023 16:29, Alyssa Rosenzweig wrote:
>>> Mali v10 (second Valhal iteration) and later GPUs replaced the Job
>>> Manager block by a command stream based interface called CSF (for
>>> Command Stream Frontend). This interface is not only turning the job
>>> chain based submission model into a command stream based one, but also
>>> introducing FW-assisted scheduling of command stream queues. This is a
>>> fundamental shift in both how userspace is supposed to submit jobs, but
>>> also how the driver is architectured. We initially tried to retrofit the
>>> CSF model into panfrost, but this ended up introducing unneeded
>>> complexity to the existing driver, which we all know is a potential
>>> source of regression.
>>
>> While I agree there's some big differences which effectively mandate
>> splitting the driver I do think there are some parts which make a lot of
>> sense to share.
>>
>> For example pancsf_regs.h and panfrost_regs.h are really quite similar
>> and I think could easily be combined. The clock/regulator code is pretty
>> much a direct copy/paste (just adding support for more clocks), etc.
>>
>> What would be ideal is factoring out 'generic' parts from panfrost and
>> then being able to use them from pancsf.
>>
>> I had a go at starting that:
>>
>> https://gitlab.arm.com/linux-arm/linux-sp/-/tree/pancsf-refactor
>>
>> (lightly tested for Panfrost, only build tested for pancsf).
>>
>> That saves around 200 lines overall and avoids needing to maintain two
>> lots of clock/regulator code. There's definite scope for sharing (most)
>> register definitions between the drivers and quite possibly some of the
>> MMU/memory code (although there's diminishing returns there).
> 
> 200 lines saved in a 5kloc+ driver doesn't seem worth much, especially
> against the added testing combinatorics, TBH. The main reason I can see
> to unify is if we want VM_BIND (and related goodies) on JM hardware too.
> That's only really for Vulkan and I really don't see the case for Vulkan
> on anything older than Valhall at this point. So it comes down to
> whether we want to start Vulkan at v9 or skip to v10. The separate
> panfrost/pancsf drivers approach strongly favours the latter.

While I agree 200 lines isn't much in the grand scheme what I really
don't want is to have to maintain two (almost) identical versions of the
same code. I agree with the concept entirely of having a separate .ko
and not trying to keep it all "one driver". Just, for the bits where
it's clearly copy/pasted from the existing Panfrost, lets move that code
into a common file and build it into both drivers.

Ultimately the 200 line saving was just a couple of hours this morning -
indeed I was 'reviewing' by comparing against Panfrost and thinking "if
it works in Panfrost it must be correct" - the review would be even
easier if it wasn't new code ;)

And as far as I'm aware the changes I'm proposing don't make any
difference to testing - I'm not sure I understand that statement.

The MMU/memory code I'm undecided on. There's clearly copied code there
but quite a few differences. If we can unify and get extra goodies for
Panfrost then it's worth doing, if the unification is going to be hard
or risk regressions then perhaps not - especially if Mesa isn't going to
get the features (which depends whether anyone working on Mesa wants to
work on Vulkan for Bifrost).

Anyway, just to be clear I don't want to stand in the way of getting
this merged. If necessary the refactor can be done on top afterwards
(indeed that's what I've got in my repo).

Thanks,

Steve
Boris Brezillon Feb. 3, 2023, 5:29 p.m. UTC | #4
Hi Steven,

On Fri, 3 Feb 2023 15:41:38 +0000
Steven Price <steven.price@arm.com> wrote:

> Hi Boris,
> 
> Thanks for the post - it's great to see the progress!

Thanks for chiming in!

> 
> On 01/02/2023 08:48, Boris Brezillon wrote:
> > Mali v10 (second Valhal iteration) and later GPUs replaced the Job
> > Manager block by a command stream based interface called CSF (for
> > Command Stream Frontend). This interface is not only turning the job
> > chain based submission model into a command stream based one, but also
> > introducing FW-assisted scheduling of command stream queues. This is a
> > fundamental shift in both how userspace is supposed to submit jobs, but
> > also how the driver is architectured. We initially tried to retrofit the
> > CSF model into panfrost, but this ended up introducing unneeded
> > complexity to the existing driver, which we all know is a potential
> > source of regression.  
> 
> While I agree there's some big differences which effectively mandate
> splitting the driver I do think there are some parts which make a lot of
> sense to share.
> 
> For example pancsf_regs.h and panfrost_regs.h are really quite similar
> and I think could easily be combined.

For registers, I'm not so sure. I mean, yes, most of them are identical,
but some disappeared, while others were kept but with a different
layout (see GPU_CMD), or bits re-purposed for different meaning
(MMU_INT_STAT where BUS_FAULT just became OPERATION_COMPLETE). This
makes the whole thing very confusing, so I'd rather keep those definitions
separate for my own sanity (spent a bit of time trying to understand
why my GPU command was doing nothing or why I was receiving BUS_FAULT
interrupts, before realizing I was referring to the old layout) :-).

> The clock/regulator code is pretty
> much a direct copy/paste (just adding support for more clocks), etc.

Clock and regulators, maybe, but there's not much to be shared here. I
mean, Linux already provides quite a few helpers making the
clk/regulator retrieval/enabling/disabling pretty straightforward. Not
to mention that, by keeping them separate, we don't really need to deal
with old Mali HW quirks, and we can focus on new HW bugs instead :-).

> 
> What would be ideal is factoring out 'generic' parts from panfrost and
> then being able to use them from pancsf.

I've been refactoring this pancsf driver a few times already, and I must
admit I'd prefer to keep things separate, at least for the initial
submission. If we see things that should be shared, then we can do that
in a follow-up series, but I think it's a bit premature to do it now.

> 
> I had a go at starting that:
> 
> https://gitlab.arm.com/linux-arm/linux-sp/-/tree/pancsf-refactor
> 
> (lightly tested for Panfrost, only build tested for pancsf).

Thanks, I'll have a look.

> 
> That saves around 200 lines overall and avoids needing to maintain two
> lots of clock/regulator code. There's definite scope for sharing (most)
> register definitions between the drivers and quite possibly some of the
> MMU/memory code (although there's diminishing returns there).

Yeah, actually the MMU code is likely to diverge even more if we want
to support VM_BIND (require pre-allocating pages for the page-table
updates, so the map/unmap operations can't fail in the run_job path),
so I'm not sure it's a good idea to share that bit, at least not until
we have a clearer idea of how we want things done.

> 
> > So here comes a brand new driver for CSF-based hardware. This is a
> > preliminary version and some important features are missing (like devfreq
> > , PM support and a memory shrinker implem, to name a few). The goal of
> > this RFC is to gather some preliminary feedback on both the uAPI and some
> > basic building blocks, like the MMU/VM code, the tiler heap allocation
> > logic...  
> 
> At the moment I don't have any CSF hardware available, so this review is
> a pure code review.

That's still very useful.

> I'll try to organise some hardware and do some
> testing, but it's probably going to take a while to arrive and get setup.

I'm actively working on the driver, and fixing things as I go, so let
me know when you're ready to test and I'll point you to the latest
version.

> > +#define DRM_PANCSF_SYNC_OP_MIN_SIZE		24
> > +#define DRM_PANCSF_QUEUE_SUBMIT_MIN_SIZE	32
> > +#define DRM_PANCSF_QUEUE_CREATE_MIN_SIZE	8
> > +#define DRM_PANCSF_VM_BIND_OP_MIN_SIZE		48  
> 
> I'm not sure why these are #defines rather than using sizeof()?

Yeah, I don't really like that. Those min sizes are here to deal with
potential new versions of the various objects passed as arrays
referenced from the main ioctl struct, so we can't really use
sizeof(struct drm_pancsf_xxx), because the size will change when those
structs are extended.

I'm considering using some macro trickery based on
offsetof(struct drm_pancsf_xxx, last_mandatory_field) +
sizeof(drm_pancsf_xxx.last_mandatory_field) to get
the minimum struct size instead of those static defines. On the other
hand, this should never change, so it's only a pain point while the
initial struct are under discussion.

> 
> > +static int pancsf_ioctl_dev_query(struct drm_device *ddev, void *data, struct drm_file *file)
> > +{
> > +	struct drm_pancsf_dev_query *args = data;
> > +	struct pancsf_device *pfdev = ddev->dev_private;
> > +	const void *src;
> > +	size_t src_size;
> > +
> > +	switch (args->type) {
> > +	case DRM_PANCSF_DEV_QUERY_GPU_INFO:
> > +		src_size = sizeof(pfdev->gpu_info);
> > +		src = &pfdev->gpu_info;
> > +		break;
> > +	case DRM_PANCSF_DEV_QUERY_CSIF_INFO:
> > +		src_size = sizeof(pfdev->csif_info);
> > +		src = &pfdev->csif_info;
> > +		break;
> > +	default:
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (!args->pointer) {
> > +		args->size = src_size;
> > +		return 0;
> > +	}
> > +
> > +	args->size = min_t(unsigned long, src_size, args->size);
> > +	if (copy_to_user((void __user *)(uintptr_t)args->pointer, src, args->size))  
> 
> NIT: use u64_to_user_ptr()

Sure, will fix that.

> 
> > +		return -EFAULT;
> > +
> > +	return 0;
> > +}
> > +

> > +static void *pancsf_get_obj_array(struct drm_pancsf_obj_array *in, u32 min_stride)
> > +{
> > +	u32 elem_size = min_t(u32, in->stride, min_stride);
> > +	int ret = 0;
> > +	void *out;
> > +
> > +	if (in->stride < min_stride)
> > +		return ERR_PTR(-EINVAL);
> > +
> > +	out = kvmalloc_array(in->count, elem_size, GFP_KERNEL);
> > +	if (!out)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	if (elem_size == in->stride) {
> > +		if (copy_from_user(out, u64_to_user_ptr(in->array), elem_size * in->count))
> > +			ret = -EFAULT;
> > +	} else {
> > +		void __user *in_ptr = u64_to_user_ptr(in->array);
> > +		void *out_ptr = out;
> > +		u32 i;
> > +
> > +		for (i = 0; i < in->count; i++) {
> > +			if (copy_from_user(out_ptr, in_ptr, elem_size)) {
> > +				ret = -EFAULT;
> > +				break;
> > +			}  
> 
> Missing the increment of out_ptr and in_ptr.

Actually, it's just one of the problem here. There's a confusion
between min_stride and actual object size, which I didn't notice
because they match right now, but as soon as we extend the base
structs it will break. The following version should fix that:

static void *pancsf_get_obj_array(struct drm_pancsf_obj_array *in, u32 min_stride, u32 obj_size)
{
        u32 cpy_elem_size = min_t(u32, in->stride, obj_size);
        int ret = 0;
        void *out;

        if (in->stride < min_stride)
                return ERR_PTR(-EINVAL);

        out = kvmalloc_array(in->count, obj_size, GFP_KERNEL | __GFP_ZERO);
        if (!out)
                return ERR_PTR(-ENOMEM);

        if (obj_size == in->stride) {
                if (copy_from_user(out, u64_to_user_ptr(in->array), obj_size * in->count))
                        ret = -EFAULT;
        } else {
                void __user *in_ptr = u64_to_user_ptr(in->array);
                void *out_ptr = out;
                u32 i;

                for (i = 0; i < in->count; i++) {
                        if (copy_from_user(out_ptr, in_ptr, cpy_elem_size)) {
                                ret = -EFAULT;
                                break;
                        }

                        out_ptr += obj_size;
                        in_ptr += in->stride;
                }
        }

        if (ret) {
                kvfree(out);
                return ERR_PTR(ret);
        }

        return out;
}

> 
> > +		}
> > +	}
> > +
> > +	if (ret) {
> > +		kvfree(out);
> > +		return ERR_PTR(ret);
> > +	}
> > +
> > +	return out;
> > +}



> > +int
> > +pancsf_collect_sync_signal_array(struct drm_file *file,
> > +				 struct drm_pancsf_sync_op *sync_ops, u32 sync_op_count,
> > +				 struct pancsf_sync_signal_array *array)
> > +{
> > +	u32 count = 0, i;
> > +	int ret;
> > +
> > +	for (i = 0; i < sync_op_count; i++) {
> > +		if (sync_ops[i].op_type == DRM_PANCSF_SYNC_OP_SIGNAL)
> > +			count++;
> > +	}
> > +
> > +	array->signals = kvmalloc_array(count, sizeof(*array->signals), GFP_KERNEL | __GFP_ZERO);
> > +	if (!array->signals)
> > +		return -ENOMEM;
> > +
> > +	for (i = 0; i < sync_op_count; i++) {
> > +		int ret;
> > +
> > +		if (sync_ops[i].op_type != DRM_PANCSF_SYNC_OP_SIGNAL)
> > +			continue;
> > +
> > +		switch (sync_ops[i].handle_type) {
> > +		case DRM_PANCSF_SYNC_HANDLE_TYPE_TIMELINE_SYNCOBJ:
> > +			array->signals[array->count].chain = dma_fence_chain_alloc();
> > +			if (!array->signals[array->count].chain) {
> > +				ret = -ENOMEM;
> > +				goto err;
> > +			}
> > +
> > +			array->signals[array->count].point = sync_ops[i].timeline_value;
> > +			fallthrough;
> > +
> > +		case DRM_PANCSF_SYNC_HANDLE_TYPE_SYNCOBJ:
> > +			array->signals[array->count].syncobj = drm_syncobj_find(file, sync_ops[i].handle);
> > +			if (!array->signals[array->count].syncobj) {
> > +				ret = -EINVAL;
> > +				goto err;
> > +			}
> > +
> > +			array->count++;  
> 
> NIT: it's not obvious in this function that array->count is set to 0
> initially. It looks like it's correct because
> pancsf_ioctl_group_submit() does an alloc with __GFP_ZERO, but it would
> be much clearer with a local variable and assigning array->count at the end.

Definitely.

> 
> > +			break;
> > +
> > +		default:
> > +			ret = -EINVAL;
> > +			goto err;
> > +		}
> > +	}
> > +
> > +	return 0;
> > +
> > +err:
> > +	pancsf_free_sync_signal_array(array);
> > +	return ret;
> > +}  

> > +
> > +struct drm_pancsf_gpu_info {
> > +#define DRM_PANCSF_ARCH_MAJOR(x)		((x) >> 28)
> > +#define DRM_PANCSF_ARCH_MINOR(x)		(((x) >> 24) & 0xf)
> > +#define DRM_PANCSF_ARCH_REV(x)			(((x) >> 20) & 0xf)
> > +#define DRM_PANCSF_PRODUCT_MAJOR(x)		(((x) >> 16) & 0xf)
> > +#define DRM_PANCSF_VERSION_MAJOR(x)		(((x) >> 12) & 0xf)
> > +#define DRM_PANCSF_VERSION_MINOR(x)		(((x) >> 4) & 0xff)
> > +#define DRM_PANCSF_VERSION_STATUS(x)		((x) & 0xf)
> > +	__u32 gpu_id;
> > +	__u32 gpu_rev;
> > +#define DRM_PANCSF_CSHW_MAJOR(x)		(((x) >> 26) & 0x3f)
> > +#define DRM_PANCSF_CSHW_MINOR(x)		(((x) >> 20) & 0x3f)
> > +#define DRM_PANCSF_CSHW_REV(x)			(((x) >> 16) & 0xf)
> > +#define DRM_PANCSF_MCU_MAJOR(x)			(((x) >> 10) & 0x3f)
> > +#define DRM_PANCSF_MCU_MINOR(x)			(((x) >> 4) & 0x3f)
> > +#define DRM_PANCSF_MCU_REV(x)			((x) & 0xf)
> > +	__u32 csf_id;
> > +	__u32 l2_features;
> > +	__u32 tiler_features;
> > +	__u32 mem_features;
> > +	__u32 mmu_features;
> > +	__u32 thread_features;
> > +	__u32 max_threads;
> > +	__u32 thread_max_workgroup_size;
> > +	__u32 thread_max_barrier_size;
> > +	__u32 coherency_features;
> > +	__u32 texture_features[4];
> > +	__u32 as_present;
> > +	__u32 core_group_count;
> > +	__u64 shader_present;
> > +	__u64 l2_present;
> > +	__u64 tiler_present;
> > +};
> > +
> > +struct drm_pancsf_csif_info {
> > +	__u32 csg_slot_count;
> > +	__u32 cs_slot_count;
> > +	__u32 cs_reg_count;
> > +	__u32 scoreboard_slot_count;
> > +	__u32 unpreserved_cs_reg_count;
> > +};
> > +
> > +struct drm_pancsf_dev_query {
> > +	/** @type: the query type (see enum drm_pancsf_dev_query_type). */
> > +	__u32 type;
> > +
> > +	/**
> > +	 * @size: size of the type being queried.
> > +	 *
> > +	 * If pointer is NULL, size is updated by the driver to provide the
> > +	 * output structure size. If pointer is not NULL, the the driver will
> > +	 * only copy min(size, actual_structure_size) bytes to the pointer,
> > +	 * and update the size accordingly. This allows us to extend query
> > +	 * types without breaking userspace.
> > +	 */
> > +	__u32 size;
> > +
> > +	/**
> > +	 * @pointer: user pointer to a query type struct.
> > +	 *
> > +	 * Pointer can be NULL, in which case, nothing is copied, but the
> > +	 * actual structure size is returned. If not NULL, it must point to
> > +	 * a location that's large enough to hold size bytes.
> > +	 */
> > +	__u64 pointer;
> > +};  
> 
> Genuine question: is there something wrong with the panfrost 'get_param'
> ioctl where individual features are queried one-by-one, rather than
> passing a big structure back to user space.

Well, I've just seen the Xe driver exposing things this way, and I thought
it was a good idea, but I don't have a strong opinion here, and if others
think it's preferable to stick to GET_PARAM, I'm fine with that too.

> 
> I ask because we've had issues in the past with trying to 'deprecate'
> registers - if a new version of the hardware stops providing a
> (meaningful) value for a register then it's hard to fix up the
> structures. The get_param method means it's possible to return a failure
> for just the register(s) that have disappeared.

There's still the option to fork the struct and add a new type if
things differ too much, or just extend the existing info struct, with
new flags to invalidate previous fields and new fields for new stuff.

> 
> There is obviously overhead iterating over all the register that user
> space cares about. Another option (used by kbase) is to return some form
> of structured data so a missing property can be encoded.

I'll have a look at how kbase does that. Thanks for the pointer.

> > +struct drm_pancsf_bo_create {
> > +	/**
> > +	 * @size: Requested size for the object
> > +	 *
> > +	 * The (page-aligned) allocated size for the object will be returned.
> > +	 */
> > +	__u64 size;
> > +
> > +	/**
> > +	 * @flags: Flags, currently unused, MBZ.
> > +	 */
> > +	__u32 flags;
> > +
> > +	/**
> > +	 * @vm_id: Attached VM, if any
> > +	 *
> > +	 * If a VM is specified, this BO must:
> > +	 *
> > +	 *  1. Only ever be bound to that VM.
> > +	 *
> > +	 *  2. Cannot be exported as a PRIME fd.
> > +	 */
> > +	__u32 vm_id;  
> 
> Unless I'm mistaken this (vm_id) isn't used (or even checked) by the
> current code.

Yeah, it will be used once we hook up the VM resv object and make all
private BOs point to this resv (instead of their own). Didn't really
see a need for that yet, since we don't have a shrinker reclaiming BOs
behind our back, and the new uAPI is all about letting the userspace driver
deal with object lifetime and passing explicit syncs. I might have missed
a use case where this is required though...

> 
> > +
> > +	/**
> > +	 * @handle: Returned handle for the object.
> > +	 *
> > +	 * Object handles are nonzero.
> > +	 */
> > +	__u32 handle;
> > +
> > +	/* @pad: MBZ. */
> > +	__u32 pad;
> > +};
> > +
> > +struct drm_pancsf_bo_mmap_offset {
> > +	/** @handle: Handle for the object being mapped. */
> > +	__u32 handle;
> > +
> > +	/** @pad: MBZ. */
> > +	__u32 pad;  
> 
> pancsf_ioctl_bo_mmap_offset is missing validation for this pad field.

I suspect most of them do, because I've added them progressively
along with other fields to keep things 64-bit aligned. I'll do a pass
and add missing checks before submitting v2.

> > +
> > +	/** @timeline_value: MBZ if handle_type != DRM_PANCSF_SYNC_HANDLE_TYPE_TIMELINE_SYNCOBJ. */
> > +	__u64 timeline_value;
> > +};
> > +
> > +struct drm_pancsf_obj_array {
> > +	/** @stride: Stride of object struct. Used for versioning. */
> > +	__u32 stride;
> > +
> > +	/** @count: Number of objects in the array. */
> > +	__u32 count;
> > +
> > +	/** @array: User pointer to an array of objects. */
> > +	__u64 array;
> > +};
> > +
> > +#define DRM_PANCSF_OBJ_ARRAY(cnt, ptr) \
> > +	{ .stride = sizeof(ptr[0]), .count = cnt, .array = (__u64)(uintptr_t)ptr }
> > +
> > +struct drm_pancsf_queue_submit {
> > +	/** @queue_index: Index of the queue inside a group. */
> > +	__u32 queue_index;
> > +
> > +	/** @stream_size: Size of the command stream to execute. */
> > +	__u32 stream_size;
> > +
> > +	/** @stream_addr: GPU address of the command stream to execute. */
> > +	__u64 stream_addr;
> > +
> > +	/** @syncs: Array of sync operations. */
> > +	struct drm_pancsf_obj_array syncs;
> > +};
> > +
> > +struct drm_pancsf_group_submit {
> > +	/** @group_handle: Handle of the group to queue jobs to. */
> > +	__u32 group_handle;
> > +
> > +	/** @syncs: Array of queue submit operations. */
> > +	struct drm_pancsf_obj_array queue_submits;
> > +};
> > +
> > +struct drm_pancsf_queue_create {
> > +	/**
> > +	 * @priority: Defines the priority of queues inside a group. Goes from 0 to 15,
> > +	 *	      15 being the highest priority.
> > +	 */
> > +	__u8 priority;
> > +
> > +	/** @pad: Padding fields, MBZ. */
> > +	__u8 pad[3];
> > +
> > +	/** @ringbuf_size: Size of the ring buffer to allocate to this queue. */  
> 
> (Must be PAGE_SIZE aligned and at most 64k)

I'll add it to the doc. Thanks 

> > +
> > +#if defined(__cplusplus)
> > +}
> > +#endif
> > +
> > +#endif /* _PANCSF_DRM_H_ */  
> 
> 
> Thanks once again for this. I hope to be able to give it a spin soon!

And thanks for your preliminary review. That's really appreciated.

Boris
Alyssa Rosenzweig Feb. 3, 2023, 5:58 p.m. UTC | #5
> > > +struct drm_pancsf_gpu_info {
> > > +#define DRM_PANCSF_ARCH_MAJOR(x)		((x) >> 28)
> > > +#define DRM_PANCSF_ARCH_MINOR(x)		(((x) >> 24) & 0xf)
> > > +#define DRM_PANCSF_ARCH_REV(x)			(((x) >> 20) & 0xf)
> > > +#define DRM_PANCSF_PRODUCT_MAJOR(x)		(((x) >> 16) & 0xf)
> > > +#define DRM_PANCSF_VERSION_MAJOR(x)		(((x) >> 12) & 0xf)
> > > +#define DRM_PANCSF_VERSION_MINOR(x)		(((x) >> 4) & 0xff)
> > > +#define DRM_PANCSF_VERSION_STATUS(x)		((x) & 0xf)
> > > +	__u32 gpu_id;
> > > +	__u32 gpu_rev;
> > > +#define DRM_PANCSF_CSHW_MAJOR(x)		(((x) >> 26) & 0x3f)
> > > +#define DRM_PANCSF_CSHW_MINOR(x)		(((x) >> 20) & 0x3f)
> > > +#define DRM_PANCSF_CSHW_REV(x)			(((x) >> 16) & 0xf)
> > > +#define DRM_PANCSF_MCU_MAJOR(x)			(((x) >> 10) & 0x3f)
> > > +#define DRM_PANCSF_MCU_MINOR(x)			(((x) >> 4) & 0x3f)
> > > +#define DRM_PANCSF_MCU_REV(x)			((x) & 0xf)
> > > +	__u32 csf_id;
> > > +	__u32 l2_features;
> > > +	__u32 tiler_features;
> > > +	__u32 mem_features;
> > > +	__u32 mmu_features;
> > > +	__u32 thread_features;
> > > +	__u32 max_threads;
> > > +	__u32 thread_max_workgroup_size;
> > > +	__u32 thread_max_barrier_size;
> > > +	__u32 coherency_features;
> > > +	__u32 texture_features[4];
> > > +	__u32 as_present;
> > > +	__u32 core_group_count;
> > > +	__u64 shader_present;
> > > +	__u64 l2_present;
> > > +	__u64 tiler_present;
> > > +};
> > > +
> > > +struct drm_pancsf_csif_info {
> > > +	__u32 csg_slot_count;
> > > +	__u32 cs_slot_count;
> > > +	__u32 cs_reg_count;
> > > +	__u32 scoreboard_slot_count;
> > > +	__u32 unpreserved_cs_reg_count;
> > > +};
> > > +
> > > +struct drm_pancsf_dev_query {
> > > +	/** @type: the query type (see enum drm_pancsf_dev_query_type). */
> > > +	__u32 type;
> > > +
> > > +	/**
> > > +	 * @size: size of the type being queried.
> > > +	 *
> > > +	 * If pointer is NULL, size is updated by the driver to provide the
> > > +	 * output structure size. If pointer is not NULL, the the driver will
> > > +	 * only copy min(size, actual_structure_size) bytes to the pointer,
> > > +	 * and update the size accordingly. This allows us to extend query
> > > +	 * types without breaking userspace.
> > > +	 */
> > > +	__u32 size;
> > > +
> > > +	/**
> > > +	 * @pointer: user pointer to a query type struct.
> > > +	 *
> > > +	 * Pointer can be NULL, in which case, nothing is copied, but the
> > > +	 * actual structure size is returned. If not NULL, it must point to
> > > +	 * a location that's large enough to hold size bytes.
> > > +	 */
> > > +	__u64 pointer;
> > > +};  
> > 
> > Genuine question: is there something wrong with the panfrost 'get_param'
> > ioctl where individual features are queried one-by-one, rather than
> > passing a big structure back to user space.
> 
> Well, I've just seen the Xe driver exposing things this way, and I thought
> it was a good idea, but I don't have a strong opinion here, and if others
> think it's preferable to stick to GET_PARAM, I'm fine with that too.

I vastly prefer the info struct, GET_PARAM isn't a great interface when
there are large numbers of properties to query... Actually I just
suggested to Lina that she adopt this approach for Asahi instead of the
current GET_PARAM ioctl we have (downstream for now).

It isn't a *big* deal but GET_PARAM doesn't really seem better on any
axes.

> > I ask because we've had issues in the past with trying to 'deprecate'
> > registers - if a new version of the hardware stops providing a
> > (meaningful) value for a register then it's hard to fix up the
> > structures.

I'm not sure this is a big deal. If the register no longer exists
(meaningfully), zero it out in the info structure and trust userspace to
interpret meaningfully based on the GPU. If registers are getting
dropped between revisions, that's obviously not great. But this should
only change at major architecture boundaries; I don't see the added
value of doing the interpretation in kernel instead of userspace. I say
this with my userspace hat on, of course ;-)

> > There is obviously overhead iterating over all the register that user
> > space cares about. Another option (used by kbase) is to return some form
> > of structured data so a missing property can be encoded.
> 
> I'll have a look at how kbase does that. Thanks for the pointer.

I'd be fine with the kbase approach but I don't really see the added
value over what Boris proposed in the RFC, tbh.
Steven Price Feb. 6, 2023, 9:37 a.m. UTC | #6
On 03/02/2023 17:58, Alyssa Rosenzweig wrote:
>>>> +struct drm_pancsf_gpu_info {
>>>> +#define DRM_PANCSF_ARCH_MAJOR(x)		((x) >> 28)
>>>> +#define DRM_PANCSF_ARCH_MINOR(x)		(((x) >> 24) & 0xf)
>>>> +#define DRM_PANCSF_ARCH_REV(x)			(((x) >> 20) & 0xf)
>>>> +#define DRM_PANCSF_PRODUCT_MAJOR(x)		(((x) >> 16) & 0xf)
>>>> +#define DRM_PANCSF_VERSION_MAJOR(x)		(((x) >> 12) & 0xf)
>>>> +#define DRM_PANCSF_VERSION_MINOR(x)		(((x) >> 4) & 0xff)
>>>> +#define DRM_PANCSF_VERSION_STATUS(x)		((x) & 0xf)
>>>> +	__u32 gpu_id;
>>>> +	__u32 gpu_rev;
>>>> +#define DRM_PANCSF_CSHW_MAJOR(x)		(((x) >> 26) & 0x3f)
>>>> +#define DRM_PANCSF_CSHW_MINOR(x)		(((x) >> 20) & 0x3f)
>>>> +#define DRM_PANCSF_CSHW_REV(x)			(((x) >> 16) & 0xf)
>>>> +#define DRM_PANCSF_MCU_MAJOR(x)			(((x) >> 10) & 0x3f)
>>>> +#define DRM_PANCSF_MCU_MINOR(x)			(((x) >> 4) & 0x3f)
>>>> +#define DRM_PANCSF_MCU_REV(x)			((x) & 0xf)
>>>> +	__u32 csf_id;
>>>> +	__u32 l2_features;
>>>> +	__u32 tiler_features;
>>>> +	__u32 mem_features;
>>>> +	__u32 mmu_features;
>>>> +	__u32 thread_features;
>>>> +	__u32 max_threads;
>>>> +	__u32 thread_max_workgroup_size;
>>>> +	__u32 thread_max_barrier_size;
>>>> +	__u32 coherency_features;
>>>> +	__u32 texture_features[4];
>>>> +	__u32 as_present;
>>>> +	__u32 core_group_count;
>>>> +	__u64 shader_present;
>>>> +	__u64 l2_present;
>>>> +	__u64 tiler_present;
>>>> +};
>>>> +
>>>> +struct drm_pancsf_csif_info {
>>>> +	__u32 csg_slot_count;
>>>> +	__u32 cs_slot_count;
>>>> +	__u32 cs_reg_count;
>>>> +	__u32 scoreboard_slot_count;
>>>> +	__u32 unpreserved_cs_reg_count;
>>>> +};
>>>> +
>>>> +struct drm_pancsf_dev_query {
>>>> +	/** @type: the query type (see enum drm_pancsf_dev_query_type). */
>>>> +	__u32 type;
>>>> +
>>>> +	/**
>>>> +	 * @size: size of the type being queried.
>>>> +	 *
>>>> +	 * If pointer is NULL, size is updated by the driver to provide the
>>>> +	 * output structure size. If pointer is not NULL, the the driver will
>>>> +	 * only copy min(size, actual_structure_size) bytes to the pointer,
>>>> +	 * and update the size accordingly. This allows us to extend query
>>>> +	 * types without breaking userspace.
>>>> +	 */
>>>> +	__u32 size;
>>>> +
>>>> +	/**
>>>> +	 * @pointer: user pointer to a query type struct.
>>>> +	 *
>>>> +	 * Pointer can be NULL, in which case, nothing is copied, but the
>>>> +	 * actual structure size is returned. If not NULL, it must point to
>>>> +	 * a location that's large enough to hold size bytes.
>>>> +	 */
>>>> +	__u64 pointer;
>>>> +};  
>>>
>>> Genuine question: is there something wrong with the panfrost 'get_param'
>>> ioctl where individual features are queried one-by-one, rather than
>>> passing a big structure back to user space.
>>
>> Well, I've just seen the Xe driver exposing things this way, and I thought
>> it was a good idea, but I don't have a strong opinion here, and if others
>> think it's preferable to stick to GET_PARAM, I'm fine with that too.
> 
> I vastly prefer the info struct, GET_PARAM isn't a great interface when
> there are large numbers of properties to query... Actually I just
> suggested to Lina that she adopt this approach for Asahi instead of the
> current GET_PARAM ioctl we have (downstream for now).

Ok, good to know there is some preference here - like I said this was a
genuine question: I'm not trying to say this is wrong.

> It isn't a *big* deal but GET_PARAM doesn't really seem better on any
> axes.
> 
>>> I ask because we've had issues in the past with trying to 'deprecate'
>>> registers - if a new version of the hardware stops providing a
>>> (meaningful) value for a register then it's hard to fix up the
>>> structures.
> 
> I'm not sure this is a big deal. If the register no longer exists
> (meaningfully), zero it out in the info structure and trust userspace to
> interpret meaningfully based on the GPU. If registers are getting
> dropped between revisions, that's obviously not great. But this should
> only change at major architecture boundaries; I don't see the added
> value of doing the interpretation in kernel instead of userspace. I say
> this with my userspace hat on, of course ;-)

Just some background:

In the early days of the Midgard DDK driver there was a structure much
like the one proposed here in which the kernel dumped the various
feature registers and passed it as a large struct to user space. User
space then ended up using that struct directly in various bits of code
all over the place.

We then ended up with the problem that it was easy to add new properties
to that struct (including derived and hideously badly defined ones like
"gpu_available_memory_size") but basically impossible to remove anything
since the struct definition was shared between user and kernel. The
kernel couldn't drop anything because old user space might need it, and
user space picked up the definition from the kernel so these problematic
members were always there to tempt user space coders.

By packing the data into a structured blob it provided the ability to:

 * Provide a logical separation between user and kernel - each could
have their own structure and it was definitively not uABI.

 * The kernel could provide backwards compat values for properties we
wanted to kill and user space could simply ignore them. They wouldn't be
around to tempt future user space coders.

 * New properties could be added without breaking old user space, and
without forever bloating the structure if we wanted to back out the
change (at worst you just burn an ID value for the structured blob).

 * (In theory) it's possible for user space to identify that a property
isn't present (e.g. running on an old kernel) rather than trying to
populate every field with a dummy value. As far as I remember this never
actually happened... user space just made up values for what was missing ;)

>>> There is obviously overhead iterating over all the register that user
>>> space cares about. Another option (used by kbase) is to return some form
>>> of structured data so a missing property can be encoded.
>>
>> I'll have a look at how kbase does that. Thanks for the pointer.
> 
> I'd be fine with the kbase approach but I don't really see the added
> value over what Boris proposed in the RFC, tbh.

My main concern is making that structure uABI. Which is because 6 years
ago I had to clean up the mess that we had in the DDK ;)

Although I'll admit that most of the problems we had with the DDK was
user space developers wanting information that the GPU driver shouldn't
have been providing (maximum amount of memory available, clock speeds
etc) or derived properties that user space could have calculated itself
(e.g. decoded GPU ID). I guess this is also one of the problems with
developing a driver in parallel with the hardware - things get added to
try the idea out, and not always reverted if it turns out badly (or even
cleaned up if it's a good idea).

Panfrost/PanCSF thankfully have much cleaner sets of properties exposed.

Steve
diff mbox series

Patch

diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index 315cbdf61979..29c18d9f2980 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -344,6 +344,8 @@  source "drivers/gpu/drm/vboxvideo/Kconfig"
 
 source "drivers/gpu/drm/lima/Kconfig"
 
+source "drivers/gpu/drm/pancsf/Kconfig"
+
 source "drivers/gpu/drm/panfrost/Kconfig"
 
 source "drivers/gpu/drm/aspeed/Kconfig"
diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
index cc637343d87b..582d2e1fe274 100644
--- a/drivers/gpu/drm/Makefile
+++ b/drivers/gpu/drm/Makefile
@@ -188,6 +188,7 @@  obj-$(CONFIG_DRM_TVE200) += tve200/
 obj-$(CONFIG_DRM_XEN) += xen/
 obj-$(CONFIG_DRM_VBOXVIDEO) += vboxvideo/
 obj-$(CONFIG_DRM_LIMA)  += lima/
+obj-$(CONFIG_DRM_PANCSF) += pancsf/
 obj-$(CONFIG_DRM_PANFROST) += panfrost/
 obj-$(CONFIG_DRM_ASPEED_GFX) += aspeed/
 obj-$(CONFIG_DRM_MCDE) += mcde/
diff --git a/drivers/gpu/drm/pancsf/Kconfig b/drivers/gpu/drm/pancsf/Kconfig
new file mode 100644
index 000000000000..b3937eb240fc
--- /dev/null
+++ b/drivers/gpu/drm/pancsf/Kconfig
@@ -0,0 +1,15 @@ 
+# SPDX-License-Identifier: GPL-2.0
+
+config DRM_PANCSF
+	tristate "PanCSF (DRM support for ARM Mali CSF-based GPUs)"
+	depends on DRM
+	depends on ARM || ARM64 || (COMPILE_TEST && !GENERIC_ATOMIC64)
+	depends on MMU
+	select DRM_SCHED
+	select IOMMU_SUPPORT
+	select IOMMU_IO_PGTABLE_LPAE
+	select DRM_GEM_SHMEM_HELPER
+	select PM_DEVFREQ
+	select DEVFREQ_GOV_SIMPLE_ONDEMAND
+	help
+	  DRM driver for ARM Mali CSF-based GPUs.
diff --git a/drivers/gpu/drm/pancsf/Makefile b/drivers/gpu/drm/pancsf/Makefile
new file mode 100644
index 000000000000..94a4f2d34331
--- /dev/null
+++ b/drivers/gpu/drm/pancsf/Makefile
@@ -0,0 +1,14 @@ 
+# SPDX-License-Identifier: GPL-2.0
+
+pancsf-y := \
+	pancsf_device.o \
+	pancsf_drv.o \
+	pancsf_gem.o \
+	pancsf_gpu.o \
+	pancsf_heap.o \
+	pancsf_heap.o \
+	pancsf_mcu.o \
+	pancsf_mmu.o \
+	pancsf_sched.o
+
+obj-$(CONFIG_DRM_PANCSF) += pancsf.o
diff --git a/drivers/gpu/drm/pancsf/pancsf_device.c b/drivers/gpu/drm/pancsf/pancsf_device.c
new file mode 100644
index 000000000000..685e816ae19f
--- /dev/null
+++ b/drivers/gpu/drm/pancsf/pancsf_device.c
@@ -0,0 +1,391 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright 2018 Marty E. Plummer <hanetzer@startmail.com> */
+/* Copyright 2019 Linaro, Ltd, Rob Herring <robh@kernel.org> */
+/* Copyright 2023 Collabora ltd. */
+
+#include <linux/clk.h>
+#include <linux/reset.h>
+#include <linux/platform_device.h>
+#include <linux/pm_domain.h>
+#include <linux/regulator/consumer.h>
+
+#include "pancsf_sched.h"
+#include "pancsf_device.h"
+#include "pancsf_gpu.h"
+#include "pancsf_mmu.h"
+
+static int pancsf_reset_init(struct pancsf_device *pfdev)
+{
+	pfdev->rstc = devm_reset_control_array_get_optional_exclusive(pfdev->dev);
+	if (IS_ERR(pfdev->rstc)) {
+		dev_err(pfdev->dev, "get reset failed %ld\n", PTR_ERR(pfdev->rstc));
+		return PTR_ERR(pfdev->rstc);
+	}
+
+	return reset_control_deassert(pfdev->rstc);
+}
+
+static void pancsf_reset_fini(struct pancsf_device *pfdev)
+{
+	reset_control_assert(pfdev->rstc);
+}
+
+static int pancsf_clk_init(struct pancsf_device *pfdev)
+{
+	int err, i;
+	unsigned long rate;
+
+	pfdev->clock = devm_clk_get(pfdev->dev, NULL);
+	if (IS_ERR(pfdev->clock)) {
+		dev_err(pfdev->dev, "get clock failed %ld\n", PTR_ERR(pfdev->clock));
+		return PTR_ERR(pfdev->clock);
+	}
+
+	rate = clk_get_rate(pfdev->clock);
+	dev_info(pfdev->dev, "clock rate = %lu\n", rate);
+
+	err = clk_prepare_enable(pfdev->clock);
+	if (err)
+		return err;
+
+	pfdev->bus_clock = devm_clk_get_optional(pfdev->dev, "bus");
+	if (IS_ERR(pfdev->bus_clock)) {
+		dev_err(pfdev->dev, "get bus_clock failed %ld\n",
+			PTR_ERR(pfdev->bus_clock));
+		return PTR_ERR(pfdev->bus_clock);
+	}
+
+	if (pfdev->bus_clock) {
+		rate = clk_get_rate(pfdev->bus_clock);
+		dev_info(pfdev->dev, "bus_clock rate = %lu\n", rate);
+
+		err = clk_prepare_enable(pfdev->bus_clock);
+		if (err)
+			goto disable_main_clock;
+	}
+
+	if (pfdev->comp->num_clks) {
+		pfdev->platform_clocks = devm_kcalloc(pfdev->dev, pfdev->comp->num_clks,
+						      sizeof(*pfdev->platform_clocks),
+						      GFP_KERNEL);
+		if (!pfdev->platform_clocks) {
+			err = -ENOMEM;
+			goto disable_bus_clock;
+		}
+
+		for (i = 0; i < pfdev->comp->num_clks; i++)
+			pfdev->platform_clocks[i].id = pfdev->comp->clk_names[i];
+
+		err = devm_clk_bulk_get(pfdev->dev,
+					pfdev->comp->num_clks,
+					pfdev->platform_clocks);
+		if (err < 0) {
+			dev_err(pfdev->dev, "failed to get platform clocks: %d\n", err);
+			goto disable_bus_clock;
+		}
+
+		err = clk_bulk_prepare_enable(pfdev->comp->num_clks,
+					      pfdev->platform_clocks);
+		if (err < 0) {
+			dev_err(pfdev->dev, "failed to enable platform clocks: %d\n", err);
+			goto disable_bus_clock;
+		}
+	}
+
+	return 0;
+
+disable_bus_clock:
+	clk_disable_unprepare(pfdev->bus_clock);
+
+disable_main_clock:
+	clk_disable_unprepare(pfdev->clock);
+	return err;
+}
+
+static void pancsf_clk_fini(struct pancsf_device *pfdev)
+{
+	if (pfdev->platform_clocks) {
+		clk_bulk_disable_unprepare(pfdev->comp->num_clks,
+					   pfdev->platform_clocks);
+	}
+
+	clk_disable_unprepare(pfdev->bus_clock);
+	clk_disable_unprepare(pfdev->clock);
+}
+
+static int pancsf_regulator_init(struct pancsf_device *pfdev)
+{
+	int ret, i;
+
+	pfdev->regulators = devm_kcalloc(pfdev->dev, pfdev->comp->num_supplies,
+					 sizeof(*pfdev->regulators),
+					 GFP_KERNEL);
+	if (!pfdev->regulators)
+		return -ENOMEM;
+
+	for (i = 0; i < pfdev->comp->num_supplies; i++)
+		pfdev->regulators[i].supply = pfdev->comp->supply_names[i];
+
+	ret = devm_regulator_bulk_get(pfdev->dev,
+				      pfdev->comp->num_supplies,
+				      pfdev->regulators);
+	if (ret < 0) {
+		if (ret != -EPROBE_DEFER)
+			dev_err(pfdev->dev, "failed to get regulators: %d\n",
+				ret);
+		return ret;
+	}
+
+	ret = regulator_bulk_enable(pfdev->comp->num_supplies,
+				    pfdev->regulators);
+	if (ret < 0) {
+		dev_err(pfdev->dev, "failed to enable regulators: %d\n", ret);
+		return ret;
+	}
+
+	return 0;
+}
+
+static void pancsf_regulator_fini(struct pancsf_device *pfdev)
+{
+	if (!pfdev->regulators)
+		return;
+
+	regulator_bulk_disable(pfdev->comp->num_supplies, pfdev->regulators);
+}
+
+static void pancsf_pm_domain_fini(struct pancsf_device *pfdev)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(pfdev->pm_domain_devs); i++) {
+		if (!pfdev->pm_domain_devs[i])
+			break;
+
+		if (pfdev->pm_domain_links[i])
+			device_link_del(pfdev->pm_domain_links[i]);
+
+		dev_pm_domain_detach(pfdev->pm_domain_devs[i], true);
+	}
+}
+
+static int pancsf_pm_domain_init(struct pancsf_device *pfdev)
+{
+	int err;
+	int i, num_domains;
+
+	num_domains = of_count_phandle_with_args(pfdev->dev->of_node,
+						 "power-domains",
+						 "#power-domain-cells");
+
+	/*
+	 * Single domain is handled by the core, and, if only a single power
+	 * the power domain is requested, the property is optional.
+	 */
+	if (num_domains < 2 && pfdev->comp->num_pm_domains < 2)
+		return 0;
+
+	if (num_domains != pfdev->comp->num_pm_domains) {
+		dev_err(pfdev->dev,
+			"Incorrect number of power domains: %d provided, %d needed\n",
+			num_domains, pfdev->comp->num_pm_domains);
+		return -EINVAL;
+	}
+
+	if (WARN(num_domains > ARRAY_SIZE(pfdev->pm_domain_devs),
+		 "Too many supplies in compatible structure.\n"))
+		return -EINVAL;
+
+	for (i = 0; i < num_domains; i++) {
+		pfdev->pm_domain_devs[i] =
+			dev_pm_domain_attach_by_name(pfdev->dev,
+						     pfdev->comp->pm_domain_names[i]);
+		if (IS_ERR_OR_NULL(pfdev->pm_domain_devs[i])) {
+			err = PTR_ERR(pfdev->pm_domain_devs[i]) ? : -ENODATA;
+			pfdev->pm_domain_devs[i] = NULL;
+			dev_err(pfdev->dev,
+				"failed to get pm-domain %s(%d): %d\n",
+				pfdev->comp->pm_domain_names[i], i, err);
+			goto err;
+		}
+
+		pfdev->pm_domain_links[i] = device_link_add(pfdev->dev,
+							    pfdev->pm_domain_devs[i],
+							    DL_FLAG_PM_RUNTIME |
+							    DL_FLAG_STATELESS |
+							    DL_FLAG_RPM_ACTIVE);
+		if (!pfdev->pm_domain_links[i]) {
+			dev_err(pfdev->pm_domain_devs[i],
+				"adding device link failed!\n");
+			err = -ENODEV;
+			goto err;
+		}
+	}
+
+	return 0;
+
+err:
+	pancsf_pm_domain_fini(pfdev);
+	return err;
+}
+
+int pancsf_device_init(struct pancsf_device *pfdev)
+{
+	struct resource *res;
+	int err;
+
+	err = pancsf_clk_init(pfdev);
+	if (err) {
+		dev_err(pfdev->dev, "clk init failed %d\n", err);
+		return err;
+	}
+
+	err = pancsf_regulator_init(pfdev);
+	if (err)
+		goto err_clk_fini;
+
+	err = pancsf_reset_init(pfdev);
+	if (err) {
+		dev_err(pfdev->dev, "reset init failed %d\n", err);
+		goto err_regulator_fini;
+	}
+
+	err = pancsf_pm_domain_init(pfdev);
+	if (err)
+		goto err_reset_fini;
+
+	pfdev->iomem = devm_platform_get_and_ioremap_resource(pfdev->pdev, 0, &res);
+	if (IS_ERR(pfdev->iomem)) {
+		err = PTR_ERR(pfdev->iomem);
+		goto err_pm_domain_fini;
+	}
+
+	pfdev->phys_addr = res->start;
+
+	err = pancsf_gpu_init(pfdev);
+	if (err)
+		goto err_pm_domain_fini;
+
+	err = pancsf_mmu_init(pfdev);
+	if (err)
+		goto err_gpu_fini;
+
+	err = pancsf_mcu_init(pfdev);
+	if (err)
+		goto err_mmu_fini;
+
+	err = pancsf_sched_init(pfdev);
+	if (err)
+		goto err_mcu_fini;
+
+	return 0;
+
+err_mcu_fini:
+	pancsf_mcu_fini(pfdev);
+err_mmu_fini:
+	pancsf_mmu_fini(pfdev);
+err_gpu_fini:
+	pancsf_gpu_fini(pfdev);
+err_pm_domain_fini:
+	pancsf_pm_domain_fini(pfdev);
+err_reset_fini:
+	pancsf_reset_fini(pfdev);
+err_regulator_fini:
+	pancsf_regulator_fini(pfdev);
+err_clk_fini:
+	pancsf_clk_fini(pfdev);
+	return err;
+}
+
+void pancsf_device_fini(struct pancsf_device *pfdev)
+{
+	pancsf_sched_fini(pfdev);
+	pancsf_mcu_fini(pfdev);
+	pancsf_mmu_fini(pfdev);
+	pancsf_gpu_fini(pfdev);
+	pancsf_pm_domain_fini(pfdev);
+	pancsf_reset_fini(pfdev);
+	pancsf_regulator_fini(pfdev);
+	pancsf_clk_fini(pfdev);
+}
+
+#define PANCSF_EXCEPTION(id) \
+	[DRM_PANCSF_EXCEPTION_ ## id] = { \
+		.name = #id, \
+	}
+
+struct pancsf_exception_info {
+	const char *name;
+};
+
+static const struct pancsf_exception_info pancsf_exception_infos[] = {
+	PANCSF_EXCEPTION(OK),
+	PANCSF_EXCEPTION(TERMINATED),
+	PANCSF_EXCEPTION(KABOOM),
+	PANCSF_EXCEPTION(EUREKA),
+	PANCSF_EXCEPTION(ACTIVE),
+	PANCSF_EXCEPTION(CS_RES_TERM),
+	PANCSF_EXCEPTION(CS_CONFIG_FAULT),
+	PANCSF_EXCEPTION(CS_ENDPOINT_FAULT),
+	PANCSF_EXCEPTION(CS_BUS_FAULT),
+	PANCSF_EXCEPTION(CS_INSTR_INVALID),
+	PANCSF_EXCEPTION(CS_CALL_STACK_OVERFLOW),
+	PANCSF_EXCEPTION(CS_INHERIT_FAULT),
+	PANCSF_EXCEPTION(INSTR_INVALID_PC),
+	PANCSF_EXCEPTION(INSTR_INVALID_ENC),
+	PANCSF_EXCEPTION(INSTR_BARRIER_FAULT),
+	PANCSF_EXCEPTION(DATA_INVALID_FAULT),
+	PANCSF_EXCEPTION(TILE_RANGE_FAULT),
+	PANCSF_EXCEPTION(ADDR_RANGE_FAULT),
+	PANCSF_EXCEPTION(IMPRECISE_FAULT),
+	PANCSF_EXCEPTION(OOM),
+	PANCSF_EXCEPTION(CSF_FW_INTERNAL_ERROR),
+	PANCSF_EXCEPTION(CSF_RES_EVICTION_TIMEOUT),
+	PANCSF_EXCEPTION(GPU_BUS_FAULT),
+	PANCSF_EXCEPTION(GPU_SHAREABILITY_FAULT),
+	PANCSF_EXCEPTION(SYS_SHAREABILITY_FAULT),
+	PANCSF_EXCEPTION(GPU_CACHEABILITY_FAULT),
+	PANCSF_EXCEPTION(TRANSLATION_FAULT_0),
+	PANCSF_EXCEPTION(TRANSLATION_FAULT_1),
+	PANCSF_EXCEPTION(TRANSLATION_FAULT_2),
+	PANCSF_EXCEPTION(TRANSLATION_FAULT_3),
+	PANCSF_EXCEPTION(TRANSLATION_FAULT_4),
+	PANCSF_EXCEPTION(PERM_FAULT_0),
+	PANCSF_EXCEPTION(PERM_FAULT_1),
+	PANCSF_EXCEPTION(PERM_FAULT_2),
+	PANCSF_EXCEPTION(PERM_FAULT_3),
+	PANCSF_EXCEPTION(ACCESS_FLAG_1),
+	PANCSF_EXCEPTION(ACCESS_FLAG_2),
+	PANCSF_EXCEPTION(ACCESS_FLAG_3),
+	PANCSF_EXCEPTION(ADDR_SIZE_FAULT_IN),
+	PANCSF_EXCEPTION(ADDR_SIZE_FAULT_OUT0),
+	PANCSF_EXCEPTION(ADDR_SIZE_FAULT_OUT1),
+	PANCSF_EXCEPTION(ADDR_SIZE_FAULT_OUT2),
+	PANCSF_EXCEPTION(ADDR_SIZE_FAULT_OUT3),
+	PANCSF_EXCEPTION(MEM_ATTR_FAULT_0),
+	PANCSF_EXCEPTION(MEM_ATTR_FAULT_1),
+	PANCSF_EXCEPTION(MEM_ATTR_FAULT_2),
+	PANCSF_EXCEPTION(MEM_ATTR_FAULT_3),
+};
+
+const char *pancsf_exception_name(u32 exception_code)
+{
+	if (WARN_ON(exception_code >= ARRAY_SIZE(pancsf_exception_infos) ||
+		    !pancsf_exception_infos[exception_code].name))
+		return "Unknown exception type";
+
+	return pancsf_exception_infos[exception_code].name;
+}
+
+#ifdef CONFIG_PM
+int pancsf_device_resume(struct device *dev)
+{
+	return 0;
+}
+
+int pancsf_device_suspend(struct device *dev)
+{
+	/* FIXME: PM support */
+	return -EBUSY;
+}
+#endif
diff --git a/drivers/gpu/drm/pancsf/pancsf_device.h b/drivers/gpu/drm/pancsf/pancsf_device.h
new file mode 100644
index 000000000000..634bdaaf3a81
--- /dev/null
+++ b/drivers/gpu/drm/pancsf/pancsf_device.h
@@ -0,0 +1,168 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright 2018 Marty E. Plummer <hanetzer@startmail.com> */
+/* Copyright 2019 Linaro, Ltd, Rob Herring <robh@kernel.org> */
+/* Copyright 2023 Collabora ltd. */
+
+#ifndef __PANCSF_DEVICE_H__
+#define __PANCSF_DEVICE_H__
+
+#include <linux/atomic.h>
+#include <linux/io-pgtable.h>
+#include <linux/regulator/consumer.h>
+#include <linux/spinlock.h>
+#include <drm/drm_device.h>
+#include <drm/drm_mm.h>
+#include <drm/gpu_scheduler.h>
+#include <drm/pancsf_drm.h>
+
+struct pancsf_csf;
+struct pancsf_csf_ctx;
+struct pancsf_device;
+struct pancsf_fw_iface;
+struct pancsf_gpu;
+struct pancsf_group_pool;
+struct pancsf_heap_pool;
+struct pancsf_job;
+struct pancsf_mmu;
+struct pancsf_mcu;
+struct pancsf_perfcnt;
+struct pancsf_vm;
+struct pancsf_vm_pool;
+struct pancsf_vm_bind_queue_pool;
+
+#define MAX_PM_DOMAINS 3
+
+/*
+ * Features that cannot be automatically detected and need matching using the
+ * compatible string, typically SoC-specific.
+ */
+struct pancsf_compatible {
+	/* Supplies count and names. */
+	int num_supplies;
+	const char * const *supply_names;
+	/*
+	 * Number of power domains required, note that values 0 and 1 are
+	 * handled identically, as only values > 1 need special handling.
+	 */
+	int num_pm_domains;
+	/* Only required if num_pm_domains > 1. */
+	const char * const *pm_domain_names;
+
+	/* Clocks count and names. */
+	int num_clks;
+	const char * const *clk_names;
+
+	/* Vendor implementation quirks callback */
+	void (*vendor_quirk)(struct pancsf_device *pfdev);
+};
+
+struct pancsf_device {
+	struct device *dev;
+	struct drm_device *ddev;
+	struct platform_device *pdev;
+
+	phys_addr_t phys_addr;
+	void __iomem *iomem;
+	struct clk *clock;
+	struct clk *bus_clock;
+	struct clk_bulk_data *platform_clocks;
+	struct regulator_bulk_data *regulators;
+	struct reset_control *rstc;
+	/* pm_domains for devices with more than one. */
+	struct device *pm_domain_devs[MAX_PM_DOMAINS];
+	struct device_link *pm_domain_links[MAX_PM_DOMAINS];
+	bool coherent;
+
+	struct drm_pancsf_gpu_info gpu_info;
+	struct drm_pancsf_csif_info csif_info;
+
+	const struct pancsf_compatible *comp;
+
+	struct pancsf_gpu *gpu;
+	struct pancsf_mcu *mcu;
+	struct pancsf_mmu *mmu;
+	struct pancsf_fw_iface *iface;
+	struct pancsf_scheduler *scheduler;
+};
+
+struct pancsf_file {
+	struct pancsf_device *pfdev;
+	struct pancsf_vm_pool *vms;
+	struct pancsf_vm_bind_queue_pool *vm_bind_queues;
+
+	struct mutex heaps_lock;
+	struct pancsf_heap_pool *heaps;
+	struct pancsf_group_pool *groups;
+};
+
+static inline struct pancsf_device *to_pancsf_device(struct drm_device *ddev)
+{
+	return ddev->dev_private;
+}
+
+int pancsf_device_init(struct pancsf_device *pfdev);
+void pancsf_device_fini(struct pancsf_device *pfdev);
+
+int pancsf_device_resume(struct device *dev);
+int pancsf_device_suspend(struct device *dev);
+
+enum drm_pancsf_exception_type {
+	DRM_PANCSF_EXCEPTION_OK = 0x00,
+	DRM_PANCSF_EXCEPTION_TERMINATED = 0x04,
+	DRM_PANCSF_EXCEPTION_KABOOM = 0x05,
+	DRM_PANCSF_EXCEPTION_EUREKA = 0x06,
+	DRM_PANCSF_EXCEPTION_ACTIVE = 0x08,
+	DRM_PANCSF_EXCEPTION_CS_RES_TERM = 0x0f,
+	DRM_PANCSF_EXCEPTION_MAX_NON_FAULT = 0x3f,
+	DRM_PANCSF_EXCEPTION_CS_CONFIG_FAULT = 0x40,
+	DRM_PANCSF_EXCEPTION_CS_ENDPOINT_FAULT = 0x44,
+	DRM_PANCSF_EXCEPTION_CS_BUS_FAULT = 0x48,
+	DRM_PANCSF_EXCEPTION_CS_INSTR_INVALID = 0x49,
+	DRM_PANCSF_EXCEPTION_CS_CALL_STACK_OVERFLOW = 0x4a,
+	DRM_PANCSF_EXCEPTION_CS_INHERIT_FAULT = 0x4b,
+	DRM_PANCSF_EXCEPTION_INSTR_INVALID_PC = 0x50,
+	DRM_PANCSF_EXCEPTION_INSTR_INVALID_ENC = 0x51,
+	DRM_PANCSF_EXCEPTION_INSTR_BARRIER_FAULT = 0x55,
+	DRM_PANCSF_EXCEPTION_DATA_INVALID_FAULT = 0x58,
+	DRM_PANCSF_EXCEPTION_TILE_RANGE_FAULT = 0x59,
+	DRM_PANCSF_EXCEPTION_ADDR_RANGE_FAULT = 0x5a,
+	DRM_PANCSF_EXCEPTION_IMPRECISE_FAULT = 0x5b,
+	DRM_PANCSF_EXCEPTION_OOM = 0x60,
+	DRM_PANCSF_EXCEPTION_CSF_FW_INTERNAL_ERROR = 0x68,
+	DRM_PANCSF_EXCEPTION_CSF_RES_EVICTION_TIMEOUT = 0x69,
+	DRM_PANCSF_EXCEPTION_GPU_BUS_FAULT = 0x80,
+	DRM_PANCSF_EXCEPTION_GPU_SHAREABILITY_FAULT = 0x88,
+	DRM_PANCSF_EXCEPTION_SYS_SHAREABILITY_FAULT = 0x89,
+	DRM_PANCSF_EXCEPTION_GPU_CACHEABILITY_FAULT = 0x8a,
+	DRM_PANCSF_EXCEPTION_TRANSLATION_FAULT_0 = 0xc0,
+	DRM_PANCSF_EXCEPTION_TRANSLATION_FAULT_1 = 0xc1,
+	DRM_PANCSF_EXCEPTION_TRANSLATION_FAULT_2 = 0xc2,
+	DRM_PANCSF_EXCEPTION_TRANSLATION_FAULT_3 = 0xc3,
+	DRM_PANCSF_EXCEPTION_TRANSLATION_FAULT_4 = 0xc4,
+	DRM_PANCSF_EXCEPTION_PERM_FAULT_0 = 0xc8,
+	DRM_PANCSF_EXCEPTION_PERM_FAULT_1 = 0xc9,
+	DRM_PANCSF_EXCEPTION_PERM_FAULT_2 = 0xca,
+	DRM_PANCSF_EXCEPTION_PERM_FAULT_3 = 0xcb,
+	DRM_PANCSF_EXCEPTION_ACCESS_FLAG_1 = 0xd9,
+	DRM_PANCSF_EXCEPTION_ACCESS_FLAG_2 = 0xda,
+	DRM_PANCSF_EXCEPTION_ACCESS_FLAG_3 = 0xdb,
+	DRM_PANCSF_EXCEPTION_ADDR_SIZE_FAULT_IN = 0xe0,
+	DRM_PANCSF_EXCEPTION_ADDR_SIZE_FAULT_OUT0 = 0xe4,
+	DRM_PANCSF_EXCEPTION_ADDR_SIZE_FAULT_OUT1 = 0xe5,
+	DRM_PANCSF_EXCEPTION_ADDR_SIZE_FAULT_OUT2 = 0xe6,
+	DRM_PANCSF_EXCEPTION_ADDR_SIZE_FAULT_OUT3 = 0xe7,
+	DRM_PANCSF_EXCEPTION_MEM_ATTR_FAULT_0 = 0xe8,
+	DRM_PANCSF_EXCEPTION_MEM_ATTR_FAULT_1 = 0xe9,
+	DRM_PANCSF_EXCEPTION_MEM_ATTR_FAULT_2 = 0xea,
+	DRM_PANCSF_EXCEPTION_MEM_ATTR_FAULT_3 = 0xeb,
+};
+
+static inline bool
+pancsf_exception_is_fault(u32 exception_code)
+{
+	return exception_code > DRM_PANCSF_EXCEPTION_MAX_NON_FAULT;
+}
+
+const char *pancsf_exception_name(u32 exception_code);
+
+#endif
diff --git a/drivers/gpu/drm/pancsf/pancsf_drv.c b/drivers/gpu/drm/pancsf/pancsf_drv.c
new file mode 100644
index 000000000000..05c4df13ba90
--- /dev/null
+++ b/drivers/gpu/drm/pancsf/pancsf_drv.c
@@ -0,0 +1,812 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright 2018 Marty E. Plummer <hanetzer@startmail.com> */
+/* Copyright 2019 Linaro, Ltd., Rob Herring <robh@kernel.org> */
+/* Copyright 2019 Collabora ltd. */
+
+#include <linux/module.h>
+#include <linux/of_platform.h>
+#include <linux/pagemap.h>
+#include <linux/pm_runtime.h>
+#include <drm/pancsf_drm.h>
+#include <drm/drm_drv.h>
+#include <drm/drm_ioctl.h>
+#include <drm/drm_syncobj.h>
+#include <drm/drm_utils.h>
+#include <drm/drm_debugfs.h>
+
+#include "pancsf_sched.h"
+#include "pancsf_device.h"
+#include "pancsf_gem.h"
+#include "pancsf_heap.h"
+#include "pancsf_mcu.h"
+#include "pancsf_mmu.h"
+#include "pancsf_gpu.h"
+
+#define DRM_PANCSF_SYNC_OP_MIN_SIZE		24
+#define DRM_PANCSF_QUEUE_SUBMIT_MIN_SIZE	32
+#define DRM_PANCSF_QUEUE_CREATE_MIN_SIZE	8
+#define DRM_PANCSF_VM_BIND_OP_MIN_SIZE		48
+
+static int pancsf_ioctl_dev_query(struct drm_device *ddev, void *data, struct drm_file *file)
+{
+	struct drm_pancsf_dev_query *args = data;
+	struct pancsf_device *pfdev = ddev->dev_private;
+	const void *src;
+	size_t src_size;
+
+	switch (args->type) {
+	case DRM_PANCSF_DEV_QUERY_GPU_INFO:
+		src_size = sizeof(pfdev->gpu_info);
+		src = &pfdev->gpu_info;
+		break;
+	case DRM_PANCSF_DEV_QUERY_CSIF_INFO:
+		src_size = sizeof(pfdev->csif_info);
+		src = &pfdev->csif_info;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	if (!args->pointer) {
+		args->size = src_size;
+		return 0;
+	}
+
+	args->size = min_t(unsigned long, src_size, args->size);
+	if (copy_to_user((void __user *)(uintptr_t)args->pointer, src, args->size))
+		return -EFAULT;
+
+	return 0;
+}
+
+#define PANCSF_MAX_VMS_PER_FILE		32
+#define PANCSF_VM_CREATE_FLAGS		0
+
+int pancsf_ioctl_vm_create(struct drm_device *ddev, void *data,
+			   struct drm_file *file)
+{
+	struct pancsf_device *pfdev = ddev->dev_private;
+	struct pancsf_file *pfile = file->driver_priv;
+	struct drm_pancsf_vm_create *args = data;
+	int ret;
+
+	if (args->flags & ~PANCSF_VM_CREATE_FLAGS)
+		return -EINVAL;
+
+	ret = pancsf_vm_pool_create_vm(pfdev, pfile->vms);
+	if (ret < 0)
+		return ret;
+
+	args->id = ret;
+	return 0;
+}
+
+int pancsf_ioctl_vm_destroy(struct drm_device *ddev, void *data,
+			    struct drm_file *file)
+{
+	struct pancsf_file *pfile = file->driver_priv;
+	struct drm_pancsf_vm_destroy *args = data;
+
+	pancsf_vm_pool_destroy_vm(pfile->vms, args->id);
+	return 0;
+}
+
+#define PANCSF_BO_FLAGS		0
+
+static int pancsf_ioctl_bo_create(struct drm_device *ddev, void *data,
+				  struct drm_file *file)
+{
+	struct pancsf_gem_object *bo;
+	struct drm_pancsf_bo_create *args = data;
+
+	if (!args->size || args->pad ||
+	    (args->flags & ~PANCSF_BO_FLAGS))
+		return -EINVAL;
+
+	bo = pancsf_gem_create_with_handle(file, ddev, args->size, args->flags,
+					   &args->handle);
+	if (IS_ERR(bo))
+		return PTR_ERR(bo);
+
+	return 0;
+}
+
+#define PANCSF_VMA_MAP_FLAGS (PANCSF_VMA_MAP_READONLY | \
+			      PANCSF_VMA_MAP_NOEXEC | \
+			      PANCSF_VMA_MAP_UNCACHED | \
+			      PANCSF_VMA_MAP_FRAG_SHADER | \
+			      PANCSF_VMA_MAP_ON_FAULT | \
+			      PANCSF_VMA_MAP_AUTO_VA)
+
+static int pancsf_ioctl_vm_map(struct drm_device *ddev, void *data,
+			       struct drm_file *file)
+{
+	struct pancsf_file *pfile = file->driver_priv;
+	struct drm_pancsf_vm_map *args = data;
+	struct drm_gem_object *gem;
+	struct pancsf_vm *vm;
+	int ret;
+
+	if (args->flags & ~PANCSF_VMA_MAP_FLAGS)
+		return -EINVAL;
+
+	gem = drm_gem_object_lookup(file, args->bo_handle);
+	if (!gem)
+		return -EINVAL;
+
+	vm = pancsf_vm_pool_get_vm(pfile->vms, args->vm_id);
+	if (vm) {
+		ret = pancsf_vm_map_bo_range(vm, to_pancsf_bo(gem), args->bo_offset,
+					     args->size, &args->va, args->flags);
+	} else {
+		ret = -EINVAL;
+	}
+
+	pancsf_vm_put(vm);
+	drm_gem_object_put(gem);
+	return ret;
+}
+
+#define PANCSF_VMA_UNMAP_FLAGS 0
+
+static int pancsf_ioctl_vm_unmap(struct drm_device *ddev, void *data,
+				 struct drm_file *file)
+{
+	struct pancsf_file *pfile = file->driver_priv;
+	struct drm_pancsf_vm_unmap *args = data;
+	struct pancsf_vm *vm;
+	int ret;
+
+	if (args->flags & ~PANCSF_VMA_UNMAP_FLAGS)
+		return -EINVAL;
+
+	vm = pancsf_vm_pool_get_vm(pfile->vms, args->vm_id);
+	if (vm)
+		ret = pancsf_vm_unmap_range(vm, args->va, args->size);
+	else
+		ret = -EINVAL;
+
+	pancsf_vm_put(vm);
+	return ret;
+}
+
+static void *pancsf_get_obj_array(struct drm_pancsf_obj_array *in, u32 min_stride)
+{
+	u32 elem_size = min_t(u32, in->stride, min_stride);
+	int ret = 0;
+	void *out;
+
+	if (in->stride < min_stride)
+		return ERR_PTR(-EINVAL);
+
+	out = kvmalloc_array(in->count, elem_size, GFP_KERNEL);
+	if (!out)
+		return ERR_PTR(-ENOMEM);
+
+	if (elem_size == in->stride) {
+		if (copy_from_user(out, u64_to_user_ptr(in->array), elem_size * in->count))
+			ret = -EFAULT;
+	} else {
+		void __user *in_ptr = u64_to_user_ptr(in->array);
+		void *out_ptr = out;
+		u32 i;
+
+		for (i = 0; i < in->count; i++) {
+			if (copy_from_user(out_ptr, in_ptr, elem_size)) {
+				ret = -EFAULT;
+				break;
+			}
+		}
+	}
+
+	if (ret) {
+		kvfree(out);
+		return ERR_PTR(ret);
+	}
+
+	return out;
+}
+
+static int pancsf_add_job_deps(struct drm_file *file, struct pancsf_job *job,
+			       struct drm_pancsf_sync_op *sync_ops, u32 sync_op_count)
+{
+	u32 i;
+
+	for (i = 0; i < sync_op_count; i++) {
+		struct dma_fence *fence;
+		int ret;
+
+		if (sync_ops[i].op_type != DRM_PANCSF_SYNC_OP_WAIT)
+			continue;
+
+		switch (sync_ops[i].handle_type) {
+		case DRM_PANCSF_SYNC_HANDLE_TYPE_SYNCOBJ:
+		case DRM_PANCSF_SYNC_HANDLE_TYPE_TIMELINE_SYNCOBJ:
+			ret = drm_syncobj_find_fence(file, sync_ops[i].handle,
+						     sync_ops[i].timeline_value,
+						     0, &fence);
+			if (ret)
+				return ret;
+
+			ret = pancsf_add_job_dep(job, fence);
+			if (ret) {
+				dma_fence_put(fence);
+				return ret;
+			}
+			break;
+
+		default:
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+struct pancsf_sync_signal {
+	struct drm_syncobj *syncobj;
+	struct dma_fence_chain *chain;
+	u64 point;
+};
+
+struct pancsf_sync_signal_array {
+	struct pancsf_sync_signal *signals;
+	u32 count;
+};
+
+void
+pancsf_free_sync_signal_array(struct pancsf_sync_signal_array *array)
+{
+	u32 i;
+
+	for (i = 0; i < array->count; i++) {
+		drm_syncobj_put(array->signals[i].syncobj);
+		dma_fence_chain_free(array->signals[i].chain);
+	}
+
+	kvfree(array->signals);
+	array->signals = NULL;
+	array->count = 0;
+}
+
+int
+pancsf_collect_sync_signal_array(struct drm_file *file,
+				 struct drm_pancsf_sync_op *sync_ops, u32 sync_op_count,
+				 struct pancsf_sync_signal_array *array)
+{
+	u32 count = 0, i;
+	int ret;
+
+	for (i = 0; i < sync_op_count; i++) {
+		if (sync_ops[i].op_type == DRM_PANCSF_SYNC_OP_SIGNAL)
+			count++;
+	}
+
+	array->signals = kvmalloc_array(count, sizeof(*array->signals), GFP_KERNEL | __GFP_ZERO);
+	if (!array->signals)
+		return -ENOMEM;
+
+	for (i = 0; i < sync_op_count; i++) {
+		int ret;
+
+		if (sync_ops[i].op_type != DRM_PANCSF_SYNC_OP_SIGNAL)
+			continue;
+
+		switch (sync_ops[i].handle_type) {
+		case DRM_PANCSF_SYNC_HANDLE_TYPE_TIMELINE_SYNCOBJ:
+			array->signals[array->count].chain = dma_fence_chain_alloc();
+			if (!array->signals[array->count].chain) {
+				ret = -ENOMEM;
+				goto err;
+			}
+
+			array->signals[array->count].point = sync_ops[i].timeline_value;
+			fallthrough;
+
+		case DRM_PANCSF_SYNC_HANDLE_TYPE_SYNCOBJ:
+			array->signals[array->count].syncobj = drm_syncobj_find(file, sync_ops[i].handle);
+			if (!array->signals[array->count].syncobj) {
+				ret = -EINVAL;
+				goto err;
+			}
+
+			array->count++;
+			break;
+
+		default:
+			ret = -EINVAL;
+			goto err;
+		}
+	}
+
+	return 0;
+
+err:
+	pancsf_free_sync_signal_array(array);
+	return ret;
+}
+
+static void pancsf_attach_done_fence(struct drm_file *file, struct dma_fence *done_fence,
+				     struct pancsf_sync_signal_array *signal_array)
+{
+	u32 i;
+
+	for (i = 0; i < signal_array->count; i++) {
+		if (signal_array->signals[i].chain) {
+			drm_syncobj_add_point(signal_array->signals[i].syncobj,
+					      signal_array->signals[i].chain,
+					      done_fence,
+					      signal_array->signals[i].point);
+			signal_array->signals[i].chain = NULL;
+		} else {
+			drm_syncobj_replace_fence(signal_array->signals[i].syncobj, done_fence);
+		}
+	}
+}
+
+static int pancsf_ioctl_bo_mmap_offset(struct drm_device *ddev, void *data,
+				       struct drm_file *file)
+{
+	struct drm_pancsf_bo_mmap_offset *args = data;
+
+	return drm_gem_dumb_map_offset(file, ddev, args->handle, &args->offset);
+}
+
+static int pancsf_ioctl_group_submit(struct drm_device *ddev, void *data,
+				     struct drm_file *file)
+{
+	struct pancsf_file *pfile = file->driver_priv;
+	struct drm_pancsf_group_submit *args = data;
+	struct drm_pancsf_queue_submit *queue_submits;
+	struct pancsf_job **jobs = NULL;
+	struct drm_pancsf_sync_op *sync_ops = NULL;
+	struct pancsf_sync_signal_array *sync_signal_arrays;
+	int ret = 0;
+	u32 i;
+
+	queue_submits = pancsf_get_obj_array(&args->queue_submits,
+					     DRM_PANCSF_QUEUE_SUBMIT_MIN_SIZE);
+	jobs = kvmalloc_array(args->queue_submits.count, sizeof(*jobs), GFP_KERNEL | __GFP_ZERO);
+	sync_signal_arrays = kvmalloc_array(args->queue_submits.count, sizeof(*sync_signal_arrays),
+					    GFP_KERNEL | __GFP_ZERO);
+	if (!queue_submits || !jobs || !sync_signal_arrays) {
+		ret = -ENOMEM;
+		goto out_free_tmp_objs;
+	}
+
+	for (i = 0; i < args->queue_submits.count; i++) {
+		struct drm_pancsf_queue_submit *qsubmit = &queue_submits[i];
+		struct pancsf_call_info cs_call = {
+			.start = qsubmit->stream_addr,
+			.size = qsubmit->stream_size,
+		};
+
+		jobs[i] = pancsf_create_job(pfile, args->group_handle,
+					    qsubmit->queue_index, &cs_call);
+		if (IS_ERR(jobs[i])) {
+			ret = PTR_ERR(jobs[i]);
+			goto out_free_tmp_objs;
+		}
+
+		sync_ops = pancsf_get_obj_array(&qsubmit->syncs, DRM_PANCSF_SYNC_OP_MIN_SIZE);
+		if (IS_ERR(sync_ops)) {
+			ret = PTR_ERR(sync_ops);
+			sync_ops = NULL;
+			goto out_free_tmp_objs;
+		}
+
+		ret = pancsf_add_job_deps(file, jobs[i], sync_ops, qsubmit->syncs.count);
+		if (ret)
+			goto out_free_tmp_objs;
+
+		ret = pancsf_collect_sync_signal_array(file, sync_ops, qsubmit->syncs.count,
+						       &sync_signal_arrays[i]);
+		if (ret)
+			goto out_free_tmp_objs;
+
+		kvfree(sync_ops);
+		sync_ops = NULL;
+	}
+
+	for (i = 0; i < args->queue_submits.count; i++) {
+		pancsf_attach_done_fence(file, pancsf_get_job_done_fence(jobs[i]),
+					 &sync_signal_arrays[i]);
+	}
+
+	for (i = 0; i < args->queue_submits.count; i++) {
+		struct dma_fence *done_fence = pancsf_get_job_done_fence(jobs[i]);
+
+		if (!ret) {
+			ret = pancsf_push_job(jobs[i]);
+			if (ret) {
+				dma_fence_set_error(done_fence, ret);
+				dma_fence_signal(done_fence);
+			}
+		} else  {
+			dma_fence_set_error(done_fence, -ECANCELED);
+			dma_fence_signal(done_fence);
+		}
+	}
+
+out_free_tmp_objs:
+	if (sync_signal_arrays) {
+		for (i = 0; i < args->queue_submits.count; i++)
+			pancsf_free_sync_signal_array(&sync_signal_arrays[i]);
+		kvfree(sync_signal_arrays);
+	}
+
+	if (jobs) {
+		for (i = 0; i < args->queue_submits.count; i++)
+			pancsf_put_job(jobs[i]);
+
+		kvfree(jobs);
+	}
+
+	kvfree(queue_submits);
+	kvfree(sync_ops);
+
+	return ret;
+}
+
+static int pancsf_ioctl_group_destroy(struct drm_device *ddev, void *data,
+				      struct drm_file *file)
+{
+	struct pancsf_file *pfile = file->driver_priv;
+	struct drm_pancsf_group_destroy *args = data;
+
+	pancsf_destroy_group(pfile, args->group_handle);
+	return 0;
+}
+
+static int pancsf_ioctl_group_create(struct drm_device *ddev, void *data,
+				     struct drm_file *file)
+{
+	struct pancsf_file *pfile = file->driver_priv;
+	struct drm_pancsf_group_create *args = data;
+	struct drm_pancsf_queue_create *queue_args;
+	int ret;
+
+	if (!args->queues.count)
+		return -EINVAL;
+
+	queue_args = pancsf_get_obj_array(&args->queues, DRM_PANCSF_QUEUE_CREATE_MIN_SIZE);
+	if (IS_ERR(queue_args))
+		return PTR_ERR(queue_args);
+
+	ret = pancsf_create_group(pfile, args, queue_args);
+	if (ret >= 0) {
+		args->group_handle = ret;
+		ret = 0;
+	}
+
+	kvfree(queue_args);
+	return ret;
+}
+
+static int pancsf_ioctl_tiler_heap_create(struct drm_device *ddev, void *data,
+					  struct drm_file *file)
+{
+	struct pancsf_device *pfdev = ddev->dev_private;
+	struct pancsf_file *pfile = file->driver_priv;
+	struct drm_pancsf_tiler_heap_create *args = data;
+	struct pancsf_heap_pool *pool;
+	struct pancsf_vm *vm;
+	int ret;
+
+	vm = pancsf_vm_pool_get_vm(pfile->vms, args->vm_id);
+	if (!vm)
+		return -EINVAL;
+
+	mutex_lock(&pfile->heaps_lock);
+	if (IS_ERR_OR_NULL(pfile->heaps))
+		pfile->heaps = pancsf_heap_pool_create(pfdev, vm);
+	pool = pfile->heaps;
+	mutex_unlock(&pfile->heaps_lock);
+
+	if (IS_ERR(pool)) {
+		ret = PTR_ERR(pool);
+		goto out_vm_put;
+	}
+
+	ret = pancsf_heap_create(pool,
+				 args->initial_chunk_count,
+				 args->chunk_size,
+				 args->max_chunks,
+				 args->target_in_flight,
+				 &args->tiler_heap_ctx_gpu_va,
+				 &args->first_heap_chunk_gpu_va);
+	if (ret < 0)
+		goto out_vm_put;
+
+	args->handle = ret;
+	ret = 0;
+
+out_vm_put:
+	pancsf_vm_put(vm);
+	return ret;
+}
+
+static int pancsf_ioctl_tiler_heap_destroy(struct drm_device *ddev, void *data,
+					   struct drm_file *file)
+{
+	struct pancsf_file *pfile = file->driver_priv;
+	struct drm_pancsf_tiler_heap_destroy *args = data;
+	struct pancsf_heap_pool *pool;
+
+	mutex_lock(&pfile->heaps_lock);
+	pool = pfile->heaps;
+	mutex_unlock(&pfile->heaps_lock);
+
+	if (IS_ERR_OR_NULL(pool))
+		return -EINVAL;
+
+	return pancsf_heap_destroy(pool, args->handle);
+}
+
+static int
+pancsf_open(struct drm_device *ddev, struct drm_file *file)
+{
+	int ret;
+	struct pancsf_device *pfdev = ddev->dev_private;
+	struct pancsf_file *pfile;
+
+	pfile = kzalloc(sizeof(*pfile), GFP_KERNEL);
+	if (!pfile)
+		return -ENOMEM;
+
+	/* Heap pool is created on-demand, we just init the lock to serialize
+	 * the pool creation/destruction here.
+	 */
+	mutex_init(&pfile->heaps_lock);
+
+	pfile->pfdev = pfdev;
+
+	ret = pancsf_vm_pool_create(pfile);
+	if (ret)
+		goto err_destroy_heaps_lock;
+
+	ret = pancsf_group_pool_create(pfile);
+	if (ret)
+		goto err_destroy_vm_pool;
+
+	file->driver_priv = pfile;
+	return 0;
+
+err_destroy_vm_pool:
+	pancsf_vm_pool_destroy(pfile);
+
+err_destroy_heaps_lock:
+	mutex_destroy(&pfile->heaps_lock);
+	kfree(pfile);
+	return ret;
+}
+
+static void
+pancsf_postclose(struct drm_device *ddev, struct drm_file *file)
+{
+	struct pancsf_file *pfile = file->driver_priv;
+
+	pancsf_group_pool_destroy(pfile);
+
+	mutex_lock(&pfile->heaps_lock);
+	pancsf_heap_pool_destroy(pfile->heaps);
+	mutex_unlock(&pfile->heaps_lock);
+	mutex_destroy(&pfile->heaps_lock);
+
+	pancsf_vm_pool_destroy(pfile);
+
+	kfree(pfile);
+}
+
+static const struct drm_ioctl_desc pancsf_drm_driver_ioctls[] = {
+#define PANCSF_IOCTL(n, func, flags) \
+	DRM_IOCTL_DEF_DRV(PANCSF_##n, pancsf_ioctl_##func, flags)
+
+	PANCSF_IOCTL(DEV_QUERY,		dev_query,	DRM_RENDER_ALLOW),
+	PANCSF_IOCTL(VM_CREATE,		vm_create,	DRM_RENDER_ALLOW),
+	PANCSF_IOCTL(VM_DESTROY,	vm_destroy,	DRM_RENDER_ALLOW),
+	PANCSF_IOCTL(BO_CREATE,		bo_create,	DRM_RENDER_ALLOW),
+	PANCSF_IOCTL(BO_MMAP_OFFSET,	bo_mmap_offset,	DRM_RENDER_ALLOW),
+	PANCSF_IOCTL(VM_MAP,		vm_map,		DRM_RENDER_ALLOW),
+	PANCSF_IOCTL(VM_UNMAP,		vm_unmap,	DRM_RENDER_ALLOW),
+	PANCSF_IOCTL(GROUP_CREATE,	group_create, DRM_RENDER_ALLOW),
+	PANCSF_IOCTL(GROUP_DESTROY,	group_destroy, DRM_RENDER_ALLOW),
+	PANCSF_IOCTL(TILER_HEAP_CREATE,	tiler_heap_create, DRM_RENDER_ALLOW),
+	PANCSF_IOCTL(TILER_HEAP_DESTROY, tiler_heap_destroy, DRM_RENDER_ALLOW),
+	PANCSF_IOCTL(GROUP_SUBMIT,	group_submit,	DRM_RENDER_ALLOW),
+};
+
+static int pancsf_mmap_io(struct file *filp, struct vm_area_struct *vma)
+{
+	struct drm_file *priv = filp->private_data;
+	struct pancsf_file *pfile = priv->driver_priv;
+	struct pancsf_device *pfdev = pfile->pfdev;
+	phys_addr_t phys_offset;
+	size_t size;
+
+	switch (vma->vm_pgoff << PAGE_SHIFT) {
+	case DRM_PANCSF_USER_FLUSH_ID_MMIO_OFFSET:
+		size = PAGE_SIZE;
+		phys_offset = 0;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	if (vma->vm_end - vma->vm_start != size)
+		return -EINVAL;
+
+	vma->vm_flags |= VM_IO | VM_DONTCOPY | VM_DONTEXPAND | VM_NORESERVE |
+			 VM_DONTDUMP | VM_PFNMAP;
+
+	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+
+	return io_remap_pfn_range(vma,
+				  vma->vm_start,
+				  (pfdev->phys_addr + phys_offset) >> PAGE_SHIFT,
+				  size,
+				  vma->vm_page_prot);
+}
+
+static int pancsf_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	if (vma->vm_pgoff >= (DRM_PANCSF_USER_MMIO_OFFSET >> PAGE_SHIFT))
+		return pancsf_mmap_io(filp, vma);
+
+	return drm_gem_mmap(filp, vma);
+}
+
+static const struct file_operations pancsf_drm_driver_fops = {
+	.open = drm_open,
+	.release = drm_release,
+	.unlocked_ioctl = drm_ioctl,
+	.compat_ioctl = drm_compat_ioctl,
+	.poll = drm_poll,
+	.read = drm_read,
+	.llseek = noop_llseek,
+	.mmap = pancsf_mmap,
+};
+
+/*
+ * PanCSF driver version:
+ * - 1.0 - initial interface
+ */
+static const struct drm_driver pancsf_drm_driver = {
+	.driver_features	= DRIVER_RENDER | DRIVER_GEM | DRIVER_SYNCOBJ |
+				  DRIVER_SYNCOBJ_TIMELINE,
+	.open			= pancsf_open,
+	.postclose		= pancsf_postclose,
+	.ioctls			= pancsf_drm_driver_ioctls,
+	.num_ioctls		= ARRAY_SIZE(pancsf_drm_driver_ioctls),
+	.fops			= &pancsf_drm_driver_fops,
+	.name			= "pancsf",
+	.desc			= "pancsf DRM",
+	.date			= "20230120",
+	.major			= 1,
+	.minor			= 0,
+
+	.gem_create_object	= pancsf_gem_create_object,
+	.prime_handle_to_fd	= drm_gem_prime_handle_to_fd,
+	.prime_fd_to_handle	= drm_gem_prime_fd_to_handle,
+	.gem_prime_import_sg_table = pancsf_gem_prime_import_sg_table,
+	.gem_prime_mmap		= drm_gem_prime_mmap,
+};
+
+static int pancsf_probe(struct platform_device *pdev)
+{
+	struct pancsf_device *pfdev;
+	struct drm_device *ddev;
+	int err;
+
+	pfdev = devm_kzalloc(&pdev->dev, sizeof(*pfdev), GFP_KERNEL);
+	if (!pfdev)
+		return -ENOMEM;
+
+	pfdev->pdev = pdev;
+	pfdev->dev = &pdev->dev;
+
+	platform_set_drvdata(pdev, pfdev);
+
+	pfdev->comp = of_device_get_match_data(&pdev->dev);
+	if (!pfdev->comp)
+		return -ENODEV;
+
+	pfdev->coherent = device_get_dma_attr(&pdev->dev) == DEV_DMA_COHERENT;
+
+	/* Allocate and initialize the DRM device. */
+	ddev = drm_dev_alloc(&pancsf_drm_driver, &pdev->dev);
+	if (IS_ERR(ddev))
+		return PTR_ERR(ddev);
+
+	ddev->dev_private = pfdev;
+	pfdev->ddev = ddev;
+
+	err = pancsf_device_init(pfdev);
+	if (err) {
+		if (err != -EPROBE_DEFER)
+			dev_err(&pdev->dev, "Fatal error during GPU init\n");
+		goto err_out0;
+	}
+
+	pm_runtime_set_active(pfdev->dev);
+	pm_runtime_mark_last_busy(pfdev->dev);
+	pm_runtime_enable(pfdev->dev);
+	pm_runtime_set_autosuspend_delay(pfdev->dev, 50); /* ~3 frames */
+	pm_runtime_use_autosuspend(pfdev->dev);
+
+	/*
+	 * Register the DRM device with the core and the connectors with
+	 * sysfs
+	 */
+	err = drm_dev_register(ddev, 0);
+	if (err < 0)
+		goto err_out1;
+
+	return 0;
+
+err_out1:
+	pm_runtime_disable(pfdev->dev);
+	pancsf_device_fini(pfdev);
+	pm_runtime_set_suspended(pfdev->dev);
+err_out0:
+	drm_dev_put(ddev);
+	return err;
+}
+
+static int pancsf_remove(struct platform_device *pdev)
+{
+	struct pancsf_device *pfdev = platform_get_drvdata(pdev);
+	struct drm_device *ddev = pfdev->ddev;
+
+	drm_dev_unregister(ddev);
+
+	pm_runtime_get_sync(pfdev->dev);
+	pm_runtime_disable(pfdev->dev);
+	pancsf_device_fini(pfdev);
+	pm_runtime_set_suspended(pfdev->dev);
+
+	drm_dev_put(ddev);
+	return 0;
+}
+
+/*
+ * The OPP core wants the supply names to be NULL terminated, but we need the
+ * correct num_supplies value for regulator core. Hence, we NULL terminate here
+ * and then initialize num_supplies with ARRAY_SIZE - 1.
+ */
+static const char * const rockchip_rk3588_supplies[] = { "mali", "sram", NULL };
+static const char * const rockchip_rk3588_clks[] = { "coregroup", "stacks" };
+static const struct pancsf_compatible rockchip_rk3588_data = {
+	.num_supplies = ARRAY_SIZE(rockchip_rk3588_supplies) - 1,
+	.supply_names = rockchip_rk3588_supplies,
+	.num_pm_domains = 1,
+	.pm_domain_names = NULL,
+	.num_clks = ARRAY_SIZE(rockchip_rk3588_clks),
+	.clk_names = rockchip_rk3588_clks,
+};
+
+static const struct of_device_id dt_match[] = {
+	{ .compatible = "rockchip,rk3588-mali", .data = &rockchip_rk3588_data },
+	{}
+};
+MODULE_DEVICE_TABLE(of, dt_match);
+
+static const struct dev_pm_ops pancsf_pm_ops = {
+	SET_SYSTEM_SLEEP_PM_OPS(pm_runtime_force_suspend, pm_runtime_force_resume)
+	SET_RUNTIME_PM_OPS(pancsf_device_suspend, pancsf_device_resume, NULL)
+};
+
+static struct platform_driver pancsf_driver = {
+	.probe		= pancsf_probe,
+	.remove		= pancsf_remove,
+	.driver		= {
+		.name	= "pancsf",
+		.pm	= &pancsf_pm_ops,
+		.of_match_table = dt_match,
+	},
+};
+module_platform_driver(pancsf_driver);
+
+MODULE_AUTHOR("Panfrost Project Developers");
+MODULE_DESCRIPTION("Panfrost CSF DRM Driver");
+MODULE_LICENSE("GPL v2");
diff --git a/drivers/gpu/drm/pancsf/pancsf_gem.c b/drivers/gpu/drm/pancsf/pancsf_gem.c
new file mode 100644
index 000000000000..96f738d7894d
--- /dev/null
+++ b/drivers/gpu/drm/pancsf/pancsf_gem.c
@@ -0,0 +1,161 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright 2019 Linaro, Ltd, Rob Herring <robh@kernel.org> */
+/* Copyright 2023 Collabora ltd. */
+
+#include <linux/err.h>
+#include <linux/slab.h>
+#include <linux/dma-buf.h>
+#include <linux/dma-mapping.h>
+
+#include <drm/pancsf_drm.h>
+#include "pancsf_device.h"
+#include "pancsf_gem.h"
+#include "pancsf_mmu.h"
+
+/* Called DRM core on the last userspace/kernel unreference of the
+ * BO.
+ */
+static void pancsf_gem_free_object(struct drm_gem_object *obj)
+{
+	struct pancsf_gem_object *bo = to_pancsf_bo(obj);
+
+	drm_gem_free_mmap_offset(&bo->base.base);
+	drm_gem_shmem_free(&bo->base);
+}
+
+void pancsf_gem_unmap_and_put(struct pancsf_vm *vm, struct pancsf_gem_object *bo,
+			      u64 gpu_va, void *cpu_va)
+{
+	if (cpu_va) {
+		struct iosys_map map = IOSYS_MAP_INIT_VADDR(cpu_va);
+
+		drm_gem_shmem_vunmap(&bo->base, &map);
+	}
+
+	WARN_ON(pancsf_vm_unmap_range(vm, gpu_va, bo->base.base.size));
+	drm_gem_object_put(&bo->base.base);
+}
+
+struct pancsf_gem_object *
+pancsf_gem_create_and_map(struct pancsf_device *pfdev, struct pancsf_vm *vm,
+			  size_t size, u32 bo_flags, u32 vm_map_flags,
+			  u64 *gpu_va, void **cpu_va)
+{
+	struct drm_gem_shmem_object *obj;
+	struct pancsf_gem_object *bo;
+	int ret;
+
+	obj = drm_gem_shmem_create(pfdev->ddev, size);
+	if (!obj)
+		return ERR_PTR(-ENOMEM);
+
+	bo = to_pancsf_bo(&obj->base);
+
+	ret = pancsf_vm_map_bo_range(vm, bo, 0, obj->base.size, gpu_va, vm_map_flags);
+	if (ret) {
+		drm_gem_object_put(&obj->base);
+		return ERR_PTR(ret);
+	}
+
+	if (cpu_va) {
+		struct iosys_map map;
+		int ret;
+
+		ret = drm_gem_shmem_vmap(obj, &map);
+		if (ret) {
+			pancsf_vm_unmap_range(vm, *gpu_va, obj->base.size);
+			drm_gem_object_put(&obj->base);
+			return ERR_PTR(ret);
+		}
+
+		*cpu_va = map.vaddr;
+	}
+
+	return bo;
+}
+
+static int pancsf_gem_pin(struct drm_gem_object *obj)
+{
+	struct pancsf_gem_object *bo = to_pancsf_bo(obj);
+
+	return drm_gem_shmem_pin(&bo->base);
+}
+
+static const struct drm_gem_object_funcs pancsf_gem_funcs = {
+	.free = pancsf_gem_free_object,
+	.print_info = drm_gem_shmem_object_print_info,
+	.pin = pancsf_gem_pin,
+	.unpin = drm_gem_shmem_object_unpin,
+	.get_sg_table = drm_gem_shmem_object_get_sg_table,
+	.vmap = drm_gem_shmem_object_vmap,
+	.vunmap = drm_gem_shmem_object_vunmap,
+	.mmap = drm_gem_shmem_object_mmap,
+	.vm_ops = &drm_gem_shmem_vm_ops,
+};
+
+/**
+ * pancsf_gem_create_object - Implementation of driver->gem_create_object.
+ * @dev: DRM device
+ * @size: Size in bytes of the memory the object will reference
+ *
+ * This lets the GEM helpers allocate object structs for us, and keep
+ * our BO stats correct.
+ */
+struct drm_gem_object *pancsf_gem_create_object(struct drm_device *ddev, size_t size)
+{
+	struct pancsf_device *pfdev = ddev->dev_private;
+	struct pancsf_gem_object *obj;
+
+	obj = kzalloc(sizeof(*obj), GFP_KERNEL);
+	if (!obj)
+		return ERR_PTR(-ENOMEM);
+
+	obj->base.base.funcs = &pancsf_gem_funcs;
+	obj->base.map_wc = !pfdev->coherent;
+
+	return &obj->base.base;
+}
+
+struct pancsf_gem_object *
+pancsf_gem_create_with_handle(struct drm_file *file,
+			      struct drm_device *ddev, size_t size,
+			      u32 flags, u32 *handle)
+{
+	int ret;
+	struct drm_gem_shmem_object *shmem;
+	struct pancsf_gem_object *bo;
+
+	shmem = drm_gem_shmem_create(ddev, size);
+	if (IS_ERR(shmem))
+		return ERR_CAST(shmem);
+
+	bo = to_pancsf_bo(&shmem->base);
+
+	/*
+	 * Allocate an id of idr table where the obj is registered
+	 * and handle has the id what user can see.
+	 */
+	ret = drm_gem_handle_create(file, &shmem->base, handle);
+	/* drop reference from allocate - handle holds it now. */
+	drm_gem_object_put(&shmem->base);
+	if (ret)
+		return ERR_PTR(ret);
+
+	return bo;
+}
+
+struct drm_gem_object *
+pancsf_gem_prime_import_sg_table(struct drm_device *ddev,
+				 struct dma_buf_attachment *attach,
+				 struct sg_table *sgt)
+{
+	struct drm_gem_object *obj;
+	struct pancsf_gem_object *bo;
+
+	obj = drm_gem_shmem_prime_import_sg_table(ddev, attach, sgt);
+	if (IS_ERR(obj))
+		return ERR_CAST(obj);
+
+	bo = to_pancsf_bo(obj);
+	return obj;
+}
diff --git a/drivers/gpu/drm/pancsf/pancsf_gem.h b/drivers/gpu/drm/pancsf/pancsf_gem.h
new file mode 100644
index 000000000000..399b1336c1ea
--- /dev/null
+++ b/drivers/gpu/drm/pancsf/pancsf_gem.h
@@ -0,0 +1,45 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright 2019 Linaro, Ltd, Rob Herring <robh@kernel.org> */
+/* Copyright 2023 Collabora ltd. */
+
+#ifndef __PANCSF_GEM_H__
+#define __PANCSF_GEM_H__
+
+#include <drm/drm_gem_shmem_helper.h>
+#include <drm/drm_mm.h>
+
+#include <linux/rwsem.h>
+
+struct pancsf_vm;
+
+struct pancsf_gem_object {
+	struct drm_gem_shmem_object base;
+};
+
+static inline
+struct pancsf_gem_object *to_pancsf_bo(struct drm_gem_object *obj)
+{
+	return container_of(to_drm_gem_shmem_obj(obj), struct pancsf_gem_object, base);
+}
+
+struct drm_gem_object *pancsf_gem_create_object(struct drm_device *ddev, size_t size);
+
+struct drm_gem_object *
+pancsf_gem_prime_import_sg_table(struct drm_device *ddev,
+				 struct dma_buf_attachment *attach,
+				 struct sg_table *sgt);
+
+struct pancsf_gem_object *
+pancsf_gem_create_with_handle(struct drm_file *file,
+			      struct drm_device *ddev, size_t size,
+			      u32 flags,
+			      uint32_t *handle);
+
+void pancsf_gem_unmap_and_put(struct pancsf_vm *vm, struct pancsf_gem_object *bo,
+			      u64 gpu_va, void *cpu_va);
+struct pancsf_gem_object *
+pancsf_gem_create_and_map(struct pancsf_device *pfdev, struct pancsf_vm *vm,
+			  size_t size, u32 bo_flags, u32 vm_map_flags,
+			  u64 *gpu_va, void **cpu_va);
+
+#endif /* __PANCSF_GEM_H__ */
diff --git a/drivers/gpu/drm/pancsf/pancsf_gpu.c b/drivers/gpu/drm/pancsf/pancsf_gpu.c
new file mode 100644
index 000000000000..63d9318db726
--- /dev/null
+++ b/drivers/gpu/drm/pancsf/pancsf_gpu.c
@@ -0,0 +1,381 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright 2018 Marty E. Plummer <hanetzer@startmail.com> */
+/* Copyright 2019 Linaro, Ltd., Rob Herring <robh@kernel.org> */
+/* Copyright 2019 Collabora ltd. */
+
+#include <linux/bitfield.h>
+#include <linux/bitmap.h>
+#include <linux/delay.h>
+#include <linux/dma-mapping.h>
+#include <linux/interrupt.h>
+#include <linux/io.h>
+#include <linux/iopoll.h>
+#include <linux/platform_device.h>
+#include <linux/pm_runtime.h>
+
+#include "pancsf_device.h"
+#include "pancsf_gpu.h"
+#include "pancsf_regs.h"
+
+#define MAX_HW_REVS 6
+
+struct pancsf_gpu {
+	int irq;
+	spinlock_t reqs_lock;
+	u32 pending_reqs;
+	wait_queue_head_t reqs_acked;
+};
+
+struct pancsf_model {
+	const char *name;
+	u32 id;
+};
+
+#define GPU_MODEL(_name, _id, ...) \
+{\
+	.name = __stringify(_name),				\
+	.id = _id,						\
+}
+
+#define GPU_MODEL_ID_MASK		0xf00f0000
+
+static const struct pancsf_model gpu_models[] = {
+	GPU_MODEL(g610, 0xa0070000),
+	{},
+};
+
+static void pancsf_gpu_init_info(struct pancsf_device *pfdev)
+{
+	const struct pancsf_model *model;
+	u32 major, minor, status;
+	unsigned int i;
+
+	pfdev->gpu_info.gpu_id = gpu_read(pfdev, GPU_ID);
+	pfdev->gpu_info.csf_id = gpu_read(pfdev, GPU_CSF_ID);
+	pfdev->gpu_info.gpu_rev = gpu_read(pfdev, GPU_REVID);
+	pfdev->gpu_info.l2_features = gpu_read(pfdev, GPU_L2_FEATURES);
+	pfdev->gpu_info.tiler_features = gpu_read(pfdev, GPU_TILER_FEATURES);
+	pfdev->gpu_info.mem_features = gpu_read(pfdev, GPU_MEM_FEATURES);
+	pfdev->gpu_info.mmu_features = gpu_read(pfdev, GPU_MMU_FEATURES);
+	pfdev->gpu_info.thread_features = gpu_read(pfdev, GPU_THREAD_FEATURES);
+	pfdev->gpu_info.max_threads = gpu_read(pfdev, GPU_THREAD_MAX_THREADS);
+	pfdev->gpu_info.thread_max_workgroup_size = gpu_read(pfdev, GPU_THREAD_MAX_WORKGROUP_SIZE);
+	pfdev->gpu_info.thread_max_barrier_size = gpu_read(pfdev, GPU_THREAD_MAX_BARRIER_SIZE);
+	pfdev->gpu_info.coherency_features = gpu_read(pfdev, GPU_COHERENCY_FEATURES);
+	for (i = 0; i < 4; i++)
+		pfdev->gpu_info.texture_features[i] = gpu_read(pfdev, GPU_TEXTURE_FEATURES(i));
+
+	pfdev->gpu_info.as_present = gpu_read(pfdev, GPU_AS_PRESENT);
+
+	pfdev->gpu_info.shader_present = gpu_read(pfdev, GPU_SHADER_PRESENT_LO);
+	pfdev->gpu_info.shader_present |= (u64)gpu_read(pfdev, GPU_SHADER_PRESENT_HI) << 32;
+
+	pfdev->gpu_info.tiler_present = gpu_read(pfdev, GPU_TILER_PRESENT_LO);
+	pfdev->gpu_info.tiler_present |= (u64)gpu_read(pfdev, GPU_TILER_PRESENT_HI) << 32;
+
+	pfdev->gpu_info.l2_present = gpu_read(pfdev, GPU_L2_PRESENT_LO);
+	pfdev->gpu_info.l2_present |= (u64)gpu_read(pfdev, GPU_L2_PRESENT_HI) << 32;
+	pfdev->gpu_info.core_group_count = hweight64(pfdev->gpu_info.l2_present);
+
+	major = (pfdev->gpu_info.gpu_id >> 12) & 0xf;
+	minor = (pfdev->gpu_info.gpu_id >> 4) & 0xff;
+	status = pfdev->gpu_info.gpu_id & 0xf;
+
+	for (model = gpu_models; model->name; model++) {
+		if (model->id == (pfdev->gpu_info.gpu_id & GPU_MODEL_ID_MASK))
+			break;
+	}
+
+	dev_info(pfdev->dev, "mali-%s id 0x%x major 0x%x minor 0x%x status 0x%x",
+		 model->name ?: "unknown", pfdev->gpu_info.gpu_id >> 16,
+		 major, minor, status);
+
+	dev_info(pfdev->dev, "Features: L2:0x%08x Tiler:0x%08x Mem:0x%0x MMU:0x%08x AS:0x%x",
+		 pfdev->gpu_info.l2_features,
+		 pfdev->gpu_info.tiler_features,
+		 pfdev->gpu_info.mem_features,
+		 pfdev->gpu_info.mmu_features,
+		 pfdev->gpu_info.as_present);
+
+	dev_info(pfdev->dev, "shader_present=0x%0llx l2_present=0x%0llx tiler_present=0x%0llx",
+		 pfdev->gpu_info.shader_present, pfdev->gpu_info.l2_present,
+		 pfdev->gpu_info.tiler_present);
+}
+
+static irqreturn_t pancsf_gpu_irq_handler(int irq, void *data)
+{
+	struct pancsf_device *pfdev = data;
+	u32 state = gpu_read(pfdev, GPU_INT_STAT);
+
+	if (!state)
+		return IRQ_NONE;
+
+	if (state & (GPU_IRQ_FAULT | GPU_IRQ_PROTM_FAULT)) {
+		u32 fault_status = gpu_read(pfdev, GPU_FAULT_STATUS);
+		u64 address = ((u64)gpu_read(pfdev, GPU_FAULT_ADDR_HI) << 32) |
+			      gpu_read(pfdev, GPU_FAULT_ADDR_LO);
+
+		dev_warn(pfdev->dev, "GPU Fault 0x%08x (%s) at 0x%016llx\n",
+			 fault_status, pancsf_exception_name(fault_status & 0xFF),
+			 address);
+	}
+
+	spin_lock(&pfdev->gpu->reqs_lock);
+	if (state & pfdev->gpu->pending_reqs) {
+		pfdev->gpu->pending_reqs &= ~state;
+		wake_up_all(&pfdev->gpu->reqs_acked);
+	}
+	spin_unlock(&pfdev->gpu->reqs_lock);
+
+	gpu_write(pfdev, GPU_INT_CLEAR, state);
+	return IRQ_HANDLED;
+}
+
+void pancsf_gpu_fini(struct pancsf_device *pfdev)
+{
+	unsigned long flags;
+
+	gpu_write(pfdev, GPU_INT_MASK, 0);
+
+	if (pfdev->gpu->irq > 0)
+		synchronize_irq(pfdev->gpu->irq);
+
+	spin_lock_irqsave(&pfdev->gpu->reqs_lock, flags);
+	pfdev->gpu->pending_reqs = 0;
+	wake_up_all(&pfdev->gpu->reqs_acked);
+	spin_unlock_irqrestore(&pfdev->gpu->reqs_lock, flags);
+}
+
+int pancsf_gpu_init(struct pancsf_device *pfdev)
+{
+	struct pancsf_gpu *gpu;
+	u32 pa_bits;
+	int ret, irq;
+
+	gpu = devm_kzalloc(pfdev->dev, sizeof(*gpu), GFP_KERNEL);
+	if (!gpu)
+		return -ENOMEM;
+
+	spin_lock_init(&gpu->reqs_lock);
+	init_waitqueue_head(&gpu->reqs_acked);
+	pfdev->gpu = gpu;
+	pancsf_gpu_init_info(pfdev);
+
+	dma_set_max_seg_size(pfdev->dev, UINT_MAX);
+	pa_bits = GPU_MMU_FEATURES_PA_BITS(pfdev->gpu_info.mmu_features);
+	ret = dma_set_mask_and_coherent(pfdev->dev, DMA_BIT_MASK(pa_bits));
+	if (ret)
+		return ret;
+
+	gpu_write(pfdev, GPU_INT_CLEAR, ~0);
+	gpu_write(pfdev, GPU_INT_MASK,
+		  GPU_IRQ_FAULT |
+		  GPU_IRQ_PROTM_FAULT |
+		  GPU_IRQ_RESET_COMPLETED |
+		  GPU_IRQ_MCU_STATUS_CHANGED |
+		  GPU_IRQ_CLEAN_CACHES_COMPLETED);
+
+	irq = platform_get_irq_byname(to_platform_device(pfdev->dev), "gpu");
+	gpu->irq = irq;
+	if (irq <= 0)
+		return -ENODEV;
+
+	ret = devm_request_irq(pfdev->dev, irq,
+			       pancsf_gpu_irq_handler,
+			       IRQF_SHARED, KBUILD_MODNAME "-gpu",
+			       pfdev);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+int pancsf_gpu_block_power_off(struct pancsf_device *pfdev,
+			       const char *blk_name,
+			       u32 pwroff_reg, u32 pwrtrans_reg,
+			       u64 mask, u32 timeout_us)
+{
+	u32 val, i;
+	int ret;
+
+	for (i = 0; i < 2; i++) {
+		u32 mask32 = mask >> (i * 32);
+
+		if (!mask32)
+			continue;
+
+		ret = readl_relaxed_poll_timeout(pfdev->iomem + pwrtrans_reg + (i * 4),
+						 val, !(mask32 & val),
+						 100, timeout_us);
+		if (ret) {
+			dev_err(pfdev->dev, "timeout waiting on %s:%llx power transition",
+				blk_name, mask);
+			return ret;
+		}
+	}
+
+	if (mask & GENMASK(31, 0))
+		gpu_write(pfdev, pwroff_reg, mask);
+
+	if (mask >> 32)
+		gpu_write(pfdev, pwroff_reg, mask >> 32);
+
+	for (i = 0; i < 2; i++) {
+		u32 mask32 = mask >> (i * 32);
+
+		if (!mask32)
+			continue;
+
+		ret = readl_relaxed_poll_timeout(pfdev->iomem + pwrtrans_reg + (i * 4),
+						 val, !(mask & val),
+						 100, timeout_us);
+		if (ret) {
+			dev_err(pfdev->dev, "timeout waiting on %s:%llx power transition",
+				blk_name, mask);
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+int pancsf_gpu_block_power_on(struct pancsf_device *pfdev,
+			      const char *blk_name,
+			      u32 pwron_reg, u32 pwrtrans_reg,
+			      u32 rdy_reg, u64 mask, u32 timeout_us)
+{
+	u32 val, i;
+	int ret;
+
+	for (i = 0; i < 2; i++) {
+		u32 mask32 = mask >> (i * 32);
+
+		if (!mask32)
+			continue;
+
+		ret = readl_relaxed_poll_timeout(pfdev->iomem + pwrtrans_reg + (i * 4),
+						 val, !(mask32 & val),
+						 100, timeout_us);
+		if (ret) {
+			dev_err(pfdev->dev, "timeout waiting on %s:%llx power transition",
+				blk_name, mask);
+			return ret;
+		}
+	}
+
+	if (mask & GENMASK(31, 0))
+		gpu_write(pfdev, pwron_reg, mask);
+
+	if (mask >> 32)
+		gpu_write(pfdev, pwron_reg + 4, mask >> 32);
+
+	for (i = 0; i < 2; i++) {
+		u32 mask32 = mask >> (i * 32);
+
+		if (!mask32)
+			continue;
+
+		ret = readl_relaxed_poll_timeout(pfdev->iomem + rdy_reg + (i * 4),
+						 val, (mask32 & val) == mask32,
+						 100, timeout_us);
+		if (ret) {
+			dev_err(pfdev->dev, "timeout waiting on %s:%llx readyness",
+				blk_name, mask);
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+int pancsf_gpu_l2_power_on(struct pancsf_device *pfdev)
+{
+	u64 core_mask = U64_MAX;
+
+	if (pfdev->gpu_info.l2_present != 1) {
+		/*
+		 * Only support one core group now.
+		 * ~(l2_present - 1) unsets all bits in l2_present except
+		 * the bottom bit. (l2_present - 2) has all the bits in
+		 * the first core group set. AND them together to generate
+		 * a mask of cores in the first core group.
+		 */
+		core_mask = ~(pfdev->gpu_info.l2_present - 1) &
+			     (pfdev->gpu_info.l2_present - 2);
+		dev_info_once(pfdev->dev, "using only 1st core group (%lu cores from %lu)\n",
+			      hweight64(core_mask),
+			      hweight64(pfdev->gpu_info.shader_present));
+	}
+
+	return pancsf_gpu_power_on(pfdev, L2,
+				   pfdev->gpu_info.l2_present & core_mask,
+				   20000);
+}
+
+int pancsf_gpu_flush_caches(struct pancsf_device *pfdev,
+			    u32 l2, u32 lsc, u32 other)
+{
+	bool timedout = false;
+	unsigned long flags;
+
+	spin_lock_irqsave(&pfdev->gpu->reqs_lock, flags);
+	if (!WARN_ON(pfdev->gpu->pending_reqs & GPU_IRQ_CLEAN_CACHES_COMPLETED)) {
+		pfdev->gpu->pending_reqs |= GPU_IRQ_CLEAN_CACHES_COMPLETED;
+		gpu_write(pfdev, GPU_CMD, GPU_FLUSH_CACHES(l2, lsc, other));
+	}
+	spin_unlock_irqrestore(&pfdev->gpu->reqs_lock, flags);
+
+	if (!wait_event_timeout(pfdev->gpu->reqs_acked,
+				!(pfdev->gpu->pending_reqs & GPU_IRQ_CLEAN_CACHES_COMPLETED),
+				msecs_to_jiffies(100))) {
+		spin_lock_irqsave(&pfdev->gpu->reqs_lock, flags);
+		if ((pfdev->gpu->pending_reqs & GPU_IRQ_CLEAN_CACHES_COMPLETED) != 0 &&
+		    !(gpu_read(pfdev, GPU_INT_STAT) & GPU_IRQ_CLEAN_CACHES_COMPLETED))
+			timedout = true;
+		spin_unlock_irqrestore(&pfdev->gpu->reqs_lock, flags);
+	}
+
+	if (timedout) {
+		dev_err(pfdev->dev, "Flush caches timeout");
+		return -ETIMEDOUT;
+	}
+
+	return 0;
+}
+
+int pancsf_gpu_soft_reset(struct pancsf_device *pfdev)
+{
+	bool timedout = false;
+	unsigned long flags;
+
+	spin_lock_irqsave(&pfdev->gpu->reqs_lock, flags);
+	if (!WARN_ON(pfdev->gpu->pending_reqs & GPU_IRQ_RESET_COMPLETED)) {
+		pfdev->gpu->pending_reqs |= GPU_IRQ_RESET_COMPLETED;
+		gpu_write(pfdev, GPU_CMD, GPU_SOFT_RESET);
+	}
+	spin_unlock_irqrestore(&pfdev->gpu->reqs_lock, flags);
+
+	if (!wait_event_timeout(pfdev->gpu->reqs_acked,
+				!(pfdev->gpu->pending_reqs & GPU_IRQ_RESET_COMPLETED),
+				msecs_to_jiffies(100))) {
+		spin_lock_irqsave(&pfdev->gpu->reqs_lock, flags);
+		if ((pfdev->gpu->pending_reqs & GPU_IRQ_RESET_COMPLETED) != 0 &&
+		    !(gpu_read(pfdev, GPU_INT_STAT) & GPU_IRQ_RESET_COMPLETED))
+			timedout = true;
+		spin_unlock_irqrestore(&pfdev->gpu->reqs_lock, flags);
+	}
+
+	gpu_write(pfdev, GPU_INT_MASK,
+		  GPU_IRQ_FAULT |
+		  GPU_IRQ_PROTM_FAULT |
+		  GPU_IRQ_RESET_COMPLETED |
+		  GPU_IRQ_MCU_STATUS_CHANGED |
+		  GPU_IRQ_CLEAN_CACHES_COMPLETED);
+
+	if (timedout) {
+		dev_err(pfdev->dev, "Soft reset timeout");
+		return -ETIMEDOUT;
+	}
+
+	return 0;
+}
diff --git a/drivers/gpu/drm/pancsf/pancsf_gpu.h b/drivers/gpu/drm/pancsf/pancsf_gpu.h
new file mode 100644
index 000000000000..1ee39b8b10b0
--- /dev/null
+++ b/drivers/gpu/drm/pancsf/pancsf_gpu.h
@@ -0,0 +1,40 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright 2018 Marty E. Plummer <hanetzer@startmail.com> */
+/* Copyright 2019 Collabora ltd. */
+
+#ifndef __PANCSF_GPU_H__
+#define __PANCSF_GPU_H__
+
+struct pancsf_device;
+
+int pancsf_gpu_init(struct pancsf_device *pfdev);
+void pancsf_gpu_fini(struct pancsf_device *pfdev);
+
+int pancsf_gpu_block_power_on(struct pancsf_device *pfdev,
+			      const char *blk_name,
+			      u32 pwron_reg, u32 pwrtrans_reg,
+			      u32 rdy_reg, u64 mask, u32 timeout_us);
+int pancsf_gpu_block_power_off(struct pancsf_device *pfdev,
+			       const char *blk_name,
+			       u32 pwroff_reg, u32 pwrtrans_reg,
+			       u64 mask, u32 timeout_us);
+
+#define pancsf_gpu_power_on(pfdev, type, mask, timeout_us) \
+	pancsf_gpu_block_power_on(pfdev, #type, \
+				    type ## _PWRON_LO, \
+				    type ## _PWRTRANS_LO, \
+				    type ## _READY_LO, \
+				    mask, timeout_us)
+
+#define pancsf_gpu_power_off(pfdev, type, mask, timeout_us) \
+	pancsf_gpu_block_power_off(pfdev, #type, \
+				     type ## _PWROFF_LO, \
+				     type ## _PWRTRANS_LO, \
+				     mask, timeout_us)
+
+int pancsf_gpu_l2_power_on(struct pancsf_device *pfdev);
+int pancsf_gpu_flush_caches(struct pancsf_device *pfdev,
+			    u32 l2, u32 lsc, u32 other);
+int pancsf_gpu_soft_reset(struct pancsf_device *pfdev);
+
+#endif
diff --git a/drivers/gpu/drm/pancsf/pancsf_heap.c b/drivers/gpu/drm/pancsf/pancsf_heap.c
new file mode 100644
index 000000000000..fb28c4cabe08
--- /dev/null
+++ b/drivers/gpu/drm/pancsf/pancsf_heap.c
@@ -0,0 +1,337 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright 2023 Collabora ltd. */
+
+#include <linux/iosys-map.h>
+#include <linux/rwsem.h>
+
+#include <drm/pancsf_drm.h>
+
+#include "pancsf_device.h"
+#include "pancsf_gem.h"
+#include "pancsf_heap.h"
+#include "pancsf_mmu.h"
+
+struct pancsf_heap_gpu_ctx {
+	u64 first_heap_chunk;
+	u32 unused1[2];
+	u32 vt_started_count;
+	u32 vt_completed_count;
+	u32 unused2;
+	u32 frag_completed_count;
+};
+
+struct pancsf_heap_chunk_header {
+	u64 next;
+	u32 unknown[14];
+};
+
+struct pancsf_heap_chunk {
+	struct list_head node;
+	struct pancsf_gem_object *bo;
+	u64 gpu_va;
+};
+
+struct pancsf_heap {
+	struct list_head chunks;
+	u32 chunk_size;
+	u32 max_chunks;
+	u32 target_in_flight;
+	u32 chunk_count;
+};
+
+#define MAX_HEAPS_PER_POOL    128
+
+struct pancsf_heap_pool {
+	struct pancsf_device *pfdev;
+	struct pancsf_vm *vm;
+	struct rw_semaphore lock;
+	struct xarray xa;
+	struct pancsf_gem_object *bo;
+	struct pancsf_heap_gpu_ctx *gpu_contexts;
+	u64 gpu_va;
+};
+
+static void pancsf_free_heap_chunk(struct pancsf_vm *vm,
+				   struct pancsf_heap_chunk *chunk)
+{
+	if (!chunk)
+		return;
+
+	list_del(&chunk->node);
+	pancsf_gem_unmap_and_put(vm, chunk->bo, chunk->gpu_va, NULL);
+	kfree(chunk);
+}
+
+static int pancsf_alloc_heap_chunk(struct pancsf_device *pfdev,
+				   struct pancsf_vm *vm,
+				   struct pancsf_heap *heap,
+				   bool link_with_prev)
+{
+	struct iosys_map map = IOSYS_MAP_INIT_VADDR(NULL);
+	struct pancsf_heap_chunk *chunk;
+	struct pancsf_heap_chunk_header *hdr;
+	int ret;
+
+	chunk = kmalloc(sizeof(*chunk), GFP_KERNEL);
+	if (!chunk)
+		return -ENOMEM;
+
+	chunk->bo = pancsf_gem_create_and_map(pfdev, vm, heap->chunk_size, 0,
+					      PANCSF_VMA_MAP_NOEXEC |
+					      PANCSF_VMA_MAP_AUTO_VA,
+					      &chunk->gpu_va,
+					      (void **)&hdr);
+	if (IS_ERR(chunk->bo)) {
+		ret = PTR_ERR(chunk->bo);
+		goto err_free_chunk;
+	}
+
+	memset(hdr, 0, sizeof(*hdr));
+
+	map.vaddr = hdr;
+	drm_gem_shmem_vunmap(&chunk->bo->base, &map);
+
+	if (link_with_prev && !list_empty(&heap->chunks)) {
+		struct pancsf_heap_chunk *prev_chunk;
+
+		prev_chunk = list_first_entry(&heap->chunks,
+					      struct pancsf_heap_chunk,
+					      node);
+
+		ret = drm_gem_shmem_vmap(&prev_chunk->bo->base, &map);
+		if (ret)
+			goto err_put_bo;
+
+		hdr = map.vaddr;
+		hdr->next = (chunk->gpu_va & GENMASK(11, 0)) |
+			    (heap->chunk_size >> 12);
+
+		drm_gem_shmem_vunmap(&prev_chunk->bo->base, &map);
+	}
+
+	list_add(&chunk->node, &heap->chunks);
+	heap->chunk_count++;
+
+	return 0;
+
+err_put_bo:
+	drm_gem_object_put(&chunk->bo->base.base);
+err_free_chunk:
+	kfree(chunk);
+
+	return ret;
+}
+
+static void pancsf_free_heap_chunks(struct pancsf_vm *vm,
+				    struct pancsf_heap *heap)
+{
+	struct pancsf_heap_chunk *chunk, *tmp;
+
+	list_for_each_entry_safe(chunk, tmp, &heap->chunks, node) {
+		pancsf_free_heap_chunk(vm, chunk);
+	}
+
+	heap->chunk_count = 0;
+}
+
+static int pancsf_alloc_heap_chunks(struct pancsf_device *pfdev,
+				    struct pancsf_vm *vm,
+				    struct pancsf_heap *heap,
+				    u32 chunk_count)
+{
+	int ret;
+	u32 i;
+
+	for (i = 0; i < chunk_count; i++) {
+		ret = pancsf_alloc_heap_chunk(pfdev,
+					      vm,
+					      heap, true);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static int
+pancsf_heap_destroy_locked(struct pancsf_heap_pool *pool, u32 handle)
+{
+	struct pancsf_heap *heap = NULL;
+
+	heap = xa_erase(&pool->xa, handle);
+	if (!heap)
+		return -EINVAL;
+
+	pancsf_free_heap_chunks(pool->vm, heap);
+	kfree(heap);
+	return 0;
+}
+
+int pancsf_heap_destroy(struct pancsf_heap_pool *pool, u32 handle)
+{
+	int ret;
+
+	down_write(&pool->lock);
+	ret = pancsf_heap_destroy_locked(pool, handle);
+	up_write(&pool->lock);
+
+	return ret;
+}
+
+int pancsf_heap_create(struct pancsf_heap_pool *pool,
+		       u32 initial_chunk_count,
+		       u32 chunk_size,
+		       u32 max_chunks,
+		       u32 target_in_flight,
+		       u64 *heap_ctx_gpu_va,
+		       u64 *first_chunk_gpu_va)
+{
+	struct pancsf_heap *heap;
+	struct pancsf_heap_gpu_ctx *gpu_ctx;
+	struct pancsf_heap_chunk *first_chunk;
+	int ret = 0;
+	u32 id;
+
+	if (initial_chunk_count == 0)
+		return -EINVAL;
+
+	if (hweight32(chunk_size) != 1 ||
+	    chunk_size < SZ_256K || chunk_size > SZ_2M)
+		return -EINVAL;
+
+	heap = kzalloc(sizeof(*heap), GFP_KERNEL);
+	if (!heap)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&heap->chunks);
+	heap->chunk_size = chunk_size;
+	heap->max_chunks = max_chunks;
+	heap->target_in_flight = target_in_flight;
+
+	down_write(&pool->lock);
+	ret = xa_alloc(&pool->xa, &id, heap, XA_LIMIT(1, MAX_HEAPS_PER_POOL), GFP_KERNEL);
+	if (ret) {
+		kfree(heap);
+		goto out_unlock;
+	}
+
+	gpu_ctx = &pool->gpu_contexts[id];
+	memset(gpu_ctx, 0, sizeof(*gpu_ctx));
+
+	ret = pancsf_alloc_heap_chunks(pool->pfdev, pool->vm, heap,
+				       initial_chunk_count);
+	if (ret) {
+		pancsf_heap_destroy_locked(pool, id);
+		goto out_unlock;
+	}
+
+	*heap_ctx_gpu_va = pool->gpu_va + (sizeof(*pool->gpu_contexts) * id);
+
+	first_chunk = list_first_entry(&heap->chunks,
+				       struct pancsf_heap_chunk,
+				       node);
+	*first_chunk_gpu_va = first_chunk->gpu_va;
+	ret = id;
+
+out_unlock:
+	up_write(&pool->lock);
+	return ret;
+}
+
+int pancsf_heap_grow(struct pancsf_heap_pool *pool,
+		     u64 heap_gpu_va,
+		     u32 renderpasses_in_flight,
+		     u32 pending_frag_count,
+		     u64 *new_chunk_gpu_va)
+{
+	u64 heap_id = (heap_gpu_va - pool->gpu_va) /
+		      sizeof(struct pancsf_heap_gpu_ctx);
+	struct pancsf_heap_chunk *chunk;
+	struct pancsf_heap *heap;
+	int ret;
+
+	down_read(&pool->lock);
+	heap = xa_load(&pool->xa, heap_id);
+	if (!heap) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	if (renderpasses_in_flight > heap->target_in_flight ||
+	    (pending_frag_count > 0 && heap->chunk_count >= heap->max_chunks)) {
+		ret = -EBUSY;
+		goto out_unlock;
+	} else if (heap->chunk_count >= heap->max_chunks) {
+		ret = -ENOMEM;
+		goto out_unlock;
+	}
+
+	ret = pancsf_alloc_heap_chunk(pool->pfdev, pool->vm, heap, false);
+	if (ret)
+		goto out_unlock;
+
+	chunk = list_first_entry(&heap->chunks,
+				 struct pancsf_heap_chunk,
+				 node);
+	*new_chunk_gpu_va = chunk->gpu_va;
+	ret = 0;
+
+out_unlock:
+	up_read(&pool->lock);
+	return ret;
+}
+
+void pancsf_heap_pool_destroy(struct pancsf_heap_pool *pool)
+{
+	struct pancsf_heap *heap;
+	unsigned long i;
+
+	if (IS_ERR_OR_NULL(pool))
+		return;
+
+	down_write(&pool->lock);
+	xa_for_each(&pool->xa, i, heap)
+		WARN_ON(pancsf_heap_destroy_locked(pool, i));
+
+	if (!IS_ERR_OR_NULL(pool->bo))
+		pancsf_gem_unmap_and_put(pool->vm, pool->bo, pool->gpu_va, pool->gpu_contexts);
+	up_write(&pool->lock);
+
+	pancsf_vm_put(pool->vm);
+	kfree(pool);
+}
+
+struct pancsf_heap_pool *
+pancsf_heap_pool_create(struct pancsf_device *pfdev, struct pancsf_vm *vm)
+{
+	size_t bosize = ALIGN(MAX_HEAPS_PER_POOL *
+			      sizeof(struct pancsf_heap_gpu_ctx),
+			      4096);
+	struct pancsf_heap_pool *pool;
+	int ret = 0;
+
+	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
+	if (!pool)
+		return ERR_PTR(-ENOMEM);
+
+	pool->pfdev = pfdev;
+	pool->vm = pancsf_vm_get(vm);
+	init_rwsem(&pool->lock);
+	xa_init_flags(&pool->xa, XA_FLAGS_ALLOC1);
+
+	pool->bo = pancsf_gem_create_and_map(pfdev, vm, bosize, 0,
+					     PANCSF_VMA_MAP_NOEXEC |
+					     PANCSF_VMA_MAP_AUTO_VA,
+					     &pool->gpu_va,
+					     (void *)&pool->gpu_contexts);
+	if (IS_ERR(pool->bo)) {
+		ret = PTR_ERR(pool->bo);
+		goto err_destroy_pool;
+	}
+
+	return pool;
+
+err_destroy_pool:
+	pancsf_heap_pool_destroy(pool);
+	return ERR_PTR(ret);
+}
diff --git a/drivers/gpu/drm/pancsf/pancsf_heap.h b/drivers/gpu/drm/pancsf/pancsf_heap.h
new file mode 100644
index 000000000000..ed395ab33b5e
--- /dev/null
+++ b/drivers/gpu/drm/pancsf/pancsf_heap.h
@@ -0,0 +1,30 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright 2023 Collabora ltd. */
+
+#ifndef __PANCSF_HEAP_H__
+#define __PANCSF_HEAP_H__
+
+#include <linux/types.h>
+
+struct pancsf_device;
+struct pancsf_heap_pool;
+struct pancsf_vm;
+
+int pancsf_heap_create(struct pancsf_heap_pool *pool,
+		       u32 initial_chunk_count,
+		       u32 chunk_size,
+		       u32 max_chunks,
+		       u32 target_in_flight,
+		       u64 *heap_ctx_gpu_va,
+		       u64 *first_chunk_gpu_va);
+int pancsf_heap_destroy(struct pancsf_heap_pool *pool, u32 handle);
+struct pancsf_heap_pool *
+pancsf_heap_pool_create(struct pancsf_device *pfdev, struct pancsf_vm *vm);
+void pancsf_heap_pool_destroy(struct pancsf_heap_pool *pool);
+int pancsf_heap_grow(struct pancsf_heap_pool *pool,
+		     u64 heap_gpu_va,
+		     u32 renderpasses_in_flight,
+		     u32 pending_frag_count,
+		     u64 *new_chunk_gpu_va);
+
+#endif
diff --git a/drivers/gpu/drm/pancsf/pancsf_mcu.c b/drivers/gpu/drm/pancsf/pancsf_mcu.c
new file mode 100644
index 000000000000..f93f366e7fa1
--- /dev/null
+++ b/drivers/gpu/drm/pancsf/pancsf_mcu.c
@@ -0,0 +1,891 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright 2023 Collabora ltd. */
+
+#include <linux/dma-mapping.h>
+#include <linux/firmware.h>
+#include <linux/iopoll.h>
+#include <linux/iosys-map.h>
+#include <linux/mutex.h>
+#include <linux/platform_device.h>
+
+#include "pancsf_device.h"
+#include "pancsf_gem.h"
+#include "pancsf_gpu.h"
+#include "pancsf_regs.h"
+#include "pancsf_mcu.h"
+#include "pancsf_mmu.h"
+#include "pancsf_sched.h"
+
+#define CSF_FW_NAME "mali_csffw.bin"
+
+struct pancsf_fw_mem {
+	struct drm_mm_node mm_node;
+	u32 num_pages;
+	struct page **pages;
+	struct sg_table sgt;
+	void *kmap;
+};
+
+struct pancsf_fw_hdr {
+	u32 magic;
+	u8 minor;
+	u8 major;
+	u16 padding1;
+	u32 version_hash;
+	u32 padding2;
+	u32 size;
+};
+
+enum pancsf_fw_entry_type {
+	CSF_FW_ENTRY_TYPE_IFACE = 0,
+	CSF_FW_ENTRY_TYPE_CONFIG = 1,
+	CSF_FW_ENTRY_TYPE_FUTF_TEST = 2,
+	CSF_FW_ENTRY_TYPE_TRACE_BUFFER = 3,
+	CSF_FW_ENTRY_TYPE_TIMELINE_METADATA = 4,
+};
+
+#define CSF_FW_ENTRY_TYPE(ehdr)			((ehdr) & 0xff)
+#define CSF_FW_ENTRY_SIZE(ehdr)			(((ehdr) >> 8) & 0xff)
+#define CSF_FW_ENTRY_UPDATE			BIT(30)
+#define CSF_FW_ENTRY_OPTIONAL			BIT(31)
+
+#define CSF_FW_IFACE_ENTRY_RD			BIT(0)
+#define CSF_FW_IFACE_ENTRY_WR			BIT(1)
+#define CSF_FW_IFACE_ENTRY_EX			BIT(2)
+#define CSF_FW_IFACE_ENTRY_CACHE_MODE_NONE	(0 << 3)
+#define CSF_FW_IFACE_ENTRY_CACHE_MODE_CACHED	(1 << 3)
+#define CSF_FW_IFACE_ENTRY_CACHE_MODE_UNCACHED_COHERENT (2 << 3)
+#define CSF_FW_IFACE_ENTRY_CACHE_MODE_CACHED_COHERENT (3 << 3)
+#define CSF_FW_IFACE_ENTRY_CACHE_MODE_MASK	GENMASK(4, 3)
+#define CSF_FW_IFACE_ENTRY_PROT			BIT(5)
+#define CSF_FW_IFACE_ENTRY_SHARED		BIT(30)
+#define CSF_FW_IFACE_ENTRY_ZERO			BIT(31)
+
+#define CSF_FW_IFACE_ENTRY_SUPPORTED_FLAGS      \
+	(CSF_FW_IFACE_ENTRY_RD |		\
+	 CSF_FW_IFACE_ENTRY_WR |		\
+	 CSF_FW_IFACE_ENTRY_EX |		\
+	 CSF_FW_IFACE_ENTRY_CACHE_MODE_MASK |	\
+	 CSF_FW_IFACE_ENTRY_PROT |		\
+	 CSF_FW_IFACE_ENTRY_SHARED  |		\
+	 CSF_FW_IFACE_ENTRY_ZERO)
+
+struct pancsf_fw_section_entry_hdr {
+	u32 flags;
+	struct {
+		u32 start;
+		u32 end;
+	} va;
+	struct {
+		u32 start;
+		u32 end;
+	} data;
+};
+
+struct pancsf_fw_iter {
+	const void *data;
+	size_t size;
+	size_t offset;
+};
+
+struct pancsf_fw_section {
+	struct list_head node;
+	u32 flags;
+	struct pancsf_fw_mem *mem;
+	const char *name;
+	/* Keep data around so we can reload writeable sections after an MCU
+	 * reset.
+	 */
+	struct {
+		const void *buf;
+		size_t size;
+	} data;
+};
+
+#define CSF_MCU_SHARED_REGION_START		0x04000000ULL
+#define CSF_MCU_SHARED_REGION_END		0x08000000ULL
+
+#define CSF_FW_HEADER_MAGIC			0xc3f13a6e
+#define CSF_FW_HEADER_MAJOR_MAX			0
+
+#define MIN_CS_PER_CSG				8
+#define MIN_CSGS				3
+#define MAX_CSG_PRIO				0xf
+
+#define CSF_IFACE_VERSION(major, minor, patch)	\
+	(((major) << 24) | ((minor) << 16) | (patch))
+#define CSF_IFACE_VERSION_MAJOR(v)		((v) >> 24)
+#define CSF_IFACE_VERSION_MINOR(v)		(((v) >> 16) & 0xff)
+#define CSF_IFACE_VERSION_PATCH(v)		((v) & 0xffff)
+
+#define CSF_GROUP_CONTROL_OFFSET		0x1000
+#define CSF_STREAM_CONTROL_OFFSET		0x40
+#define CSF_UNPRESERVED_REG_COUNT		4
+
+struct pancsf_mcu {
+	struct pancsf_vm *vm;
+	int as;
+
+	struct list_head sections;
+	struct pancsf_fw_section *shared_section;
+	struct pancsf_fw_iface iface;
+
+	bool booted;
+	wait_queue_head_t booted_event;
+
+	int job_irq;
+};
+
+static irqreturn_t pancsf_job_irq_handler(int irq, void *data)
+{
+	struct pancsf_device *pfdev = data;
+	irqreturn_t ret = IRQ_NONE;
+
+	while (true) {
+		u32 status = gpu_read(pfdev, JOB_INT_STAT);
+
+		if (!status)
+			break;
+
+		gpu_write(pfdev, JOB_INT_CLEAR, status);
+
+		if (!pfdev->mcu->booted) {
+			if (status & JOB_INT_GLOBAL_IF) {
+				pfdev->mcu->booted = true;
+				wake_up_all(&pfdev->mcu->booted_event);
+			}
+
+			return IRQ_HANDLED;
+		}
+
+		pancsf_sched_handle_job_irqs(pfdev, status);
+		ret = IRQ_HANDLED;
+	}
+
+	return ret;
+}
+
+static int pancsf_fw_iter_read(struct pancsf_device *pfdev,
+			       struct pancsf_fw_iter *iter,
+				     void *out, size_t size)
+{
+	size_t new_offset = iter->offset + size;
+
+	if (new_offset > iter->size || new_offset < iter->offset) {
+		dev_err(pfdev->dev, "Firmware too small\n");
+		return -EINVAL;
+	}
+
+	memcpy(out, iter->data + iter->offset, size);
+	iter->offset = new_offset;
+	return 0;
+}
+
+static void pancsf_fw_init_section_mem(struct pancsf_device *pfdev,
+				       struct pancsf_fw_section *section)
+{
+	size_t data_len = section->data.size;
+	size_t data_offs = 0;
+	u32 page;
+
+	for (page = 0; page < section->mem->num_pages; page++) {
+		void *mem = kmap_local_page(section->mem->pages[page]);
+		u32 copy_len = min_t(u32, PAGE_SIZE, data_len);
+
+		memcpy(mem, section->data.buf + data_offs, copy_len);
+		data_len -= copy_len;
+		data_offs += copy_len;
+
+		if (section->flags & CSF_FW_IFACE_ENTRY_ZERO)
+			memset(mem + copy_len, 0, PAGE_SIZE - copy_len);
+
+		kunmap_local(mem);
+	}
+}
+
+u64 pancsf_fw_mem_va(struct pancsf_fw_mem *mem)
+{
+	return mem->mm_node.start << PAGE_SHIFT;
+}
+
+void pancsf_fw_mem_vunmap(struct pancsf_fw_mem *mem)
+{
+	if (mem->kmap)
+		vunmap(mem->kmap);
+}
+
+void *pancsf_fw_mem_vmap(struct pancsf_fw_mem *mem, pgprot_t prot)
+{
+	if (!mem->kmap)
+		mem->kmap = vmap(mem->pages, mem->num_pages, VM_MAP, prot);
+
+	return mem->kmap;
+}
+
+void pancsf_fw_mem_free(struct pancsf_device *pfdev, struct pancsf_fw_mem *mem)
+{
+	unsigned int i;
+
+	if (IS_ERR_OR_NULL(mem))
+		return;
+
+	pancsf_fw_mem_vunmap(mem);
+
+	if (drm_mm_node_allocated(&mem->mm_node))
+		pancsf_vm_unmap_mcu_pages(pfdev->mcu->vm, &mem->mm_node);
+
+	dma_unmap_sgtable(pfdev->dev, &mem->sgt, DMA_BIDIRECTIONAL, 0);
+	sg_free_table(&mem->sgt);
+
+	for (i = 0; i < mem->num_pages; i++)
+		__free_page(mem->pages[i]);
+
+	kfree(mem->pages);
+	kfree(mem);
+}
+
+struct pancsf_fw_mem *
+pancsf_fw_mem_alloc(struct pancsf_device *pfdev,
+		    unsigned int num_pages,
+		    u32 mcu_va_start, u32 mcu_va_end,
+		    int prot)
+{
+	struct pancsf_fw_mem *mem = kzalloc(sizeof(*mem), GFP_KERNEL);
+	unsigned int allocated_pages;
+	int ret;
+
+	if (!mem)
+		return ERR_PTR(-ENOMEM);
+
+	mem->pages = kcalloc(num_pages, sizeof(*mem->pages), GFP_KERNEL);
+	if (!mem->pages) {
+		ret = -ENOMEM;
+		goto err_free_mem;
+	}
+
+	allocated_pages = alloc_pages_bulk_array(GFP_KERNEL, num_pages, mem->pages);
+	if (num_pages != allocated_pages) {
+		ret = -ENOMEM;
+		goto err_free_mem;
+	}
+
+	mem->num_pages = num_pages;
+	ret = sg_alloc_table_from_pages(&mem->sgt, mem->pages, num_pages,
+					0, num_pages << PAGE_SHIFT, GFP_KERNEL);
+	if (ret) {
+		ret = -ENOMEM;
+		goto err_free_mem;
+	}
+
+	ret = dma_map_sgtable(pfdev->dev, &mem->sgt, DMA_BIDIRECTIONAL, 0);
+	if (ret)
+		goto err_free_mem;
+
+	ret = pancsf_vm_map_mcu_pages(pfdev->mcu->vm, &mem->mm_node,
+				      &mem->sgt, num_pages,
+				      mcu_va_start, mcu_va_end,
+				      prot);
+	if (ret)
+		goto err_free_mem;
+
+	return mem;
+
+err_free_mem:
+	pancsf_fw_mem_free(pfdev, mem);
+	return ERR_PTR(ret);
+}
+
+struct pancsf_fw_mem *pancsf_fw_alloc_queue_iface_mem(struct pancsf_device *pfdev)
+{
+	return pancsf_fw_mem_alloc(pfdev, 2,
+				   CSF_MCU_SHARED_REGION_START,
+				   CSF_MCU_SHARED_REGION_END,
+				   IOMMU_READ | IOMMU_WRITE);
+}
+
+struct pancsf_fw_mem *
+pancsf_fw_alloc_suspend_buf_mem(struct pancsf_device *pfdev, size_t size)
+{
+	size_t page_count = DIV_ROUND_UP(size, PAGE_SIZE);
+
+	if (!page_count)
+		return NULL;
+
+	return pancsf_fw_mem_alloc(pfdev, page_count,
+				   CSF_MCU_SHARED_REGION_START,
+				   CSF_MCU_SHARED_REGION_END,
+				   IOMMU_READ | IOMMU_WRITE |
+				   IOMMU_NOEXEC | IOMMU_CACHE);
+}
+
+static int pancsf_fw_load_section_entry(struct pancsf_device *pfdev,
+					const struct firmware *fw,
+					struct pancsf_fw_iter *iter,
+					u32 ehdr)
+{
+	struct pancsf_fw_section_entry_hdr hdr;
+	struct pancsf_fw_section *section;
+	u32 name_len, num_pages;
+	int ret;
+
+	ret = pancsf_fw_iter_read(pfdev, iter, &hdr, sizeof(hdr));
+	if (ret)
+		return ret;
+
+	if (hdr.data.end < hdr.data.start) {
+		dev_err(pfdev->dev, "Firmware corrupted, data.end < data.start (0x%x < 0x%x)\n",
+			hdr.data.end, hdr.data.start);
+		return -EINVAL;
+	}
+
+	if (hdr.va.end < hdr.va.start) {
+		dev_err(pfdev->dev, "Firmware corrupted, hdr.va.end < hdr.va.start (0x%x < 0x%x)\n",
+			hdr.va.end, hdr.va.start);
+		return -EINVAL;
+	}
+
+	if (hdr.data.end > fw->size) {
+		dev_err(pfdev->dev, "Firmware corrupted, file truncated? data_end=0x%x > fw size=0x%zx\n",
+			hdr.data.end, fw->size);
+		return -EINVAL;
+	}
+
+	if ((hdr.va.start & ~PAGE_MASK) != 0 ||
+	    (hdr.va.end & ~PAGE_MASK) != 0) {
+		dev_err(pfdev->dev, "Firmware corrupted, virtual addresses not page aligned: 0x%x-0x%x\n",
+			hdr.va.start, hdr.va.end);
+		return -EINVAL;
+	}
+
+	if (hdr.flags & ~CSF_FW_IFACE_ENTRY_SUPPORTED_FLAGS) {
+		dev_err(pfdev->dev, "Firmware contains interface with unsupported flags (0x%x)\n",
+			hdr.flags);
+		return -EINVAL;
+	}
+
+	if (hdr.flags & CSF_FW_IFACE_ENTRY_PROT) {
+		dev_warn(pfdev->dev,
+			 "Firmware protected mode entry not be supported, ignoring");
+		return 0;
+	}
+
+	if (hdr.va.start == CSF_MCU_SHARED_REGION_START &&
+	    !(hdr.flags & CSF_FW_IFACE_ENTRY_SHARED)) {
+		dev_err(pfdev->dev,
+			"Interface at 0x%llx must be shared", CSF_MCU_SHARED_REGION_START);
+		return -EINVAL;
+	}
+
+	name_len = iter->size - iter->offset;
+
+	section = devm_kzalloc(pfdev->dev, sizeof(*section), GFP_KERNEL);
+	if (!section)
+		return -ENOMEM;
+
+	section->flags = hdr.flags;
+	section->data.size = hdr.data.end - hdr.data.start;
+
+	if (section->data.size > 0) {
+		void *data = devm_kmalloc(pfdev->dev, section->data.size, GFP_KERNEL);
+
+		if (!data)
+			return -ENOMEM;
+
+		memcpy(data, fw->data + hdr.data.start, section->data.size);
+		section->data.buf = data;
+	}
+
+	if (name_len > 0) {
+		char *name = devm_kmalloc(pfdev->dev, name_len + 1, GFP_KERNEL);
+
+		if (!name)
+			return -ENOMEM;
+
+		memcpy(name, iter->data + iter->offset, name_len);
+		name[name_len] = '\0';
+		section->name = name;
+	}
+
+	num_pages = (hdr.va.end - hdr.va.start) >> PAGE_SHIFT;
+	if (num_pages > 0) {
+		u32 cache_mode = hdr.flags & CSF_FW_IFACE_ENTRY_CACHE_MODE_MASK;
+		int prot = 0;
+
+		if (hdr.flags & CSF_FW_IFACE_ENTRY_RD)
+			prot |= IOMMU_READ;
+
+		if (hdr.flags & CSF_FW_IFACE_ENTRY_WR)
+			prot |= IOMMU_WRITE;
+
+		if (!(hdr.flags & CSF_FW_IFACE_ENTRY_EX))
+			prot |= IOMMU_NOEXEC;
+
+		/* TODO: CSF_FW_IFACE_ENTRY_CACHE_MODE_*_COHERENT are mapped to
+		 * non-cacheable for now. We might want to introduce a new
+		 * IOMMU_xxx flag (or abuse IOMMU_MMIO, which maps to device
+		 * memory and is currently not used by our driver) for
+		 * AS_MEMATTR_AARCH64_SHARED memory, so we can take benefit
+		 * from IO-coherent systems.
+		 */
+		if (cache_mode == CSF_FW_IFACE_ENTRY_CACHE_MODE_CACHED)
+			prot |= IOMMU_CACHE;
+
+		section->mem = pancsf_fw_mem_alloc(pfdev, num_pages,
+						   hdr.va.start, hdr.va.end, prot);
+		if (IS_ERR(section->mem))
+			return PTR_ERR(section->mem);
+
+		pancsf_fw_init_section_mem(pfdev, section);
+
+		dma_sync_sgtable_for_device(pfdev->dev, &section->mem->sgt, DMA_TO_DEVICE);
+
+		if (section->flags & CSF_FW_IFACE_ENTRY_SHARED) {
+			pgprot_t kmap_prot = PAGE_KERNEL;
+
+			if (cache_mode != CSF_FW_IFACE_ENTRY_CACHE_MODE_CACHED)
+				kmap_prot = pgprot_writecombine(kmap_prot);
+
+			if (!pancsf_fw_mem_vmap(section->mem, kmap_prot))
+				return -ENOMEM;
+		}
+	}
+
+	if (hdr.va.start == CSF_MCU_SHARED_REGION_START)
+		pfdev->mcu->shared_section = section;
+
+	list_add_tail(&section->node, &pfdev->mcu->sections);
+	return 0;
+}
+
+static void
+pancsf_reload_fw_sections(struct pancsf_device *pfdev, bool full_reload)
+{
+	struct pancsf_fw_section *section;
+
+	list_for_each_entry(section, &pfdev->mcu->sections, node) {
+		if (!full_reload && !(section->flags & CSF_FW_IFACE_ENTRY_WR))
+			continue;
+
+		pancsf_fw_init_section_mem(pfdev, section);
+		dma_sync_sgtable_for_device(pfdev->dev, &section->mem->sgt, DMA_TO_DEVICE);
+	}
+}
+
+static int pancsf_fw_load_entry(struct pancsf_device *pfdev,
+				const struct firmware *fw,
+				struct pancsf_fw_iter *iter)
+{
+	struct pancsf_fw_iter eiter;
+	u32 ehdr;
+	int ret;
+
+	ret = pancsf_fw_iter_read(pfdev, iter, &ehdr, sizeof(ehdr));
+	if (ret)
+		return ret;
+
+	if ((iter->offset % sizeof(u32)) ||
+	    (CSF_FW_ENTRY_SIZE(ehdr) % sizeof(u32))) {
+		dev_err(pfdev->dev, "Firmware entry isn't 32 bit aligned, offset=0x%x size=0x%x\n",
+			(u32)(iter->offset - sizeof(u32)), CSF_FW_ENTRY_SIZE(ehdr));
+		return -EINVAL;
+	}
+
+	eiter.offset = 0;
+	eiter.data = iter->data + iter->offset;
+	eiter.size = CSF_FW_ENTRY_SIZE(ehdr) - sizeof(ehdr);
+	iter->offset += eiter.size;
+
+	switch (CSF_FW_ENTRY_TYPE(ehdr)) {
+	case CSF_FW_ENTRY_TYPE_IFACE:
+		return pancsf_fw_load_section_entry(pfdev, fw, &eiter, ehdr);
+
+	/* FIXME: handle those entry types? */
+	case CSF_FW_ENTRY_TYPE_CONFIG:
+	case CSF_FW_ENTRY_TYPE_FUTF_TEST:
+	case CSF_FW_ENTRY_TYPE_TRACE_BUFFER:
+	case CSF_FW_ENTRY_TYPE_TIMELINE_METADATA:
+		return 0;
+	default:
+		break;
+	}
+
+	if (ehdr & CSF_FW_ENTRY_OPTIONAL)
+		return 0;
+
+	dev_err(pfdev->dev,
+		"Unsupported non-optional entry type %u in firmware\n",
+		CSF_FW_ENTRY_TYPE(ehdr));
+	return -EINVAL;
+}
+
+static int pancsf_fw_init(struct pancsf_device *pfdev)
+{
+	const struct firmware *fw = NULL;
+	struct pancsf_fw_iter iter = {};
+	struct pancsf_fw_hdr hdr;
+	int ret;
+
+	ret = request_firmware(&fw, CSF_FW_NAME, pfdev->dev);
+	if (ret) {
+		dev_err(pfdev->dev, "Failed to load firmware image '%s'\n",
+			CSF_FW_NAME);
+		return ret;
+	}
+
+	iter.data = fw->data;
+	iter.size = fw->size;
+	ret = pancsf_fw_iter_read(pfdev, &iter, &hdr, sizeof(hdr));
+	if (ret)
+		goto out;
+
+	if (hdr.magic != CSF_FW_HEADER_MAGIC) {
+		ret = -EINVAL;
+		dev_err(pfdev->dev, "Invalid firmware magic\n");
+		goto out;
+	}
+
+	if (hdr.major != CSF_FW_HEADER_MAJOR_MAX) {
+		ret = -EINVAL;
+		dev_err(pfdev->dev, "Unsupported firmware header version %d.%d (expected %d.x)\n",
+			hdr.major, hdr.minor, CSF_FW_HEADER_MAJOR_MAX);
+		goto out;
+	}
+
+	if (hdr.size > iter.size) {
+		dev_err(pfdev->dev, "Firmware image is truncated\n");
+		goto out;
+	}
+
+	iter.size = hdr.size;
+
+	while (iter.offset < hdr.size) {
+		ret = pancsf_fw_load_entry(pfdev, fw, &iter);
+		if (ret)
+			goto out;
+	}
+
+	if (!pfdev->mcu->shared_section) {
+		dev_err(pfdev->dev, "Shared interface region not found\n");
+		ret = -EINVAL;
+		goto out;
+	}
+
+out:
+	release_firmware(fw);
+	return ret;
+}
+
+static void *pancsf_mcu_to_cpu_addr(struct pancsf_device *pfdev, u32 mcu_va)
+{
+	u64 shared_mem_start = pfdev->mcu->shared_section->mem->mm_node.start << PAGE_SHIFT;
+	u64 shared_mem_end = (pfdev->mcu->shared_section->mem->mm_node.start +
+			      pfdev->mcu->shared_section->mem->mm_node.size) << PAGE_SHIFT;
+	if (mcu_va < shared_mem_start || mcu_va >= shared_mem_end)
+		return NULL;
+
+	return pfdev->mcu->shared_section->mem->kmap + (mcu_va - shared_mem_start);
+}
+
+static int pancsf_init_cs_iface(struct pancsf_device *pfdev,
+				unsigned int csg_idx, unsigned int cs_idx)
+{
+	const struct pancsf_fw_global_iface *glb_iface = pancsf_get_glb_iface(pfdev);
+	const struct pancsf_fw_csg_iface *csg_iface = pancsf_get_csg_iface(pfdev, csg_idx);
+	struct pancsf_fw_cs_iface *cs_iface = &pfdev->mcu->iface.groups[csg_idx].streams[cs_idx];
+	u64 shared_section_sz = pfdev->mcu->shared_section->mem->mm_node.size << PAGE_SHIFT;
+	u32 iface_offset = CSF_GROUP_CONTROL_OFFSET +
+			   (csg_idx * glb_iface->control->group_stride) +
+			   CSF_STREAM_CONTROL_OFFSET +
+			   (cs_idx * csg_iface->control->stream_stride);
+
+	if (iface_offset + sizeof(*cs_iface) >= shared_section_sz)
+		return -EINVAL;
+
+	cs_iface->control = pfdev->mcu->shared_section->mem->kmap + iface_offset;
+	cs_iface->input = pancsf_mcu_to_cpu_addr(pfdev, cs_iface->control->input_va);
+	cs_iface->output = pancsf_mcu_to_cpu_addr(pfdev, cs_iface->control->output_va);
+
+	if (!cs_iface->input || !cs_iface->output) {
+		dev_err(pfdev->dev, "Invalid stream control interface input/output VA");
+		return -EINVAL;
+	}
+
+	if (csg_idx > 0 || cs_idx > 0) {
+		const struct pancsf_fw_cs_iface *first_cs_iface = pancsf_get_cs_iface(pfdev, 0, 0);
+
+		if (cs_iface->control->features != first_cs_iface->control->features) {
+			dev_err(pfdev->dev, "Expecting identical CS slots");
+			return -EINVAL;
+		}
+	} else {
+		u32 reg_count = CS_FEATURES_WORK_REGS(cs_iface->control->features);
+
+		pfdev->csif_info.cs_reg_count = reg_count;
+		pfdev->csif_info.unpreserved_cs_reg_count = CSF_UNPRESERVED_REG_COUNT;
+	}
+
+	return 0;
+}
+
+static int pancsf_init_csg_iface(struct pancsf_device *pfdev,
+				 unsigned int csg_idx)
+{
+	const struct pancsf_fw_global_iface *glb_iface = pancsf_get_glb_iface(pfdev);
+	struct pancsf_fw_csg_iface *csg_iface = &pfdev->mcu->iface.groups[csg_idx];
+	u64 shared_section_sz = pfdev->mcu->shared_section->mem->mm_node.size << PAGE_SHIFT;
+	u32 iface_offset = CSF_GROUP_CONTROL_OFFSET + (csg_idx * glb_iface->control->group_stride);
+	unsigned int i;
+
+	if (iface_offset + sizeof(*csg_iface) >= shared_section_sz)
+		return -EINVAL;
+
+	csg_iface->control = pfdev->mcu->shared_section->mem->kmap + iface_offset;
+	csg_iface->input = pancsf_mcu_to_cpu_addr(pfdev, csg_iface->control->input_va);
+	csg_iface->output = pancsf_mcu_to_cpu_addr(pfdev, csg_iface->control->output_va);
+
+	if (csg_iface->control->stream_num < MIN_CS_PER_CSG ||
+	    csg_iface->control->stream_num > MAX_CS_PER_CSG)
+		return -EINVAL;
+
+	if (!csg_iface->input || !csg_iface->output) {
+		dev_err(pfdev->dev, "Invalid group control interface input/output VA");
+		return -EINVAL;
+	}
+
+	if (csg_idx > 0) {
+		const struct pancsf_fw_csg_iface *first_csg_iface = pancsf_get_csg_iface(pfdev, 0);
+		u32 first_protm_suspend_size = first_csg_iface->control->protm_suspend_size;
+
+		if (first_csg_iface->control->features != csg_iface->control->features ||
+		    first_csg_iface->control->suspend_size != csg_iface->control->suspend_size ||
+		    first_protm_suspend_size != csg_iface->control->protm_suspend_size ||
+		    first_csg_iface->control->stream_num != csg_iface->control->stream_num) {
+			dev_err(pfdev->dev, "Expecting identical CSG slots");
+			return -EINVAL;
+		}
+	}
+
+	for (i = 0; i < csg_iface->control->stream_num; i++) {
+		int ret = pancsf_init_cs_iface(pfdev, csg_idx, i);
+
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static u32 pancsf_get_instr_features(struct pancsf_device *pfdev)
+{
+	const struct pancsf_fw_global_iface *glb_iface = pancsf_get_glb_iface(pfdev);
+
+	if (glb_iface->control->version < CSF_IFACE_VERSION(1, 1, 0))
+		return 0;
+
+	return glb_iface->control->instr_features;
+}
+
+static int pancsf_init_ifaces(struct pancsf_device *pfdev)
+{
+	struct pancsf_fw_global_iface *glb_iface;
+	unsigned int i;
+
+	if (!pfdev->mcu->shared_section->mem->kmap)
+		return -EINVAL;
+
+	pfdev->iface = &pfdev->mcu->iface;
+	glb_iface = pancsf_get_glb_iface(pfdev);
+	glb_iface->control = pfdev->mcu->shared_section->mem->kmap;
+
+	if (!glb_iface->control->version) {
+		dev_err(pfdev->dev, "Invalid CSF interface version %d.%d.%d (%x)",
+			CSF_IFACE_VERSION_MAJOR(glb_iface->control->version),
+			CSF_IFACE_VERSION_MINOR(glb_iface->control->version),
+			CSF_IFACE_VERSION_PATCH(glb_iface->control->version),
+			glb_iface->control->version);
+		return -EINVAL;
+	}
+
+	glb_iface->input = pancsf_mcu_to_cpu_addr(pfdev, glb_iface->control->input_va);
+	glb_iface->output = pancsf_mcu_to_cpu_addr(pfdev, glb_iface->control->output_va);
+	if (!glb_iface->input || !glb_iface->output) {
+		dev_err(pfdev->dev, "Invalid global control interface input/output VA");
+		return -EINVAL;
+	}
+
+	if (glb_iface->control->group_num > MAX_CSGS ||
+	    glb_iface->control->group_num < MIN_CSGS) {
+		dev_err(pfdev->dev, "Invalid number of control groups");
+		return -EINVAL;
+	}
+
+	for (i = 0; i < glb_iface->control->group_num; i++) {
+		int ret = pancsf_init_csg_iface(pfdev, i);
+
+		if (ret)
+			return ret;
+	}
+
+	pfdev->iface = &pfdev->mcu->iface;
+	dev_info(pfdev->dev, "CSF FW v%d.%d.%d, Features %x Instrumentation features %x",
+		 CSF_IFACE_VERSION_MAJOR(glb_iface->control->version),
+		 CSF_IFACE_VERSION_MINOR(glb_iface->control->version),
+		 CSF_IFACE_VERSION_PATCH(glb_iface->control->version),
+		 glb_iface->control->features,
+		 pancsf_get_instr_features(pfdev));
+	return 0;
+}
+
+static int pancsf_mcu_start(struct pancsf_device *pfdev)
+{
+	bool timedout = false;
+
+	pfdev->mcu->booted = false;
+	gpu_write(pfdev, JOB_INT_CLEAR, ~0);
+	gpu_write(pfdev, JOB_INT_MASK, ~0);
+	gpu_write(pfdev, MCU_CONTROL, MCU_CONTROL_AUTO);
+
+	if (!wait_event_timeout(pfdev->mcu->booted_event,
+				pfdev->mcu->booted,
+				msecs_to_jiffies(1000))) {
+		if (!pfdev->mcu->booted &&
+		    !(gpu_read(pfdev, JOB_INT_STAT) & JOB_INT_GLOBAL_IF))
+			timedout = true;
+	}
+
+	if (timedout) {
+		dev_err(pfdev->dev, "Failed to boot MCU");
+		return -ETIMEDOUT;
+	}
+
+	return 0;
+}
+
+static void pancsf_mcu_stop(struct pancsf_device *pfdev)
+{
+	u32 status;
+
+	gpu_write(pfdev, MCU_CONTROL, MCU_CONTROL_DISABLE);
+	if (readl_poll_timeout(pfdev->iomem + MCU_CONTROL, status,
+			       status == MCU_CONTROL_DISABLE, 10, 100000))
+		dev_err(pfdev->dev, "Failed to stop MCU");
+}
+
+int pancsf_mcu_reset(struct pancsf_device *pfdev, bool full_fw_reload)
+{
+	pfdev->mcu->as = pancsf_vm_as_get(pfdev->mcu->vm);
+	pancsf_reload_fw_sections(pfdev, full_fw_reload);
+
+	return pancsf_mcu_start(pfdev);
+}
+
+int pancsf_mcu_init(struct pancsf_device *pfdev)
+{
+	struct pancsf_mcu *mcu;
+	struct pancsf_fw_section *section;
+	int ret, irq;
+
+	mcu = devm_kzalloc(pfdev->dev, sizeof(*mcu), GFP_KERNEL);
+	if (!mcu)
+		return -ENOMEM;
+
+	pfdev->mcu = mcu;
+	init_waitqueue_head(&mcu->booted_event);
+	INIT_LIST_HEAD(&pfdev->mcu->sections);
+	mcu->as = -1;
+
+	gpu_write(pfdev, JOB_INT_MASK, 0);
+
+	irq = platform_get_irq_byname(to_platform_device(pfdev->dev), "job");
+	if (irq <= 0)
+		return -ENODEV;
+
+	mcu->job_irq = irq;
+	ret = devm_request_threaded_irq(pfdev->dev, irq,
+					NULL, pancsf_job_irq_handler,
+					IRQF_ONESHOT, KBUILD_MODNAME "-job",
+					pfdev);
+	if (ret) {
+		dev_err(pfdev->dev, "failed to request job irq");
+		return ret;
+	}
+
+	ret = pancsf_gpu_l2_power_on(pfdev);
+	if (ret)
+		return ret;
+
+	mcu->vm = pancsf_vm_create(pfdev, true);
+	if (IS_ERR(mcu->vm)) {
+		ret = PTR_ERR(mcu->vm);
+		mcu->vm = NULL;
+		goto err_l2_pwroff;
+	}
+
+	ret = pancsf_fw_init(pfdev);
+	if (ret)
+		goto err_free_sections;
+
+	mcu->as = pancsf_vm_as_get(mcu->vm);
+	if (WARN_ON(mcu->as != 0)) {
+		ret = -EINVAL;
+		goto err_free_sections;
+	}
+
+	ret = pancsf_mcu_start(pfdev);
+	if (ret)
+		goto err_put_as;
+
+	ret = pancsf_init_ifaces(pfdev);
+	if (ret)
+		goto err_stop_mcu;
+
+	return 0;
+
+err_stop_mcu:
+	pancsf_mcu_stop(pfdev);
+
+err_put_as:
+	pancsf_vm_as_put(mcu->vm);
+
+err_free_sections:
+	list_for_each_entry(section, &pfdev->mcu->sections, node) {
+		pancsf_fw_mem_free(pfdev, section->mem);
+	}
+
+err_l2_pwroff:
+	pancsf_gpu_power_off(pfdev, L2,
+			     pfdev->gpu_info.l2_present,
+			     20000);
+
+	return ret;
+}
+
+void pancsf_mcu_fini(struct pancsf_device *pfdev)
+{
+	struct pancsf_fw_section *section;
+
+	if (!pfdev->mcu)
+		return;
+
+	gpu_write(pfdev, JOB_INT_MASK, 0);
+	synchronize_irq(pfdev->mcu->job_irq);
+
+	pancsf_mcu_stop(pfdev);
+
+	list_for_each_entry(section, &pfdev->mcu->sections, node) {
+		pancsf_fw_mem_free(pfdev, section->mem);
+	}
+
+	if (pfdev->mcu->vm && pfdev->mcu->as == 0)
+		pancsf_vm_as_put(pfdev->mcu->vm);
+
+	pancsf_vm_put(pfdev->mcu->vm);
+
+	pancsf_gpu_power_off(pfdev, L2, pfdev->gpu_info.l2_present, 20000);
+}
+
+void pancsf_mcu_pre_reset(struct pancsf_device *pfdev)
+{
+	gpu_write(pfdev, JOB_INT_MASK, 0);
+	synchronize_irq(pfdev->mcu->job_irq);
+}
diff --git a/drivers/gpu/drm/pancsf/pancsf_mcu.h b/drivers/gpu/drm/pancsf/pancsf_mcu.h
new file mode 100644
index 000000000000..052665389302
--- /dev/null
+++ b/drivers/gpu/drm/pancsf/pancsf_mcu.h
@@ -0,0 +1,313 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright 2023 Collabora ltd. */
+
+#ifndef __PANCSF_MCU_H__
+#define __PANCSF_MCU_H__
+
+#include <linux/types.h>
+
+#include "pancsf_device.h"
+
+struct pancsf_fw_mem;
+
+#define MAX_CSGS				31
+#define MAX_CS_PER_CSG                          32
+
+struct pancsf_ringbuf_input_iface {
+	u64 insert;
+	u64 extract;
+} __packed;
+
+struct pancsf_ringbuf_output_iface {
+	u64 extract;
+	u32 active;
+} __packed;
+
+struct pancsf_cs_control_iface {
+#define CS_FEATURES_WORK_REGS(x)		(((x) & GENMASK(7, 0)) + 1)
+#define CS_FEATURES_SCOREBOARDS(x)		(((x) & GENMASK(15, 8)) >> 8)
+#define CS_FEATURES_COMPUTE			BIT(16)
+#define CS_FEATURES_FRAGMENT			BIT(17)
+#define CS_FEATURES_TILER			BIT(18)
+	u32 features;
+	u32 input_va;
+	u32 output_va;
+} __packed;
+
+struct pancsf_cs_input_iface {
+#define CS_STATE_MASK				GENMASK(2, 0)
+#define CS_STATE_STOP				0
+#define CS_STATE_START				1
+#define CS_EXTRACT_EVENT			BIT(4)
+#define CS_IDLE_SYNC_WAIT			BIT(8)
+#define CS_IDLE_PROTM_PENDING			BIT(9)
+#define CS_IDLE_EMPTY				BIT(10)
+#define CS_IDLE_RESOURCE_REQ			BIT(11)
+#define CS_TILER_OOM				BIT(26)
+#define CS_PROTM_PENDING			BIT(27)
+#define CS_FATAL				BIT(30)
+#define CS_FAULT				BIT(31)
+
+	u32 req;
+
+#define CS_CONFIG_PRIORITY(x)			((x) & GENMASK(3, 0))
+#define CS_CONFIG_DOORBELL(x)			(((x) << 8) & GENMASK(15, 8))
+	u32 config;
+	u32 reserved1;
+	u32 ack_irq_mask;
+	u64 ringbuf_base;
+	u32 ringbuf_size;
+	u32 reserved2;
+	u64 heap_start;
+	u64 heap_end;
+	u64 ringbuf_input;
+	u64 ringbuf_output;
+	u32 instr_config;
+	u32 instrbuf_size;
+	u64 instrbuf_base;
+	u64 instrbuf_offset_ptr;
+} __packed;
+
+struct pancsf_cs_output_iface {
+	u32 ack;
+	u32 reserved1[15];
+	u64 status_cmd_ptr;
+
+#define CS_STATUS_WAIT_SB_MASK			GENMASK(15, 0)
+#define CS_STATUS_WAIT_SB_SRC_MASK		GENMASK(19, 16)
+#define CS_STATUS_WAIT_SB_SRC_NONE		(0 << 16)
+#define CS_STATUS_WAIT_SB_SRC_WAIT		(8 << 16)
+#define CS_STATUS_WAIT_SYNC_COND_LE		(0 << 24)
+#define CS_STATUS_WAIT_SYNC_COND_GT		(1 << 24)
+#define CS_STATUS_WAIT_SYNC_COND_MASK		GENMASK(27, 24)
+#define CS_STATUS_WAIT_PROGRESS			BIT(28)
+#define CS_STATUS_WAIT_PROTM			BIT(29)
+#define CS_STATUS_WAIT_SYNC_64B			BIT(30)
+#define CS_STATUS_WAIT_SYNC			BIT(31)
+	u32 status_wait;
+	u32 status_req_resource;
+	u64 status_wait_sync_ptr;
+	u32 status_wait_sync_value;
+	u32 status_scoreboards;
+
+#define CS_STATUS_BLOCKED_REASON_UNBLOCKED	0
+#define CS_STATUS_BLOCKED_REASON_SB_WAIT	1
+#define CS_STATUS_BLOCKED_REASON_PROGRESS_WAIT	2
+#define CS_STATUS_BLOCKED_REASON_SYNC_WAIT	3
+#define CS_STATUS_BLOCKED_REASON_DEFERRED	5
+#define CS_STATUS_BLOCKED_REASON_RES		6
+#define CS_STATUS_BLOCKED_REASON_FLUSH		7
+#define CS_STATUS_BLOCKED_REASON_MASK		GENMASK(3, 0)
+	u32 status_blocked_reason;
+	u32 status_wait_sync_value_hi;
+	u32 reserved2[6];
+
+#define CS_EXCEPTION_TYPE(x)			((x) & GENMASK(7, 0))
+#define CS_EXCEPTION_DATA(x)			(((x) >> 8) & GENMASK(23, 0))
+	u32 fault;
+	u32 fatal;
+	u64 fault_info;
+	u64 fatal_info;
+	u32 reserved3[10];
+	u32 heap_vt_start;
+	u32 heap_vt_end;
+	u32 reserved4;
+	u32 heap_frag_end;
+	u64 heap_address;
+} __packed;
+
+struct pancsf_csg_control_iface {
+	u32 features;
+	u32 input_va;
+	u32 output_va;
+	u32 suspend_size;
+	u32 protm_suspend_size;
+	u32 stream_num;
+	u32 stream_stride;
+} __packed;
+
+struct pancsf_csg_input_iface {
+#define CSG_STATE_MASK				GENMASK(2, 0)
+#define CSG_STATE_TERMINATE			0
+#define CSG_STATE_START				1
+#define CSG_STATE_SUSPEND			2
+#define CSG_STATE_RESUME			3
+#define CSG_ENDPOINT_CONFIG			BIT(4)
+#define CSG_STATUS_UPDATE			BIT(5)
+#define CSG_SYNC_UPDATE				BIT(28)
+#define CSG_IDLE				BIT(29)
+#define CSG_DOORBELL				BIT(30)
+#define CSG_PROGRESS_TIMER_EVENT		BIT(31)
+	u32 req;
+	u32 ack_irq_mask;
+
+	u32 doorbell_req;
+	u32 irq_ack;
+	u32 reserved1[4];
+	u64 allow_compute;
+	u64 allow_fragment;
+	u32 allow_other;
+
+#define CSG_EP_REQ_COMPUTE(x)			((x) & GENMASK(7, 0))
+#define CSG_EP_REQ_FRAGMENT(x)			(((x) << 8) & GENMASK(15, 8))
+#define CSG_EP_REQ_TILER(x)			(((x) << 16) & GENMASK(19, 16))
+#define CSG_EP_REQ_EXCL_COMPUTE			BIT(20)
+#define CSG_EP_REQ_EXCL_FRAGMENT		BIT(21)
+#define CSG_EP_REQ_PRIORITY(x)			(((x) << 28) & GENMASK(31, 28))
+#define CSG_EP_REQ_PRIORITY_MASK		GENMASK(31, 28)
+	u32 endpoint_req;
+	u32 reserved2[2];
+	u64 suspend_buf;
+	u64 protm_suspend_buf;
+	u32 config;
+	u32 iter_trace_config;
+} __packed;
+
+struct pancsf_csg_output_iface {
+	u32 ack;
+	u32 reserved1;
+	u32 doorbell_ack;
+	u32 irq_req;
+	u32 status_endpoint_current;
+	u32 status_endpoint_req;
+
+#define CSG_STATUS_STATE_IS_IDLE		BIT(0)
+	u32 status_state;
+	u32 resource_dep;
+} __packed;
+
+struct pancsf_global_control_iface {
+	u32 version;
+	u32 features;
+	u32 input_va;
+	u32 output_va;
+	u32 group_num;
+	u32 group_stride;
+	u32 perfcnt_size;
+	u32 instr_features;
+} __packed;
+
+struct pancsf_global_input_iface {
+#define GLB_HALT				BIT(0)
+#define GLB_CFG_PROGRESS_TIMER			BIT(1)
+#define GLB_CFG_ALLOC_EN			BIT(2)
+#define GLB_CFG_POWEROFF_TIMER			BIT(3)
+#define GLB_PROTM_ENTER				BIT(4)
+#define GLB_PERFCNT_EN				BIT(5)
+#define GLB_PERFCNT_SAMPLER			BIT(6)
+#define GLB_COUNTER_EN				BIT(7)
+#define GLB_PING				BIT(8)
+#define GLB_FWCFG_UPDATE			BIT(9)
+#define GLB_IDLE_EN				BIT(10)
+#define GLB_SLEEP				BIT(12)
+#define GLB_INACTIVE_COMPUTE			BIT(20)
+#define GLB_INACTIVE_FRAGMENT			BIT(21)
+#define GLB_INACTIVE_TILER			BIT(22)
+#define GLB_PROTM_EXIT				BIT(23)
+#define GLB_PERFCNT_THRESHOLD			BIT(24)
+#define GLB_PERFCNT_OVERFLOW			BIT(25)
+#define GLB_IDLE				BIT(26)
+#define GLB_DBG_CSF				BIT(30)
+#define GLB_DBG_HOST				BIT(31)
+	u32 req;
+	u32 ack_irq_mask;
+	u32 doorbell_req;
+	u32 reserved1;
+	u32 progress_timer;
+
+#define GLB_TIMER_VAL(x)			((x) & GENMASK(30, 0))
+#define GLB_TIMER_SOURCE_GPU_COUNTER		BIT(31)
+	u32 poweroff_timer;
+	u64 core_en_mask;
+	u32 reserved2;
+	u32 perfcnt_as;
+	u64 perfcnt_base;
+	u32 perfcnt_extract;
+	u32 reserved3[3];
+	u32 perfcnt_config;
+	u32 perfcnt_csg_select;
+	u32 perfcnt_fw_enable;
+	u32 perfcnt_csg_enable;
+	u32 perfcnt_csf_enable;
+	u32 perfcnt_shader_enable;
+	u32 perfcnt_tiler_enable;
+	u32 perfcnt_mmu_l2_enable;
+	u32 reserved4[8];
+	u32 idle_timer;
+} __packed;
+
+struct pancsf_global_output_iface {
+	u32 ack;
+	u32 reserved1;
+	u32 doorbell_ack;
+	u32 reserved2;
+	u32 halt_status;
+	u32 perfcnt_status;
+	u32 perfcnt_insert;
+} __packed;
+
+static inline u32 pancsf_toggle_reqs(u32 cur_req_val, u32 ack_val, u32 req_mask)
+{
+	return ((ack_val ^ req_mask) & req_mask) | (cur_req_val & ~req_mask);
+}
+
+static inline u32 pancsf_update_reqs(u32 cur_req_val, u32 new_reqs, u32 req_mask)
+{
+	return (cur_req_val & ~req_mask) | (new_reqs & req_mask);
+}
+
+
+struct pancsf_fw_cs_iface {
+	struct pancsf_cs_control_iface *control;
+	struct pancsf_cs_input_iface *input;
+	const struct pancsf_cs_output_iface *output;
+};
+
+struct pancsf_fw_csg_iface {
+	const struct pancsf_csg_control_iface *control;
+	struct pancsf_csg_input_iface *input;
+	const struct pancsf_csg_output_iface *output;
+	struct pancsf_fw_cs_iface streams[MAX_CS_PER_CSG];
+};
+
+struct pancsf_fw_global_iface {
+	const struct pancsf_global_control_iface *control;
+	struct pancsf_global_input_iface *input;
+	const struct pancsf_global_output_iface *output;
+};
+
+struct pancsf_fw_iface {
+	struct pancsf_fw_global_iface global;
+	struct pancsf_fw_csg_iface groups[MAX_CSGS];
+};
+
+static inline struct pancsf_fw_global_iface *
+pancsf_get_glb_iface(struct pancsf_device *pfdev)
+{
+	return &pfdev->iface->global;
+}
+
+static inline struct pancsf_fw_csg_iface *
+pancsf_get_csg_iface(struct pancsf_device *pfdev, u32 csg_slot)
+{
+	return &pfdev->iface->groups[csg_slot];
+}
+
+static inline struct pancsf_fw_cs_iface *
+pancsf_get_cs_iface(struct pancsf_device *pfdev, u32 csg_slot, u32 cs_slot)
+{
+	return &pfdev->iface->groups[csg_slot].streams[cs_slot];
+}
+
+void pancsf_fw_mem_vunmap(struct pancsf_fw_mem *mem);
+void *pancsf_fw_mem_vmap(struct pancsf_fw_mem *mem, pgprot_t prot);
+u64 pancsf_fw_mem_va(struct pancsf_fw_mem *mem);
+void pancsf_fw_mem_free(struct pancsf_device *pfdev, struct pancsf_fw_mem *mem);
+struct pancsf_fw_mem *pancsf_fw_alloc_queue_iface_mem(struct pancsf_device *pfdev);
+struct pancsf_fw_mem *
+pancsf_fw_alloc_suspend_buf_mem(struct pancsf_device *pfdev, size_t size);
+
+void pancsf_mcu_pre_reset(struct pancsf_device *pfdev);
+int pancsf_mcu_reset(struct pancsf_device *pfdev, bool full_fw_reload);
+
+#endif
diff --git a/drivers/gpu/drm/pancsf/pancsf_mmu.c b/drivers/gpu/drm/pancsf/pancsf_mmu.c
new file mode 100644
index 000000000000..85341a0c6434
--- /dev/null
+++ b/drivers/gpu/drm/pancsf/pancsf_mmu.c
@@ -0,0 +1,1345 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright 2019 Linaro, Ltd, Rob Herring <robh@kernel.org> */
+/* Copyright 2023 Collabora ltd. */
+
+#include <drm/pancsf_drm.h>
+#include <drm/gpu_scheduler.h>
+
+#include <linux/atomic.h>
+#include <linux/bitfield.h>
+#include <linux/delay.h>
+#include <linux/dma-mapping.h>
+#include <linux/interrupt.h>
+#include <linux/io.h>
+#include <linux/iopoll.h>
+#include <linux/io-pgtable.h>
+#include <linux/iommu.h>
+#include <linux/platform_device.h>
+#include <linux/pm_runtime.h>
+#include <linux/rwsem.h>
+#include <linux/shmem_fs.h>
+#include <linux/sizes.h>
+
+#include "pancsf_device.h"
+#include "pancsf_mmu.h"
+#include "pancsf_sched.h"
+#include "pancsf_gem.h"
+#include "pancsf_regs.h"
+
+#define mmu_write(dev, reg, data) writel(data, (dev)->iomem + (reg))
+#define mmu_read(dev, reg) readl((dev)->iomem + (reg))
+
+#define MM_COLOR_FRAG_SHADER		BIT(0)
+
+#define MAX_AS_SLOTS			32
+
+struct pancsf_vm;
+
+struct pancsf_mmu {
+	int irq;
+
+	struct {
+		struct mutex slots_lock;
+		unsigned long in_use_mask;
+		unsigned long alloc_mask;
+		unsigned long faulty_mask;
+		struct pancsf_vm *slots[MAX_AS_SLOTS];
+		struct list_head lru_list;
+		spinlock_t op_lock;
+	} as;
+};
+
+struct pancsf_vm_pool {
+	struct xarray xa;
+	struct mutex lock;
+};
+
+struct pancsf_vm {
+	struct pancsf_device *pfdev;
+	struct kref refcount;
+	u64 memattr;
+	struct io_pgtable_cfg pgtbl_cfg;
+	struct io_pgtable_ops *pgtbl_ops;
+
+	/* VM reservation object. All private BOs should use this resv
+	 * instead of the GEM object one.
+	 */
+	struct dma_resv resv;
+	struct mutex lock;
+	struct drm_mm mm;
+	int as;
+	atomic_t as_count;
+	bool for_mcu;
+	struct list_head node;
+};
+
+struct pancsf_vma {
+	struct drm_mm_node vm_mm_node;
+	struct rb_node bo_vma_node;
+	struct pancsf_gem_object *bo;
+	struct pancsf_vm *vm;
+	u64 offset;
+	u32 flags;
+	bool mapped;
+};
+
+static int wait_ready(struct pancsf_device *pfdev, u32 as_nr)
+{
+	int ret;
+	u32 val;
+
+	/* Wait for the MMU status to indicate there is no active command, in
+	 * case one is pending.
+	 */
+	ret = readl_relaxed_poll_timeout_atomic(pfdev->iomem + AS_STATUS(as_nr),
+						val, !(val & AS_STATUS_AS_ACTIVE),
+						10, 100000);
+
+	if (ret) {
+		/* The GPU hung, let's trigger a reset */
+		pancsf_sched_queue_reset(pfdev);
+		dev_err(pfdev->dev, "AS_ACTIVE bit stuck\n");
+	}
+
+	return ret;
+}
+
+static int write_cmd(struct pancsf_device *pfdev, u32 as_nr, u32 cmd)
+{
+	int status;
+
+	/* write AS_COMMAND when MMU is ready to accept another command */
+	status = wait_ready(pfdev, as_nr);
+	if (!status)
+		mmu_write(pfdev, AS_COMMAND(as_nr), cmd);
+
+	return status;
+}
+
+static void lock_region(struct pancsf_device *pfdev, u32 as_nr,
+			u64 region_start, u64 size)
+{
+	u8 region_width;
+	u64 region;
+	u64 region_end = region_start + size;
+
+	if (!size)
+		return;
+
+	/*
+	 * The locked region is a naturally aligned power of 2 block encoded as
+	 * log2 minus(1).
+	 * Calculate the desired start/end and look for the highest bit which
+	 * differs. The smallest naturally aligned block must include this bit
+	 * change, the desired region starts with this bit (and subsequent bits)
+	 * zeroed and ends with the bit (and subsequent bits) set to one.
+	 */
+	region_width = max(fls64(region_start ^ (region_end - 1)),
+			   const_ilog2(AS_LOCK_REGION_MIN_SIZE)) - 1;
+
+	/*
+	 * Mask off the low bits of region_start (which would be ignored by
+	 * the hardware anyway)
+	 */
+	region_start &= GENMASK_ULL(63, region_width);
+
+	region = region_width | region_start;
+
+	/* Lock the region that needs to be updated */
+	mmu_write(pfdev, AS_LOCKADDR_LO(as_nr), lower_32_bits(region));
+	mmu_write(pfdev, AS_LOCKADDR_HI(as_nr), upper_32_bits(region));
+	write_cmd(pfdev, as_nr, AS_COMMAND_LOCK);
+}
+
+static int mmu_hw_do_operation_locked(struct pancsf_device *pfdev, int as_nr,
+				      u64 iova, u64 size, u32 op)
+{
+	if (as_nr < 0)
+		return 0;
+
+	if (op != AS_COMMAND_UNLOCK)
+		lock_region(pfdev, as_nr, iova, size);
+
+	/* Run the MMU operation */
+	write_cmd(pfdev, as_nr, op);
+
+	/* Wait for the flush to complete */
+	return wait_ready(pfdev, as_nr);
+}
+
+static int mmu_hw_do_operation(struct pancsf_vm *vm,
+			       u64 iova, u64 size, u32 op)
+{
+	struct pancsf_device *pfdev = vm->pfdev;
+	int ret;
+
+	spin_lock(&pfdev->mmu->as.op_lock);
+	ret = mmu_hw_do_operation_locked(pfdev, vm->as, iova, size, op);
+	spin_unlock(&pfdev->mmu->as.op_lock);
+	return ret;
+}
+
+static void pancsf_mmu_as_enable(struct pancsf_device *pfdev, u32 as_nr,
+				 u64 transtab, u64 transcfg, u64 memattr)
+{
+	mmu_hw_do_operation_locked(pfdev, as_nr, 0, ~0ULL, AS_COMMAND_FLUSH_MEM);
+
+	mmu_write(pfdev, AS_TRANSTAB_LO(as_nr), lower_32_bits(transtab));
+	mmu_write(pfdev, AS_TRANSTAB_HI(as_nr), upper_32_bits(transtab));
+
+	mmu_write(pfdev, AS_MEMATTR_LO(as_nr), lower_32_bits(memattr));
+	mmu_write(pfdev, AS_MEMATTR_HI(as_nr), upper_32_bits(memattr));
+
+	mmu_write(pfdev, AS_TRANSCFG_LO(as_nr), lower_32_bits(transcfg));
+	mmu_write(pfdev, AS_TRANSCFG_HI(as_nr), upper_32_bits(transcfg));
+
+	write_cmd(pfdev, as_nr, AS_COMMAND_UPDATE);
+}
+
+static void pancsf_mmu_as_disable(struct pancsf_device *pfdev, u32 as_nr)
+{
+	mmu_hw_do_operation_locked(pfdev, as_nr, 0, ~0ULL, AS_COMMAND_FLUSH_MEM);
+
+	mmu_write(pfdev, AS_TRANSTAB_LO(as_nr), 0);
+	mmu_write(pfdev, AS_TRANSTAB_HI(as_nr), 0);
+
+	mmu_write(pfdev, AS_MEMATTR_LO(as_nr), 0);
+	mmu_write(pfdev, AS_MEMATTR_HI(as_nr), 0);
+
+	mmu_write(pfdev, AS_TRANSCFG_LO(as_nr), AS_TRANSCFG_ADRMODE_UNMAPPED);
+	mmu_write(pfdev, AS_TRANSCFG_HI(as_nr), 0);
+
+	write_cmd(pfdev, as_nr, AS_COMMAND_UPDATE);
+}
+
+static void pancsf_vm_enable(struct pancsf_vm *vm)
+{
+	struct pancsf_device *pfdev = vm->pfdev;
+	struct io_pgtable_cfg *cfg = &vm->pgtbl_cfg;
+	u64 transtab, transcfg;
+
+	transtab = cfg->arm_lpae_s1_cfg.ttbr;
+	transcfg = AS_TRANSCFG_PTW_MEMATTR_WB |
+		   AS_TRANSCFG_PTW_RA |
+		   AS_TRANSCFG_ADRMODE_AARCH64_4K;
+	if (pfdev->coherent)
+		transcfg |= AS_TRANSCFG_PTW_SH_OS;
+
+	pancsf_mmu_as_enable(vm->pfdev, vm->as, transtab, transcfg, vm->memattr);
+}
+
+static void pancsf_vm_disable(struct pancsf_vm *vm)
+{
+	pancsf_mmu_as_disable(vm->pfdev, vm->as);
+}
+
+static u32 pancsf_mmu_fault_mask(struct pancsf_device *pfdev, u32 value)
+{
+	/* Bits 16 to 31 mean REQ_COMPLETE. */
+	return value & GENMASK(15, 0);
+}
+
+static u32 pancsf_mmu_as_fault_mask(struct pancsf_device *pfdev, u32 as)
+{
+	return BIT(as);
+}
+
+int pancsf_vm_as_get(struct pancsf_vm *vm)
+{
+	struct pancsf_device *pfdev = vm->pfdev;
+	int as;
+
+	mutex_lock(&pfdev->mmu->as.slots_lock);
+
+	as = vm->as;
+	if (as >= 0) {
+		u32 mask = pancsf_mmu_as_fault_mask(pfdev, as);
+
+		atomic_inc(&vm->as_count);
+		list_move(&vm->node, &pfdev->mmu->as.lru_list);
+
+		if (pfdev->mmu->as.faulty_mask & mask) {
+			/* Unhandled pagefault on this AS, the MMU was
+			 * disabled. We need to re-enable the MMU after
+			 * clearing+unmasking the AS interrupts.
+			 */
+			mmu_write(pfdev, MMU_INT_CLEAR, mask);
+			pfdev->mmu->as.faulty_mask &= ~mask;
+			mmu_write(pfdev, MMU_INT_MASK, ~pfdev->mmu->as.faulty_mask);
+			pancsf_vm_enable(vm);
+		}
+
+		goto out;
+	}
+
+	/* Check for a free AS */
+	if (vm->for_mcu) {
+		WARN_ON(pfdev->mmu->as.alloc_mask & BIT(0));
+		as = 0;
+	} else {
+		as = ffz(pfdev->mmu->as.alloc_mask | BIT(0));
+	}
+
+	if (!(BIT(as) & pfdev->gpu_info.as_present)) {
+		struct pancsf_vm *lru_vm;
+
+		list_for_each_entry_reverse(lru_vm, &pfdev->mmu->as.lru_list, node) {
+			if (!atomic_read(&lru_vm->as_count))
+				break;
+		}
+		WARN_ON(&lru_vm->node == &pfdev->mmu->as.lru_list);
+
+		list_del_init(&lru_vm->node);
+		as = lru_vm->as;
+
+		WARN_ON(as < 0);
+		lru_vm->as = -1;
+	}
+
+	/* Assign the free or reclaimed AS to the FD */
+	vm->as = as;
+	pfdev->mmu->as.slots[as] = vm;
+	set_bit(as, &pfdev->mmu->as.alloc_mask);
+	atomic_set(&vm->as_count, 1);
+	list_add(&vm->node, &pfdev->mmu->as.lru_list);
+
+	pancsf_vm_enable(vm);
+
+out:
+	mutex_unlock(&pfdev->mmu->as.slots_lock);
+	return as;
+}
+
+void pancsf_vm_as_put(struct pancsf_vm *vm)
+{
+	atomic_dec(&vm->as_count);
+	WARN_ON(atomic_read(&vm->as_count) < 0);
+}
+
+void pancsf_mmu_pre_reset(struct pancsf_device *pfdev)
+{
+	mmu_write(pfdev, MMU_INT_MASK, 0);
+
+	mutex_lock(&pfdev->mmu->as.slots_lock);
+	/* Flag all AS faulty on reset so the interrupts doesn't get re-enabled
+	 * in the interrupt handler if it's running concurrently.
+	 */
+	pfdev->mmu->as.faulty_mask = ~0;
+	mutex_unlock(&pfdev->mmu->as.slots_lock);
+
+	synchronize_irq(pfdev->mmu->irq);
+}
+
+void pancsf_mmu_reset(struct pancsf_device *pfdev)
+{
+	struct pancsf_vm *vm, *vm_tmp;
+
+	mutex_lock(&pfdev->mmu->as.slots_lock);
+
+	pfdev->mmu->as.alloc_mask = 0;
+	pfdev->mmu->as.faulty_mask = 0;
+
+	list_for_each_entry_safe(vm, vm_tmp, &pfdev->mmu->as.lru_list, node) {
+		vm->as = -1;
+		atomic_set(&vm->as_count, 0);
+		list_del_init(&vm->node);
+	}
+
+	memset(pfdev->mmu->as.slots, 0, sizeof(pfdev->mmu->as.slots));
+	mutex_unlock(&pfdev->mmu->as.slots_lock);
+
+	mmu_write(pfdev, MMU_INT_CLEAR, pancsf_mmu_fault_mask(pfdev, ~0));
+	mmu_write(pfdev, MMU_INT_MASK, pancsf_mmu_fault_mask(pfdev, ~0));
+}
+
+static size_t get_pgsize(u64 addr, size_t size, size_t *count)
+{
+	/*
+	 * io-pgtable only operates on multiple pages within a single table
+	 * entry, so we need to split at boundaries of the table size, i.e.
+	 * the next block size up. The distance from address A to the next
+	 * boundary of block size B is logically B - A % B, but in unsigned
+	 * two's complement where B is a power of two we get the equivalence
+	 * B - A % B == (B - A) % B == (n * B - A) % B, and choose n = 0 :)
+	 */
+	size_t blk_offset = -addr % SZ_2M;
+
+	if (blk_offset || size < SZ_2M) {
+		*count = min_not_zero(blk_offset, size) / SZ_4K;
+		return SZ_4K;
+	}
+	blk_offset = -addr % SZ_1G ?: SZ_1G;
+	*count = min(blk_offset, size) / SZ_2M;
+	return SZ_2M;
+}
+
+static int pancsf_vm_flush_range(struct pancsf_vm *vm, u64 iova, u64 size)
+{
+	struct pancsf_device *pfdev = vm->pfdev;
+	int ret = 0;
+
+	if (vm->as < 0)
+		return 0;
+
+	pm_runtime_get_noresume(pfdev->dev);
+
+	/* Flush the PTs only if we're already awake */
+	if (pm_runtime_active(pfdev->dev))
+		ret = mmu_hw_do_operation(vm, iova, size, AS_COMMAND_FLUSH_PT);
+
+	pm_runtime_put_sync_autosuspend(pfdev->dev);
+	return ret;
+}
+
+int pancsf_vm_unmap_pages(struct pancsf_vm *vm, u64 iova, size_t size)
+{
+	struct pancsf_device *pfdev = vm->pfdev;
+	struct io_pgtable_ops *ops = vm->pgtbl_ops;
+	size_t remaining = size;
+	int ret;
+
+	dev_dbg(pfdev->dev, "unmap: as=%d, iova=%llx, len=%zx",	vm->as, iova, size);
+
+	while (remaining) {
+		size_t unmapped_sz = 0, pgcount;
+		size_t pgsize = get_pgsize(iova, remaining, &pgcount);
+
+		if (ops->iova_to_phys(ops, iova)) {
+			unmapped_sz = ops->unmap_pages(ops, iova, pgsize, pgcount, NULL);
+			/* Unmapping might involve splitting 2MB ptes into 4K ones,
+			 * which might fail on memory allocation. Assume this is what
+			 * happened when unmap_pages() returns 0.
+			 */
+			if (WARN_ON(!unmapped_sz)) {
+				ret = -ENOMEM;
+				break;
+			}
+		}
+
+		/* If we couldn't unmap the whole range, skip a page, and try again. */
+		if (unmapped_sz < pgsize * pgcount)
+			unmapped_sz += pgsize;
+
+		iova += unmapped_sz;
+		remaining -= unmapped_sz;
+	}
+
+	/* Range might be partially unmapped, but flushing the TLB for the whole range
+	 * is safe and simpler.
+	 */
+	return pancsf_vm_flush_range(vm, iova, size);
+}
+
+static int
+pancsf_vm_map_pages(struct pancsf_vm *vm, u64 iova, int prot,
+		    struct sg_table *sgt, u64 offset, ssize_t size)
+{
+	struct pancsf_device *pfdev = vm->pfdev;
+	unsigned int count;
+	struct scatterlist *sgl;
+	struct io_pgtable_ops *ops = vm->pgtbl_ops;
+	u64 start_iova = iova;
+	int ret;
+
+	if (!size)
+		return 0;
+
+	for_each_sgtable_dma_sg(sgt, sgl, count) {
+		dma_addr_t paddr = sg_dma_address(sgl);
+		size_t len = sg_dma_len(sgl);
+
+		if (len <= offset) {
+			offset -= len;
+			continue;
+		}
+
+		paddr -= offset;
+		len -= offset;
+
+		if (size >= 0) {
+			len = min_t(size_t, len, size);
+			size -= len;
+		}
+
+		dev_dbg(pfdev->dev, "map: as=%d, iova=%llx, paddr=%llx, len=%zx",
+			vm->as, iova, paddr, len);
+
+		while (len) {
+			size_t pgcount, mapped = 0;
+			size_t pgsize = get_pgsize(iova | paddr, len, &pgcount);
+
+			ret = ops->map_pages(ops, iova, paddr, pgsize, pgcount, prot,
+					     GFP_KERNEL, &mapped);
+			/* Don't get stuck if things have gone wrong */
+			mapped = max(mapped, pgsize);
+			iova += mapped;
+			paddr += mapped;
+			len -= mapped;
+
+			if (ret) {
+				/* If something failed, unmap what we've already mapped before
+				 * returning. The unmap call is not supposed to fail.
+				 */
+				WARN_ON(pancsf_vm_unmap_pages(vm, start_iova, iova - start_iova));
+				return ret;
+			}
+		}
+
+		if (!size)
+			break;
+	}
+
+	return pancsf_vm_flush_range(vm, start_iova, iova - start_iova);
+}
+
+static int
+pancsf_vma_map_locked(struct pancsf_vma *vma, struct sg_table *sgt)
+{
+	int prot = 0;
+	int ret;
+
+	if (WARN_ON(vma->mapped))
+		return 0;
+
+	if (vma->flags & PANCSF_VMA_MAP_NOEXEC)
+		prot |= IOMMU_NOEXEC;
+
+	if (!(vma->flags & PANCSF_VMA_MAP_UNCACHED))
+		prot |= IOMMU_CACHE;
+
+	if (vma->flags & PANCSF_VMA_MAP_READONLY)
+		prot |= IOMMU_READ;
+	else
+		prot |= IOMMU_READ | IOMMU_WRITE;
+
+	if (vma->bo)
+		sgt = drm_gem_shmem_get_pages_sgt(&vma->bo->base);
+
+	if (IS_ERR(sgt))
+		return PTR_ERR(sgt);
+	else if (!sgt)
+		return -EINVAL;
+
+	ret = pancsf_vm_map_pages(vma->vm, vma->vm_mm_node.start << PAGE_SHIFT, prot,
+				  sgt, vma->offset, vma->vm_mm_node.size << PAGE_SHIFT);
+	if (!ret)
+		vma->mapped = true;
+
+	return ret;
+}
+
+static int
+pancsf_vma_unmap_locked(struct pancsf_vma *vma)
+{
+	int ret;
+
+	if (WARN_ON(!vma->mapped))
+		return 0;
+
+	ret = pancsf_vm_unmap_pages(vma->vm, vma->vm_mm_node.start << PAGE_SHIFT,
+				    vma->vm_mm_node.size << PAGE_SHIFT);
+	if (!ret)
+		vma->mapped = false;
+
+	return ret;
+}
+
+#define PANCSF_GPU_AUTO_VA_RANGE_START	0x800000000000ull
+#define PANCSF_GPU_AUTO_VA_RANGE_END	0x1000000000000ull
+
+static void
+pancsf_vm_free_vma_locked(struct pancsf_vma *vma)
+{
+	struct pancsf_gem_object *bo = vma->bo;
+
+	if (WARN_ON(vma->mapped)) {
+		/* Leak BO/VMA memory if we can't unmap. That's better than
+		 * letting the GPU access a region that can be re-allocated.
+		 */
+		if (WARN_ON(pancsf_vma_unmap_locked(vma)))
+			return;
+	}
+
+	if (vma->vm)
+		lockdep_assert_held(&vma->vm->lock);
+
+	if (bo)
+		drm_gem_object_put(&bo->base.base);
+
+	if (drm_mm_node_allocated(&vma->vm_mm_node))
+		drm_mm_remove_node(&vma->vm_mm_node);
+
+	kfree(vma);
+}
+
+static struct pancsf_vma *
+pancsf_vm_alloc_vma_locked(struct pancsf_vm *vm, size_t size, u64 va, u32 flags)
+{
+	bool is_exec = !(flags & PANCSF_VMA_MAP_NOEXEC);
+	u64 range_start, range_end, align;
+	struct pancsf_vma *vma;
+	unsigned int color = 0;
+	int ret;
+
+	lockdep_assert_held(&vm->lock);
+
+	if (flags & PANCSF_VMA_MAP_AUTO_VA)
+		va = 0;
+
+	if (vm->for_mcu || !size || ((size | va) & ~PAGE_MASK) != 0)
+		return ERR_PTR(-EINVAL);
+
+	if (!(flags & PANCSF_VMA_MAP_AUTO_VA)) {
+		/* Explicit VA assignment, we don't add coloring. If VA is
+		 * inappropriate it's the caller responsibility.
+		 */
+		range_start = va >> PAGE_SHIFT;
+		range_end = (va + size) >> PAGE_SHIFT;
+		align = 1;
+	} else {
+		range_start = PANCSF_GPU_AUTO_VA_RANGE_START >> PAGE_SHIFT;
+		range_end = PANCSF_GPU_AUTO_VA_RANGE_END >> PAGE_SHIFT;
+
+		if (is_exec) {
+			/* JUMP instructions can't cross a 4GB boundary, but
+			 * JUMP_EX ones can. We assume any executable VMA smaller
+			 * than 4GB expects things to fit in a single 4GB block.
+			 * Let's align on the closest power-of-2 size to guarantee
+			 * that. For anything bigger, we align the mapping to 4GB
+			 * and assume userspace uses JUMP_EX where appropriate.
+			 */
+			if (size < SZ_4G) {
+				u64 aligned_size = 1 << const_ilog2(size);
+
+				if (aligned_size != size)
+					aligned_size <<= 1;
+
+				align = aligned_size >> PAGE_SHIFT;
+			} else {
+				align = SZ_4G;
+			}
+		} else {
+			/* Align to 2MB is the buffer is bigger than 2MB. */
+			align = (size >= SZ_2M ? SZ_2M : PAGE_SIZE) >> PAGE_SHIFT;
+		}
+
+		/* Fragment shaders might call blend shaders which need to
+		 * be in the same 4GB block. We reserve the last 16M of such
+		 * VMAs to map our blend shaders. Mapping blend shaders has
+		 * to be done using a VM_BIND with an explicit VA.
+		 */
+		if (flags & PANCSF_VMA_MAP_FRAG_SHADER)
+			color |= MM_COLOR_FRAG_SHADER;
+	}
+
+	vma = kzalloc(sizeof(*vma), GFP_KERNEL);
+	if (!vma)
+		return ERR_PTR(-ENOMEM);
+
+	vma->flags = flags;
+	vma->vm = vm;
+	ret = drm_mm_insert_node_in_range(&vm->mm, &vma->vm_mm_node,
+					  size >> PAGE_SHIFT, align, color,
+					  range_start, range_end,
+					  DRM_MM_INSERT_BEST);
+
+	if (ret)
+		goto err_free_vma;
+
+	return vma;
+
+err_free_vma:
+	pancsf_vm_free_vma_locked(vma);
+	return ERR_PTR(ret);
+}
+
+int
+pancsf_vm_map_bo_range(struct pancsf_vm *vm, struct pancsf_gem_object *bo,
+		       u64 offset, size_t size, u64 *va, u32 flags)
+{
+	struct pancsf_vma *vma;
+	int ret = 0;
+
+	/* Make sure the VA and size are aligned and in-bounds. */
+	if (size > bo->base.base.size || offset > bo->base.base.size - size)
+		return -EINVAL;
+
+	mutex_lock(&vm->lock);
+	vma = pancsf_vm_alloc_vma_locked(vm, size, *va, flags);
+	if (IS_ERR(vma)) {
+		ret = PTR_ERR(vma);
+		goto out_unlock;
+	}
+
+	drm_gem_object_get(&bo->base.base);
+	vma->bo = bo;
+
+	if (!(flags & PANCSF_VMA_MAP_ON_FAULT)) {
+		ret = pancsf_vma_map_locked(vma, NULL);
+		if (ret)
+			goto out_unlock;
+	}
+
+	*va = vma->vm_mm_node.start << PAGE_SHIFT;
+
+out_unlock:
+	if (ret && !IS_ERR(vma))
+		pancsf_vm_free_vma_locked(vma);
+
+	mutex_unlock(&vm->lock);
+
+	return ret;
+}
+
+#define drm_mm_for_each_node_in_range_safe(node__, next_node__, mm__, start__, end__) \
+	for (node__ = __drm_mm_interval_first((mm__), (start__), (end__) - 1), \
+	     next_node__ = list_next_entry(node__, node_list); \
+	     node__->start < (end__); \
+	     node__ = next_node__, next_node__ = list_next_entry(node__, node_list))
+
+int
+pancsf_vm_unmap_range(struct pancsf_vm *vm, u64 va, size_t size)
+{
+	struct drm_mm_node *mm_node, *tmp_mm_node;
+	struct pancsf_vma *first = NULL, *last = NULL;
+	size_t first_new_size = 0, first_unmap_size = 0, last_new_size = 0, last_unmap_size = 0;
+	u64 first_va, first_unmap_va = 0, last_va, last_unmap_va = 0, last_new_offset = 0;
+	int ret = 0;
+
+	if ((va | size) & ~PAGE_MASK)
+		return -EINVAL;
+
+	if (!size)
+		return 0;
+
+	mutex_lock(&vm->lock);
+
+	drm_mm_for_each_node_in_range_safe(mm_node, tmp_mm_node, &vm->mm, va >> PAGE_SHIFT,
+					   (va + size) >> PAGE_SHIFT) {
+		struct pancsf_vma *vma = container_of(mm_node, struct pancsf_vma, vm_mm_node);
+		u64 vma_va = mm_node->start << PAGE_SHIFT;
+		size_t vma_size = mm_node->size << PAGE_SHIFT;
+
+		if (vma_va < va) {
+			first = vma;
+			first_va = vma_va;
+			first_new_size = va - vma_va;
+			first_unmap_size = vma_size - first_new_size;
+			first_unmap_va = va;
+		}
+
+		if (vma_va + vma_size > va + size) {
+			last = vma;
+			last_va = va + size;
+			last_new_size = vma_va + vma_size - last_va;
+			last_new_offset = last->offset +
+					  ((last->vm_mm_node.size << PAGE_SHIFT) - last_new_size);
+			last_unmap_size = vma_size - last_new_size;
+			last_unmap_va = vma_va;
+
+			/* Partial unmap in the middle of the only VMA. We need to create a
+			 * new VMA.
+			 */
+			if (first == last) {
+				struct pancsf_vma *last = kzalloc(sizeof(*vma), GFP_KERNEL);
+
+				if (!last) {
+					ret = -ENOMEM;
+					goto out_unlock;
+				}
+
+				last_unmap_va = 0;
+				last_unmap_size = 0;
+				first_unmap_size -= last_new_size;
+				last->flags = first->flags;
+				last->bo = first->bo;
+				last->mapped = first->mapped;
+				drm_gem_object_get(&last->bo->base.base);
+			}
+		}
+
+		mm_node = list_next_entry(mm_node, node_list);
+
+		if (vma != last && vma != first) {
+			ret = pancsf_vma_unmap_locked(vma);
+			if (ret)
+				goto out_unlock;
+
+			pancsf_vm_free_vma_locked(vma);
+		}
+	}
+
+	/* Re-insert first and last VMAs if the unmap was partial. */
+	if (first) {
+		ret = pancsf_vm_unmap_pages(vm, first_unmap_va >> PAGE_SHIFT,
+					    first_unmap_size >> PAGE_SHIFT);
+		if (ret)
+			goto out_unlock;
+
+		drm_mm_remove_node(&first->vm_mm_node);
+		drm_mm_insert_node_in_range(&vm->mm, &first->vm_mm_node,
+					    first_new_size >> PAGE_SHIFT, 1, 0,
+					    first_va >> PAGE_SHIFT,
+					    (first_va + first_new_size) >> PAGE_SHIFT,
+					    DRM_MM_INSERT_BEST);
+	}
+
+	if (last) {
+		if (drm_mm_node_allocated(&last->vm_mm_node)) {
+			ret = pancsf_vm_unmap_pages(vm, last_unmap_va >> PAGE_SHIFT,
+						    last_unmap_size >> PAGE_SHIFT);
+			if (ret)
+				goto out_unlock;
+
+			drm_mm_remove_node(&last->vm_mm_node);
+		}
+
+		last->offset = last_new_offset;
+		drm_mm_insert_node_in_range(&vm->mm, &last->vm_mm_node, last_new_size, 1, 0,
+					    last_va, last_va + last_new_size,
+					    DRM_MM_INSERT_BEST);
+	}
+
+out_unlock:
+	mutex_unlock(&vm->lock);
+
+	return ret;
+}
+
+struct pancsf_gem_object *
+pancsf_vm_get_bo_for_vma(struct pancsf_vm *vm, u64 va, u64 *bo_offset)
+{
+	struct pancsf_gem_object *bo = ERR_PTR(-ENOENT);
+	struct pancsf_vma *vma = NULL;
+	struct drm_mm_node *mm_node;
+	u64 vma_va;
+
+	mutex_lock(&vm->lock);
+	drm_mm_for_each_node_in_range(mm_node, &vm->mm, va >> PAGE_SHIFT, (va >> PAGE_SHIFT) + 1) {
+		vma = container_of(mm_node, struct pancsf_vma, vm_mm_node);
+		break;
+	}
+
+	if (vma && vma->bo) {
+		bo = vma->bo;
+		drm_gem_object_get(&bo->base.base);
+		vma_va = mm_node->start << PAGE_SHIFT;
+		*bo_offset = va - vma_va;
+	}
+	mutex_unlock(&vm->lock);
+
+	return bo;
+}
+
+#define PANCSF_MAX_VMS_PER_FILE	 32
+
+int pancsf_vm_pool_create_vm(struct pancsf_device *pfdev, struct pancsf_vm_pool *pool)
+{
+	struct pancsf_vm *vm;
+	int ret;
+	u32 id;
+
+	vm = pancsf_vm_create(pfdev, false);
+	if (IS_ERR(vm))
+		return PTR_ERR(vm);
+
+	mutex_lock(&pool->lock);
+	ret = xa_alloc(&pool->xa, &id, vm,
+		       XA_LIMIT(1, PANCSF_MAX_VMS_PER_FILE), GFP_KERNEL);
+	mutex_unlock(&pool->lock);
+
+	if (ret) {
+		pancsf_vm_put(vm);
+		return ret;
+	}
+
+	return id;
+}
+
+void pancsf_vm_pool_destroy_vm(struct pancsf_vm_pool *pool, u32 handle)
+{
+	struct pancsf_vm *vm;
+
+	mutex_lock(&pool->lock);
+	vm = xa_erase(&pool->xa, handle);
+	mutex_unlock(&pool->lock);
+
+	if (vm)
+		pancsf_vm_put(vm);
+}
+
+struct pancsf_vm *pancsf_vm_pool_get_vm(struct pancsf_vm_pool *pool, u32 handle)
+{
+	struct pancsf_vm *vm;
+
+	mutex_lock(&pool->lock);
+	vm = xa_load(&pool->xa, handle);
+	if (vm)
+		pancsf_vm_get(vm);
+	mutex_unlock(&pool->lock);
+
+	return vm;
+}
+
+void pancsf_vm_pool_destroy(struct pancsf_file *pfile)
+{
+	struct pancsf_vm *vm;
+	unsigned long i;
+
+	if (!pfile->vms)
+		return;
+
+	mutex_lock(&pfile->vms->lock);
+	xa_for_each(&pfile->vms->xa, i, vm)
+		pancsf_vm_put(vm);
+	mutex_unlock(&pfile->vms->lock);
+
+	mutex_destroy(&pfile->vms->lock);
+	xa_destroy(&pfile->vms->xa);
+	kfree(pfile->vms);
+}
+
+int pancsf_vm_pool_create(struct pancsf_file *pfile)
+{
+	pfile->vms = kzalloc(sizeof(*pfile->vms), GFP_KERNEL);
+	if (!pfile->vms)
+		return -ENOMEM;
+
+	xa_init_flags(&pfile->vms->xa, XA_FLAGS_ALLOC1);
+	mutex_init(&pfile->vms->lock);
+	return 0;
+}
+
+/* dummy TLB ops, the real TLB flush happens in pancsf_vm_flush_range() */
+static void mmu_tlb_flush_all(void *cookie)
+{
+}
+
+static void mmu_tlb_flush_walk(unsigned long iova, size_t size, size_t granule, void *cookie)
+{
+}
+
+static const struct iommu_flush_ops mmu_tlb_ops = {
+	.tlb_flush_all = mmu_tlb_flush_all,
+	.tlb_flush_walk = mmu_tlb_flush_walk,
+};
+
+#define NUM_FAULT_PAGES (SZ_2M / PAGE_SIZE)
+
+static int pancsf_mmu_map_fault_addr_locked(struct pancsf_device *pfdev, int as, u64 addr)
+{
+	struct pancsf_vm *vm;
+	struct pancsf_vma *vma = NULL;
+	struct drm_mm_node *mm_node;
+	int ret;
+
+	vm = pfdev->mmu->as.slots[as];
+	if (!vm)
+		return -ENOENT;
+
+	mutex_lock(&vm->lock);
+	drm_mm_for_each_node_in_range(mm_node, &vm->mm, addr, addr + 1)
+		vma = container_of(mm_node, struct pancsf_vma, vm_mm_node);
+
+	if (!vma) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	if (!(vma->flags & PANCSF_VMA_MAP_ON_FAULT)) {
+		dev_warn(pfdev->dev, "matching VMA is not MAP_ON_FAULT (GPU VA = %llx)",
+			 vma->vm_mm_node.start << PAGE_SHIFT);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	WARN_ON(vma->vm->as != as);
+
+	ret = pancsf_vma_map_locked(vma, NULL);
+	if (ret)
+		goto out;
+
+	dev_dbg(pfdev->dev, "mapped page fault @ AS%d %llx", as, addr);
+
+out:
+	mutex_unlock(&vm->lock);
+	return ret;
+}
+
+static void pancsf_vm_release(struct kref *kref)
+{
+	struct pancsf_vm *vm = container_of(kref, struct pancsf_vm, refcount);
+	struct pancsf_device *pfdev = vm->pfdev;
+	u32 va_bits = GPU_MMU_FEATURES_VA_BITS(pfdev->gpu_info.mmu_features);
+
+	mutex_lock(&pfdev->mmu->as.slots_lock);
+	if (vm->as >= 0) {
+		pm_runtime_get_noresume(pfdev->dev);
+		if (pm_runtime_active(pfdev->dev))
+			pancsf_vm_disable(vm);
+		pm_runtime_put_autosuspend(pfdev->dev);
+
+		pfdev->mmu->as.slots[vm->as] = NULL;
+		clear_bit(vm->as, &pfdev->mmu->as.alloc_mask);
+		clear_bit(vm->as, &pfdev->mmu->as.in_use_mask);
+		list_del(&vm->node);
+	}
+	mutex_unlock(&pfdev->mmu->as.slots_lock);
+
+	pancsf_vm_unmap_range(vm, 0, 1ull << va_bits);
+
+	free_io_pgtable_ops(vm->pgtbl_ops);
+	drm_mm_takedown(&vm->mm);
+	mutex_destroy(&vm->lock);
+	dma_resv_fini(&vm->resv);
+	kfree(vm);
+}
+
+void pancsf_vm_put(struct pancsf_vm *vm)
+{
+	if (vm)
+		kref_put(&vm->refcount, pancsf_vm_release);
+}
+
+struct pancsf_vm *pancsf_vm_get(struct pancsf_vm *vm)
+{
+	if (vm)
+		kref_get(&vm->refcount);
+
+	return vm;
+}
+
+#define PFN_4G		(SZ_4G >> PAGE_SHIFT)
+#define PFN_4G_MASK	(PFN_4G - 1)
+#define PFN_16M		(SZ_16M >> PAGE_SHIFT)
+
+static void pancsf_drm_mm_color_adjust(const struct drm_mm_node *node,
+				       unsigned long color,
+				       u64 *start, u64 *end)
+{
+	if (color & MM_COLOR_FRAG_SHADER) {
+		u64 next_seg;
+
+		/* Reserve the last 16M of the 4GB block for blend shaders */
+		next_seg = ALIGN(*start + 1, PFN_4G);
+		if (next_seg - *start <= PFN_16M)
+			*start = next_seg + 1;
+
+		*end = min(*end, ALIGN(*start, PFN_4G) - PFN_16M);
+	}
+}
+
+static u64 mair_to_memattr(u64 mair)
+{
+	u64 memattr = 0;
+	u32 i;
+
+	for (i = 0; i < 8; i++) {
+		u8 in_attr = mair >> (8 * i), out_attr;
+		u8 outer = in_attr >> 4, inner = in_attr & 0xf;
+
+		/* For caching to be enabled, inner and outer caching policy
+		 * have to be both write-back, if one of them is write-through
+		 * or non-cacheable, we just choose non-cacheable. Device
+		 * memory is also translated to non-cacheable.
+		 */
+		if (!(outer & 3) || !(outer & 4) || !(inner & 4)) {
+			out_attr = AS_MEMATTR_AARCH64_INNER_OUTER_NC |
+				   AS_MEMATTR_AARCH64_SH_MIDGARD_INNER |
+				   AS_MEMATTR_AARCH64_INNER_ALLOC_EXPL(false, false);
+		} else {
+			/* Use SH_CPU_INNER mode so SH_IS, which is used when
+			 * IOMMU_CACHE is set, actually maps to the standard
+			 * definition of inner-shareable and not Mali's
+			 * internal-shareable mode.
+			 */
+			out_attr = AS_MEMATTR_AARCH64_INNER_OUTER_WB |
+				   AS_MEMATTR_AARCH64_SH_CPU_INNER |
+				   AS_MEMATTR_AARCH64_INNER_ALLOC_EXPL(inner & 1, inner & 2);
+		}
+
+		memattr |= (u64)out_attr << (8 * i);
+	}
+
+	return memattr;
+}
+
+void pancsf_vm_unmap_mcu_pages(struct pancsf_vm *vm,
+			       struct drm_mm_node *mm_node)
+{
+	struct io_pgtable_ops *ops = vm->pgtbl_ops;
+	size_t len = mm_node->size << PAGE_SHIFT;
+	u64 iova = mm_node->start << PAGE_SHIFT;
+	size_t unmapped_len = 0;
+
+	while (unmapped_len < len) {
+		size_t unmapped_page, pgcount;
+		size_t pgsize = get_pgsize(iova, len - unmapped_len, &pgcount);
+
+		if (ops->iova_to_phys(ops, iova)) {
+			unmapped_page = ops->unmap_pages(ops, iova, pgsize, pgcount, NULL);
+			WARN_ON(unmapped_page != pgsize * pgcount);
+		}
+		iova += pgsize * pgcount;
+		unmapped_len += pgsize * pgcount;
+	}
+
+	pancsf_vm_flush_range(vm, mm_node->start << PAGE_SHIFT, len);
+
+	mutex_lock(&vm->lock);
+	drm_mm_remove_node(mm_node);
+	mutex_unlock(&vm->lock);
+}
+
+int pancsf_vm_remap_mcu_pages(struct pancsf_vm *vm,
+			      struct drm_mm_node *mm_node,
+			      struct sg_table *sgt,
+			      int prot)
+{
+	if (WARN_ON(!drm_mm_node_allocated(mm_node)))
+		return -EINVAL;
+
+	pancsf_vm_map_pages(vm, mm_node->start << PAGE_SHIFT, prot, sgt, 0, -1);
+	return 0;
+}
+
+int pancsf_vm_map_mcu_pages(struct pancsf_vm *vm,
+			    struct drm_mm_node *mm_node,
+			    struct sg_table *sgt,
+			    unsigned int num_pages,
+			    u64 va_start, u64 va_end,
+			    int prot)
+{
+	int ret;
+
+	if (WARN_ON(!vm->for_mcu))
+		return -EINVAL;
+
+	mutex_lock(&vm->lock);
+	ret = drm_mm_insert_node_in_range(&vm->mm, mm_node,
+					  num_pages, 0, 0,
+					  va_start >> PAGE_SHIFT,
+					  va_end >> PAGE_SHIFT,
+					  DRM_MM_INSERT_BEST);
+	mutex_unlock(&vm->lock);
+
+	if (ret) {
+		dev_err(vm->pfdev->dev, "Failed to reserve VA range %llx-%llx num_pages %d (err=%d)",
+			va_start, va_end, num_pages, ret);
+		return ret;
+	}
+
+	pancsf_vm_map_pages(vm, mm_node->start << PAGE_SHIFT, prot, sgt,
+			    0, num_pages << PAGE_SHIFT);
+	return 0;
+}
+
+struct pancsf_vm *pancsf_vm_create(struct pancsf_device *pfdev, bool for_mcu)
+{
+	u32 va_bits = GPU_MMU_FEATURES_VA_BITS(pfdev->gpu_info.mmu_features);
+	u32 pa_bits = GPU_MMU_FEATURES_PA_BITS(pfdev->gpu_info.mmu_features);
+	struct pancsf_vm *vm;
+	u64 va_start, va_end;
+
+	vm = kzalloc(sizeof(*vm), GFP_KERNEL);
+	if (!vm)
+		return ERR_PTR(-ENOMEM);
+
+	vm->for_mcu = for_mcu;
+	vm->pfdev = pfdev;
+	dma_resv_init(&vm->resv);
+	mutex_init(&vm->lock);
+
+	if (for_mcu) {
+		/* CSF MCU is a cortex M7, and can only address 4G */
+		va_start = 0;
+		va_end = SZ_4G;
+	} else {
+		va_start = SZ_32M;
+		va_end = (1ull << va_bits) - SZ_32M;
+		vm->mm.color_adjust = pancsf_drm_mm_color_adjust;
+	}
+
+	drm_mm_init(&vm->mm, va_start >> PAGE_SHIFT, va_end >> PAGE_SHIFT);
+
+	INIT_LIST_HEAD(&vm->node);
+	vm->as = -1;
+
+	vm->pgtbl_cfg = (struct io_pgtable_cfg) {
+		.pgsize_bitmap	= SZ_4K | SZ_2M,
+		.ias		= va_bits,
+		.oas		= pa_bits,
+		.coherent_walk	= pfdev->coherent,
+		.tlb		= &mmu_tlb_ops,
+		.iommu_dev	= pfdev->dev,
+	};
+
+	vm->pgtbl_ops = alloc_io_pgtable_ops(ARM_64_LPAE_S1, &vm->pgtbl_cfg, vm);
+	if (!vm->pgtbl_ops) {
+		kfree(vm);
+		return ERR_PTR(-EINVAL);
+	}
+
+	vm->memattr = mair_to_memattr(vm->pgtbl_cfg.arm_lpae_s1_cfg.mair);
+	kref_init(&vm->refcount);
+
+	return vm;
+}
+
+static const char *access_type_name(struct pancsf_device *pfdev,
+				    u32 fault_status)
+{
+	switch (fault_status & AS_FAULTSTATUS_ACCESS_TYPE_MASK) {
+	case AS_FAULTSTATUS_ACCESS_TYPE_ATOMIC:
+		return "ATOMIC";
+	case AS_FAULTSTATUS_ACCESS_TYPE_READ:
+		return "READ";
+	case AS_FAULTSTATUS_ACCESS_TYPE_WRITE:
+		return "WRITE";
+	case AS_FAULTSTATUS_ACCESS_TYPE_EX:
+		return "EXECUTE";
+	default:
+		WARN_ON(1);
+		return NULL;
+	}
+}
+
+static irqreturn_t pancsf_mmu_irq_handler(int irq, void *data)
+{
+	struct pancsf_device *pfdev = data;
+
+	if (!mmu_read(pfdev, MMU_INT_STAT))
+		return IRQ_NONE;
+
+	mmu_write(pfdev, MMU_INT_MASK, 0);
+	return IRQ_WAKE_THREAD;
+}
+
+static irqreturn_t pancsf_mmu_irq_handler_thread(int irq, void *data)
+{
+	struct pancsf_device *pfdev = data;
+	u32 status = mmu_read(pfdev, MMU_INT_RAWSTAT);
+	int ret;
+
+	status = pancsf_mmu_fault_mask(pfdev, status);
+	while (status) {
+		u32 as = ffs(status | (status >> 16)) - 1;
+		u32 mask = pancsf_mmu_as_fault_mask(pfdev, as);
+		u64 addr;
+		u32 fault_status;
+		u32 exception_type;
+		u32 access_type;
+		u32 source_id;
+
+		fault_status = mmu_read(pfdev, AS_FAULTSTATUS(as));
+		addr = mmu_read(pfdev, AS_FAULTADDRESS_LO(as));
+		addr |= (u64)mmu_read(pfdev, AS_FAULTADDRESS_HI(as)) << 32;
+
+		/* decode the fault status */
+		exception_type = fault_status & 0xFF;
+		access_type = (fault_status >> 8) & 0x3;
+		source_id = (fault_status >> 16);
+
+		mmu_write(pfdev, MMU_INT_CLEAR, mask);
+
+		/* Page fault only */
+		ret = -1;
+		mutex_lock(&pfdev->mmu->as.slots_lock);
+		if ((status & mask) == BIT(as) && (exception_type & 0xF8) == 0xC0)
+			ret = pancsf_mmu_map_fault_addr_locked(pfdev, as, addr);
+
+		if (ret) {
+			/* terminal fault, print info about the fault */
+			dev_err(pfdev->dev,
+				"Unhandled Page fault in AS%d at VA 0x%016llX\n"
+				"Reason: %s\n"
+				"raw fault status: 0x%X\n"
+				"decoded fault status: %s\n"
+				"exception type 0x%X: %s\n"
+				"access type 0x%X: %s\n"
+				"source id 0x%X\n",
+				as, addr,
+				"TODO",
+				fault_status,
+				(fault_status & (1 << 10) ? "DECODER FAULT" : "SLAVE FAULT"),
+				exception_type, pancsf_exception_name(exception_type),
+				access_type, access_type_name(pfdev, fault_status),
+				source_id);
+
+			/* Ignore MMU interrupts on this AS until it's been
+			 * re-enabled.
+			 */
+			pfdev->mmu->as.faulty_mask |= mask;
+
+			/* Disable the MMU to kill jobs on this AS. */
+			pancsf_mmu_as_disable(pfdev, as);
+		}
+
+		mutex_unlock(&pfdev->mmu->as.slots_lock);
+
+		status &= ~mask;
+
+		/* If we received new MMU interrupts, process them before returning. */
+		if (!status) {
+			status = pancsf_mmu_fault_mask(pfdev, mmu_read(pfdev, MMU_INT_RAWSTAT));
+			status &= ~pfdev->mmu->as.faulty_mask;
+		}
+	}
+
+	mutex_lock(&pfdev->mmu->as.slots_lock);
+	mmu_write(pfdev, MMU_INT_MASK, pancsf_mmu_fault_mask(pfdev, ~pfdev->mmu->as.faulty_mask));
+	mutex_unlock(&pfdev->mmu->as.slots_lock);
+
+	return IRQ_HANDLED;
+};
+
+int pancsf_mmu_init(struct pancsf_device *pfdev)
+{
+	struct pancsf_mmu *mmu;
+	int ret, irq;
+
+	mmu = kzalloc(sizeof(*mmu), GFP_KERNEL);
+	if (!mmu)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&mmu->as.lru_list);
+	spin_lock_init(&mmu->as.op_lock);
+	mutex_init(&mmu->as.slots_lock);
+
+	pfdev->mmu = mmu;
+
+	irq = platform_get_irq_byname(to_platform_device(pfdev->dev), "mmu");
+	if (irq <= 0) {
+		ret = -ENODEV;
+		goto err_free_mmu;
+	}
+
+	mmu->irq = irq;
+
+	mmu_write(pfdev, MMU_INT_CLEAR, pancsf_mmu_fault_mask(pfdev, ~0));
+	mmu_write(pfdev, MMU_INT_MASK, pancsf_mmu_fault_mask(pfdev, ~0));
+	ret = devm_request_threaded_irq(pfdev->dev, irq,
+					pancsf_mmu_irq_handler,
+					pancsf_mmu_irq_handler_thread,
+					IRQF_SHARED, KBUILD_MODNAME "-mmu",
+					pfdev);
+
+	if (ret)
+		goto err_free_mmu;
+
+	return 0;
+
+err_free_mmu:
+	mutex_destroy(&mmu->as.slots_lock);
+	kfree(mmu);
+	return ret;
+}
+
+void pancsf_mmu_fini(struct pancsf_device *pfdev)
+{
+	mmu_write(pfdev, MMU_INT_MASK, 0);
+	pfdev->mmu->as.faulty_mask = ~0;
+	synchronize_irq(pfdev->mmu->irq);
+	mutex_destroy(&pfdev->mmu->as.slots_lock);
+	kfree(pfdev->mmu);
+}
diff --git a/drivers/gpu/drm/pancsf/pancsf_mmu.h b/drivers/gpu/drm/pancsf/pancsf_mmu.h
new file mode 100644
index 000000000000..9d436d055d01
--- /dev/null
+++ b/drivers/gpu/drm/pancsf/pancsf_mmu.h
@@ -0,0 +1,51 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright 2019 Linaro, Ltd, Rob Herring <robh@kernel.org> */
+/* Copyright 2023 Collabora ltd. */
+
+#ifndef __PANCSF_MMU_H__
+#define __PANCSF_MMU_H__
+
+struct pancsf_gem_object;
+struct pancsf_vm;
+struct pancsf_vma;
+struct pancsf_mmu;
+struct pancsf_vm_bind_job;
+
+int pancsf_vm_remap_mcu_pages(struct pancsf_vm *vm,
+			      struct drm_mm_node *mm_node,
+			      struct sg_table *sgt,
+			      int prot);
+int pancsf_vm_map_mcu_pages(struct pancsf_vm *vm,
+			    struct drm_mm_node *mm_node,
+			    struct sg_table *sgt,
+			    unsigned int num_pages,
+			    u64 va_start, u64 va_end,
+			    int prot);
+void pancsf_vm_unmap_mcu_pages(struct pancsf_vm *vm,
+			       struct drm_mm_node *mm_node);
+
+int pancsf_mmu_init(struct pancsf_device *pfdev);
+void pancsf_mmu_fini(struct pancsf_device *pfdev);
+void pancsf_mmu_pre_reset(struct pancsf_device *pfdev);
+void pancsf_mmu_reset(struct pancsf_device *pfdev);
+
+int pancsf_vm_map_bo_range(struct pancsf_vm *vm, struct pancsf_gem_object *bo,
+			   u64 offset, size_t size, u64 *va, u32 flags);
+int pancsf_vm_unmap_range(struct pancsf_vm *vm, u64 va, size_t size);
+struct pancsf_gem_object *
+pancsf_vm_get_bo_for_vma(struct pancsf_vm *vm, u64 va, u64 *bo_offset);
+
+int pancsf_vm_as_get(struct pancsf_vm *vm);
+void pancsf_vm_as_put(struct pancsf_vm *vm);
+
+struct pancsf_vm *pancsf_vm_get(struct pancsf_vm *vm);
+void pancsf_vm_put(struct pancsf_vm *vm);
+struct pancsf_vm *pancsf_vm_create(struct pancsf_device *pfdev, bool for_mcu);
+
+void pancsf_vm_pool_destroy(struct pancsf_file *pfile);
+int pancsf_vm_pool_create(struct pancsf_file *pfile);
+int pancsf_vm_pool_create_vm(struct pancsf_device *pfdev, struct pancsf_vm_pool *pool);
+void pancsf_vm_pool_destroy_vm(struct pancsf_vm_pool *pool, u32 handle);
+struct pancsf_vm *pancsf_vm_pool_get_vm(struct pancsf_vm_pool *pool, u32 handle);
+
+#endif
diff --git a/drivers/gpu/drm/pancsf/pancsf_regs.h b/drivers/gpu/drm/pancsf/pancsf_regs.h
new file mode 100644
index 000000000000..aaf1ef6b51da
--- /dev/null
+++ b/drivers/gpu/drm/pancsf/pancsf_regs.h
@@ -0,0 +1,225 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright 2018 Marty E. Plummer <hanetzer@startmail.com> */
+/* Copyright 2019 Linaro, Ltd, Rob Herring <robh@kernel.org> */
+/* Copyright 2023 Collabora ltd. */
+/*
+ * Register definitions based on mali_kbase_gpu_regmap.h and
+ * mali_kbase_gpu_regmap_csf.h
+ * (C) COPYRIGHT 2010-2022 ARM Limited. All rights reserved.
+ */
+#ifndef __PANCSF_REGS_H__
+#define __PANCSF_REGS_H__
+
+#define GPU_ID						0x00
+#define GPU_L2_FEATURES					0x004
+#define GPU_TILER_FEATURES				0x00C
+#define GPU_MEM_FEATURES				0x010
+#define   GROUPS_L2_COHERENT				BIT(0)
+
+#define GPU_MMU_FEATURES				0x014
+#define  GPU_MMU_FEATURES_VA_BITS(x)			((x) & GENMASK(7, 0))
+#define  GPU_MMU_FEATURES_PA_BITS(x)			(((x) >> 8) & GENMASK(7, 0))
+#define GPU_AS_PRESENT					0x018
+#define GPU_CSF_ID					0x01C
+
+#define GPU_INT_RAWSTAT					0x20
+#define GPU_INT_CLEAR					0x24
+#define GPU_INT_MASK					0x28
+#define GPU_INT_STAT					0x2c
+#define   GPU_IRQ_FAULT					BIT(0)
+#define   GPU_IRQ_PROTM_FAULT				BIT(1)
+#define   GPU_IRQ_RESET_COMPLETED			BIT(8)
+#define   GPU_IRQ_POWER_CHANGED				BIT(9)
+#define   GPU_IRQ_POWER_CHANGED_ALL			BIT(10)
+#define   GPU_IRQ_CLEAN_CACHES_COMPLETED		BIT(17)
+#define   GPU_IRQ_DOORBELL_MIRROR			BIT(18)
+#define   GPU_IRQ_MCU_STATUS_CHANGED			BIT(19)
+#define GPU_CMD						0x30
+#define   GPU_CMD_DEF(type, payload)			((type) | ((payload) << 8))
+#define   GPU_SOFT_RESET				GPU_CMD_DEF(1, 1)
+#define   GPU_HARD_RESET				GPU_CMD_DEF(1, 2)
+#define   CACHE_CLEAN					BIT(0)
+#define   CACHE_INV					BIT(1)
+#define   GPU_FLUSH_CACHES(l2, lsc, oth)		\
+	  GPU_CMD_DEF(4, ((l2) << 0) | ((lsc) << 4) | ((oth) << 8))
+
+#define GPU_STATUS					0x34
+#define   GPU_STATUS_ACTIVE				BIT(0)
+#define   GPU_STATUS_PWR_ACTIVE				BIT(1)
+#define   GPU_STATUS_PAGE_FAULT				BIT(4)
+#define   GPU_STATUS_PROTM_ACTIVE			BIT(7)
+#define   GPU_STATUS_DBG_ENABLED			BIT(8)
+
+#define GPU_FAULT_STATUS				0x3C
+#define GPU_FAULT_ADDR_LO				0x40
+#define GPU_FAULT_ADDR_HI				0x44
+
+#define GPU_PWR_KEY					0x50
+#define  GPU_PWR_KEY_UNLOCK				0x2968A819
+#define GPU_PWR_OVERRIDE0				0x54
+#define GPU_PWR_OVERRIDE1				0x58
+
+#define GPU_TIMESTAMP_OFFSET_LO				0x88
+#define GPU_TIMESTAMP_OFFSET_HI				0x8C
+#define GPU_CYCLE_COUNT_LO				0x90
+#define GPU_CYCLE_COUNT_HI				0x94
+#define GPU_TIMESTAMP_LO				0x98
+#define GPU_TIMESTAMP_HI				0x9C
+
+#define GPU_THREAD_MAX_THREADS				0xA0
+#define GPU_THREAD_MAX_WORKGROUP_SIZE			0xA4
+#define GPU_THREAD_MAX_BARRIER_SIZE			0xA8
+#define GPU_THREAD_FEATURES				0xAC
+
+#define GPU_TEXTURE_FEATURES(n)				(0xB0 + ((n) * 4))
+
+#define GPU_SHADER_PRESENT_LO				0x100
+#define GPU_SHADER_PRESENT_HI				0x104
+#define GPU_TILER_PRESENT_LO				0x110
+#define GPU_TILER_PRESENT_HI				0x114
+#define GPU_L2_PRESENT_LO				0x120
+#define GPU_L2_PRESENT_HI				0x124
+
+#define SHADER_READY_LO					0x140
+#define SHADER_READY_HI					0x144
+#define TILER_READY_LO					0x150
+#define TILER_READY_HI					0x154
+#define L2_READY_LO					0x160
+#define L2_READY_HI					0x164
+
+#define SHADER_PWRON_LO					0x180
+#define SHADER_PWRON_HI					0x184
+#define TILER_PWRON_LO					0x190
+#define TILER_PWRON_HI					0x194
+#define L2_PWRON_LO					0x1A0
+#define L2_PWRON_HI					0x1A4
+
+#define SHADER_PWROFF_LO				0x1C0
+#define SHADER_PWROFF_HI				0x1C4
+#define TILER_PWROFF_LO					0x1D0
+#define TILER_PWROFF_HI					0x1D4
+#define L2_PWROFF_LO					0x1E0
+#define L2_PWROFF_HI					0x1E4
+
+#define SHADER_PWRTRANS_LO				0x200
+#define SHADER_PWRTRANS_HI				0x204
+#define TILER_PWRTRANS_LO				0x210
+#define TILER_PWRTRANS_HI				0x214
+#define L2_PWRTRANS_LO					0x220
+#define L2_PWRTRANS_HI					0x224
+
+#define SHADER_PWRACTIVE_LO				0x240
+#define SHADER_PWRACTIVE_HI				0x244
+#define TILER_PWRACTIVE_LO				0x250
+#define TILER_PWRACTIVE_HI				0x254
+#define L2_PWRACTIVE_LO					0x260
+#define L2_PWRACTIVE_HI					0x264
+
+#define GPU_REVID					0x280
+
+#define GPU_COHERENCY_FEATURES				0x300
+#define GPU_COHERENCY_PROT_BIT(name)			BIT(GPU_COHERENCY_  ## name)
+
+#define GPU_COHERENCY_PROTOCOL				0x304
+#define   GPU_COHERENCY_ACE				0
+#define   GPU_COHERENCY_ACE_LITE			1
+#define   GPU_COHERENCY_NONE				31
+
+#define MCU_CONTROL					0x700
+#define MCU_CONTROL_ENABLE				1
+#define MCU_CONTROL_AUTO				2
+#define MCU_CONTROL_DISABLE				0
+
+#define MCU_STATUS					0x704
+#define MCU_STATUS_DISABLED				0
+#define MCU_STATUS_ENABLED				1
+#define MCU_STATUS_HALT					2
+#define MCU_STATUS_FATAL				3
+
+/* Job Control regs */
+#define JOB_INT_RAWSTAT					0x1000
+#define JOB_INT_CLEAR					0x1004
+#define JOB_INT_MASK					0x1008
+#define JOB_INT_STAT					0x100c
+#define   JOB_INT_GLOBAL_IF				BIT(31)
+#define   JOB_INT_CSG_IF(x)				BIT(x)
+
+/* MMU regs */
+#define MMU_INT_RAWSTAT					0x2000
+#define MMU_INT_CLEAR					0x2004
+#define MMU_INT_MASK					0x2008
+#define MMU_INT_STAT					0x200c
+
+/* AS_COMMAND register commands */
+
+#define MMU_BASE					0x2400
+#define MMU_AS_SHIFT					6
+#define MMU_AS(as)					(MMU_BASE + ((as) << MMU_AS_SHIFT))
+
+#define AS_TRANSTAB_LO(as)				(MMU_AS(as) + 0x00)
+#define AS_TRANSTAB_HI(as)				(MMU_AS(as) + 0x04)
+#define AS_MEMATTR_LO(as)				(MMU_AS(as) + 0x08)
+#define AS_MEMATTR_HI(as)				(MMU_AS(as) + 0x0C)
+#define   AS_MEMATTR_AARCH64_INNER_ALLOC_IMPL		(2 << 2)
+#define   AS_MEMATTR_AARCH64_INNER_ALLOC_EXPL(w, r)	((3 << 2) | \
+							 ((w) ? BIT(0) : 0) | \
+							 ((r) ? BIT(1) : 0))
+#define   AS_MEMATTR_AARCH64_SH_MIDGARD_INNER		(0 << 4)
+#define   AS_MEMATTR_AARCH64_SH_CPU_INNER		(1 << 4)
+#define   AS_MEMATTR_AARCH64_SH_CPU_INNER_SHADER_COH	(2 << 4)
+#define   AS_MEMATTR_AARCH64_SHARED			(0 << 6)
+#define   AS_MEMATTR_AARCH64_INNER_OUTER_NC		(1 << 6)
+#define   AS_MEMATTR_AARCH64_INNER_OUTER_WB		(2 << 6)
+#define   AS_MEMATTR_AARCH64_FAULT			(3 << 6)
+#define AS_LOCKADDR_LO(as)				(MMU_AS(as) + 0x10)
+#define AS_LOCKADDR_HI(as)				(MMU_AS(as) + 0x14)
+#define AS_COMMAND(as)					(MMU_AS(as) + 0x18)
+#define   AS_COMMAND_NOP				0
+#define   AS_COMMAND_UPDATE				1
+#define   AS_COMMAND_LOCK				2
+#define   AS_COMMAND_UNLOCK				3
+#define   AS_COMMAND_FLUSH_PT				4
+#define   AS_COMMAND_FLUSH_MEM				5
+#define   AS_LOCK_REGION_MIN_SIZE			(1ULL << 15)
+#define AS_FAULTSTATUS(as)				(MMU_AS(as) + 0x1C)
+#define  AS_FAULTSTATUS_ACCESS_TYPE_MASK		(0x3 << 8)
+#define  AS_FAULTSTATUS_ACCESS_TYPE_ATOMIC		(0x0 << 8)
+#define  AS_FAULTSTATUS_ACCESS_TYPE_EX			(0x1 << 8)
+#define  AS_FAULTSTATUS_ACCESS_TYPE_READ		(0x2 << 8)
+#define  AS_FAULTSTATUS_ACCESS_TYPE_WRITE		(0x3 << 8)
+#define AS_FAULTADDRESS_LO(as)				(MMU_AS(as) + 0x20)
+#define AS_FAULTADDRESS_HI(as)				(MMU_AS(as) + 0x24)
+#define AS_STATUS(as)					(MMU_AS(as) + 0x28)
+#define   AS_STATUS_AS_ACTIVE				BIT(0)
+#define AS_TRANSCFG_LO(as)				(MMU_AS(as) + 0x30)
+#define AS_TRANSCFG_HI(as)				(MMU_AS(as) + 0x34)
+#define   AS_TRANSCFG_ADRMODE_LEGACY			(0 << 0)
+#define   AS_TRANSCFG_ADRMODE_UNMAPPED			(1 << 0)
+#define   AS_TRANSCFG_ADRMODE_IDENTITY			(2 << 0)
+#define   AS_TRANSCFG_ADRMODE_AARCH64_4K		(6 << 0)
+#define   AS_TRANSCFG_ADRMODE_AARCH64_64K		(8 << 0)
+#define   AS_TRANSCFG_INA_BITS(x)			((x) << 6)
+#define   AS_TRANSCFG_OUTA_BITS(x)			((x) << 14)
+#define   AS_TRANSCFG_SL_CONCAT				BIT(22)
+#define   AS_TRANSCFG_PTW_MEMATTR_NC			(1 << 24)
+#define   AS_TRANSCFG_PTW_MEMATTR_WB			(2 << 24)
+#define   AS_TRANSCFG_PTW_SH_NS				(0 << 28)
+#define   AS_TRANSCFG_PTW_SH_OS				(2 << 28)
+#define   AS_TRANSCFG_PTW_SH_IS				(3 << 28)
+#define   AS_TRANSCFG_PTW_RA				BIT(30)
+#define   AS_TRANSCFG_DISABLE_HIER_AP			BIT(33)
+#define   AS_TRANSCFG_DISABLE_AF_FAULT			BIT(34)
+#define   AS_TRANSCFG_WXN				BIT(35)
+#define   AS_TRANSCFG_XREADABLE				BIT(36)
+#define AS_FAULTEXTRA_LO(as)				(MMU_AS(as) + 0x38)
+#define AS_FAULTEXTRA_HI(as)				(MMU_AS(as) + 0x3C)
+
+#define CSF_GPU_LATEST_FLUSH_ID				0x10000
+
+#define CSF_DOORBELL(i)					(0x80000 + ((i) * 0x10000))
+#define CSF_GLB_DOORBELL_ID				0
+
+#define gpu_write(dev, reg, data) writel(data, (dev)->iomem + (reg))
+#define gpu_read(dev, reg) readl((dev)->iomem + (reg))
+
+#endif
diff --git a/drivers/gpu/drm/pancsf/pancsf_sched.c b/drivers/gpu/drm/pancsf/pancsf_sched.c
new file mode 100644
index 000000000000..160e6fb326b4
--- /dev/null
+++ b/drivers/gpu/drm/pancsf/pancsf_sched.c
@@ -0,0 +1,2837 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright 2023 Collabora ltd. */
+
+#ifdef CONFIG_ARM_ARCH_TIMER
+#include <asm/arch_timer.h>
+#endif
+
+#include <drm/pancsf_drm.h>
+#include <drm/drm_gem_shmem_helper.h>
+
+#include <linux/build_bug.h>
+#include <linux/clk.h>
+#include <linux/delay.h>
+#include <linux/dma-mapping.h>
+#include <linux/firmware.h>
+#include <linux/interrupt.h>
+#include <linux/io.h>
+#include <linux/iopoll.h>
+#include <linux/iosys-map.h>
+#include <linux/module.h>
+#include <linux/platform_device.h>
+#include <linux/pm_runtime.h>
+#include <linux/dma-resv.h>
+
+#include "pancsf_sched.h"
+#include "pancsf_device.h"
+#include "pancsf_gem.h"
+#include "pancsf_heap.h"
+#include "pancsf_regs.h"
+#include "pancsf_gpu.h"
+#include "pancsf_mcu.h"
+#include "pancsf_mmu.h"
+
+#define PANCSF_CS_FW_NAME "mali_csffw.bin"
+
+#define CSF_JOB_TIMEOUT_MS			5000
+
+#define MIN_CS_PER_CSG				8
+
+#define MIN_CSGS				3
+#define MAX_CSG_PRIO				0xf
+
+struct pancsf_group;
+
+struct pancsf_csg_slot {
+	struct pancsf_group *group;
+	u32 pending_reqs;
+	wait_queue_head_t reqs_acked;
+	u8 priority;
+	bool idle;
+};
+
+enum pancsf_csg_priority {
+	PANCSF_CSG_PRIORITY_LOW = 0,
+	PANCSF_CSG_PRIORITY_MEDIUM,
+	PANCSF_CSG_PRIORITY_HIGH,
+	PANCSF_CSG_PRIORITY_RT,
+	PANCSF_CSG_PRIORITY_COUNT,
+};
+
+struct pancsf_scheduler {
+	struct pancsf_device *pfdev;
+	struct workqueue_struct *wq;
+	struct delayed_work tick_work;
+	struct delayed_work ping_work;
+	struct work_struct sync_upd_work;
+	struct work_struct reset_work;
+	bool reset_pending;
+	u64 resched_target;
+	u64 last_tick;
+	u64 tick_period;
+
+	struct mutex lock;
+	struct list_head run_queues[PANCSF_CSG_PRIORITY_COUNT];
+	struct list_head idle_queues[PANCSF_CSG_PRIORITY_COUNT];
+	struct list_head wait_queue;
+
+	struct pancsf_csg_slot csg_slots[MAX_CSGS];
+	u32 csg_slot_count;
+	u32 cs_slot_count;
+	u32 as_slot_count;
+	u32 used_csg_slot_count;
+	u32 sb_slot_count;
+	bool might_have_idle_groups;
+
+	u32 pending_reqs;
+	wait_queue_head_t reqs_acked;
+};
+
+struct pancsf_syncobj_32b {
+	u32 seqno;
+	u32 status;
+};
+
+struct pancsf_syncobj_64b {
+	u64 seqno;
+	u32 status;
+	u32 pad;
+};
+
+#define PANCSF_CS_QUEUE_FENCE_CTX_NAME_PREFIX "pancsf-csqf-ctx-"
+#define PANCSF_CS_QUEUE_FENCE_CTX_NAME_LEN (sizeof(PANCSF_CS_QUEUE_FENCE_CTX_NAME_PREFIX) + 16)
+
+struct pancsf_queue_fence_ctx {
+	struct kref refcount;
+	char name[PANCSF_CS_QUEUE_FENCE_CTX_NAME_LEN];
+	spinlock_t lock;
+	u64 id;
+	atomic64_t seqno;
+};
+
+struct pancsf_queue_fence {
+	struct dma_fence base;
+	struct pancsf_queue_fence_ctx *ctx;
+	u64 seqno;
+};
+
+struct pancsf_queue {
+	u8 priority;
+	struct {
+		struct pancsf_gem_object *bo;
+		u64 gpu_va;
+		u64 *kmap;
+	} ringbuf;
+
+	struct {
+		struct pancsf_fw_mem *mem;
+		struct pancsf_ringbuf_input_iface *input;
+		struct pancsf_ringbuf_output_iface *output;
+	} iface;
+
+	struct {
+		spinlock_t lock;
+		struct list_head in_flight;
+		struct list_head pending;
+	} jobs;
+
+
+	struct {
+		u64 gpu_va;
+		u64 ref;
+		bool gt;
+		bool sync64;
+		struct pancsf_gem_object *bo;
+		u64 offset;
+		void *kmap;
+	} syncwait;
+
+	struct pancsf_syncobj_64b *syncobj;
+	struct pancsf_queue_fence_ctx *fence_ctx;
+};
+
+struct pancsf_file_ctx;
+
+enum pancsf_group_state {
+	PANCSF_CS_GROUP_CREATED,
+	PANCSF_CS_GROUP_ACTIVE,
+	PANCSF_CS_GROUP_SUSPENDED,
+	PANCSF_CS_GROUP_TERMINATED,
+};
+
+struct pancsf_group {
+	struct kref refcount;
+	struct pancsf_device *pfdev;
+	struct pancsf_heap_pool *heaps;
+	struct pancsf_vm *vm;
+	u64 compute_core_mask;
+	u64 fragment_core_mask;
+	u64 tiler_core_mask;
+	u8 max_compute_cores;
+	u8 max_fragment_cores;
+	u8 max_tiler_cores;
+	u8 priority;
+	u32 blocked_streams;
+	u32 idle_streams;
+	spinlock_t fatal_lock;
+	u32 fatal_streams;
+	u32 stream_count;
+	struct pancsf_queue *streams[MAX_CS_PER_CSG];
+	int as_id;
+	int csg_id;
+	bool destroyed;
+	bool timedout;
+	bool in_tick_ctx;
+
+	struct {
+		struct pancsf_gem_object *bo;
+		u64 gpu_va;
+		void *kmap;
+	} syncobjs;
+
+	enum pancsf_group_state state;
+	struct pancsf_fw_mem *suspend_buf;
+	struct pancsf_fw_mem *protm_suspend_buf;
+
+	struct work_struct sync_upd_work;
+	struct work_struct job_pending_work;
+	struct work_struct term_work;
+
+	struct list_head run_node;
+	struct list_head wait_node;
+};
+
+#define MAX_GROUPS_PER_POOL MAX_CSGS
+
+struct pancsf_group_pool {
+	struct mutex lock;
+	struct xarray xa;
+};
+
+struct pancsf_job {
+	struct kref refcount;
+	struct pancsf_group *group;
+	u32 stream_idx;
+
+	struct pancsf_call_info call_info;
+	struct {
+		u64 start, end;
+	} ringbuf;
+	struct delayed_work timeout_work;
+	struct dma_fence_cb dep_cb;
+	struct dma_fence *cur_dep;
+	struct xarray dependencies;
+	unsigned long last_dependency;
+	struct list_head node;
+	struct dma_fence *done_fence;
+};
+
+static u32 pancsf_conv_timeout(struct pancsf_device *pfdev, u32 timeout_us)
+{
+	bool use_cycle_counter = false;
+	u32 timer_rate = 0;
+	u64 cycles;
+
+#ifdef CONFIG_ARM_ARCH_TIMER
+	timer_rate = arch_timer_get_cntfrq();
+#endif
+
+	if (!timer_rate) {
+		use_cycle_counter = true;
+		timer_rate = clk_get_rate(pfdev->clock);
+	}
+
+	if (WARN_ON(!timer_rate)) {
+		/* We couldn't get a valid clock rate, let's just pick the
+		 * maximum value so the FW still handles the core
+		 * power on/off requests.
+		 */
+		return GLB_TIMER_VAL(0x7fffffff) |
+		       GLB_TIMER_SOURCE_GPU_COUNTER;
+	}
+
+	cycles = DIV_ROUND_UP_ULL((u64)timeout_us * timer_rate, 1000000);
+	return GLB_TIMER_VAL(cycles >> 10) |
+	       (use_cycle_counter ? GLB_TIMER_SOURCE_GPU_COUNTER : 0);
+}
+
+static void pancsf_global_init(struct pancsf_device *pfdev)
+{
+	const struct pancsf_fw_global_iface *glb_iface = pancsf_get_glb_iface(pfdev);
+	struct pancsf_scheduler *sched = pfdev->scheduler;
+	u32 req_mask = GLB_CFG_ALLOC_EN |
+		       GLB_CFG_POWEROFF_TIMER |
+		       GLB_CFG_PROGRESS_TIMER |
+		       GLB_IDLE_EN;
+	u32 req_val;
+
+	/* Enable all cores. */
+	glb_iface->input->core_en_mask = pfdev->gpu_info.shader_present;
+
+	/* 800us power off hysteresis. */
+	glb_iface->input->poweroff_timer = pancsf_conv_timeout(pfdev, 800);
+
+	/* Progress timeout set to 2500 * 1024 cycles. */
+	glb_iface->input->progress_timer = 2500;
+
+	/* 10ms idle hysteresis */
+	glb_iface->input->idle_timer = pancsf_conv_timeout(pfdev, 10000);
+
+	req_val = pancsf_toggle_reqs(glb_iface->input->req,
+				     glb_iface->output->ack,
+				     GLB_CFG_ALLOC_EN |
+				     GLB_CFG_POWEROFF_TIMER |
+				     GLB_CFG_PROGRESS_TIMER) |
+		  GLB_IDLE_EN;
+
+	/* Update the request reg */
+	glb_iface->input->req = pancsf_update_reqs(glb_iface->input->req,
+						   req_val | GLB_IDLE_EN,
+						   req_mask);
+
+	/* Enable interrupts we care about. */
+	glb_iface->input->ack_irq_mask = GLB_CFG_ALLOC_EN |
+					 GLB_PING |
+					 GLB_CFG_PROGRESS_TIMER |
+					 GLB_CFG_POWEROFF_TIMER |
+					 GLB_IDLE_EN |
+					 GLB_IDLE;
+
+	gpu_write(pfdev, CSF_DOORBELL(CSF_GLB_DOORBELL_ID), 1);
+
+	/* Kick the FW watchdog. */
+	mod_delayed_work(sched->wq,
+			 &sched->ping_work,
+			 msecs_to_jiffies(12000));
+}
+
+static void
+pancsf_queue_release_syncwait_obj(struct pancsf_group *group,
+				  struct pancsf_queue *stream)
+{
+	pancsf_gem_unmap_and_put(group->vm, stream->syncwait.bo,
+				 stream->syncwait.gpu_va, stream->syncwait.kmap);
+}
+
+static void pancsf_queue_fence_ctx_release(struct kref *kref)
+{
+	struct pancsf_queue_fence_ctx *ctx = container_of(kref,
+							  struct pancsf_queue_fence_ctx,
+							  refcount);
+
+	kfree(ctx);
+	module_put(THIS_MODULE);
+}
+
+static void pancsf_free_queue(struct pancsf_group *group,
+			      struct pancsf_queue *stream)
+{
+	if (IS_ERR_OR_NULL(stream))
+		return;
+
+	if (stream->syncwait.bo)
+		pancsf_queue_release_syncwait_obj(group, stream);
+
+	if (stream->fence_ctx)
+		kref_put(&stream->fence_ctx->refcount, pancsf_queue_fence_ctx_release);
+
+	if (!IS_ERR_OR_NULL(stream->ringbuf.bo)) {
+		pancsf_gem_unmap_and_put(group->vm, stream->ringbuf.bo,
+					 stream->ringbuf.gpu_va, stream->ringbuf.kmap);
+	}
+
+	pancsf_fw_mem_free(group->pfdev, stream->iface.mem);
+}
+
+static void pancsf_release_group(struct kref *kref)
+{
+	struct pancsf_group *group = container_of(kref,
+						     struct pancsf_group,
+						     refcount);
+	u32 i;
+
+	WARN_ON(group->csg_id >= 0);
+	WARN_ON(!list_empty(&group->run_node));
+	WARN_ON(!list_empty(&group->wait_node));
+
+	for (i = 0; i < group->stream_count; i++)
+		pancsf_free_queue(group, group->streams[i]);
+
+	if (group->suspend_buf)
+		pancsf_fw_mem_free(group->pfdev, group->suspend_buf);
+
+	if (group->protm_suspend_buf)
+		pancsf_fw_mem_free(group->pfdev, group->protm_suspend_buf);
+
+	if (!IS_ERR_OR_NULL(group->syncobjs.bo)) {
+		pancsf_gem_unmap_and_put(group->vm, group->syncobjs.bo,
+					 group->syncobjs.gpu_va, group->syncobjs.kmap);
+	}
+
+	if (group->vm)
+		pancsf_vm_put(group->vm);
+
+	kfree(group);
+}
+
+static void pancsf_group_put(struct pancsf_group *group)
+{
+	if (group)
+		kref_put(&group->refcount, pancsf_release_group);
+}
+
+static struct pancsf_group *
+pancsf_group_get(struct pancsf_group *group)
+{
+	if (group)
+		kref_get(&group->refcount);
+
+	return group;
+}
+
+static int
+pancsf_bind_group_locked(struct pancsf_group *group,
+			 u32 csg_id)
+{
+	struct pancsf_device *pfdev = group->pfdev;
+	struct pancsf_csg_slot *csg_slot;
+
+	if (WARN_ON(group->csg_id != -1 || csg_id >= MAX_CSGS ||
+		    pfdev->scheduler->csg_slots[csg_id].group))
+		return -EINVAL;
+
+	csg_slot = &pfdev->scheduler->csg_slots[csg_id];
+	pancsf_group_get(group);
+	group->csg_id = csg_id;
+	group->as_id = pancsf_vm_as_get(group->vm);
+	csg_slot->group = group;
+
+	return 0;
+}
+
+static int
+pancsf_unbind_group_locked(struct pancsf_group *group)
+{
+	struct pancsf_device *pfdev = group->pfdev;
+	struct pancsf_csg_slot *slot;
+
+	if (WARN_ON(group->csg_id < 0 || group->csg_id >= MAX_CSGS))
+		return -EINVAL;
+
+	if (WARN_ON(group->state == PANCSF_CS_GROUP_ACTIVE))
+		return -EINVAL;
+
+	slot = &pfdev->scheduler->csg_slots[group->csg_id];
+	pancsf_vm_as_put(group->vm);
+	group->as_id = -1;
+	group->csg_id = -1;
+	slot->group = NULL;
+
+	pancsf_group_put(group);
+	return 0;
+}
+
+static int
+pancsf_prog_stream_locked(struct pancsf_group *group, u32 stream_idx)
+{
+	const struct pancsf_fw_cs_iface *cs_iface;
+	struct pancsf_queue *stream;
+
+	if (stream_idx >= group->stream_count)
+		return -EINVAL;
+
+	stream = group->streams[stream_idx];
+	cs_iface = pancsf_get_cs_iface(group->pfdev, group->csg_id, stream_idx);
+	stream->iface.input->extract = stream->iface.output->extract;
+	WARN_ON(stream->iface.input->insert < stream->iface.input->extract);
+
+	cs_iface->input->ringbuf_base = stream->ringbuf.gpu_va;
+	cs_iface->input->ringbuf_size = stream->ringbuf.bo->base.base.size;
+	cs_iface->input->ringbuf_input = pancsf_fw_mem_va(stream->iface.mem);
+	cs_iface->input->ringbuf_output = pancsf_fw_mem_va(stream->iface.mem) + PAGE_SIZE;
+	cs_iface->input->config = CS_CONFIG_PRIORITY(stream->priority) |
+				  CS_CONFIG_DOORBELL(group->csg_id + 1);
+	cs_iface->input->ack_irq_mask = ~0;
+	cs_iface->input->req = pancsf_update_reqs(cs_iface->input->req,
+						  CS_IDLE_SYNC_WAIT |
+						  CS_IDLE_EMPTY |
+						  CS_STATE_START |
+						  CS_EXTRACT_EVENT,
+						  CS_IDLE_SYNC_WAIT |
+						  CS_IDLE_EMPTY |
+						  CS_STATE_MASK |
+						  CS_EXTRACT_EVENT);
+	return 0;
+}
+
+static int
+pancsf_reset_cs_slot_locked(struct pancsf_group *group,
+			    u32 stream_idx)
+{
+	const struct pancsf_fw_cs_iface *cs_iface;
+
+	if (stream_idx >= group->stream_count)
+		return -EINVAL;
+
+	cs_iface = pancsf_get_cs_iface(group->pfdev, group->csg_id, stream_idx);
+	cs_iface->input->req = pancsf_update_reqs(cs_iface->input->req,
+						  CS_STATE_STOP,
+						  CS_STATE_MASK);
+	return 0;
+}
+
+static void
+pancsf_sync_csg_slot_priority_locked(struct pancsf_device *pfdev,
+				     u32 csg_id)
+{
+	struct pancsf_csg_slot *csg_slot = &pfdev->scheduler->csg_slots[csg_id];
+	const struct pancsf_fw_csg_iface *csg_iface;
+
+	csg_iface = pancsf_get_csg_iface(pfdev, csg_id);
+	csg_slot->priority = (csg_iface->input->endpoint_req & CSG_EP_REQ_PRIORITY_MASK) >> 28;
+}
+
+static void
+pancsf_sync_queue_state_locked(struct pancsf_group *group, u32 cs_id)
+{
+	struct pancsf_queue *stream = group->streams[cs_id];
+	struct pancsf_fw_cs_iface *cs_iface;
+	u32 status_wait_cond;
+
+	cs_iface = pancsf_get_cs_iface(group->pfdev, group->csg_id, cs_id);
+
+	switch (cs_iface->output->status_blocked_reason) {
+	case CS_STATUS_BLOCKED_REASON_UNBLOCKED:
+		if (stream->iface.input->insert == stream->iface.output->extract &&
+		    cs_iface->output->status_scoreboards == 0)
+			group->idle_streams |= BIT(cs_id);
+		break;
+
+	case CS_STATUS_BLOCKED_REASON_SYNC_WAIT:
+		WARN_ON(!list_empty(&group->wait_node));
+		list_move_tail(&group->wait_node, &group->pfdev->scheduler->wait_queue);
+		group->blocked_streams |= BIT(cs_id);
+		stream->syncwait.gpu_va = cs_iface->output->status_wait_sync_ptr;
+		stream->syncwait.ref = cs_iface->output->status_wait_sync_value;
+		status_wait_cond = cs_iface->output->status_wait & CS_STATUS_WAIT_SYNC_COND_MASK;
+		stream->syncwait.gt = status_wait_cond == CS_STATUS_WAIT_SYNC_COND_GT;
+		if (cs_iface->output->status_wait & CS_STATUS_WAIT_SYNC_64B) {
+			u64 sync_val_hi = cs_iface->output->status_wait_sync_value_hi;
+
+			stream->syncwait.sync64 = true;
+			stream->syncwait.ref |= sync_val_hi << 32;
+		} else {
+			stream->syncwait.sync64 = false;
+		}
+		break;
+
+	default:
+		/* Other reasons are not blocking. Consider the stream as runnable
+		 * in those cases.
+		 */
+		break;
+	}
+}
+
+static void
+pancsf_sync_csg_slot_streams_state_locked(struct pancsf_device *pfdev,
+					  u32 csg_id)
+{
+	struct pancsf_csg_slot *csg_slot = &pfdev->scheduler->csg_slots[csg_id];
+	struct pancsf_group *group = csg_slot->group;
+	u32 i;
+
+	group->idle_streams = 0;
+	group->blocked_streams = 0;
+
+	for (i = 0; i < group->stream_count; i++) {
+		if (group->streams[i])
+			pancsf_sync_queue_state_locked(group, i);
+	}
+}
+
+static void
+pancsf_sync_csg_slot_state_locked(struct pancsf_device *pfdev, u32 csg_id)
+{
+	struct pancsf_csg_slot *csg_slot = &pfdev->scheduler->csg_slots[csg_id];
+	const struct pancsf_fw_csg_iface *csg_iface;
+	struct pancsf_group *group;
+	enum pancsf_group_state new_state, old_state;
+
+	csg_iface = pancsf_get_csg_iface(pfdev, csg_id);
+	group = csg_slot->group;
+
+	if (!group)
+		return;
+
+	old_state = group->state;
+	switch (csg_iface->output->ack & CSG_STATE_MASK) {
+	case CSG_STATE_START:
+	case CSG_STATE_RESUME:
+		new_state = PANCSF_CS_GROUP_ACTIVE;
+		break;
+	case CSG_STATE_TERMINATE:
+		new_state = PANCSF_CS_GROUP_TERMINATED;
+		break;
+	case CSG_STATE_SUSPEND:
+		new_state = PANCSF_CS_GROUP_SUSPENDED;
+		break;
+	}
+
+	if (old_state == new_state)
+		return;
+
+	if (new_state == PANCSF_CS_GROUP_SUSPENDED)
+		pancsf_sync_csg_slot_streams_state_locked(pfdev, csg_id);
+
+	if (old_state == PANCSF_CS_GROUP_ACTIVE) {
+		u32 i;
+
+		/* Reset the stream slots so we start from a clean
+		 * state when starting/resuming a new group on this
+		 * CSG slot. No wait needed here, and no ringbell
+		 * either, since the CS slot will only be re-used
+		 * on the next CSG start operation.
+		 */
+		for (i = 0; i < group->stream_count; i++) {
+			if (group->streams[i])
+				pancsf_reset_cs_slot_locked(group, i);
+		}
+	}
+
+	group->state = new_state;
+}
+
+static int
+pancsf_prog_csg_slot_locked(struct pancsf_device *pfdev, u32 csg_id, u32 priority)
+{
+	const struct pancsf_fw_global_iface *glb_iface;
+	const struct pancsf_fw_csg_iface *csg_iface;
+	struct pancsf_csg_slot *csg_slot;
+	struct pancsf_group *group;
+	u32 stream_mask = 0, i;
+
+	if (priority > MAX_CSG_PRIO)
+		return -EINVAL;
+
+	if (WARN_ON(csg_id >= MAX_CSGS))
+		return -EINVAL;
+
+	csg_slot = &pfdev->scheduler->csg_slots[csg_id];
+	group = csg_slot->group;
+	if (!group || group->state == PANCSF_CS_GROUP_ACTIVE)
+		return 0;
+
+	glb_iface = pancsf_get_glb_iface(group->pfdev);
+	csg_iface = pancsf_get_csg_iface(group->pfdev, group->csg_id);
+
+	for (i = 0; i < group->stream_count; i++) {
+		if (group->streams[i]) {
+			pancsf_prog_stream_locked(group, i);
+			stream_mask |= BIT(i);
+		}
+	}
+
+	csg_iface->input->allow_compute = group->compute_core_mask;
+	csg_iface->input->allow_fragment = group->fragment_core_mask;
+	csg_iface->input->allow_other = group->tiler_core_mask;
+	csg_iface->input->endpoint_req = CSG_EP_REQ_COMPUTE(group->max_compute_cores) |
+					 CSG_EP_REQ_FRAGMENT(group->max_fragment_cores) |
+					 CSG_EP_REQ_TILER(group->max_tiler_cores) |
+					 CSG_EP_REQ_PRIORITY(csg_slot->priority);
+	csg_iface->input->config = group->as_id;
+
+	if (group->suspend_buf)
+		csg_iface->input->suspend_buf = pancsf_fw_mem_va(group->suspend_buf);
+	else
+		csg_iface->input->suspend_buf = 0;
+
+	if (group->protm_suspend_buf)
+		csg_iface->input->protm_suspend_buf = pancsf_fw_mem_va(group->protm_suspend_buf);
+	else
+		csg_iface->input->protm_suspend_buf = 0;
+
+	csg_iface->input->ack_irq_mask = ~0;
+	csg_iface->input->doorbell_req = pancsf_toggle_reqs(csg_iface->input->doorbell_req,
+							    csg_iface->output->doorbell_ack,
+							    stream_mask);
+	return 0;
+}
+
+static void pancsf_handle_cs_fatal(struct pancsf_device *pfdev,
+				   unsigned int csg_id, unsigned int cs_id)
+{
+	struct pancsf_scheduler *sched = pfdev->scheduler;
+	struct pancsf_csg_slot *csg_slot = &sched->csg_slots[csg_id];
+	struct pancsf_group *group = csg_slot->group;
+	const struct pancsf_fw_cs_iface *csg_iface;
+	const struct pancsf_fw_cs_iface *cs_iface;
+	u32 fatal;
+	u64 info;
+
+	csg_iface = pancsf_get_cs_iface(pfdev, csg_id, cs_id);
+	cs_iface = pancsf_get_cs_iface(pfdev, csg_id, cs_id);
+	fatal = cs_iface->output->fatal;
+	info = cs_iface->output->fatal_info;
+	group->fatal_streams |= BIT(cs_id);
+	mod_delayed_work(sched->wq, &sched->tick_work, 0);
+	dev_warn(pfdev->dev,
+		 "CSG slot %d CS slot: %d\n"
+		 "CS_FATAL.EXCEPTION_TYPE: 0x%x (%s)\n"
+		 "CS_FATAL.EXCEPTION_DATA: 0x%x\n"
+		 "CS_FATAL_INFO.EXCEPTION_DATA: 0x%llx\n",
+		 csg_id, cs_id,
+		 (unsigned int)CS_EXCEPTION_TYPE(fatal),
+		 pancsf_exception_name(CS_EXCEPTION_TYPE(fatal)),
+		 (unsigned int)CS_EXCEPTION_DATA(fatal),
+		 info);
+}
+
+static void pancsf_handle_cs_fault(struct pancsf_device *pfdev,
+				   unsigned int csg_id, unsigned int cs_id)
+{
+	const struct pancsf_fw_cs_iface *cs_iface;
+	u32 fault;
+	u64 info;
+
+	cs_iface = pancsf_get_cs_iface(pfdev, csg_id, cs_id);
+	fault = cs_iface->output->fault;
+	info = cs_iface->output->fault_info;
+
+	dev_warn(pfdev->dev,
+		 "CSG slot %d CS slot: %d\n"
+		 "CS_FAULT.EXCEPTION_TYPE: 0x%x (%s)\n"
+		 "CS_FAULT.EXCEPTION_DATA: 0x%x\n"
+		 "CS_FAULT_INFO.EXCEPTION_DATA: 0x%llx\n",
+		 csg_id, cs_id,
+		 (unsigned int)CS_EXCEPTION_TYPE(fault),
+		 pancsf_exception_name(CS_EXCEPTION_TYPE(fault)),
+		 (unsigned int)CS_EXCEPTION_DATA(fault),
+		 info);
+}
+
+static void pancsf_handle_tiler_oom(struct pancsf_device *pfdev,
+				    unsigned int csg_id,
+				    unsigned int cs_id)
+{
+	struct pancsf_scheduler *sched = pfdev->scheduler;
+	struct pancsf_csg_slot *csg_slot = &sched->csg_slots[csg_id];
+	struct pancsf_group *group = csg_slot->group;
+	const struct pancsf_fw_cs_iface *cs_iface;
+	struct pancsf_heap_pool *heaps;
+	struct pancsf_queue *stream;
+	u32 fault, vt_start, vt_end, frag_end;
+	u32 renderpasses_in_flight, pending_frag_count;
+	u64 info, heap_address, new_chunk_va;
+	int ret;
+
+	if (WARN_ON(!group))
+		return;
+
+	cs_iface = pancsf_get_cs_iface(pfdev, csg_id, cs_id);
+	stream = group->streams[cs_id];
+	heaps = group->heaps;
+	fault = cs_iface->output->fault;
+	info = cs_iface->output->fault_info;
+	heap_address = cs_iface->output->heap_address;
+	vt_start = cs_iface->output->heap_vt_start;
+	vt_end = cs_iface->output->heap_vt_end;
+	frag_end = cs_iface->output->heap_frag_end;
+	renderpasses_in_flight = vt_start - frag_end;
+	pending_frag_count = vt_end - frag_end;
+
+	if (!heaps || frag_end > vt_end || vt_end >= vt_start) {
+		ret = -EINVAL;
+	} else {
+		ret = pancsf_heap_grow(heaps, heap_address,
+				       renderpasses_in_flight,
+				       pending_frag_count, &new_chunk_va);
+	}
+
+	if (!ret) {
+		cs_iface->input->heap_start = new_chunk_va;
+		cs_iface->input->heap_end = new_chunk_va;
+	} else if (ret == -EBUSY) {
+		cs_iface->input->heap_start = 0;
+		cs_iface->input->heap_end = 0;
+	} else {
+		group->fatal_streams |= BIT(csg_id);
+		mod_delayed_work(sched->wq, &sched->tick_work, 0);
+	}
+}
+
+static bool pancsf_handle_cs_irq(struct pancsf_device *pfdev,
+				 unsigned int csg_id, unsigned int cs_id)
+{
+	const struct pancsf_fw_cs_iface *cs_iface;
+	u32 req, ack, events;
+
+	cs_iface = pancsf_get_cs_iface(pfdev, csg_id, cs_id);
+	req = cs_iface->input->req;
+	ack = cs_iface->output->ack;
+	events = req ^ ack;
+
+	if (events & CS_FATAL)
+		pancsf_handle_cs_fatal(pfdev, csg_id, cs_id);
+
+	if (events & CS_FAULT)
+		pancsf_handle_cs_fault(pfdev, csg_id, cs_id);
+
+	if (events & CS_TILER_OOM)
+		pancsf_handle_tiler_oom(pfdev, csg_id, cs_id);
+
+	cs_iface->input->req = pancsf_update_reqs(req, ack,
+						  CS_FATAL | CS_FAULT | CS_TILER_OOM);
+
+	return (events & (CS_FAULT | CS_TILER_OOM)) != 0;
+}
+
+static void pancsf_handle_csg_state_update(struct pancsf_device *pfdev, unsigned int csg_id)
+{
+	struct pancsf_csg_slot *csg_slot = &pfdev->scheduler->csg_slots[csg_id];
+	const struct pancsf_fw_csg_iface *csg_iface;
+
+	csg_iface = pancsf_get_csg_iface(pfdev, csg_id);
+	csg_slot->idle = csg_iface->output->status_state & CSG_STATUS_STATE_IS_IDLE;
+}
+
+static void pancsf_handle_csg_idle(struct pancsf_device *pfdev, unsigned int csg_id)
+{
+	struct pancsf_scheduler *sched = pfdev->scheduler;
+	int prio;
+
+	sched->might_have_idle_groups = true;
+
+	/* Schedule a tick if there are other runnable groups waiting. */
+	for (prio = PANCSF_CSG_PRIORITY_COUNT - 1; prio >= 0; prio--) {
+		if (!list_empty(&sched->run_queues[prio])) {
+			mod_delayed_work(sched->wq, &sched->tick_work, 0);
+			return;
+		}
+	}
+}
+
+/* Macro automating the 'grab a ref, schedule the work and release ref if
+ * already scheduled' sequence.
+ */
+#define pancsf_group_queue_work(sched, group, wname) \
+	do { \
+		pancsf_group_get(group); \
+		if (!queue_work((sched)->wq, &(group)->wname ## _work)) \
+			pancsf_group_put(group); \
+	} while (0)
+
+static void pancsf_queue_csg_sync_update_locked(struct pancsf_device *pfdev,
+						unsigned int csg_id)
+{
+	struct pancsf_csg_slot *csg_slot = &pfdev->scheduler->csg_slots[csg_id];
+	struct pancsf_group *group = csg_slot->group;
+
+	pancsf_group_queue_work(pfdev->scheduler, group, sync_upd);
+
+	queue_work(pfdev->scheduler->wq, &pfdev->scheduler->sync_upd_work);
+}
+
+static void
+pancsf_handle_csg_progress_timer_evt(struct pancsf_device *pfdev, unsigned int csg_id)
+{
+	struct pancsf_scheduler *sched = pfdev->scheduler;
+	struct pancsf_csg_slot *csg_slot = &sched->csg_slots[csg_id];
+	struct pancsf_group *group = csg_slot->group;
+
+	WARN_ON(1);
+	group->timedout = true;
+	mod_delayed_work(sched->wq, &sched->tick_work, 0);
+}
+
+static bool pancsf_sched_handle_csg_irq(struct pancsf_device *pfdev, unsigned int csg_id)
+{
+	struct pancsf_csg_slot *csg_slot = &pfdev->scheduler->csg_slots[csg_id];
+	const struct pancsf_fw_csg_iface *csg_iface;
+	struct pancsf_group *group = csg_slot->group;
+	u32 req, ack, irq_req, irq_ack, cs_events, csg_events;
+	u32 ring_cs_db_mask = 0;
+	u32 acked_reqs;
+
+	lockdep_assert_held(&pfdev->scheduler->lock);
+	if (WARN_ON(csg_id >= pfdev->scheduler->csg_slot_count))
+		return false;
+
+	csg_iface = pancsf_get_csg_iface(pfdev, csg_id);
+	req = csg_iface->input->req;
+	ack = csg_iface->output->ack;
+	irq_req = csg_iface->output->irq_req;
+	irq_ack = csg_iface->input->irq_ack;
+	csg_events = req ^ ack;
+	acked_reqs = csg_slot->pending_reqs & ~csg_events;
+
+	if (acked_reqs & CSG_ENDPOINT_CONFIG)
+		pancsf_sync_csg_slot_priority_locked(pfdev, csg_id);
+
+	if (acked_reqs & CSG_STATE_MASK) {
+		acked_reqs |= CSG_STATE_MASK;
+		pancsf_sync_csg_slot_state_locked(pfdev, csg_id);
+	}
+
+	if (acked_reqs & CSG_STATUS_UPDATE)
+		pancsf_handle_csg_state_update(pfdev, csg_id);
+
+	if (acked_reqs) {
+		csg_slot->pending_reqs &= ~acked_reqs;
+		wake_up_all(&csg_slot->reqs_acked);
+	}
+
+	/* There may not be any pending CSG/CS interrupts to process */
+	if (req == ack && irq_req == irq_ack)
+		return false;
+
+	/* Immediately set IRQ_ACK bits to be same as the IRQ_REQ bits before
+	 * examining the CS_ACK & CS_REQ bits. This would ensure that Host
+	 * doesn't misses an interrupt for the CS in the race scenario where
+	 * whilst Host is servicing an interrupt for the CS, firmware sends
+	 * another interrupt for that CS.
+	 */
+	csg_iface->input->irq_ack = irq_req;
+
+	if (WARN_ON(!group))
+		return false;
+
+	csg_iface->input->req = pancsf_update_reqs(req, ack,
+						   CSG_SYNC_UPDATE |
+						   CSG_IDLE |
+						   CSG_PROGRESS_TIMER_EVENT);
+
+	if (csg_events & CSG_SYNC_UPDATE)
+		pancsf_queue_csg_sync_update_locked(pfdev, csg_id);
+
+	if (csg_events & CSG_IDLE)
+		pancsf_handle_csg_idle(pfdev, csg_id);
+
+	if (csg_events & CSG_PROGRESS_TIMER_EVENT)
+		pancsf_handle_csg_progress_timer_evt(pfdev, csg_id);
+
+	cs_events = irq_req ^ irq_ack;
+	while (cs_events) {
+		u32 cs_id = ffs(cs_events) - 1;
+
+		if (pancsf_handle_cs_irq(pfdev, csg_id, cs_id))
+			ring_cs_db_mask |= BIT(cs_id);
+
+		cs_events &= ~BIT(cs_id);
+	}
+
+	if (ring_cs_db_mask) {
+		csg_iface->input->doorbell_req = pancsf_toggle_reqs(csg_iface->input->doorbell_req,
+								    csg_iface->output->doorbell_ack,
+								    ring_cs_db_mask);
+	}
+
+	return ring_cs_db_mask != 0;
+}
+
+static void pancsf_sched_handle_global_irq(struct pancsf_device *pfdev)
+{
+	struct pancsf_scheduler *sched = pfdev->scheduler;
+	const struct pancsf_fw_global_iface *glb_iface = pancsf_get_glb_iface(pfdev);
+	u32 req, ack, events, acked_reqs;
+
+	req = glb_iface->input->req;
+	ack = glb_iface->output->ack;
+	events = req ^ ack;
+	acked_reqs = sched->pending_reqs & ~events;
+
+	if (acked_reqs) {
+		sched->pending_reqs &= ~acked_reqs;
+		wake_up_all(&sched->reqs_acked);
+	}
+
+	if (events & GLB_IDLE) {
+		glb_iface->input->req = pancsf_update_reqs(glb_iface->input->req,
+							   glb_iface->output->ack,
+							   GLB_IDLE);
+	}
+}
+
+void pancsf_sched_handle_job_irqs(struct pancsf_device *pfdev, u32 status)
+{
+	struct pancsf_scheduler *sched = pfdev->scheduler;
+	u32 csg_ints = status & ~JOB_INT_GLOBAL_IF;
+	u32 ring_csg_db = 0;
+
+	mutex_lock(&sched->lock);
+	if (status & JOB_INT_GLOBAL_IF)
+		pancsf_sched_handle_global_irq(pfdev);
+
+	while (csg_ints) {
+		u32 csg_id = ffs(csg_ints) - 1;
+
+		csg_ints &= ~BIT(csg_id);
+		if (pancsf_sched_handle_csg_irq(pfdev, csg_id))
+			ring_csg_db |= BIT(csg_id);
+	}
+
+	if (ring_csg_db) {
+		const struct pancsf_fw_global_iface *glb_iface = pancsf_get_glb_iface(pfdev);
+
+		glb_iface->input->doorbell_req = pancsf_toggle_reqs(glb_iface->input->doorbell_req,
+								    glb_iface->output->doorbell_ack,
+								    ring_csg_db);
+		gpu_write(pfdev, CSF_DOORBELL(CSF_GLB_DOORBELL_ID), 1);
+	}
+
+	mutex_unlock(&sched->lock);
+}
+
+static struct pancsf_queue_fence *
+to_pancsf_queue_fence(struct dma_fence *fence)
+{
+	return container_of(fence, struct pancsf_queue_fence, base);
+}
+
+static const char *pancsf_queue_fence_get_driver_name(struct dma_fence *fence)
+{
+	return "pancsf";
+}
+
+static const char *pancsf_queue_fence_get_timeline_name(struct dma_fence *fence)
+{
+	struct pancsf_queue_fence *f = to_pancsf_queue_fence(fence);
+
+	return f->ctx->name;
+}
+
+static int
+pancsf_queue_fence_ctx_create(struct pancsf_queue *queue)
+{
+	struct pancsf_queue_fence_ctx *ctx;
+
+	if (WARN_ON(!try_module_get(THIS_MODULE)))
+		return -ENOENT;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx) {
+		module_put(THIS_MODULE);
+		return -ENOMEM;
+	}
+
+	spin_lock_init(&ctx->lock);
+	snprintf(ctx->name, sizeof(ctx->name),
+		 PANCSF_CS_QUEUE_FENCE_CTX_NAME_PREFIX "%llx",
+		 ctx->id);
+	kref_init(&ctx->refcount);
+	queue->fence_ctx = ctx;
+	return 0;
+}
+
+static void pancsf_queue_fence_release(struct dma_fence *fence)
+{
+	struct pancsf_queue_fence *f = to_pancsf_queue_fence(fence);
+
+	kref_put(&f->ctx->refcount, pancsf_queue_fence_ctx_release);
+	dma_fence_free(fence);
+}
+
+static const struct dma_fence_ops pancsf_queue_fence_ops = {
+	.get_driver_name = pancsf_queue_fence_get_driver_name,
+	.get_timeline_name = pancsf_queue_fence_get_timeline_name,
+	.release = pancsf_queue_fence_release,
+};
+
+static struct dma_fence *pancsf_queue_fence_create(struct pancsf_queue *queue)
+{
+	struct pancsf_queue_fence *fence;
+
+	fence = kzalloc(sizeof(*fence), GFP_KERNEL);
+	if (!fence)
+		return ERR_PTR(-ENOMEM);
+
+	kref_get(&queue->fence_ctx->refcount);
+	fence->ctx = queue->fence_ctx;
+	fence->seqno = atomic64_inc_return(&fence->ctx->seqno);
+	dma_fence_init(&fence->base, &pancsf_queue_fence_ops, &fence->ctx->lock,
+		       fence->ctx->id, fence->seqno);
+
+	return &fence->base;
+}
+
+#define CSF_MAX_QUEUE_PRIO	GENMASK(3, 0)
+
+static struct pancsf_queue *
+pancsf_create_queue(struct pancsf_group *group,
+		    const struct drm_pancsf_queue_create *args)
+{
+	struct pancsf_queue *stream;
+	int ret;
+
+	if (args->pad[0] || args->pad[1] || args->pad[2])
+		return ERR_PTR(-EINVAL);
+
+	if (!IS_ALIGNED(args->ringbuf_size, PAGE_SIZE) || args->ringbuf_size > SZ_64K)
+		return ERR_PTR(-EINVAL);
+
+	if (args->priority > CSF_MAX_QUEUE_PRIO)
+		return ERR_PTR(-EINVAL);
+
+	stream = kzalloc(sizeof(*stream), GFP_KERNEL);
+	if (!stream)
+		return ERR_PTR(-ENOMEM);
+
+	spin_lock_init(&stream->jobs.lock);
+	INIT_LIST_HEAD(&stream->jobs.in_flight);
+	INIT_LIST_HEAD(&stream->jobs.pending);
+	stream->priority = args->priority;
+
+	stream->ringbuf.bo = pancsf_gem_create_and_map(group->pfdev, group->vm,
+						       args->ringbuf_size, 0,
+						       PANCSF_VMA_MAP_NOEXEC |
+						       PANCSF_VMA_MAP_UNCACHED |
+						       PANCSF_VMA_MAP_AUTO_VA,
+						       &stream->ringbuf.gpu_va,
+						       (void **)&stream->ringbuf.kmap);
+	if (IS_ERR(stream->ringbuf.bo)) {
+		ret = PTR_ERR(stream->ringbuf.bo);
+		goto out;
+	}
+
+	stream->iface.mem = pancsf_fw_alloc_queue_iface_mem(group->pfdev);
+	if (IS_ERR(stream->iface.mem)) {
+		ret = PTR_ERR(stream->iface.mem);
+		goto out;
+	}
+
+	stream->iface.input = pancsf_fw_mem_vmap(stream->iface.mem,
+						 pgprot_writecombine(PAGE_KERNEL));
+	if (!stream->iface.input) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	memset(stream->iface.input, 0, sizeof(*stream->iface.input));
+	stream->iface.output = (void *)stream->iface.input + PAGE_SIZE;
+	memset((void *)stream->iface.output, 0, sizeof(*stream->iface.output));
+
+	ret = pancsf_queue_fence_ctx_create(stream);
+
+out:
+	if (ret)
+		return ERR_PTR(ret);
+
+	return stream;
+}
+
+static void pancsf_job_dep_cb(struct dma_fence *fence, struct dma_fence_cb *cb)
+{
+	struct pancsf_job *job = container_of(cb, struct pancsf_job, dep_cb);
+	struct pancsf_group *group = job->group;
+	struct pancsf_scheduler *sched = group->pfdev->scheduler;
+
+	pancsf_group_queue_work(sched, group, job_pending);
+}
+
+static bool
+pancsf_job_deps_done(struct pancsf_job *job)
+{
+	if (job->cur_dep && !dma_fence_is_signaled(job->cur_dep))
+		return false;
+
+	dma_fence_put(job->cur_dep);
+	job->cur_dep = NULL;
+
+	while (!xa_empty(&job->dependencies)) {
+		struct dma_fence *next_dep;
+		int ret;
+
+		next_dep = xa_erase(&job->dependencies, job->last_dependency++);
+		ret = dma_fence_add_callback(next_dep, &job->dep_cb,
+					     pancsf_job_dep_cb);
+		if (!ret) {
+			job->cur_dep = next_dep;
+			return false;
+		}
+
+		WARN_ON(ret != -ENOENT);
+		dma_fence_put(next_dep);
+	}
+
+	return true;
+}
+
+#define NUM_INSTRS_PER_SLOT		8
+
+static bool
+pancsf_queue_can_take_new_jobs(struct pancsf_queue *stream)
+{
+	u32 ringbuf_size = stream->ringbuf.bo->base.base.size;
+	u64 used_size = stream->iface.input->insert - stream->iface.output->extract;
+
+	return used_size + (NUM_INSTRS_PER_SLOT * sizeof(u64)) <= ringbuf_size;
+}
+
+static void pancsf_queue_submit_job(struct pancsf_queue *stream, struct pancsf_job *job)
+{
+	struct pancsf_group *group = job->group;
+	struct pancsf_device *pfdev = group->pfdev;
+	struct pancsf_scheduler *sched = pfdev->scheduler;
+	u32 ringbuf_size = stream->ringbuf.bo->base.base.size;
+	u32 ringbuf_insert = stream->iface.input->insert % ringbuf_size;
+	u64 addr_reg = pfdev->csif_info.cs_reg_count -
+		       pfdev->csif_info.unpreserved_cs_reg_count;
+	u64 val_reg = addr_reg + 2;
+	u64 sync_addr = group->syncobjs.gpu_va +
+			job->stream_idx * sizeof(struct pancsf_syncobj_64b);
+	u32 waitall_mask = GENMASK(sched->sb_slot_count - 1, 0);
+
+	u64 call_instrs[NUM_INSTRS_PER_SLOT] = {
+		/* MOV48 rX:rX+1, cs.start */
+		(1ull << 56) | (addr_reg << 48) | job->call_info.start,
+
+		/* MOV32 rX+2, cs.size */
+		(2ull << 56) | (val_reg << 48) | job->call_info.size,
+
+		/* CALL rX:rX+1, rX+2 */
+		(32ull << 56) | (addr_reg << 40) | (val_reg << 32),
+
+		/* MOV48 rX:rX+1, sync_addr */
+		(1ull << 56) | (addr_reg << 48) | sync_addr,
+
+		/* MOV32 rX+2, sync_seqno */
+		(1ull << 56) | (val_reg << 48) | 1,
+
+		/* WAIT(all) */
+		(3ull << 56) | (waitall_mask << 16),
+
+		/* SYNC_ADD64.system_scope.propage_err.nowait rX:rX+1, rX+2*/
+		(51ull << 56) | (0ull << 48) | (addr_reg << 40) | (val_reg << 32) | (0 << 16) | 1,
+
+		/* ERROR_BARRIER, so we can recover from faults at job
+		 * boundaries.
+		 */
+		(47ull << 56),
+	};
+
+	/* Need to be cacheline aligned to please the prefetcher */
+	WARN_ON(sizeof(call_instrs) % 64);
+
+	memcpy((u8 *)stream->ringbuf.kmap + ringbuf_insert,
+	       call_instrs, sizeof(call_instrs));
+
+	spin_lock(&stream->jobs.lock);
+	list_move_tail(&job->node, &stream->jobs.in_flight);
+	spin_unlock(&stream->jobs.lock);
+
+	job->ringbuf.start = stream->iface.input->insert;
+	job->ringbuf.end = job->ringbuf.start + sizeof(call_instrs);
+
+	/* Make sure the ring buffer is updated before the INSERT
+	 * register.
+	 */
+	wmb();
+	stream->iface.input->extract = stream->iface.output->extract;
+	stream->iface.input->insert = job->ringbuf.end;
+	kref_get(&job->refcount);
+	queue_delayed_work(pfdev->scheduler->wq,
+			   &job->timeout_work,
+			   msecs_to_jiffies(CSF_JOB_TIMEOUT_MS));
+}
+
+struct pancsf_csg_slots_upd_ctx {
+	u32 update_mask;
+	u32 timedout_mask;
+	struct {
+		u32 value;
+		u32 mask;
+	} requests[MAX_CSGS];
+};
+
+static void csgs_upd_ctx_init(struct pancsf_csg_slots_upd_ctx *ctx)
+{
+	memset(ctx, 0, sizeof(*ctx));
+}
+
+static void csgs_upd_ctx_queue_reqs(struct pancsf_device *pfdev,
+				    struct pancsf_csg_slots_upd_ctx *ctx,
+				    u32 csg_id, u32 value, u32 mask)
+{
+	if (WARN_ON(!mask) ||
+	    WARN_ON(csg_id >= pfdev->scheduler->csg_slot_count))
+		return;
+
+	ctx->requests[csg_id].value = (ctx->requests[csg_id].value & ~mask) | (value & mask);
+	ctx->requests[csg_id].mask |= mask;
+	ctx->update_mask |= BIT(csg_id);
+}
+
+static int csgs_upd_ctx_apply_locked(struct pancsf_device *pfdev,
+				     struct pancsf_csg_slots_upd_ctx *ctx)
+{
+	struct pancsf_scheduler *sched = pfdev->scheduler;
+	struct pancsf_fw_global_iface *glb_iface;
+	u32 update_slots = ctx->update_mask;
+
+	lockdep_assert_held(&sched->lock);
+
+	if (!ctx->update_mask)
+		return 0;
+
+	while (update_slots) {
+		const struct pancsf_fw_csg_iface *csg_iface;
+		u32 csg_id = ffs(update_slots) - 1;
+		struct pancsf_csg_slot *csg_slot = &sched->csg_slots[csg_id];
+
+		update_slots &= ~BIT(csg_id);
+		csg_iface = pancsf_get_csg_iface(pfdev, csg_id);
+		csg_slot->pending_reqs |= ctx->requests[csg_id].mask;
+		csg_iface->input->req = pancsf_update_reqs(csg_iface->input->req,
+							   ctx->requests[csg_id].value,
+							   ctx->requests[csg_id].mask);
+	}
+
+	glb_iface = pancsf_get_glb_iface(pfdev);
+	glb_iface->input->doorbell_req = pancsf_toggle_reqs(glb_iface->input->doorbell_req,
+							    glb_iface->output->doorbell_ack,
+							    ctx->update_mask);
+	gpu_write(pfdev, CSF_DOORBELL(CSF_GLB_DOORBELL_ID), 1);
+
+	update_slots = ctx->update_mask;
+	while (update_slots) {
+		const struct pancsf_fw_csg_iface *csg_iface;
+		u32 csg_id = ffs(update_slots) - 1;
+		struct pancsf_csg_slot *csg_slot = &sched->csg_slots[csg_id];
+		u32 req_mask = ctx->requests[csg_id].mask;
+		bool timedout = false;
+
+		update_slots &= ~BIT(csg_id);
+		csg_iface = pancsf_get_csg_iface(pfdev, csg_id);
+
+		/* Release the lock while we're waiting. */
+		mutex_unlock(&sched->lock);
+		if (!wait_event_timeout(csg_slot->reqs_acked,
+					!(csg_slot->pending_reqs & req_mask),
+					msecs_to_jiffies(100))) {
+			WARN_ON(gpu_read(pfdev, JOB_INT_MASK) == 0);
+			timedout = true;
+		}
+		mutex_lock(&sched->lock);
+
+		if (timedout &&
+		    (csg_slot->pending_reqs & req_mask) != 0 &&
+		    ((csg_iface->input->req ^ csg_iface->output->ack) & req_mask) != 0) {
+			dev_err(pfdev->dev, "CSG %d update request timedout", csg_id);
+			ctx->timedout_mask |= BIT(csg_id);
+		}
+	}
+
+	if (ctx->timedout_mask)
+		return -ETIMEDOUT;
+
+	return 0;
+}
+
+struct pancsf_sched_tick_ctx {
+	struct list_head old_groups[PANCSF_CSG_PRIORITY_COUNT];
+	struct list_head groups[PANCSF_CSG_PRIORITY_COUNT];
+	u32 idle_group_count;
+	u32 group_count;
+	enum pancsf_csg_priority min_priority;
+	struct pancsf_vm *vms[MAX_CS_PER_CSG];
+	u32 as_count;
+	bool immediate_tick;
+	u32 csg_upd_failed_mask;
+};
+
+static bool
+tick_ctx_is_full(const struct pancsf_scheduler *sched,
+		 const struct pancsf_sched_tick_ctx *ctx)
+{
+	return ctx->group_count == sched->csg_slot_count;
+}
+
+static bool
+pancsf_group_is_idle(struct pancsf_group *group)
+{
+	struct pancsf_device *pfdev = group->pfdev;
+	u32 inactive_streams;
+
+	if (group->csg_id >= 0)
+		return pfdev->scheduler->csg_slots[group->csg_id].idle;
+
+	inactive_streams = group->idle_streams | group->blocked_streams;
+	return hweight32(inactive_streams) == group->stream_count;
+}
+
+static bool
+pancsf_group_can_run(struct pancsf_group *group)
+{
+	return group->state != PANCSF_CS_GROUP_TERMINATED &&
+	       !group->destroyed && group->fatal_streams == 0 &&
+	       !group->timedout;
+}
+
+static void
+tick_ctx_pick_groups_from_queue(const struct pancsf_scheduler *sched,
+				struct pancsf_sched_tick_ctx *ctx,
+				struct list_head *queue,
+				bool skip_idle_groups)
+{
+	struct pancsf_group *group, *tmp;
+
+	if (tick_ctx_is_full(sched, ctx))
+		return;
+
+	list_for_each_entry_safe(group, tmp, queue, run_node) {
+		u32 i;
+
+		if (!pancsf_group_can_run(group))
+			continue;
+
+		if (skip_idle_groups && pancsf_group_is_idle(group))
+			continue;
+
+		for (i = 0; i < ctx->as_count; i++) {
+			if (ctx->vms[i] == group->vm)
+				break;
+		}
+
+		if (i == ctx->as_count && ctx->as_count == sched->as_slot_count)
+			continue;
+
+		if (!group->in_tick_ctx) {
+			pancsf_group_get(group);
+			group->in_tick_ctx = true;
+		}
+
+		list_move_tail(&group->run_node, &ctx->groups[group->priority]);
+		ctx->group_count++;
+		if (pancsf_group_is_idle(group))
+			ctx->idle_group_count++;
+
+		if (i == ctx->as_count)
+			ctx->vms[ctx->as_count++] = group->vm;
+
+		if (ctx->min_priority > group->priority)
+			ctx->min_priority = group->priority;
+
+		if (tick_ctx_is_full(sched, ctx))
+			return;
+	}
+}
+
+static void
+tick_ctx_insert_old_group(struct pancsf_scheduler *sched,
+			  struct pancsf_sched_tick_ctx *ctx,
+			  struct pancsf_group *group,
+			  bool full_tick)
+{
+	struct pancsf_csg_slot *csg_slot = &sched->csg_slots[group->csg_id];
+	struct pancsf_group *other_group;
+
+	if (!full_tick) {
+		list_add_tail(&group->run_node, &ctx->old_groups[group->priority]);
+		return;
+	}
+
+	/* Rotate to make sure groups with lower CSG slot
+	 * priorities have a chance to get a higher CSG slot
+	 * priority next time they get picked. This priority
+	 * has an impact on resource request ordering, so it's
+	 * important to make sure we don't let one group starve
+	 * all other groups with the same group priority.
+	 */
+	list_for_each_entry(other_group,
+			    &ctx->old_groups[csg_slot->group->priority],
+			    run_node) {
+		struct pancsf_csg_slot *other_csg_slot = &sched->csg_slots[other_group->csg_id];
+
+		if (other_csg_slot->priority > csg_slot->priority) {
+			list_add_tail(&csg_slot->group->run_node, &other_group->run_node);
+			return;
+		}
+	}
+
+	list_add_tail(&group->run_node, &ctx->old_groups[group->priority]);
+}
+
+static void pancsf_sched_queue_reset_locked(struct pancsf_device *pfdev)
+{
+	struct pancsf_scheduler *sched = pfdev->scheduler;
+
+	if (!sched->reset_pending) {
+		sched->reset_pending = true;
+		queue_work(sched->wq, &sched->reset_work);
+	}
+}
+
+void pancsf_sched_queue_reset(struct pancsf_device *pfdev)
+{
+	struct pancsf_scheduler *sched = pfdev->scheduler;
+
+	mutex_lock(&sched->lock);
+	pancsf_sched_queue_reset_locked(pfdev);
+	mutex_unlock(&sched->lock);
+}
+
+static void
+tick_ctx_init(struct pancsf_scheduler *sched,
+	      struct pancsf_sched_tick_ctx *ctx,
+	      bool full_tick)
+{
+	struct pancsf_fw_global_iface *glb_iface;
+	struct pancsf_device *pfdev = sched->pfdev;
+	struct pancsf_csg_slots_upd_ctx upd_ctx;
+	int ret;
+	u32 i;
+
+	glb_iface = pancsf_get_glb_iface(pfdev);
+	memset(ctx, 0, sizeof(*ctx));
+	csgs_upd_ctx_init(&upd_ctx);
+
+	ctx->min_priority = PANCSF_CSG_PRIORITY_COUNT;
+	for (i = 0; i < ARRAY_SIZE(ctx->groups); i++) {
+		INIT_LIST_HEAD(&ctx->groups[i]);
+		INIT_LIST_HEAD(&ctx->old_groups[i]);
+	}
+
+	for (i = 0; i < sched->csg_slot_count; i++) {
+		struct pancsf_csg_slot *csg_slot = &sched->csg_slots[i];
+		const struct pancsf_fw_csg_iface *csg_iface;
+
+		csg_iface = pancsf_get_csg_iface(pfdev, i);
+		if (csg_slot->group) {
+			pancsf_group_get(csg_slot->group);
+			tick_ctx_insert_old_group(sched, ctx, csg_slot->group, full_tick);
+
+			csg_slot->group->in_tick_ctx = true;
+			csgs_upd_ctx_queue_reqs(pfdev, &upd_ctx, i,
+						csg_iface->output->ack ^ CSG_STATUS_UPDATE,
+						CSG_STATUS_UPDATE);
+		}
+	}
+
+	ret = csgs_upd_ctx_apply_locked(pfdev, &upd_ctx);
+	if (ret) {
+		pancsf_sched_queue_reset_locked(pfdev);
+		ctx->csg_upd_failed_mask |= upd_ctx.timedout_mask;
+	}
+}
+
+static void
+pancsf_group_term_post_processing(struct pancsf_group *group)
+{
+	bool cookie;
+	u32 i = 0;
+
+	if (WARN_ON(pancsf_group_can_run(group)))
+		return;
+
+	cookie = dma_fence_begin_signalling();
+	for (i = 0; i < group->stream_count; i++) {
+		struct pancsf_queue *stream = group->streams[i];
+		struct pancsf_job *job;
+		int err;
+
+		if (group->fatal_streams & BIT(i))
+			err = -EINVAL;
+		else if (group->timedout)
+			err = -ETIMEDOUT;
+		else
+			err = -ECANCELED;
+
+		if (!stream)
+			continue;
+
+		list_for_each_entry(job, &stream->jobs.in_flight, node) {
+			dma_fence_set_error(job->done_fence, err);
+			dma_fence_signal(job->done_fence);
+		}
+
+		list_for_each_entry(job, &stream->jobs.pending, node) {
+			dma_fence_set_error(job->done_fence, -ECANCELED);
+			dma_fence_signal(job->done_fence);
+		}
+	}
+	dma_fence_end_signalling(cookie);
+
+	for (i = 0; i < group->stream_count; i++) {
+		struct pancsf_queue *stream = group->streams[i];
+		struct pancsf_job *job, *job_tmp;
+
+		list_for_each_entry_safe(job, job_tmp, &stream->jobs.in_flight, node) {
+			list_del_init(&job->node);
+			if (cancel_delayed_work(&job->timeout_work))
+				pancsf_put_job(job);
+			pancsf_put_job(job);
+		}
+
+		list_for_each_entry_safe(job, job_tmp, &stream->jobs.pending, node) {
+			list_del_init(&job->node);
+			pancsf_put_job(job);
+		}
+	}
+}
+
+static void pancsf_group_term_work(struct work_struct *work)
+{
+	struct pancsf_group *group = container_of(work,
+						       struct pancsf_group,
+						       term_work);
+	pancsf_group_term_post_processing(group);
+	pancsf_group_put(group);
+}
+
+static void
+tick_ctx_cleanup(struct pancsf_scheduler *sched,
+		 struct pancsf_sched_tick_ctx *ctx)
+{
+	struct pancsf_group *group, *tmp;
+	u32 i;
+
+	for (i = 0; i < ARRAY_SIZE(ctx->old_groups); i++) {
+		list_for_each_entry_safe(group, tmp, &ctx->old_groups[i], run_node) {
+			/* If everything went fine, we should only have groups
+			 * to be terminated in the old_groups lists.
+			 */
+			WARN_ON(!ctx->csg_upd_failed_mask &&
+				pancsf_group_can_run(group));
+
+			if (!pancsf_group_can_run(group)) {
+				list_del_init(&group->run_node);
+				pancsf_group_queue_work(sched, group, term);
+			} else if (group->csg_id >= 0) {
+				list_del_init(&group->run_node);
+			} else {
+				list_move(&group->run_node,
+					  pancsf_group_is_idle(group) ?
+					  &sched->idle_queues[group->priority] :
+					  &sched->run_queues[group->priority]);
+			}
+			pancsf_group_put(group);
+			group->in_tick_ctx = false;
+		}
+	}
+
+	for (i = 0; i < ARRAY_SIZE(ctx->groups); i++) {
+		/* If everything went fine, the groups to schedule lists should
+		 * be empty.
+		 */
+		WARN_ON(!ctx->csg_upd_failed_mask && !list_empty(&ctx->groups[i]));
+
+		list_for_each_entry_safe(group, tmp, &ctx->groups[i], run_node) {
+			if (group->csg_id >= 0) {
+				list_del_init(&group->run_node);
+			} else {
+				list_move(&group->run_node,
+					  pancsf_group_is_idle(group) ?
+					  &sched->idle_queues[group->priority] :
+					  &sched->run_queues[group->priority]);
+			}
+			pancsf_group_put(group);
+			group->in_tick_ctx = false;
+		}
+	}
+}
+
+static void
+tick_ctx_apply(struct pancsf_scheduler *sched, struct pancsf_sched_tick_ctx *ctx)
+{
+	struct pancsf_group *group, *tmp;
+	struct pancsf_device *pfdev = sched->pfdev;
+	struct pancsf_fw_global_iface *glb_iface;
+	struct pancsf_csg_slot *csg_slot;
+	int prio, new_csg_prio = MAX_CSG_PRIO, i;
+	u32 csg_mod_mask = 0, free_csg_slots = 0;
+	struct pancsf_csg_slots_upd_ctx upd_ctx;
+	int ret;
+
+	csgs_upd_ctx_init(&upd_ctx);
+	glb_iface = pancsf_get_glb_iface(pfdev);
+
+	for (prio = PANCSF_CSG_PRIORITY_COUNT - 1; prio >= 0; prio--) {
+		/* Suspend or terminate evicted groups. */
+		list_for_each_entry(group, &ctx->old_groups[prio], run_node) {
+			const struct pancsf_fw_csg_iface *csg_iface;
+			bool term = !pancsf_group_can_run(group);
+			int csg_id = group->csg_id;
+
+			if (WARN_ON(csg_id < 0))
+				continue;
+
+			csg_slot = &sched->csg_slots[csg_id];
+			csg_iface = pancsf_get_csg_iface(pfdev, csg_id);
+			csgs_upd_ctx_queue_reqs(pfdev, &upd_ctx, csg_id,
+						term ? CSG_STATE_TERMINATE : CSG_STATE_SUSPEND,
+						CSG_STATE_MASK);
+		}
+
+		/* Update priorities on already running groups. */
+		list_for_each_entry(group, &ctx->groups[prio], run_node) {
+			const struct pancsf_fw_csg_iface *csg_iface;
+			int csg_id = group->csg_id;
+			u32 ep_req;
+
+			if (csg_id < 0)
+				continue;
+
+			csg_slot = &sched->csg_slots[csg_id];
+			csg_iface = pancsf_get_csg_iface(pfdev, csg_id);
+			if (csg_slot->priority == new_csg_prio) {
+				new_csg_prio--;
+				continue;
+			}
+
+			ep_req = pancsf_update_reqs(csg_iface->input->endpoint_req,
+						    CSG_EP_REQ_PRIORITY(new_csg_prio),
+						    CSG_EP_REQ_PRIORITY_MASK);
+			csg_iface->input->endpoint_req = ep_req;
+			csgs_upd_ctx_queue_reqs(pfdev, &upd_ctx, csg_id,
+						csg_iface->output->ack ^ CSG_ENDPOINT_CONFIG,
+						CSG_ENDPOINT_CONFIG);
+			new_csg_prio--;
+		}
+	}
+
+	ret = csgs_upd_ctx_apply_locked(pfdev, &upd_ctx);
+	if (ret) {
+		pancsf_sched_queue_reset_locked(pfdev);
+		ctx->csg_upd_failed_mask |= upd_ctx.timedout_mask;
+		return;
+	}
+
+	/* Unbind evicted groups. */
+	for (prio = PANCSF_CSG_PRIORITY_COUNT - 1; prio >= 0; prio--) {
+		list_for_each_entry(group, &ctx->old_groups[prio], run_node) {
+			pancsf_unbind_group_locked(group);
+		}
+	}
+
+	for (i = 0; i < sched->csg_slot_count; i++) {
+		if (!sched->csg_slots[i].group)
+			free_csg_slots |= BIT(i);
+	}
+
+	csgs_upd_ctx_init(&upd_ctx);
+	new_csg_prio = MAX_CSG_PRIO;
+
+	/* Start new groups. */
+	for (prio = PANCSF_CSG_PRIORITY_COUNT - 1; prio >= 0; prio--) {
+		list_for_each_entry(group, &ctx->groups[prio], run_node) {
+			int csg_id = group->csg_id;
+			const struct pancsf_fw_csg_iface *csg_iface;
+
+			if (csg_id >= 0) {
+				new_csg_prio--;
+				continue;
+			}
+
+			csg_id = ffs(free_csg_slots) - 1;
+			if (WARN_ON(csg_id < 0))
+				break;
+
+			csg_iface = pancsf_get_csg_iface(pfdev, csg_id);
+			csg_slot = &sched->csg_slots[csg_id];
+			csg_mod_mask |= BIT(csg_id);
+			pancsf_bind_group_locked(group, csg_id);
+			pancsf_prog_csg_slot_locked(pfdev, csg_id, new_csg_prio--);
+			csgs_upd_ctx_queue_reqs(pfdev, &upd_ctx, csg_id,
+						group->state == PANCSF_CS_GROUP_SUSPENDED ?
+						CSG_STATE_RESUME : CSG_STATE_START,
+						CSG_STATE_MASK);
+			csgs_upd_ctx_queue_reqs(pfdev, &upd_ctx, csg_id,
+						csg_iface->output->ack ^ CSG_ENDPOINT_CONFIG,
+						CSG_ENDPOINT_CONFIG);
+			free_csg_slots &= ~BIT(csg_id);
+		}
+	}
+
+	ret = csgs_upd_ctx_apply_locked(pfdev, &upd_ctx);
+	if (ret) {
+		pancsf_sched_queue_reset_locked(pfdev);
+		ctx->csg_upd_failed_mask |= upd_ctx.timedout_mask;
+		return;
+	}
+
+	for (prio = PANCSF_CSG_PRIORITY_COUNT - 1; prio >= 0; prio--) {
+		list_for_each_entry_safe(group, tmp, &ctx->groups[prio], run_node) {
+			list_del_init(&group->run_node);
+			group->in_tick_ctx = false;
+
+			/* If the group has been destroyed while we were
+			 * scheduling, ask for an immediate tick to
+			 * re-evaluate as soon as possible and get rid of
+			 * this dangling group.
+			 */
+			if (group->destroyed)
+				ctx->immediate_tick = true;
+			pancsf_group_put(group);
+		}
+
+		/* Return evicted groups to the idle or run queues. Groups
+		 * that can no longer be run (because they've been destroyed
+		 * or experienced an unrecoverable error) will be scheduled
+		 * for destruction in tick_ctx_cleanup().
+		 */
+		list_for_each_entry_safe(group, tmp, &ctx->old_groups[prio], run_node) {
+			if (!pancsf_group_can_run(group))
+				continue;
+
+			if (pancsf_group_is_idle(group))
+				list_move_tail(&group->run_node, &sched->idle_queues[prio]);
+			else
+				list_move_tail(&group->run_node, &sched->run_queues[prio]);
+			group->in_tick_ctx = false;
+			pancsf_group_put(group);
+		}
+	}
+
+	sched->used_csg_slot_count = ctx->group_count;
+	sched->might_have_idle_groups = ctx->idle_group_count > 0;
+}
+
+static u64
+tick_ctx_update_resched_target(struct pancsf_scheduler *sched,
+			       const struct pancsf_sched_tick_ctx *ctx)
+{
+	/* We had space left, no need to reschedule until some external event happens. */
+	if (!tick_ctx_is_full(sched, ctx))
+		goto no_tick;
+
+	/* If idle groups were scheduled, no need to wake up until some external
+	 * event happens (group unblocked, new job submitted, ...).
+	 */
+	if (ctx->idle_group_count)
+		goto no_tick;
+
+	if (WARN_ON(ctx->min_priority >= PANCSF_CSG_PRIORITY_COUNT))
+		goto no_tick;
+
+	/* If there are groups of the same priority waiting, we need to
+	 * keep the scheduler ticking, otherwise, we'll just wait for
+	 * new groups with higher priority to be queued.
+	 */
+	if (!list_empty(&sched->run_queues[ctx->min_priority])) {
+		u64 resched_target = sched->last_tick + sched->tick_period;
+
+		if (time_before64(sched->resched_target, sched->last_tick) ||
+		    time_before64(resched_target, sched->resched_target))
+			sched->resched_target = resched_target;
+
+		return sched->resched_target - sched->last_tick;
+	}
+
+no_tick:
+	sched->resched_target = U64_MAX;
+	return U64_MAX;
+}
+
+static void pancsf_tick_work(struct work_struct *work)
+{
+	struct pancsf_scheduler *sched = container_of(work, struct pancsf_scheduler,
+						      tick_work.work);
+	struct pancsf_sched_tick_ctx ctx;
+	u64 remaining_jiffies = 0, resched_delay;
+	u64 now = get_jiffies_64();
+	int prio;
+
+	if (time_before64(now, sched->resched_target))
+		remaining_jiffies = sched->resched_target - now;
+
+	mutex_lock(&sched->lock);
+	if (sched->reset_pending)
+		goto out_unlock;
+
+	tick_ctx_init(sched, &ctx, remaining_jiffies != 0);
+	if (ctx.csg_upd_failed_mask)
+		goto out_cleanup_ctx;
+
+	if (remaining_jiffies) {
+		/* Scheduling forced in the middle of a tick. Only RT groups
+		 * can preempt non-RT ones. Currently running RT groups can't be
+		 * preempted.
+		 */
+		for (prio = PANCSF_CSG_PRIORITY_COUNT - 1;
+		     prio >= 0 && !tick_ctx_is_full(sched, &ctx);
+		     prio--) {
+			tick_ctx_pick_groups_from_queue(sched, &ctx, &ctx.old_groups[prio], true);
+			if (prio == PANCSF_CSG_PRIORITY_RT) {
+				tick_ctx_pick_groups_from_queue(sched, &ctx,
+								&sched->run_queues[prio], true);
+			}
+		}
+	}
+
+	/* First pick non-idle groups */
+	for (prio = PANCSF_CSG_PRIORITY_COUNT - 1;
+	     prio >= 0 && !tick_ctx_is_full(sched, &ctx);
+	     prio--) {
+		tick_ctx_pick_groups_from_queue(sched, &ctx, &sched->run_queues[prio], true);
+		tick_ctx_pick_groups_from_queue(sched, &ctx, &ctx.old_groups[prio], true);
+	}
+
+	/* If we have free CSG slots left, pick idle groups */
+	for (prio = PANCSF_CSG_PRIORITY_COUNT - 1;
+	     prio >= 0 && !tick_ctx_is_full(sched, &ctx);
+	     prio--) {
+		/* Check the old_group queue first to avoid reprogramming the slots */
+		tick_ctx_pick_groups_from_queue(sched, &ctx, &ctx.old_groups[prio], false);
+		tick_ctx_pick_groups_from_queue(sched, &ctx, &sched->idle_queues[prio], false);
+	}
+
+	tick_ctx_apply(sched, &ctx);
+	if (ctx.csg_upd_failed_mask)
+		goto out_cleanup_ctx;
+
+	sched->last_tick = now;
+	resched_delay = tick_ctx_update_resched_target(sched, &ctx);
+	if (ctx.immediate_tick)
+		resched_delay = 0;
+
+	if (resched_delay != U64_MAX)
+		mod_delayed_work(sched->wq, &sched->tick_work, resched_delay);
+
+out_cleanup_ctx:
+	tick_ctx_cleanup(sched, &ctx);
+
+out_unlock:
+	mutex_unlock(&sched->lock);
+}
+
+static void *
+pancsf_queue_get_syncwait_obj(struct pancsf_group *group, struct pancsf_queue *stream)
+{
+	struct iosys_map map;
+	int ret;
+
+	if (stream->syncwait.kmap)
+		return stream->syncwait.kmap + stream->syncwait.offset;
+
+	if (!stream->syncwait.bo) {
+		stream->syncwait.bo = pancsf_vm_get_bo_for_vma(group->vm, stream->syncwait.gpu_va,
+							       &stream->syncwait.offset);
+		if (WARN_ON(IS_ERR_OR_NULL(stream->syncwait.bo)))
+			return NULL;
+	}
+
+	ret = drm_gem_shmem_vmap(&stream->syncwait.bo->base, &map);
+	if (WARN_ON(ret))
+		return NULL;
+
+	stream->syncwait.kmap = map.vaddr;
+	if (WARN_ON(!stream->syncwait.kmap))
+		return NULL;
+
+	return stream->syncwait.kmap + stream->syncwait.offset;
+}
+
+static int pancsf_queue_eval_syncwait(struct pancsf_group *group, u8 stream_idx)
+{
+	struct pancsf_queue *stream = group->streams[stream_idx];
+	union {
+		struct pancsf_syncobj_64b sync64;
+		struct pancsf_syncobj_32b sync32;
+	} *syncobj;
+	bool result;
+	u64 value;
+
+	syncobj = pancsf_queue_get_syncwait_obj(group, stream);
+	if (!syncobj)
+		return -EINVAL;
+
+	value = stream->syncwait.sync64 ?
+		syncobj->sync64.seqno :
+		syncobj->sync32.seqno;
+
+	if (stream->syncwait.gt)
+		result = value > stream->syncwait.ref;
+	else
+		result = value <= stream->syncwait.ref;
+
+	if (result) {
+		pancsf_queue_release_syncwait_obj(group, stream);
+		return 1;
+	}
+
+	return 0;
+}
+
+static void pancsf_sync_upd_work(struct work_struct *work)
+{
+	struct pancsf_scheduler *sched = container_of(work,
+						      struct pancsf_scheduler,
+						      sync_upd_work);
+	struct pancsf_group *group, *tmp;
+	bool immediate_tick = false;
+
+	mutex_lock(&sched->lock);
+	list_for_each_entry_safe(group, tmp, &sched->wait_queue, wait_node) {
+		u32 tested_streams = group->blocked_streams;
+		u32 unblocked_streams = 0;
+
+		while (tested_streams) {
+			u32 cs_id = ffs(tested_streams) - 1;
+			int ret;
+
+			ret = pancsf_queue_eval_syncwait(group, cs_id);
+			WARN_ON(ret < 0);
+			if (ret)
+				unblocked_streams |= BIT(cs_id);
+
+			tested_streams &= ~BIT(cs_id);
+		}
+
+		if (unblocked_streams) {
+			group->blocked_streams &= ~unblocked_streams;
+
+			if (group->csg_id < 0) {
+				list_move(&group->run_node, &sched->run_queues[group->priority]);
+				if (group->priority == PANCSF_CSG_PRIORITY_RT)
+					immediate_tick = true;
+			}
+		}
+
+		if (!group->blocked_streams)
+			list_del_init(&group->wait_node);
+	}
+	mutex_unlock(&sched->lock);
+
+	if (immediate_tick)
+		mod_delayed_work(sched->wq, &sched->tick_work, 0);
+}
+
+static void pancsf_group_queue_locked(struct pancsf_group *group, u32 stream_mask)
+{
+	struct pancsf_device *pfdev = group->pfdev;
+	struct pancsf_scheduler *sched = pfdev->scheduler;
+	struct list_head *queue = &sched->run_queues[group->priority];
+	u64 delay_jiffies = 0;
+	bool was_idle;
+	u64 now;
+
+	if (!pancsf_group_can_run(group))
+		return;
+
+	/* All updated streams are blocked, no need to wake up the scheduler. */
+	if ((stream_mask & group->blocked_streams) == stream_mask)
+		return;
+
+	/* Group is being evaluated by the scheduler. */
+	if (group->in_tick_ctx)
+		return;
+
+	was_idle = pancsf_group_is_idle(group);
+	group->idle_streams &= ~stream_mask;
+	if (was_idle && !pancsf_group_is_idle(group))
+		list_move_tail(&group->run_node, queue);
+
+	/* RT groups are preemptive. */
+	if (group->priority == PANCSF_CSG_PRIORITY_RT) {
+		mod_delayed_work(sched->wq, &sched->tick_work, 0);
+		return;
+	}
+
+	/* Some groups might be idle, force an immediate tick to
+	 * re-evaluate.
+	 */
+	if (sched->might_have_idle_groups) {
+		mod_delayed_work(sched->wq, &sched->tick_work, 0);
+		return;
+	}
+
+	/* Scheduler is ticking, nothing to do. */
+	if (sched->resched_target != U64_MAX) {
+		/* If there are free slots, force immediating ticking. */
+		if (sched->used_csg_slot_count < sched->csg_slot_count)
+			mod_delayed_work(sched->wq, &sched->tick_work, 0);
+
+		return;
+	}
+
+	/* Scheduler tick was off, recalculate the resched_target based on the
+	 * last tick event, and queue the scheduler work.
+	 */
+	now = get_jiffies_64();
+	sched->resched_target = sched->last_tick + sched->tick_period;
+	if (sched->used_csg_slot_count == sched->csg_slot_count &&
+	    time_before64(now, sched->resched_target))
+		delay_jiffies = min_t(unsigned long, sched->resched_target - now, ULONG_MAX);
+
+	mod_delayed_work(sched->wq, &sched->tick_work, delay_jiffies);
+}
+
+static void pancsf_group_job_pending_work(struct work_struct *work)
+{
+	struct pancsf_group *group = container_of(work, struct pancsf_group, job_pending_work);
+	struct pancsf_device *pfdev = group->pfdev;
+	struct pancsf_scheduler *sched = pfdev->scheduler;
+	struct pancsf_job *job;
+	u32 stream_mask = 0;
+	u32 i;
+
+	for (i = 0; i < group->stream_count; i++) {
+		struct pancsf_queue *stream = group->streams[i];
+
+		if (!stream)
+			continue;
+
+		while (true) {
+			spin_lock(&stream->jobs.lock);
+			job = list_first_entry_or_null(&stream->jobs.pending,
+						       struct pancsf_job, node);
+			spin_unlock(&stream->jobs.lock);
+
+			if (!job ||
+			    !pancsf_queue_can_take_new_jobs(stream) ||
+			    !pancsf_job_deps_done(job))
+				break;
+
+			pancsf_queue_submit_job(stream, job);
+			stream_mask |= BIT(i);
+		}
+	}
+
+	if (stream_mask) {
+		mutex_lock(&sched->lock);
+		if (group->csg_id < 0)
+			pancsf_group_queue_locked(group, stream_mask);
+		else
+			gpu_write(pfdev, CSF_DOORBELL(group->csg_id + 1), 1);
+		mutex_unlock(&sched->lock);
+	}
+
+	pancsf_group_put(group);
+}
+
+static void pancsf_ping_work(struct work_struct *work)
+{
+	struct pancsf_scheduler *sched = container_of(work,
+						      struct pancsf_scheduler,
+						      ping_work.work);
+	struct pancsf_device *pfdev = sched->pfdev;
+	const struct pancsf_fw_global_iface *glb_iface = pancsf_get_glb_iface(pfdev);
+	bool reset_pending, timedout = false;
+
+	mutex_lock(&sched->lock);
+	reset_pending = sched->reset_pending;
+	if (!reset_pending) {
+		sched->pending_reqs |= GLB_PING;
+		glb_iface->input->req = pancsf_toggle_reqs(glb_iface->input->req,
+							   glb_iface->output->ack,
+							   GLB_PING);
+		gpu_write(pfdev, CSF_DOORBELL(CSF_GLB_DOORBELL_ID), 1);
+	}
+	mutex_unlock(&sched->lock);
+
+	if (reset_pending)
+		goto out;
+
+	if (!wait_event_timeout(sched->reqs_acked,
+				!(sched->pending_reqs & GLB_PING),
+				msecs_to_jiffies(100))) {
+		mutex_lock(&sched->lock);
+		if ((sched->pending_reqs & GLB_PING) != 0 &&
+		    ((glb_iface->input->req ^ glb_iface->output->ack) & GLB_PING) != 0)
+			timedout = true;
+		mutex_unlock(&sched->lock);
+	}
+
+	if (timedout) {
+		dev_err(pfdev->dev, "FW ping timeout, scheduling a reset");
+		pancsf_sched_queue_reset(pfdev);
+	} else {
+		/* Next ping in 12 seconds. */
+		mod_delayed_work(sched->wq,
+				 &sched->ping_work,
+				 msecs_to_jiffies(12000));
+	}
+
+out:
+}
+
+static int pancsf_sched_pre_reset(struct pancsf_device *pfdev)
+{
+	struct pancsf_scheduler *sched = pfdev->scheduler;
+	struct pancsf_csg_slots_upd_ctx upd_ctx;
+	u64 suspended_slots, faulty_slots;
+	int ret;
+	u32 i;
+
+	mutex_lock(&sched->lock);
+	csgs_upd_ctx_init(&upd_ctx);
+	for (i = 0; i < sched->csg_slot_count; i++) {
+		struct pancsf_csg_slot *csg_slot = &sched->csg_slots[i];
+
+		if (csg_slot->group) {
+			csgs_upd_ctx_queue_reqs(pfdev, &upd_ctx, i,
+						CSG_STATE_SUSPEND,
+						CSG_STATE_MASK);
+		}
+	}
+
+	suspended_slots = upd_ctx.update_mask;
+
+	ret = csgs_upd_ctx_apply_locked(pfdev, &upd_ctx);
+	suspended_slots &= ~upd_ctx.timedout_mask;
+	faulty_slots = upd_ctx.timedout_mask;
+
+	if (faulty_slots) {
+		u32 slot_mask = faulty_slots;
+
+		dev_err(pfdev->dev, "CSG suspend failed, escalating to termination");
+		csgs_upd_ctx_init(&upd_ctx);
+		while (slot_mask) {
+			u32 csg_id = ffs(slot_mask) - 1;
+
+			csgs_upd_ctx_queue_reqs(pfdev, &upd_ctx, csg_id,
+						CSG_STATE_TERMINATE,
+						CSG_STATE_MASK);
+			slot_mask &= ~BIT(csg_id);
+		}
+
+		csgs_upd_ctx_apply_locked(pfdev, &upd_ctx);
+
+		slot_mask = faulty_slots;
+		while (slot_mask) {
+			u32 csg_id = ffs(slot_mask) - 1;
+			struct pancsf_csg_slot *csg_slot = &sched->csg_slots[csg_id];
+
+			/* Terminate command timedout, but the soft-reset will
+			 * automatically terminate all active groups, so let's
+			 * force the state to halted here.
+			 */
+			if (csg_slot->group->state != PANCSF_CS_GROUP_TERMINATED)
+				csg_slot->group->state = PANCSF_CS_GROUP_TERMINATED;
+			slot_mask &= ~BIT(csg_id);
+		}
+	}
+
+	/* Flush L2 and LSC caches to make sure suspend state is up-to-date.
+	 * If the flush fails, flag all streams for termination.
+	 */
+	if (suspended_slots) {
+		bool flush_caches_failed = false;
+		u32 slot_mask = suspended_slots;
+
+		if (pancsf_gpu_flush_caches(pfdev, CACHE_CLEAN, CACHE_CLEAN, 0))
+			flush_caches_failed = true;
+
+		while (slot_mask) {
+			u32 csg_id = ffs(slot_mask) - 1;
+			struct pancsf_csg_slot *csg_slot = &sched->csg_slots[csg_id];
+
+			if (flush_caches_failed)
+				csg_slot->group->state = PANCSF_CS_GROUP_TERMINATED;
+			else
+				pancsf_queue_csg_sync_update_locked(pfdev, csg_id);
+
+			slot_mask &= ~BIT(csg_id);
+		}
+
+		if (flush_caches_failed)
+			faulty_slots |= suspended_slots;
+	}
+
+	for (i = 0; i < sched->csg_slot_count; i++) {
+		struct pancsf_csg_slot *csg_slot = &sched->csg_slots[i];
+		struct pancsf_group *group = csg_slot->group;
+
+		if (!group)
+			continue;
+
+		pancsf_group_get(group);
+		pancsf_unbind_group_locked(group);
+
+		if (pancsf_group_can_run(group)) {
+			WARN_ON(!list_empty(&group->run_node));
+			list_add(&group->run_node,
+				 pancsf_group_is_idle(group) ?
+				 &sched->idle_queues[group->priority] :
+				 &sched->run_queues[group->priority]);
+		} else {
+			pancsf_group_queue_work(sched, group, term);
+		}
+		pancsf_group_put(group);
+	}
+
+	mutex_unlock(&sched->lock);
+
+	return 0;
+}
+
+static void pancsf_reset_work(struct work_struct *work)
+{
+	struct pancsf_scheduler *sched = container_of(work,
+						      struct pancsf_scheduler,
+						      reset_work);
+	struct pancsf_device *pfdev = sched->pfdev;
+	bool full_reload = false;
+	int ret;
+
+	pancsf_sched_pre_reset(pfdev);
+
+retry:
+	pancsf_mcu_pre_reset(pfdev);
+	pancsf_mmu_pre_reset(pfdev);
+	pancsf_gpu_soft_reset(pfdev);
+	pancsf_gpu_l2_power_on(pfdev);
+	pancsf_mmu_reset(pfdev);
+	ret = pancsf_mcu_reset(pfdev, full_reload);
+	if (ret && !full_reload) {
+		full_reload = true;
+		goto retry;
+	}
+
+	if (WARN_ON(ret && full_reload))
+		dev_err(pfdev->dev, "Failed to boot MCU after reset");
+
+	pancsf_global_init(pfdev);
+
+	mutex_lock(&sched->lock);
+	sched->reset_pending = false;
+	mutex_unlock(&sched->lock);
+	mod_delayed_work(pfdev->scheduler->wq,
+			 &pfdev->scheduler->tick_work,
+			 0);
+}
+
+static void pancsf_group_sync_upd_work(struct work_struct *work)
+{
+	struct pancsf_group *group = container_of(work,
+						       struct pancsf_group,
+						       sync_upd_work);
+	struct pancsf_job *job, *job_tmp;
+	LIST_HEAD(done_jobs);
+	u32 stream_idx;
+	bool cookie;
+
+	cookie = dma_fence_begin_signalling();
+	for (stream_idx = 0; stream_idx < group->stream_count; stream_idx++) {
+		struct pancsf_queue *stream = group->streams[stream_idx];
+		struct pancsf_syncobj_64b *syncobj;
+
+		if (!stream)
+			continue;
+
+		syncobj = group->syncobjs.kmap + (stream_idx * sizeof(*syncobj));
+
+		spin_lock(&stream->jobs.lock);
+		list_for_each_entry_safe(job, job_tmp, &stream->jobs.in_flight, node) {
+			struct pancsf_queue_fence *fence;
+			u64 job_seqno = job->ringbuf.start / (NUM_INSTRS_PER_SLOT * sizeof(u64));
+
+			if (!job->call_info.size)
+				continue;
+
+			fence = container_of(job->done_fence, struct pancsf_queue_fence, base);
+			if (syncobj->seqno < job_seqno)
+				break;
+
+			list_move_tail(&job->node, &done_jobs);
+		}
+		spin_unlock(&stream->jobs.lock);
+	}
+
+	list_for_each_entry(job, &done_jobs, node) {
+		if (cancel_delayed_work(&job->timeout_work))
+			pancsf_put_job(job);
+
+		dma_fence_signal(job->done_fence);
+	}
+	dma_fence_end_signalling(cookie);
+
+	list_for_each_entry_safe(job, job_tmp, &done_jobs, node) {
+		list_del_init(&job->node);
+		pancsf_put_job(job);
+	}
+
+	pancsf_group_put(group);
+}
+
+int pancsf_create_group(struct pancsf_file *pfile,
+			const struct drm_pancsf_group_create *group_args,
+			const struct drm_pancsf_queue_create *queue_args)
+{
+	struct pancsf_device *pfdev = pfile->pfdev;
+	struct pancsf_group_pool *gpool = pfile->groups;
+	struct pancsf_scheduler *sched = pfdev->scheduler;
+	const struct pancsf_fw_csg_iface *csg_iface = pancsf_get_csg_iface(pfdev, 0);
+	struct pancsf_group *group = NULL;
+	u32 gid, i, suspend_size;
+	int ret;
+
+	if (group_args->priority > PANCSF_CSG_PRIORITY_HIGH)
+		return -EINVAL;
+
+	group = kzalloc(sizeof(*group), GFP_KERNEL);
+	if (!group)
+		return -ENOMEM;
+
+	spin_lock_init(&group->fatal_lock);
+	kref_init(&group->refcount);
+	group->state = PANCSF_CS_GROUP_CREATED;
+	group->as_id = -1;
+	group->csg_id = -1;
+
+	group->pfdev = pfdev;
+	group->max_compute_cores = group_args->max_compute_cores;
+	group->compute_core_mask = group_args->compute_core_mask & pfdev->gpu_info.shader_present;
+	group->max_fragment_cores = group_args->max_fragment_cores;
+	group->fragment_core_mask = group_args->fragment_core_mask & pfdev->gpu_info.shader_present;
+	group->max_tiler_cores = group_args->max_tiler_cores;
+	group->tiler_core_mask = group_args->tiler_core_mask & pfdev->gpu_info.tiler_present;
+	group->priority = group_args->priority;
+
+	INIT_LIST_HEAD(&group->wait_node);
+	INIT_LIST_HEAD(&group->run_node);
+	INIT_WORK(&group->job_pending_work, pancsf_group_job_pending_work);
+	INIT_WORK(&group->term_work, pancsf_group_term_work);
+	INIT_WORK(&group->sync_upd_work, pancsf_group_sync_upd_work);
+
+	if ((hweight64(group->compute_core_mask) == 0 && group_args->max_compute_cores > 0) ||
+	    (hweight64(group->fragment_core_mask) == 0 && group_args->max_fragment_cores > 0) ||
+	    (hweight64(group->tiler_core_mask) == 0 && group_args->max_tiler_cores > 0) ||
+	    (group->tiler_core_mask == 0 && !group->heaps)) {
+		ret = -EINVAL;
+		goto err_put_group;
+	}
+
+	group->vm = pancsf_vm_pool_get_vm(pfile->vms, group_args->vm_id);
+	if (!group->vm) {
+		ret = -EINVAL;
+		goto err_put_group;
+	}
+
+	/* We need to instantiate the heap pool if the group wants to use the tiler. */
+	if (group->tiler_core_mask) {
+		mutex_lock(&pfile->heaps_lock);
+		if (IS_ERR_OR_NULL(pfile->heaps))
+			pfile->heaps = pancsf_heap_pool_create(pfdev, group->vm);
+		mutex_unlock(&pfile->heaps_lock);
+
+		if (IS_ERR(pfile->heaps)) {
+			ret = PTR_ERR(pfile->heaps);
+			goto err_put_group;
+		}
+
+		group->heaps = pfile->heaps;
+	}
+
+	if (IS_ERR_OR_NULL(pfile->heaps))
+		group->heaps = pfile->heaps;
+
+	suspend_size = csg_iface->control->suspend_size;
+	group->suspend_buf = pancsf_fw_alloc_suspend_buf_mem(pfdev, suspend_size);
+	if (IS_ERR(group->suspend_buf)) {
+		ret = PTR_ERR(group->suspend_buf);
+		group->suspend_buf = NULL;
+		goto err_put_group;
+	}
+
+	suspend_size = csg_iface->control->protm_suspend_size;
+	group->protm_suspend_buf = pancsf_fw_alloc_suspend_buf_mem(pfdev, suspend_size);
+	if (IS_ERR(group->protm_suspend_buf)) {
+		ret = PTR_ERR(group->protm_suspend_buf);
+		group->protm_suspend_buf = NULL;
+		goto err_put_group;
+	}
+
+	group->syncobjs.bo = pancsf_gem_create_and_map(pfdev, group->vm,
+						       group_args->queues.count *
+						       sizeof(struct pancsf_syncobj_64b),
+						       0,
+						       PANCSF_VMA_MAP_NOEXEC |
+						       PANCSF_VMA_MAP_UNCACHED |
+						       PANCSF_VMA_MAP_AUTO_VA,
+						       &group->syncobjs.gpu_va,
+						       (void **)&group->syncobjs.kmap);
+	if (IS_ERR(group->syncobjs.bo)) {
+		ret = PTR_ERR(group->syncobjs.bo);
+		goto err_put_group;
+	}
+
+	memset(group->syncobjs.kmap, 0, group_args->queues.count * sizeof(struct pancsf_syncobj_64b));
+
+	for (i = 0; i < group_args->queues.count; i++) {
+		group->streams[i] = pancsf_create_queue(group, &queue_args[i]);
+		if (IS_ERR(group->streams[i])) {
+			ret = PTR_ERR(group->streams[i]);
+			group->streams[i] = NULL;
+			goto err_put_group;
+		}
+
+		group->stream_count++;
+	}
+
+	group->idle_streams = GENMASK(group->stream_count - 1, 0);
+
+	mutex_lock(&gpool->lock);
+	ret = xa_alloc(&gpool->xa, &gid, group, XA_LIMIT(1, sched->csg_slot_count), GFP_KERNEL);
+	mutex_unlock(&gpool->lock);
+
+	if (ret)
+		goto err_put_group;
+
+	mutex_lock(&sched->lock);
+	list_add_tail(&group->run_node, &sched->idle_queues[group->priority]);
+	mutex_unlock(&sched->lock);
+	return gid;
+
+err_put_group:
+	pancsf_group_put(group);
+	return ret;
+}
+
+void pancsf_destroy_group(struct pancsf_file *pfile, u32 group_handle)
+{
+	struct pancsf_group_pool *gpool = pfile->groups;
+	struct pancsf_device *pfdev = pfile->pfdev;
+	struct pancsf_scheduler *sched = pfdev->scheduler;
+	struct pancsf_group *group;
+
+	mutex_lock(&gpool->lock);
+	group = xa_erase(&gpool->xa, group_handle);
+	mutex_unlock(&gpool->lock);
+
+	if (!group)
+		return;
+
+	mutex_lock(&sched->lock);
+	group->destroyed = true;
+	if (group->in_tick_ctx || group->csg_id >= 0) {
+		mod_delayed_work(sched->wq, &sched->tick_work, 0);
+	} else {
+		/* Remove from the run queues, so the scheduler can't
+		 * pick the group on the next tick.
+		 */
+		list_del_init(&group->run_node);
+		list_del_init(&group->wait_node);
+
+		pancsf_group_queue_work(sched, group, term);
+	}
+	mutex_unlock(&pfdev->scheduler->lock);
+
+	pancsf_group_put(group);
+}
+
+int pancsf_group_pool_create(struct pancsf_file *pfile)
+{
+	struct pancsf_group_pool *gpool;
+
+	gpool = kzalloc(sizeof(*gpool), GFP_KERNEL);
+	if (!gpool)
+		return -ENOMEM;
+
+	xa_init_flags(&gpool->xa, XA_FLAGS_ALLOC1);
+	mutex_init(&gpool->lock);
+	pfile->groups = gpool;
+	return 0;
+}
+
+void pancsf_group_pool_destroy(struct pancsf_file *pfile)
+{
+	struct pancsf_group_pool *gpool = pfile->groups;
+	struct pancsf_group *group;
+	unsigned long i;
+
+	if (IS_ERR_OR_NULL(gpool))
+		return;
+
+	mutex_lock(&gpool->lock);
+	xa_for_each(&gpool->xa, i, group)
+		pancsf_group_put(group);
+	mutex_unlock(&gpool->lock);
+
+	mutex_destroy(&gpool->lock);
+	kfree(gpool);
+	pfile->groups = NULL;
+}
+
+static void pancsf_release_job(struct kref *ref)
+{
+	struct pancsf_job *job = container_of(ref, struct pancsf_job, refcount);
+	struct dma_fence *fence;
+	unsigned long index;
+
+	WARN_ON(cancel_delayed_work(&job->timeout_work));
+
+	if (job->cur_dep) {
+		dma_fence_remove_callback(job->cur_dep, &job->dep_cb);
+		dma_fence_put(job->cur_dep);
+	}
+
+	xa_for_each(&job->dependencies, index, fence) {
+		dma_fence_put(fence);
+	}
+	xa_destroy(&job->dependencies);
+
+	dma_fence_put(job->done_fence);
+
+	pancsf_group_put(job->group);
+
+	kfree(job);
+}
+
+void pancsf_put_job(struct pancsf_job *job)
+{
+	if (job)
+		kref_put(&job->refcount, pancsf_release_job);
+}
+
+int pancsf_add_job_dep(struct pancsf_job *job, struct dma_fence *fence)
+{
+	struct dma_fence *entry;
+	unsigned long index;
+	u32 id = 0;
+	int ret;
+
+	if (!fence)
+		return 0;
+
+	/* Deduplicate if we already depend on a fence from the same context.
+	 * This lets the size of the array of deps scale with the number of
+	 * engines involved, rather than the number of BOs.
+	 */
+	xa_for_each(&job->dependencies, index, entry) {
+		if (entry->context != fence->context)
+			continue;
+
+		if (dma_fence_is_later(fence, entry)) {
+			dma_fence_put(entry);
+			xa_store(&job->dependencies, index, fence, GFP_KERNEL);
+		} else {
+			dma_fence_put(fence);
+		}
+		return 0;
+	}
+
+	ret = xa_alloc(&job->dependencies, &id, fence, xa_limit_32b, GFP_KERNEL);
+	if (ret != 0)
+		dma_fence_put(fence);
+
+	return ret;
+}
+
+struct dma_fence *pancsf_get_job_done_fence(struct pancsf_job *job)
+{
+	return job->done_fence;
+}
+
+int pancsf_push_job(struct pancsf_job *job)
+{
+	struct pancsf_group *group = job->group;
+	struct pancsf_queue *stream = group->streams[job->stream_idx];
+	struct pancsf_device *pfdev = group->pfdev;
+	bool kick_group = false;
+	int ret = 0;
+
+	kref_get(&job->refcount);
+
+	spin_lock(&group->fatal_lock);
+	if (pancsf_group_can_run(group)) {
+		spin_lock(&stream->jobs.lock);
+		kick_group = list_empty(&stream->jobs.pending);
+		list_add_tail(&job->node, &stream->jobs.pending);
+		spin_unlock(&stream->jobs.lock);
+	} else {
+		ret = -EINVAL;
+	}
+	spin_unlock(&group->fatal_lock);
+
+	if (kick_group)
+		pancsf_group_queue_work(pfdev->scheduler, group, job_pending);
+
+	if (ret)
+		pancsf_put_job(job);
+
+	return ret;
+}
+
+static void pancsf_job_timeout_work(struct work_struct *work)
+{
+	struct pancsf_job *job = container_of(work, struct pancsf_job, timeout_work.work);
+	struct pancsf_group *group = job->group;
+	struct pancsf_device *pfdev = group->pfdev;
+	struct pancsf_scheduler *sched = pfdev->scheduler;
+
+	mutex_lock(&sched->lock);
+	group->timedout = true;
+	if (group->in_tick_ctx || group->csg_id >= 0) {
+		mod_delayed_work(pfdev->scheduler->wq,
+				 &pfdev->scheduler->tick_work,
+				 0);
+	} else {
+		/* Remove from the run queues, so the scheduler can't
+		 * pick the group on the next tick.
+		 */
+		list_del_init(&group->run_node);
+		list_del_init(&group->wait_node);
+
+		pancsf_group_queue_work(pfdev->scheduler, group, term);
+	}
+	mutex_unlock(&sched->lock);
+
+	pancsf_put_job(job);
+}
+
+struct pancsf_job *
+pancsf_create_job(struct pancsf_file *pfile,
+		  u16 group_handle, u8 stream_idx,
+		  const struct pancsf_call_info *cs_call)
+{
+	struct pancsf_group_pool *gpool = pfile->groups;
+	struct pancsf_group *group;
+	struct pancsf_job *job;
+	struct dma_fence *done_fence;
+	int ret;
+
+	if (!cs_call)
+		return ERR_PTR(-EINVAL);
+
+	if (!cs_call->size || !cs_call->start)
+		return ERR_PTR(-EINVAL);
+
+	mutex_lock(&gpool->lock);
+	group = xa_load(&gpool->xa, group_handle);
+	if (group)
+		pancsf_group_get(group);
+	mutex_unlock(&gpool->lock);
+
+	if (!group)
+		return ERR_PTR(-EINVAL);
+
+	if (stream_idx >= group->stream_count || !group->streams[stream_idx])
+		return ERR_PTR(-EINVAL);
+
+	job = kzalloc(sizeof(*job), GFP_KERNEL);
+	if (!job) {
+		ret = -ENOMEM;
+		goto err_put_group;
+	}
+
+	INIT_DELAYED_WORK(&job->timeout_work, pancsf_job_timeout_work);
+	job->group = group;
+	job->stream_idx = stream_idx;
+	job->call_info = *cs_call;
+
+	done_fence = pancsf_queue_fence_create(group->streams[stream_idx]);
+	if (IS_ERR(done_fence)) {
+		ret = PTR_ERR(done_fence);
+		goto err_free_job;
+	}
+
+	job->done_fence = done_fence;
+
+	xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC);
+	kref_init(&job->refcount);
+
+	return job;
+
+err_free_job:
+	kfree(job);
+
+err_put_group:
+	pancsf_group_put(group);
+	return ERR_PTR(ret);
+}
+
+int pancsf_sched_init(struct pancsf_device *pfdev)
+{
+	const struct pancsf_fw_global_iface *glb_iface = pancsf_get_glb_iface(pfdev);
+	const struct pancsf_fw_csg_iface *csg_iface = pancsf_get_csg_iface(pfdev, 0);
+	const struct pancsf_fw_cs_iface *cs_iface = pancsf_get_cs_iface(pfdev, 0, 0);
+	u32 gpu_as_count, num_groups, i;
+	struct pancsf_scheduler *sched;
+	int prio;
+
+	sched = devm_kzalloc(pfdev->dev, sizeof(*sched), GFP_KERNEL);
+	if (!sched)
+		return -ENOMEM;
+
+	/* The highest bit in JOB_INT_* is reserved for globabl IRQs. That
+	 * leaves 31 bits for CSG IRQs, hence the MAX_CSGS clamp here.
+	 */
+	num_groups = min_t(u32, MAX_CSGS, glb_iface->control->group_num);
+
+	/* The FW-side scheduler might deadlock if two groups with the same
+	 * priority try to access a set of resources that overlaps, with part
+	 * of the resources being allocated to one group and the other part to
+	 * the other group, both groups waiting for the remaining resources to
+	 * be allocated. To avoid that, it is recommended to assign each CSG a
+	 * different priority. In theory we could allow several groups to have
+	 * the same CSG priority if they don't request the same resources, but
+	 * that makes the scheduling logic more complicated, so let's clamp
+	 * the number of CSG slots to MAX_CSG_PRIO + 1 for now.
+	 */
+	num_groups = min_t(u32, MAX_CSG_PRIO + 1, num_groups);
+
+	/* We need at least one AS for the MCU and one for the GPU contexts. */
+	gpu_as_count = hweight32(pfdev->gpu_info.as_present & GENMASK(31, 1));
+	if (!gpu_as_count) {
+		dev_err(pfdev->dev, "Not enough AS (%d, expected at least 2)",
+			gpu_as_count + 1);
+		return -EINVAL;
+	}
+
+	sched->pfdev = pfdev;
+	sched->sb_slot_count = CS_FEATURES_SCOREBOARDS(cs_iface->control->features);
+	sched->csg_slot_count = num_groups;
+	sched->cs_slot_count = csg_iface->control->stream_num;
+	sched->as_slot_count = gpu_as_count;
+	pfdev->csif_info.csg_slot_count = sched->csg_slot_count;
+	pfdev->csif_info.cs_slot_count = sched->cs_slot_count;
+	pfdev->csif_info.scoreboard_slot_count = sched->sb_slot_count;
+
+	sched->last_tick = 0;
+	sched->resched_target = U64_MAX;
+	sched->tick_period = msecs_to_jiffies(10);
+	INIT_DELAYED_WORK(&sched->tick_work, pancsf_tick_work);
+	INIT_DELAYED_WORK(&sched->ping_work, pancsf_ping_work);
+	INIT_WORK(&sched->sync_upd_work, pancsf_sync_upd_work);
+	INIT_WORK(&sched->reset_work, pancsf_reset_work);
+
+	mutex_init(&sched->lock);
+	for (prio = PANCSF_CSG_PRIORITY_COUNT - 1; prio >= 0; prio--) {
+		INIT_LIST_HEAD(&sched->run_queues[prio]);
+		INIT_LIST_HEAD(&sched->idle_queues[prio]);
+	}
+	INIT_LIST_HEAD(&sched->wait_queue);
+
+	init_waitqueue_head(&sched->reqs_acked);
+	for (i = 0; i < num_groups; i++)
+		init_waitqueue_head(&sched->csg_slots[i].reqs_acked);
+
+	pfdev->scheduler = sched;
+
+	sched->wq = alloc_ordered_workqueue("panfrost-csf-sched", 0);
+	if (!sched->wq) {
+		dev_err(pfdev->dev, "Failed to allocate the scheduler workqueue");
+		return -ENOMEM;
+	}
+
+	pancsf_global_init(pfdev);
+	return 0;
+}
+
+void pancsf_sched_fini(struct pancsf_device *pfdev)
+{
+	struct pancsf_scheduler *sched = pfdev->scheduler;
+	int prio;
+
+	if (!sched || !sched->csg_slot_count)
+		return;
+
+	cancel_work_sync(&sched->reset_work);
+	cancel_work_sync(&sched->sync_upd_work);
+	cancel_delayed_work_sync(&sched->tick_work);
+	cancel_delayed_work_sync(&sched->ping_work);
+
+	if (sched->wq)
+		destroy_workqueue(sched->wq);
+
+	for (prio = PANCSF_CSG_PRIORITY_COUNT - 1; prio >= 0; prio--) {
+		WARN_ON(!list_empty(&sched->run_queues[prio]));
+		WARN_ON(!list_empty(&sched->idle_queues[prio]));
+	}
+
+	WARN_ON(!list_empty(&sched->wait_queue));
+	mutex_destroy(&sched->lock);
+
+	pfdev->scheduler = NULL;
+}
diff --git a/drivers/gpu/drm/pancsf/pancsf_sched.h b/drivers/gpu/drm/pancsf/pancsf_sched.h
new file mode 100644
index 000000000000..4c11463375e0
--- /dev/null
+++ b/drivers/gpu/drm/pancsf/pancsf_sched.h
@@ -0,0 +1,68 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright 2023 Collabora ltd. */
+
+#ifndef __PANCSF_SCHED_H__
+#define __PANCSF_SCHED_H__
+
+#include <drm/pancsf_drm.h>
+
+struct dma_fence;
+struct drm_file;
+struct drm_gem_object;
+struct pancsf_device;
+struct pancsf_file;
+struct pancsf_job;
+
+struct pancsf_group_pool;
+
+int pancsf_create_group(struct pancsf_file *pfile,
+			const struct drm_pancsf_group_create *group_args,
+			const struct drm_pancsf_queue_create *queue_args);
+void pancsf_destroy_group(struct pancsf_file *pfile,
+			  u32 group_handle);
+
+struct pancsf_call_info {
+	u64 start;
+	u32 size;
+};
+
+struct pancsf_job *
+pancsf_create_job(struct pancsf_file *pfile,
+		  u16 group_handle, u8 stream_idx,
+		  const struct pancsf_call_info *cs_call);
+int pancsf_set_job_bos(struct pancsf_file *pfile,
+		       struct pancsf_job *job,
+		       u32 bo_count, struct drm_gem_object **bos);
+void pancsf_put_job(struct pancsf_job *job);
+int pancsf_push_job(struct pancsf_job *job);
+int pancsf_add_job_dep(struct pancsf_job *job,
+		       struct dma_fence *fence);
+struct dma_fence *pancsf_get_job_done_fence(struct pancsf_job *job);
+
+int pancsf_group_pool_create(struct pancsf_file *pfile);
+void pancsf_group_pool_destroy(struct pancsf_file *pfile);
+
+void pancsf_sched_handle_job_irqs(struct pancsf_device *pfdev, u32 status);
+
+int pancsf_sched_init(struct pancsf_device *pfdev);
+void pancsf_sched_fini(struct pancsf_device *pfdev);
+int pancsf_mcu_init(struct pancsf_device *pfdev);
+void pancsf_mcu_fini(struct pancsf_device *pfdev);
+int pancsf_destroy_tiler_heap(struct pancsf_file *pfile,
+			      u32 handle);
+int pancsf_create_tiler_heap(struct pancsf_file *pfile,
+			     u32 initial_chunk_count,
+			     u32 chunk_size,
+			     u32 max_chunks,
+			     u32 target_in_flight,
+			     u64 *heap_ctx_gpu_va,
+			     u64 *first_chunk_gpu_va);
+
+#ifdef CONFIG_PM
+void pancsf_sched_resume(struct pancsf_device *pfdev);
+int pancsf_sched_suspend(struct pancsf_device *pfdev);
+#endif
+
+void pancsf_sched_queue_reset(struct pancsf_device *pfdev);
+
+#endif
diff --git a/include/uapi/drm/pancsf_drm.h b/include/uapi/drm/pancsf_drm.h
new file mode 100644
index 000000000000..d770d1d1037c
--- /dev/null
+++ b/include/uapi/drm/pancsf_drm.h
@@ -0,0 +1,414 @@ 
+/* SPDX-License-Identifier: MIT */
+/* Copyright (C) 2023 Collabora ltd. */
+#ifndef _PANCSF_DRM_H_
+#define _PANCSF_DRM_H_
+
+#include "drm.h"
+
+#if defined(__cplusplus)
+extern "C" {
+#endif
+
+/*
+ * Userspace driver controls GPU cache flushling through CS instructions, but
+ * the flush reduction mechanism requires a flush_id. This flush_id could be
+ * queried with an ioctl, but Arm provides a well-isolated register page
+ * containing only this read-only register, so let's expose this page through
+ * a static mmap offset and allow direct mapping of this MMIO region so we
+ * can avoid the user <-> kernel round-trip.
+ */
+#define DRM_PANCSF_USER_MMIO_OFFSET		(0xffffull << 48)
+#define DRM_PANCSF_USER_FLUSH_ID_MMIO_OFFSET	(DRM_PANCSF_USER_MMIO_OFFSET | 0)
+
+/* Place new ioctls at the end, don't re-oder. */
+enum drm_pancsf_ioctl_id {
+	DRM_PANCSF_DEV_QUERY = 0,
+	DRM_PANCSF_VM_CREATE,
+	DRM_PANCSF_VM_DESTROY,
+	DRM_PANCSF_BO_CREATE,
+	DRM_PANCSF_BO_MMAP_OFFSET,
+	DRM_PANCSF_VM_MAP,
+	DRM_PANCSF_VM_UNMAP,
+	DRM_PANCSF_GROUP_CREATE,
+	DRM_PANCSF_GROUP_DESTROY,
+	DRM_PANCSF_TILER_HEAP_CREATE,
+	DRM_PANCSF_TILER_HEAP_DESTROY,
+	DRM_PANCSF_GROUP_SUBMIT,
+};
+
+#define DRM_IOCTL_PANCSF_DEV_QUERY		DRM_IOWR(DRM_COMMAND_BASE + DRM_PANCSF_DEV_QUERY, struct drm_pancsf_dev_query)
+#define DRM_IOCTL_PANCSF_VM_CREATE		DRM_IOWR(DRM_COMMAND_BASE + DRM_PANCSF_VM_CREATE, struct drm_pancsf_vm_create)
+#define DRM_IOCTL_PANCSF_VM_DESTROY		DRM_IOWR(DRM_COMMAND_BASE + DRM_PANCSF_VM_DESTROY, struct drm_pancsf_vm_destroy)
+#define DRM_IOCTL_PANCSF_BO_CREATE		DRM_IOWR(DRM_COMMAND_BASE + DRM_PANCSF_BO_CREATE, struct drm_pancsf_bo_create)
+#define DRM_IOCTL_PANCSF_BO_MMAP_OFFSET		DRM_IOWR(DRM_COMMAND_BASE + DRM_PANCSF_BO_MMAP_OFFSET, struct drm_pancsf_bo_mmap_offset)
+#define DRM_IOCTL_PANCSF_VM_MAP			DRM_IOWR(DRM_COMMAND_BASE + DRM_PANCSF_VM_MAP, struct drm_pancsf_vm_map)
+#define DRM_IOCTL_PANCSF_VM_UNMAP		DRM_IOWR(DRM_COMMAND_BASE + DRM_PANCSF_VM_UNMAP, struct drm_pancsf_vm_unmap)
+#define DRM_IOCTL_PANCSF_GROUP_CREATE		DRM_IOWR(DRM_COMMAND_BASE + DRM_PANCSF_GROUP_CREATE, struct drm_pancsf_group_create)
+#define DRM_IOCTL_PANCSF_GROUP_DESTROY		DRM_IOWR(DRM_COMMAND_BASE + DRM_PANCSF_GROUP_DESTROY, struct drm_pancsf_group_destroy)
+#define DRM_IOCTL_PANCSF_TILER_HEAP_CREATE	DRM_IOWR(DRM_COMMAND_BASE + DRM_PANCSF_TILER_HEAP_CREATE, struct drm_pancsf_tiler_heap_create)
+#define DRM_IOCTL_PANCSF_TILER_HEAP_DESTROY	DRM_IOWR(DRM_COMMAND_BASE + DRM_PANCSF_TILER_HEAP_DESTROY, struct drm_pancsf_tiler_heap_destroy)
+#define DRM_IOCTL_PANCSF_GROUP_SUBMIT		DRM_IOWR(DRM_COMMAND_BASE + DRM_PANCSF_GROUP_SUBMIT, struct drm_pancsf_group_submit)
+
+/* Place new types at the end, don't re-oder. */
+enum drm_pancsf_dev_query_type {
+	DRM_PANCSF_DEV_QUERY_GPU_INFO = 0,
+	DRM_PANCSF_DEV_QUERY_CSIF_INFO,
+};
+
+struct drm_pancsf_gpu_info {
+#define DRM_PANCSF_ARCH_MAJOR(x)		((x) >> 28)
+#define DRM_PANCSF_ARCH_MINOR(x)		(((x) >> 24) & 0xf)
+#define DRM_PANCSF_ARCH_REV(x)			(((x) >> 20) & 0xf)
+#define DRM_PANCSF_PRODUCT_MAJOR(x)		(((x) >> 16) & 0xf)
+#define DRM_PANCSF_VERSION_MAJOR(x)		(((x) >> 12) & 0xf)
+#define DRM_PANCSF_VERSION_MINOR(x)		(((x) >> 4) & 0xff)
+#define DRM_PANCSF_VERSION_STATUS(x)		((x) & 0xf)
+	__u32 gpu_id;
+	__u32 gpu_rev;
+#define DRM_PANCSF_CSHW_MAJOR(x)		(((x) >> 26) & 0x3f)
+#define DRM_PANCSF_CSHW_MINOR(x)		(((x) >> 20) & 0x3f)
+#define DRM_PANCSF_CSHW_REV(x)			(((x) >> 16) & 0xf)
+#define DRM_PANCSF_MCU_MAJOR(x)			(((x) >> 10) & 0x3f)
+#define DRM_PANCSF_MCU_MINOR(x)			(((x) >> 4) & 0x3f)
+#define DRM_PANCSF_MCU_REV(x)			((x) & 0xf)
+	__u32 csf_id;
+	__u32 l2_features;
+	__u32 tiler_features;
+	__u32 mem_features;
+	__u32 mmu_features;
+	__u32 thread_features;
+	__u32 max_threads;
+	__u32 thread_max_workgroup_size;
+	__u32 thread_max_barrier_size;
+	__u32 coherency_features;
+	__u32 texture_features[4];
+	__u32 as_present;
+	__u32 core_group_count;
+	__u64 shader_present;
+	__u64 l2_present;
+	__u64 tiler_present;
+};
+
+struct drm_pancsf_csif_info {
+	__u32 csg_slot_count;
+	__u32 cs_slot_count;
+	__u32 cs_reg_count;
+	__u32 scoreboard_slot_count;
+	__u32 unpreserved_cs_reg_count;
+};
+
+struct drm_pancsf_dev_query {
+	/** @type: the query type (see enum drm_pancsf_dev_query_type). */
+	__u32 type;
+
+	/**
+	 * @size: size of the type being queried.
+	 *
+	 * If pointer is NULL, size is updated by the driver to provide the
+	 * output structure size. If pointer is not NULL, the the driver will
+	 * only copy min(size, actual_structure_size) bytes to the pointer,
+	 * and update the size accordingly. This allows us to extend query
+	 * types without breaking userspace.
+	 */
+	__u32 size;
+
+	/**
+	 * @pointer: user pointer to a query type struct.
+	 *
+	 * Pointer can be NULL, in which case, nothing is copied, but the
+	 * actual structure size is returned. If not NULL, it must point to
+	 * a location that's large enough to hold size bytes.
+	 */
+	__u64 pointer;
+};
+
+struct drm_pancsf_vm_create {
+	/** @flags: VM flags, MBZ. */
+	__u32 flags;
+
+	/** @id: Returned VM ID */
+	__u32 id;
+};
+
+struct drm_pancsf_vm_destroy {
+	/** @id: ID of the VM to destroy */
+	__u32 id;
+
+	/** @pad: MBZ. */
+	__u32 pad;
+};
+
+struct drm_pancsf_bo_create {
+	/**
+	 * @size: Requested size for the object
+	 *
+	 * The (page-aligned) allocated size for the object will be returned.
+	 */
+	__u64 size;
+
+	/**
+	 * @flags: Flags, currently unused, MBZ.
+	 */
+	__u32 flags;
+
+	/**
+	 * @vm_id: Attached VM, if any
+	 *
+	 * If a VM is specified, this BO must:
+	 *
+	 *  1. Only ever be bound to that VM.
+	 *
+	 *  2. Cannot be exported as a PRIME fd.
+	 */
+	__u32 vm_id;
+
+	/**
+	 * @handle: Returned handle for the object.
+	 *
+	 * Object handles are nonzero.
+	 */
+	__u32 handle;
+
+	/* @pad: MBZ. */
+	__u32 pad;
+};
+
+struct drm_pancsf_bo_mmap_offset {
+	/** @handle: Handle for the object being mapped. */
+	__u32 handle;
+
+	/** @pad: MBZ. */
+	__u32 pad;
+
+	/** @offset: The fake offset to use for subsequent mmap call */
+	__u64 offset;
+};
+
+#define PANCSF_VMA_MAP_READONLY		0x1
+#define PANCSF_VMA_MAP_NOEXEC		0x2
+#define PANCSF_VMA_MAP_UNCACHED		0x4
+#define PANCSF_VMA_MAP_FRAG_SHADER	0x8
+#define PANCSF_VMA_MAP_ON_FAULT		0x10
+#define PANCSF_VMA_MAP_AUTO_VA		0x20
+
+struct drm_pancsf_vm_map {
+	/** @vm_id: VM to map BO range to */
+	__u32 vm_id;
+
+	/** @flags: Combination of PANCSF_VMA_MAP_ flags */
+	__u32 flags;
+
+	/** @pad: padding field, MBZ. */
+	__u32 pad;
+
+	/** @bo_handle: Buffer object to map. */
+	__u32 bo_handle;
+
+	/** @bo_offset: Buffer object offset. */
+	__u64 bo_offset;
+
+	/**
+	 * @va: Virtual address to map the BO to. Mapping address returned here if
+	 *	PANCSF_VMA_MAP_ON_FAULT is set.
+	 */
+	__u64 va;
+
+	/** @size: Size to map. */
+	__u64 size;
+};
+
+struct drm_pancsf_vm_unmap {
+	/** @vm_id: VM to map BO range to */
+	__u32 vm_id;
+
+	/** @flags: MBZ. */
+	__u32 flags;
+
+	/** @va: Virtual address to unmap. */
+	__u64 va;
+
+	/** @size: Size to unmap. */
+	__u64 size;
+};
+
+enum drm_pancsf_sync_op_type {
+	DRM_PANCSF_SYNC_OP_WAIT = 0,
+	DRM_PANCSF_SYNC_OP_SIGNAL,
+};
+
+enum drm_pancsf_sync_handle_type {
+	DRM_PANCSF_SYNC_HANDLE_TYPE_SYNCOBJ = 0,
+	DRM_PANCSF_SYNC_HANDLE_TYPE_TIMELINE_SYNCOBJ,
+};
+
+struct drm_pancsf_sync_op {
+	/** @op_type: Sync operation type. */
+	__u32 op_type;
+
+	/** @handle_type: Sync handle type. */
+	__u32 handle_type;
+
+	/** @handle: Sync handle. */
+	__u32 handle;
+
+	/** @flags: MBZ. */
+	__u32 flags;
+
+	/** @timeline_value: MBZ if handle_type != DRM_PANCSF_SYNC_HANDLE_TYPE_TIMELINE_SYNCOBJ. */
+	__u64 timeline_value;
+};
+
+struct drm_pancsf_obj_array {
+	/** @stride: Stride of object struct. Used for versioning. */
+	__u32 stride;
+
+	/** @count: Number of objects in the array. */
+	__u32 count;
+
+	/** @array: User pointer to an array of objects. */
+	__u64 array;
+};
+
+#define DRM_PANCSF_OBJ_ARRAY(cnt, ptr) \
+	{ .stride = sizeof(ptr[0]), .count = cnt, .array = (__u64)(uintptr_t)ptr }
+
+struct drm_pancsf_queue_submit {
+	/** @queue_index: Index of the queue inside a group. */
+	__u32 queue_index;
+
+	/** @stream_size: Size of the command stream to execute. */
+	__u32 stream_size;
+
+	/** @stream_addr: GPU address of the command stream to execute. */
+	__u64 stream_addr;
+
+	/** @syncs: Array of sync operations. */
+	struct drm_pancsf_obj_array syncs;
+};
+
+struct drm_pancsf_group_submit {
+	/** @group_handle: Handle of the group to queue jobs to. */
+	__u32 group_handle;
+
+	/** @syncs: Array of queue submit operations. */
+	struct drm_pancsf_obj_array queue_submits;
+};
+
+struct drm_pancsf_queue_create {
+	/**
+	 * @priority: Defines the priority of queues inside a group. Goes from 0 to 15,
+	 *	      15 being the highest priority.
+	 */
+	__u8 priority;
+
+	/** @pad: Padding fields, MBZ. */
+	__u8 pad[3];
+
+	/** @ringbuf_size: Size of the ring buffer to allocate to this queue. */
+	__u32 ringbuf_size;
+};
+
+enum drm_pancsf_group_priority {
+	PANCSF_GROUP_PRIORITY_LOW = 0,
+	PANCSF_GROUP_PRIORITY_MEDIUM,
+	PANCSF_GROUP_PRIORITY_HIGH,
+};
+
+struct drm_pancsf_group_create {
+	/** @queues: Array of drm_pancsf_create_cs_queue elements. */
+	struct drm_pancsf_obj_array queues;
+
+	/**
+	 * @max_compute_cores: Maximum number of cores that can be
+	 *		       used by compute jobs across CS queues
+	 *		       bound to this group.
+	 */
+	__u8 max_compute_cores;
+
+	/**
+	 * @max_fragment_cores: Maximum number of cores that can be
+	 *			used by fragment jobs across CS queues
+	 *			bound to this group.
+	 */
+	__u8 max_fragment_cores;
+
+	/**
+	 * @max_tiler_cores: Maximum number of tilers that can be
+	 *		     used by tiler jobs across CS queues
+	 *		     bound to this group.
+	 */
+	__u8 max_tiler_cores;
+
+	/** @priority: Group priority (see drm_drm_pancsf_cs_group_priority). */
+	__u8 priority;
+
+	/** @compute_core_mask: Mask encoding cores that can be used for compute jobs. */
+	__u64 compute_core_mask;
+
+	/** @fragment_core_mask: Mask encoding cores that can be used for fragment jobs. */
+	__u64 fragment_core_mask;
+
+	/** @tiler_core_mask: Mask encoding cores that can be used for tiler jobs. */
+	__u64 tiler_core_mask;
+
+	/**
+	 * @vm_id: VM ID to bind this group to. All submission to queues bound to this
+	 *	   group will use this VM.
+	 */
+	__u32 vm_id;
+
+	/*
+	 * @group_handle: Returned group handle. Passed back when submitting jobs or
+	 *		  destroying a group.
+	 */
+	__u32 group_handle;
+};
+
+struct drm_pancsf_group_destroy {
+	/** @group_handle: Group to destroy */
+	__u32 group_handle;
+
+	/** @pad: Padding field, MBZ. */
+	__u32 pad;
+};
+
+struct drm_pancsf_tiler_heap_create {
+	/** @vm_id: VM ID the tiler heap should be mapped to */
+	__u32 vm_id;
+
+	/** @initial_chunk_count: Initial number of chunks to allocate. */
+	__u32 initial_chunk_count;
+
+	/** @chunk_size: Chunk size. Must be a power of two at least 256KB large. */
+	__u32 chunk_size;
+
+	/* @max_chunks: Maximum number of chunks that can be allocated. */
+	__u32 max_chunks;
+
+	/** @target_in_flight: Maximum number of in-flight render passes.
+	 * If exceeded the FW will wait for render passes to finish before
+	 * queuing new tiler jobs.
+	 */
+	__u32 target_in_flight;
+
+	/** @handle: Returned heap handle. Passed back to DESTROY_TILER_HEAP. */
+	__u32 handle;
+
+	/** @tiler_heap_ctx_gpu_va: Returned heap GPU virtual address returned */
+	__u64 tiler_heap_ctx_gpu_va;
+	__u64 first_heap_chunk_gpu_va;
+};
+
+struct drm_pancsf_tiler_heap_destroy {
+	/** @handle: Handle of the tiler heap to destroy */
+	__u32 handle;
+
+	/** @pad: Padding field, MBZ. */
+	__u32 pad;
+};
+
+#if defined(__cplusplus)
+}
+#endif
+
+#endif /* _PANCSF_DRM_H_ */