[7/9] drm/i915/gt: Pipelined page migration

Message ID	20210608092846.64198-8-thomas.hellstrom@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=F+1B=LC=lists.freedesktop.org=intel-gfx-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 044806127A IronPort-SDR: ygWIbymT9awPFdITUiAw+JzMpRGXMA9AxIsSEk4pfUJln54egLCH14LPsVK4eYlJ/do7enZ01f Wyv/bSY39p9Q== IronPort-SDR: FUVfUZBy/YMyQAWhfRn+6Xz06MrO4O5kwIPuJU4esCJCrsBdf+boFhDGPU8qT/tCcaRWZTYj1I wgn1D1oZ9LWw== From: =?utf-8?q?Thomas_Hellstr=C3=B6m?= <thomas.hellstrom@linux.intel.com> To: intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org Date: Tue, 8 Jun 2021 11:28:44 +0200 Message-Id: <20210608092846.64198-8-thomas.hellstrom@linux.intel.com> In-Reply-To: <20210608092846.64198-1-thomas.hellstrom@linux.intel.com> References: <20210608092846.64198-1-thomas.hellstrom@linux.intel.com> MIME-Version: 1.0 Subject: [Intel-gfx] [PATCH 7/9] drm/i915/gt: Pipelined page migration Precedence: list Cc: =?utf-8?q?Thomas_Hellstr=C3=B6m?= <thomas.hellstrom@linux.intel.com>, matthew.auld@intel.com, Chris Wilson <chris@chris-wilson.co.uk> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
Series	Prereqs for TTM accelerated migration \| expand [0/9] Prereqs for TTM accelerated migration [1/9] drm/i915: Reference objects on the ww object list [2/9] drm/i915: Break out dma_resv ww locking utilities to separate files [3/9] drm/i915: Introduce a ww transaction helper [4/9] drm/i915/gt: Add an insert_entry for gen8_ppgtt [5/9] drm/i915/gt: Add a routine to iterate over the pagetables of a GTT [6/9] drm/i915/gt: Export the pinned context constructor [7/9] drm/i915/gt: Pipelined page migration [8/9] drm/i915/gt: Pipelined clear [9/9] drm/i915/gt: Setup a default migration context on the GT

Message ID

20210608092846.64198-8-thomas.hellstrom@linux.intel.com (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 044806127A
IronPort-SDR: 
 ygWIbymT9awPFdITUiAw+JzMpRGXMA9AxIsSEk4pfUJln54egLCH14LPsVK4eYlJ/do7enZ01f
 Wyv/bSY39p9Q==
IronPort-SDR: 
 FUVfUZBy/YMyQAWhfRn+6Xz06MrO4O5kwIPuJU4esCJCrsBdf+boFhDGPU8qT/tCcaRWZTYj1I
 wgn1D1oZ9LWw==
From: =?utf-8?q?Thomas_Hellstr=C3=B6m?= <thomas.hellstrom@linux.intel.com>
To: intel-gfx@lists.freedesktop.org,
	dri-devel@lists.freedesktop.org
Date: Tue,  8 Jun 2021 11:28:44 +0200
Message-Id: <20210608092846.64198-8-thomas.hellstrom@linux.intel.com>
In-Reply-To: <20210608092846.64198-1-thomas.hellstrom@linux.intel.com>
References: <20210608092846.64198-1-thomas.hellstrom@linux.intel.com>
MIME-Version: 1.0
Subject: [Intel-gfx] [PATCH 7/9] drm/i915/gt: Pipelined page migration
Precedence: list
Cc: =?utf-8?q?Thomas_Hellstr=C3=B6m?= <thomas.hellstrom@linux.intel.com>,
 matthew.auld@intel.com, Chris Wilson <chris@chris-wilson.co.uk>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Errors-To: intel-gfx-bounces@lists.freedesktop.org
Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Series

Prereqs for TTM accelerated migration | expand

Commit Message

Thomas Hellström June 8, 2021, 9:28 a.m. UTC

From: Chris Wilson <chris@chris-wilson.co.uk>

If we pipeline the PTE updates and then do the copy of those pages
within a single unpreemptible command packet, we can submit the copies
and leave them to be scheduled without having to synchronously wait
under a global lock. In order to manage migration, we need to
preallocate the page tables (and keep them pinned and available for use
at any time), causing a bottleneck for migrations as all clients must
contend on the limited resources. By inlining the ppGTT updates and
performing the blit atomically, each client only owns the PTE while in
use, and so we can reschedule individual operations however we see fit.
And most importantly, we do not need to take a global lock on the shared
vm, and wait until the operation is complete before releasing the lock
for others to claim the PTE for themselves.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Co-developed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/i915/Makefile                 |   1 +
 drivers/gpu/drm/i915/gt/intel_engine.h        |   1 +
 drivers/gpu/drm/i915/gt/intel_gpu_commands.h  |   2 +
 drivers/gpu/drm/i915/gt/intel_migrate.c       | 543 ++++++++++++++++++
 drivers/gpu/drm/i915/gt/intel_migrate.h       |  45 ++
 drivers/gpu/drm/i915/gt/intel_migrate_types.h |  15 +
 drivers/gpu/drm/i915/gt/intel_ring.h          |   1 +
 drivers/gpu/drm/i915/gt/selftest_migrate.c    | 291 ++++++++++
 .../drm/i915/selftests/i915_live_selftests.h  |   1 +
 9 files changed, 900 insertions(+)
 create mode 100644 drivers/gpu/drm/i915/gt/intel_migrate.c
 create mode 100644 drivers/gpu/drm/i915/gt/intel_migrate.h
 create mode 100644 drivers/gpu/drm/i915/gt/intel_migrate_types.h
 create mode 100644 drivers/gpu/drm/i915/gt/selftest_migrate.c

Comments

Matthew Auld June 8, 2021, 4:18 p.m. UTC | #1

On Tue, 8 Jun 2021 at 10:29, Thomas Hellström
<thomas.hellstrom@linux.intel.com> wrote:
>
> From: Chris Wilson <chris@chris-wilson.co.uk>
>
> If we pipeline the PTE updates and then do the copy of those pages
> within a single unpreemptible command packet, we can submit the copies
> and leave them to be scheduled without having to synchronously wait
> under a global lock. In order to manage migration, we need to
> preallocate the page tables (and keep them pinned and available for use
> at any time), causing a bottleneck for migrations as all clients must
> contend on the limited resources. By inlining the ppGTT updates and
> performing the blit atomically, each client only owns the PTE while in
> use, and so we can reschedule individual operations however we see fit.
> And most importantly, we do not need to take a global lock on the shared
> vm, and wait until the operation is complete before releasing the lock
> for others to claim the PTE for themselves.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Co-developed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/Makefile                 |   1 +
>  drivers/gpu/drm/i915/gt/intel_engine.h        |   1 +
>  drivers/gpu/drm/i915/gt/intel_gpu_commands.h  |   2 +
>  drivers/gpu/drm/i915/gt/intel_migrate.c       | 543 ++++++++++++++++++
>  drivers/gpu/drm/i915/gt/intel_migrate.h       |  45 ++
>  drivers/gpu/drm/i915/gt/intel_migrate_types.h |  15 +
>  drivers/gpu/drm/i915/gt/intel_ring.h          |   1 +
>  drivers/gpu/drm/i915/gt/selftest_migrate.c    | 291 ++++++++++
>  .../drm/i915/selftests/i915_live_selftests.h  |   1 +
>  9 files changed, 900 insertions(+)
>  create mode 100644 drivers/gpu/drm/i915/gt/intel_migrate.c
>  create mode 100644 drivers/gpu/drm/i915/gt/intel_migrate.h
>  create mode 100644 drivers/gpu/drm/i915/gt/intel_migrate_types.h
>  create mode 100644 drivers/gpu/drm/i915/gt/selftest_migrate.c
>
> diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
> index ea8ee4b3e018..9f18902be626 100644
> --- a/drivers/gpu/drm/i915/Makefile
> +++ b/drivers/gpu/drm/i915/Makefile
> @@ -109,6 +109,7 @@ gt-y += \
>         gt/intel_gtt.o \
>         gt/intel_llc.o \
>         gt/intel_lrc.o \
> +       gt/intel_migrate.o \
>         gt/intel_mocs.o \
>         gt/intel_ppgtt.o \
>         gt/intel_rc6.o \
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
> index 0862c42b4cac..949965680c37 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine.h
> @@ -188,6 +188,7 @@ intel_write_status_page(struct intel_engine_cs *engine, int reg, u32 value)
>  #define I915_GEM_HWS_PREEMPT_ADDR      (I915_GEM_HWS_PREEMPT * sizeof(u32))
>  #define I915_GEM_HWS_SEQNO             0x40
>  #define I915_GEM_HWS_SEQNO_ADDR                (I915_GEM_HWS_SEQNO * sizeof(u32))
> +#define I915_GEM_HWS_MIGRATE           (0x42 * sizeof(u32))
>  #define I915_GEM_HWS_SCRATCH           0x80
>
>  #define I915_HWS_CSB_BUF0_INDEX                0x10
> diff --git a/drivers/gpu/drm/i915/gt/intel_gpu_commands.h b/drivers/gpu/drm/i915/gt/intel_gpu_commands.h
> index 2694dbb9967e..1c3af0fc0456 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gpu_commands.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gpu_commands.h
> @@ -123,8 +123,10 @@
>  #define   MI_SEMAPHORE_SAD_NEQ_SDD     (5 << 12)
>  #define   MI_SEMAPHORE_TOKEN_MASK      REG_GENMASK(9, 5)
>  #define   MI_SEMAPHORE_TOKEN_SHIFT     5
> +#define MI_STORE_DATA_IMM      MI_INSTR(0x20, 0)
>  #define MI_STORE_DWORD_IMM     MI_INSTR(0x20, 1)
>  #define MI_STORE_DWORD_IMM_GEN4        MI_INSTR(0x20, 2)
> +#define MI_STORE_QWORD_IMM_GEN8 (MI_INSTR(0x20, 3) | REG_BIT(21))
>  #define   MI_MEM_VIRTUAL       (1 << 22) /* 945,g33,965 */
>  #define   MI_USE_GGTT          (1 << 22) /* g4x+ */
>  #define MI_STORE_DWORD_INDEX   MI_INSTR(0x21, 1)
> diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
> new file mode 100644
> index 000000000000..1f60f8ee36f8
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
> @@ -0,0 +1,543 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2020 Intel Corporation
> + */
> +
> +#include "i915_drv.h"
> +#include "intel_context.h"
> +#include "intel_gpu_commands.h"
> +#include "intel_gt.h"
> +#include "intel_gtt.h"
> +#include "intel_migrate.h"
> +#include "intel_ring.h"
> +
> +struct insert_pte_data {
> +       u64 offset;
> +       bool is_lmem;
> +};
> +
> +#define CHUNK_SZ SZ_8M /* ~1ms at 8GiB/s preemption delay */
> +
> +static bool engine_supports_migration(struct intel_engine_cs *engine)
> +{
> +       if (!engine)
> +               return false;
> +
> +       /*
> +        * We need the ability to prevent aribtration (MI_ARB_ON_OFF),
> +        * the ability to write PTE using inline data (MI_STORE_DATA)
> +        * and of course the ability to do the block transfer (blits).
> +        */
> +       GEM_BUG_ON(engine->class != COPY_ENGINE_CLASS);
> +
> +       return true;
> +}
> +
> +static void insert_pte(struct i915_address_space *vm,
> +                      struct i915_page_table *pt,
> +                      void *data)
> +{
> +       struct insert_pte_data *d = data;
> +
> +       vm->insert_page(vm, px_dma(pt), d->offset, I915_CACHE_NONE,
> +                       d->is_lmem ? PTE_LM : 0);
> +       d->offset += PAGE_SIZE;
> +}
> +
> +static struct i915_address_space *migrate_vm(struct intel_gt *gt)
> +{
> +       struct i915_vm_pt_stash stash = {};
> +       struct i915_ppgtt *vm;
> +       int err;
> +       int i;
> +
> +       /*
> +        * We construct a very special VM for use by all migration contexts,
> +        * it is kept pinned so that it can be used at any time. As we need
> +        * to pre-allocate the page directories for the migration VM, this
> +        * limits us to only using a small number of prepared vma.
> +        *
> +        * To be able to pipeline and reschedule migration operations while
> +        * avoiding unnecessary contention on the vm itself, the PTE updates
> +        * are inline with the blits. All the blits use the same fixed
> +        * addresses, with the backing store redirection being updated on the
> +        * fly. Only 2 implicit vma are used for all migration operations.
> +        *
> +        * We lay the ppGTT out as:
> +        *
> +        *      [0, CHUNK_SZ) -> first object
> +        *      [CHUNK_SZ, 2 * CHUNK_SZ) -> second object
> +        *      [2 * CHUNK_SZ, 2 * CHUNK_SZ + 2 * CHUNK_SZ >> 9] -> PTE
> +        *
> +        * By exposing the dma addresses of the page directories themselves
> +        * within the ppGTT, we are then able to rewrite the PTE prior to use.
> +        * But the PTE update and subsequent migration operation must be atomic,
> +        * i.e. within the same non-preemptible window so that we do not switch
> +        * to another migration context that overwrites the PTE.
> +        */
> +
> +       vm = i915_ppgtt_create(gt);
> +       if (IS_ERR(vm))
> +               return ERR_CAST(vm);
> +
> +       if (!vm->vm.allocate_va_range || !vm->vm.foreach) {
> +               err = -ENODEV;
> +               goto err_vm;
> +       }
> +
> +       /*
> +        * Each engine instance is assigned its own chunk in the VM, so
> +        * that we can run multiple instances concurrently
> +        */
> +       for (i = 0; i < ARRAY_SIZE(gt->engine_class[COPY_ENGINE_CLASS]); i++) {
> +               struct intel_engine_cs *engine;
> +               u64 base = (u64)i << 32;
> +               struct insert_pte_data d = {};
> +               struct i915_gem_ww_ctx ww;
> +               u64 sz;
> +
> +               engine = gt->engine_class[COPY_ENGINE_CLASS][i];
> +               if (!engine_supports_migration(engine))
> +                       continue;
> +
> +               /*
> +                * We copy in 8MiB chunks. Each PDE covers 2MiB, so we need
> +                * 4x2 page directories for source/destination.
> +                */
> +               sz = 2 * CHUNK_SZ;
> +               d.offset = base + sz;
> +
> +               /*
> +                * We need another page directory setup so that we can write
> +                * the 8x512 PTE in each chunk.
> +                */
> +               sz += (sz >> 12) * sizeof(u64);
> +
> +               err = i915_vm_alloc_pt_stash(&vm->vm, &stash, sz);
> +               if (err)
> +                       goto err_vm;
> +
> +               for_i915_gem_ww(&ww, err, true) {
> +                       err = i915_vm_lock_objects(&vm->vm, &ww);
> +                       if (err)
> +                               continue;
> +                       err = i915_vm_map_pt_stash(&vm->vm, &stash);
> +                       if (err)
> +                               continue;
> +
> +                       vm->vm.allocate_va_range(&vm->vm, &stash, base, base + sz);
> +               }
> +               i915_vm_free_pt_stash(&vm->vm, &stash);
> +               if (err)
> +                       goto err_vm;
> +
> +               /* Now allow the GPU to rewrite the PTE via its own ppGTT */
> +               d.is_lmem = i915_gem_object_is_lmem(vm->vm.scratch[0]);
> +               vm->vm.foreach(&vm->vm, base, base + sz, insert_pte, &d);

Will this play nice with Xe HP where we have min page size
restrictions for the GTT + lmem? The page-table allocations sidestep
such restrictions since they were previously never inserted into the
GTT. Maybe we need the flat-ppGTT at some point? Perhaps add a TODO or
something?

Thomas Hellström June 8, 2021, 7:05 p.m. UTC | #2

On Tue, 2021-06-08 at 17:18 +0100, Matthew Auld wrote:
> On Tue, 8 Jun 2021 at 10:29, Thomas Hellström
> <thomas.hellstrom@linux.intel.com> wrote:
> > 
> > From: Chris Wilson <chris@chris-wilson.co.uk>
> > 
> > If we pipeline the PTE updates and then do the copy of those pages
> > within a single unpreemptible command packet, we can submit the
> > copies
> > and leave them to be scheduled without having to synchronously wait
> > under a global lock. In order to manage migration, we need to
> > preallocate the page tables (and keep them pinned and available for
> > use
> > at any time), causing a bottleneck for migrations as all clients
> > must
> > contend on the limited resources. By inlining the ppGTT updates and
> > performing the blit atomically, each client only owns the PTE while
> > in
> > use, and so we can reschedule individual operations however we see
> > fit.
> > And most importantly, we do not need to take a global lock on the
> > shared
> > vm, and wait until the operation is complete before releasing the
> > lock
> > for others to claim the PTE for themselves.
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Co-developed-by: Thomas Hellström
> > <thomas.hellstrom@linux.intel.com>
> > Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > ---
> >  drivers/gpu/drm/i915/Makefile                 |   1 +
> >  drivers/gpu/drm/i915/gt/intel_engine.h        |   1 +
> >  drivers/gpu/drm/i915/gt/intel_gpu_commands.h  |   2 +
> >  drivers/gpu/drm/i915/gt/intel_migrate.c       | 543
> > ++++++++++++++++++
> >  drivers/gpu/drm/i915/gt/intel_migrate.h       |  45 ++
> >  drivers/gpu/drm/i915/gt/intel_migrate_types.h |  15 +
> >  drivers/gpu/drm/i915/gt/intel_ring.h          |   1 +
> >  drivers/gpu/drm/i915/gt/selftest_migrate.c    | 291 ++++++++++
> >  .../drm/i915/selftests/i915_live_selftests.h  |   1 +
> >  9 files changed, 900 insertions(+)
> >  create mode 100644 drivers/gpu/drm/i915/gt/intel_migrate.c
> >  create mode 100644 drivers/gpu/drm/i915/gt/intel_migrate.h
> >  create mode 100644 drivers/gpu/drm/i915/gt/intel_migrate_types.h
> >  create mode 100644 drivers/gpu/drm/i915/gt/selftest_migrate.c
> > 
> > diff --git a/drivers/gpu/drm/i915/Makefile
> > b/drivers/gpu/drm/i915/Makefile
> > index ea8ee4b3e018..9f18902be626 100644
> > --- a/drivers/gpu/drm/i915/Makefile
> > +++ b/drivers/gpu/drm/i915/Makefile
> > @@ -109,6 +109,7 @@ gt-y += \
> >         gt/intel_gtt.o \
> >         gt/intel_llc.o \
> >         gt/intel_lrc.o \
> > +       gt/intel_migrate.o \
> >         gt/intel_mocs.o \
> >         gt/intel_ppgtt.o \
> >         gt/intel_rc6.o \
> > diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h
> > b/drivers/gpu/drm/i915/gt/intel_engine.h
> > index 0862c42b4cac..949965680c37 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_engine.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_engine.h
> > @@ -188,6 +188,7 @@ intel_write_status_page(struct intel_engine_cs
> > *engine, int reg, u32 value)
> >  #define I915_GEM_HWS_PREEMPT_ADDR      (I915_GEM_HWS_PREEMPT *
> > sizeof(u32))
> >  #define I915_GEM_HWS_SEQNO             0x40
> >  #define I915_GEM_HWS_SEQNO_ADDR                (I915_GEM_HWS_SEQNO
> > * sizeof(u32))
> > +#define I915_GEM_HWS_MIGRATE           (0x42 * sizeof(u32))
> >  #define I915_GEM_HWS_SCRATCH           0x80
> > 
> >  #define I915_HWS_CSB_BUF0_INDEX                0x10
> > diff --git a/drivers/gpu/drm/i915/gt/intel_gpu_commands.h
> > b/drivers/gpu/drm/i915/gt/intel_gpu_commands.h
> > index 2694dbb9967e..1c3af0fc0456 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_gpu_commands.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_gpu_commands.h
> > @@ -123,8 +123,10 @@
> >  #define   MI_SEMAPHORE_SAD_NEQ_SDD     (5 << 12)
> >  #define   MI_SEMAPHORE_TOKEN_MASK      REG_GENMASK(9, 5)
> >  #define   MI_SEMAPHORE_TOKEN_SHIFT     5
> > +#define MI_STORE_DATA_IMM      MI_INSTR(0x20, 0)
> >  #define MI_STORE_DWORD_IMM     MI_INSTR(0x20, 1)
> >  #define MI_STORE_DWORD_IMM_GEN4        MI_INSTR(0x20, 2)
> > +#define MI_STORE_QWORD_IMM_GEN8 (MI_INSTR(0x20, 3) | REG_BIT(21))
> >  #define   MI_MEM_VIRTUAL       (1 << 22) /* 945,g33,965 */
> >  #define   MI_USE_GGTT          (1 << 22) /* g4x+ */
> >  #define MI_STORE_DWORD_INDEX   MI_INSTR(0x21, 1)
> > diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c
> > b/drivers/gpu/drm/i915/gt/intel_migrate.c
> > new file mode 100644
> > index 000000000000..1f60f8ee36f8
> > --- /dev/null
> > +++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
> > @@ -0,0 +1,543 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2020 Intel Corporation
> > + */
> > +
> > +#include "i915_drv.h"
> > +#include "intel_context.h"
> > +#include "intel_gpu_commands.h"
> > +#include "intel_gt.h"
> > +#include "intel_gtt.h"
> > +#include "intel_migrate.h"
> > +#include "intel_ring.h"
> > +
> > +struct insert_pte_data {
> > +       u64 offset;
> > +       bool is_lmem;
> > +};
> > +
> > +#define CHUNK_SZ SZ_8M /* ~1ms at 8GiB/s preemption delay */
> > +
> > +static bool engine_supports_migration(struct intel_engine_cs
> > *engine)
> > +{
> > +       if (!engine)
> > +               return false;
> > +
> > +       /*
> > +        * We need the ability to prevent aribtration
> > (MI_ARB_ON_OFF),
> > +        * the ability to write PTE using inline data
> > (MI_STORE_DATA)
> > +        * and of course the ability to do the block transfer
> > (blits).
> > +        */
> > +       GEM_BUG_ON(engine->class != COPY_ENGINE_CLASS);
> > +
> > +       return true;
> > +}
> > +
> > +static void insert_pte(struct i915_address_space *vm,
> > +                      struct i915_page_table *pt,
> > +                      void *data)
> > +{
> > +       struct insert_pte_data *d = data;
> > +
> > +       vm->insert_page(vm, px_dma(pt), d->offset, I915_CACHE_NONE,
> > +                       d->is_lmem ? PTE_LM : 0);
> > +       d->offset += PAGE_SIZE;
> > +}
> > +
> > +static struct i915_address_space *migrate_vm(struct intel_gt *gt)
> > +{
> > +       struct i915_vm_pt_stash stash = {};
> > +       struct i915_ppgtt *vm;
> > +       int err;
> > +       int i;
> > +
> > +       /*
> > +        * We construct a very special VM for use by all migration
> > contexts,
> > +        * it is kept pinned so that it can be used at any time. As
> > we need
> > +        * to pre-allocate the page directories for the migration
> > VM, this
> > +        * limits us to only using a small number of prepared vma.
> > +        *
> > +        * To be able to pipeline and reschedule migration
> > operations while
> > +        * avoiding unnecessary contention on the vm itself, the
> > PTE updates
> > +        * are inline with the blits. All the blits use the same
> > fixed
> > +        * addresses, with the backing store redirection being
> > updated on the
> > +        * fly. Only 2 implicit vma are used for all migration
> > operations.
> > +        *
> > +        * We lay the ppGTT out as:
> > +        *
> > +        *      [0, CHUNK_SZ) -> first object
> > +        *      [CHUNK_SZ, 2 * CHUNK_SZ) -> second object
> > +        *      [2 * CHUNK_SZ, 2 * CHUNK_SZ + 2 * CHUNK_SZ >> 9] ->
> > PTE
> > +        *
> > +        * By exposing the dma addresses of the page directories
> > themselves
> > +        * within the ppGTT, we are then able to rewrite the PTE
> > prior to use.
> > +        * But the PTE update and subsequent migration operation
> > must be atomic,
> > +        * i.e. within the same non-preemptible window so that we
> > do not switch
> > +        * to another migration context that overwrites the PTE.
> > +        */
> > +
> > +       vm = i915_ppgtt_create(gt);
> > +       if (IS_ERR(vm))
> > +               return ERR_CAST(vm);
> > +
> > +       if (!vm->vm.allocate_va_range || !vm->vm.foreach) {
> > +               err = -ENODEV;
> > +               goto err_vm;
> > +       }
> > +
> > +       /*
> > +        * Each engine instance is assigned its own chunk in the
> > VM, so
> > +        * that we can run multiple instances concurrently
> > +        */
> > +       for (i = 0; i < ARRAY_SIZE(gt-
> > >engine_class[COPY_ENGINE_CLASS]); i++) {
> > +               struct intel_engine_cs *engine;
> > +               u64 base = (u64)i << 32;
> > +               struct insert_pte_data d = {};
> > +               struct i915_gem_ww_ctx ww;
> > +               u64 sz;
> > +
> > +               engine = gt->engine_class[COPY_ENGINE_CLASS][i];
> > +               if (!engine_supports_migration(engine))
> > +                       continue;
> > +
> > +               /*
> > +                * We copy in 8MiB chunks. Each PDE covers 2MiB, so
> > we need
> > +                * 4x2 page directories for source/destination.
> > +                */
> > +               sz = 2 * CHUNK_SZ;
> > +               d.offset = base + sz;
> > +
> > +               /*
> > +                * We need another page directory setup so that we
> > can write
> > +                * the 8x512 PTE in each chunk.
> > +                */
> > +               sz += (sz >> 12) * sizeof(u64);
> > +
> > +               err = i915_vm_alloc_pt_stash(&vm->vm, &stash, sz);
> > +               if (err)
> > +                       goto err_vm;
> > +
> > +               for_i915_gem_ww(&ww, err, true) {
> > +                       err = i915_vm_lock_objects(&vm->vm, &ww);
> > +                       if (err)
> > +                               continue;
> > +                       err = i915_vm_map_pt_stash(&vm->vm,
> > &stash);
> > +                       if (err)
> > +                               continue;
> > +
> > +                       vm->vm.allocate_va_range(&vm->vm, &stash,
> > base, base + sz);
> > +               }
> > +               i915_vm_free_pt_stash(&vm->vm, &stash);
> > +               if (err)
> > +                       goto err_vm;
> > +
> > +               /* Now allow the GPU to rewrite the PTE via its own
> > ppGTT */
> > +               d.is_lmem = i915_gem_object_is_lmem(vm-
> > >vm.scratch[0]);
> > +               vm->vm.foreach(&vm->vm, base, base + sz,
> > insert_pte, &d);
> 
> Will this play nice with Xe HP where we have min page size
> restrictions for the GTT + lmem? The page-table allocations sidestep
> such restrictions since they were previously never inserted into the
> GTT. Maybe we need the flat-ppGTT at some point? Perhaps add a TODO
> or
> something?

I've previously discussed that with Chris, and apparently it should be
as easy as adjusting the sz >> 12 above. But yeah, Chris also brought
up the flat-ppGTT idea as something we should look into.

Mieanwhile, we should add a TODO.

/Thomas

Thomas Hellström June 8, 2021, 7:09 p.m. UTC | #3

On Tue, 2021-06-08 at 11:28 +0200, Thomas Hellström wrote:
> From: Chris Wilson <chris@chris-wilson.co.uk>
> 
> If we pipeline the PTE updates and then do the copy of those pages
> within a single unpreemptible command packet, we can submit the
> copies
> and leave them to be scheduled without having to synchronously wait
> under a global lock. In order to manage migration, we need to
> preallocate the page tables (and keep them pinned and available for
> use
> at any time), causing a bottleneck for migrations as all clients must
> contend on the limited resources. By inlining the ppGTT updates and
> performing the blit atomically, each client only owns the PTE while
> in
> use, and so we can reschedule individual operations however we see
> fit.
> And most importantly, we do not need to take a global lock on the
> shared
> vm, and wait until the operation is complete before releasing the
> lock
> for others to claim the PTE for themselves.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Co-developed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/Makefile                 |   1 +
>  drivers/gpu/drm/i915/gt/intel_engine.h        |   1 +
>  drivers/gpu/drm/i915/gt/intel_gpu_commands.h  |   2 +
>  drivers/gpu/drm/i915/gt/intel_migrate.c       | 543
> ++++++++++++++++++
>  drivers/gpu/drm/i915/gt/intel_migrate.h       |  45 ++
>  drivers/gpu/drm/i915/gt/intel_migrate_types.h |  15 +
>  drivers/gpu/drm/i915/gt/intel_ring.h          |   1 +
>  drivers/gpu/drm/i915/gt/selftest_migrate.c    | 291 ++++++++++
>  .../drm/i915/selftests/i915_live_selftests.h  |   1 +
>  9 files changed, 900 insertions(+)
>  create mode 100644 drivers/gpu/drm/i915/gt/intel_migrate.c
>  create mode 100644 drivers/gpu/drm/i915/gt/intel_migrate.h
>  create mode 100644 drivers/gpu/drm/i915/gt/intel_migrate_types.h
>  create mode 100644 drivers/gpu/drm/i915/gt/selftest_migrate.c
> 
> diff --git a/drivers/gpu/drm/i915/Makefile
> b/drivers/gpu/drm/i915/Makefile
> index ea8ee4b3e018..9f18902be626 100644
> --- a/drivers/gpu/drm/i915/Makefile
> +++ b/drivers/gpu/drm/i915/Makefile
> @@ -109,6 +109,7 @@ gt-y += \
>         gt/intel_gtt.o \
>         gt/intel_llc.o \
>         gt/intel_lrc.o \
> +       gt/intel_migrate.o \
>         gt/intel_mocs.o \
>         gt/intel_ppgtt.o \
>         gt/intel_rc6.o \
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h
> b/drivers/gpu/drm/i915/gt/intel_engine.h
> index 0862c42b4cac..949965680c37 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine.h
> @@ -188,6 +188,7 @@ intel_write_status_page(struct intel_engine_cs
> *engine, int reg, u32 value)
>  #define I915_GEM_HWS_PREEMPT_ADDR      (I915_GEM_HWS_PREEMPT *
> sizeof(u32))
>  #define I915_GEM_HWS_SEQNO             0x40
>  #define I915_GEM_HWS_SEQNO_ADDR                (I915_GEM_HWS_SEQNO *
> sizeof(u32))
> +#define I915_GEM_HWS_MIGRATE           (0x42 * sizeof(u32))
>  #define I915_GEM_HWS_SCRATCH           0x80
>  
>  #define I915_HWS_CSB_BUF0_INDEX                0x10
> diff --git a/drivers/gpu/drm/i915/gt/intel_gpu_commands.h
> b/drivers/gpu/drm/i915/gt/intel_gpu_commands.h
> index 2694dbb9967e..1c3af0fc0456 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gpu_commands.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gpu_commands.h
> @@ -123,8 +123,10 @@
>  #define   MI_SEMAPHORE_SAD_NEQ_SDD     (5 << 12)
>  #define   MI_SEMAPHORE_TOKEN_MASK      REG_GENMASK(9, 5)
>  #define   MI_SEMAPHORE_TOKEN_SHIFT     5
> +#define MI_STORE_DATA_IMM      MI_INSTR(0x20, 0)
>  #define MI_STORE_DWORD_IMM     MI_INSTR(0x20, 1)
>  #define MI_STORE_DWORD_IMM_GEN4        MI_INSTR(0x20, 2)
> +#define MI_STORE_QWORD_IMM_GEN8 (MI_INSTR(0x20, 3) | REG_BIT(21))
>  #define   MI_MEM_VIRTUAL       (1 << 22) /* 945,g33,965 */
>  #define   MI_USE_GGTT          (1 << 22) /* g4x+ */
>  #define MI_STORE_DWORD_INDEX   MI_INSTR(0x21, 1)
> diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c
> b/drivers/gpu/drm/i915/gt/intel_migrate.c
> new file mode 100644
> index 000000000000..1f60f8ee36f8
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
> @@ -0,0 +1,543 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2020 Intel Corporation
> + */
> +
> +#include "i915_drv.h"
> +#include "intel_context.h"
> +#include "intel_gpu_commands.h"
> +#include "intel_gt.h"
> +#include "intel_gtt.h"
> +#include "intel_migrate.h"
> +#include "intel_ring.h"
> +
> 
...
> +
> +void intel_migrate_fini(struct intel_migrate *m)
> +{
> +       struct intel_context *ce;
> +
> +       ce = fetch_and_zero(&m->context);
> +       if (!ce)
> +               return;
> +
> +       intel_context_unpin(ce);
> +       intel_context_put(ce);
> +}

Hmm, CI hints at we should be exporting and using an
intel_engine_destroy_pinned_context() here...

/Thomas

diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
index ea8ee4b3e018..9f18902be626 100644
--- a/drivers/gpu/drm/i915/Makefile
+++ b/drivers/gpu/drm/i915/Makefile
@@ -109,6 +109,7 @@  gt-y += \
 	gt/intel_gtt.o \
 	gt/intel_llc.o \
 	gt/intel_lrc.o \
+	gt/intel_migrate.o \
 	gt/intel_mocs.o \
 	gt/intel_ppgtt.o \
 	gt/intel_rc6.o \
diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
index 0862c42b4cac..949965680c37 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine.h
@@ -188,6 +188,7 @@  intel_write_status_page(struct intel_engine_cs *engine, int reg, u32 value)
 #define I915_GEM_HWS_PREEMPT_ADDR	(I915_GEM_HWS_PREEMPT * sizeof(u32))
 #define I915_GEM_HWS_SEQNO		0x40
 #define I915_GEM_HWS_SEQNO_ADDR		(I915_GEM_HWS_SEQNO * sizeof(u32))
+#define I915_GEM_HWS_MIGRATE		(0x42 * sizeof(u32))
 #define I915_GEM_HWS_SCRATCH		0x80
 
 #define I915_HWS_CSB_BUF0_INDEX		0x10
diff --git a/drivers/gpu/drm/i915/gt/intel_gpu_commands.h b/drivers/gpu/drm/i915/gt/intel_gpu_commands.h
index 2694dbb9967e..1c3af0fc0456 100644
--- a/drivers/gpu/drm/i915/gt/intel_gpu_commands.h
+++ b/drivers/gpu/drm/i915/gt/intel_gpu_commands.h
@@ -123,8 +123,10 @@ 
 #define   MI_SEMAPHORE_SAD_NEQ_SDD	(5 << 12)
 #define   MI_SEMAPHORE_TOKEN_MASK	REG_GENMASK(9, 5)
 #define   MI_SEMAPHORE_TOKEN_SHIFT	5
+#define MI_STORE_DATA_IMM	MI_INSTR(0x20, 0)
 #define MI_STORE_DWORD_IMM	MI_INSTR(0x20, 1)
 #define MI_STORE_DWORD_IMM_GEN4	MI_INSTR(0x20, 2)
+#define MI_STORE_QWORD_IMM_GEN8 (MI_INSTR(0x20, 3) | REG_BIT(21))
 #define   MI_MEM_VIRTUAL	(1 << 22) /* 945,g33,965 */
 #define   MI_USE_GGTT		(1 << 22) /* g4x+ */
 #define MI_STORE_DWORD_INDEX	MI_INSTR(0x21, 1)
diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
new file mode 100644
index 000000000000..1f60f8ee36f8
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
@@ -0,0 +1,543 @@ 
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2020 Intel Corporation
+ */
+
+#include "i915_drv.h"
+#include "intel_context.h"
+#include "intel_gpu_commands.h"
+#include "intel_gt.h"
+#include "intel_gtt.h"
+#include "intel_migrate.h"
+#include "intel_ring.h"
+
+struct insert_pte_data {
+	u64 offset;
+	bool is_lmem;
+};
+
+#define CHUNK_SZ SZ_8M /* ~1ms at 8GiB/s preemption delay */
+
+static bool engine_supports_migration(struct intel_engine_cs *engine)
+{
+	if (!engine)
+		return false;
+
+	/*
+	 * We need the ability to prevent aribtration (MI_ARB_ON_OFF),
+	 * the ability to write PTE using inline data (MI_STORE_DATA)
+	 * and of course the ability to do the block transfer (blits).
+	 */
+	GEM_BUG_ON(engine->class != COPY_ENGINE_CLASS);
+
+	return true;
+}
+
+static void insert_pte(struct i915_address_space *vm,
+		       struct i915_page_table *pt,
+		       void *data)
+{
+	struct insert_pte_data *d = data;
+
+	vm->insert_page(vm, px_dma(pt), d->offset, I915_CACHE_NONE,
+			d->is_lmem ? PTE_LM : 0);
+	d->offset += PAGE_SIZE;
+}
+
+static struct i915_address_space *migrate_vm(struct intel_gt *gt)
+{
+	struct i915_vm_pt_stash stash = {};
+	struct i915_ppgtt *vm;
+	int err;
+	int i;
+
+	/*
+	 * We construct a very special VM for use by all migration contexts,
+	 * it is kept pinned so that it can be used at any time. As we need
+	 * to pre-allocate the page directories for the migration VM, this
+	 * limits us to only using a small number of prepared vma.
+	 *
+	 * To be able to pipeline and reschedule migration operations while
+	 * avoiding unnecessary contention on the vm itself, the PTE updates
+	 * are inline with the blits. All the blits use the same fixed
+	 * addresses, with the backing store redirection being updated on the
+	 * fly. Only 2 implicit vma are used for all migration operations.
+	 *
+	 * We lay the ppGTT out as:
+	 *
+	 *	[0, CHUNK_SZ) -> first object
+	 *	[CHUNK_SZ, 2 * CHUNK_SZ) -> second object
+	 *	[2 * CHUNK_SZ, 2 * CHUNK_SZ + 2 * CHUNK_SZ >> 9] -> PTE
+	 *
+	 * By exposing the dma addresses of the page directories themselves
+	 * within the ppGTT, we are then able to rewrite the PTE prior to use.
+	 * But the PTE update and subsequent migration operation must be atomic,
+	 * i.e. within the same non-preemptible window so that we do not switch
+	 * to another migration context that overwrites the PTE.
+	 */
+
+	vm = i915_ppgtt_create(gt);
+	if (IS_ERR(vm))
+		return ERR_CAST(vm);
+
+	if (!vm->vm.allocate_va_range || !vm->vm.foreach) {
+		err = -ENODEV;
+		goto err_vm;
+	}
+
+	/*
+	 * Each engine instance is assigned its own chunk in the VM, so
+	 * that we can run multiple instances concurrently
+	 */
+	for (i = 0; i < ARRAY_SIZE(gt->engine_class[COPY_ENGINE_CLASS]); i++) {
+		struct intel_engine_cs *engine;
+		u64 base = (u64)i << 32;
+		struct insert_pte_data d = {};
+		struct i915_gem_ww_ctx ww;
+		u64 sz;
+
+		engine = gt->engine_class[COPY_ENGINE_CLASS][i];
+		if (!engine_supports_migration(engine))
+			continue;
+
+		/*
+		 * We copy in 8MiB chunks. Each PDE covers 2MiB, so we need
+		 * 4x2 page directories for source/destination.
+		 */
+		sz = 2 * CHUNK_SZ;
+		d.offset = base + sz;
+
+		/*
+		 * We need another page directory setup so that we can write
+		 * the 8x512 PTE in each chunk.
+		 */
+		sz += (sz >> 12) * sizeof(u64);
+
+		err = i915_vm_alloc_pt_stash(&vm->vm, &stash, sz);
+		if (err)
+			goto err_vm;
+
+		for_i915_gem_ww(&ww, err, true) {
+			err = i915_vm_lock_objects(&vm->vm, &ww);
+			if (err)
+				continue;
+			err = i915_vm_map_pt_stash(&vm->vm, &stash);
+			if (err)
+				continue;
+
+			vm->vm.allocate_va_range(&vm->vm, &stash, base, base + sz);
+		}
+		i915_vm_free_pt_stash(&vm->vm, &stash);
+		if (err)
+			goto err_vm;
+
+		/* Now allow the GPU to rewrite the PTE via its own ppGTT */
+		d.is_lmem = i915_gem_object_is_lmem(vm->vm.scratch[0]);
+		vm->vm.foreach(&vm->vm, base, base + sz, insert_pte, &d);
+	}
+
+	return &vm->vm;
+
+err_vm:
+	i915_vm_put(&vm->vm);
+	return ERR_PTR(err);
+}
+
+static struct intel_engine_cs *first_copy_engine(struct intel_gt *gt)
+{
+	struct intel_engine_cs *engine;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(gt->engine_class[COPY_ENGINE_CLASS]); i++) {
+		engine = gt->engine_class[COPY_ENGINE_CLASS][i];
+		if (engine_supports_migration(engine))
+			return engine;
+	}
+
+	return NULL;
+}
+
+static struct intel_context *pinned_context(struct intel_gt *gt)
+{
+	static struct lock_class_key key;
+	struct intel_engine_cs *engine;
+	struct i915_address_space *vm;
+	struct intel_context *ce;
+
+	engine = first_copy_engine(gt);
+	if (!engine)
+		return ERR_PTR(-ENODEV);
+
+	vm = migrate_vm(gt);
+	if (IS_ERR(vm))
+		return ERR_CAST(vm);
+
+	ce = intel_engine_create_pinned_context(engine, vm, SZ_512K,
+						I915_GEM_HWS_MIGRATE,
+						&key, "migrate");
+	i915_vm_put(ce->vm);
+	return ce;
+}
+
+int intel_migrate_init(struct intel_migrate *m, struct intel_gt *gt)
+{
+	struct intel_context *ce;
+
+	memset(m, 0, sizeof(*m));
+
+	ce = pinned_context(gt);
+	if (IS_ERR(ce))
+		return PTR_ERR(ce);
+
+	m->context = ce;
+	return 0;
+}
+
+static int random_index(unsigned int max)
+{
+	return upper_32_bits(mul_u32_u32(get_random_u32(), max));
+}
+
+static struct intel_context *__migrate_engines(struct intel_gt *gt)
+{
+	struct intel_engine_cs *engines[MAX_ENGINE_INSTANCE];
+	struct intel_engine_cs *engine;
+	unsigned int count, i;
+
+	count = 0;
+	for (i = 0; i < ARRAY_SIZE(gt->engine_class[COPY_ENGINE_CLASS]); i++) {
+		engine = gt->engine_class[COPY_ENGINE_CLASS][i];
+		if (engine_supports_migration(engine))
+			engines[count++] = engine;
+	}
+
+	return intel_context_create(engines[random_index(count)]);
+}
+
+struct intel_context *intel_migrate_create_context(struct intel_migrate *m)
+{
+	struct intel_context *ce;
+
+	/*
+	 * We randomly distribute contexts across the engines upon constrction,
+	 * as they all share the same pinned vm, and so in order to allow
+	 * multiple blits to run in parallel, we must construct each blit
+	 * to use a different range of the vm for its GTT. This has to be
+	 * known at construction, so we can not use the late greedy load
+	 * balancing of the virtual-engine.
+	 */
+	ce = __migrate_engines(m->context->engine->gt);
+	if (IS_ERR(ce))
+		return ce;
+
+	ce->ring = __intel_context_ring_size(SZ_256K);
+
+	i915_vm_put(ce->vm);
+	ce->vm = i915_vm_get(m->context->vm);
+
+	return ce;
+}
+
+static inline struct sgt_dma sg_sgt(struct scatterlist *sg)
+{
+	dma_addr_t addr = sg_dma_address(sg);
+
+	return (struct sgt_dma){ sg, addr, addr + sg_dma_len(sg) };
+}
+
+static int emit_no_arbitration(struct i915_request *rq)
+{
+	u32 *cs;
+
+	cs = intel_ring_begin(rq, 2);
+	if (IS_ERR(cs))
+		return PTR_ERR(cs);
+
+	/* Explicitly disable preemption for this request. */
+	*cs++ = MI_ARB_ON_OFF;
+	*cs++ = MI_NOOP;
+	intel_ring_advance(rq, cs);
+
+	return 0;
+}
+
+static int emit_pte(struct i915_request *rq,
+		    struct sgt_dma *it,
+		    enum i915_cache_level cache_level,
+		    bool is_lmem,
+		    u64 offset,
+		    int length)
+{
+	const u64 encode = rq->context->vm->pte_encode(0, cache_level,
+						       is_lmem ? PTE_LM : 0);
+	struct intel_ring *ring = rq->ring;
+	int total = 0;
+	u32 *hdr, *cs;
+	int pkt;
+
+	GEM_BUG_ON(INTEL_GEN(rq->engine->i915) < 8);
+
+	/* Compute the page directory offset for the target address range */
+	offset += (u64)rq->engine->instance << 32;
+	offset >>= 12;
+	offset *= sizeof(u64);
+	offset += 2 * CHUNK_SZ;
+
+	cs = intel_ring_begin(rq, 6);
+	if (IS_ERR(cs))
+		return PTR_ERR(cs);
+
+	/* Pack as many PTE updates as possible into a single MI command */
+	pkt = min_t(int, 0x400, ring->space / sizeof(u32) + 5);
+	pkt = min_t(int, pkt, (ring->size - ring->emit) / sizeof(u32) + 5);
+
+	hdr = cs;
+	*cs++ = MI_STORE_DATA_IMM | REG_BIT(21); /* as qword elements */
+	*cs++ = lower_32_bits(offset);
+	*cs++ = upper_32_bits(offset);
+
+	do {
+		if (cs - hdr >= pkt) {
+			*hdr += cs - hdr - 2;
+			*cs++ = MI_NOOP;
+
+			ring->emit = (void *)cs - ring->vaddr;
+			intel_ring_advance(rq, cs);
+			intel_ring_update_space(ring);
+
+			cs = intel_ring_begin(rq, 6);
+			if (IS_ERR(cs))
+				return PTR_ERR(cs);
+
+			pkt = min_t(int, 0x400, ring->space / sizeof(u32) + 5);
+			pkt = min_t(int, pkt, (ring->size - ring->emit) / sizeof(u32) + 5);
+
+			hdr = cs;
+			*cs++ = MI_STORE_DATA_IMM | REG_BIT(21);
+			*cs++ = lower_32_bits(offset);
+			*cs++ = upper_32_bits(offset);
+		}
+
+		*cs++ = lower_32_bits(encode | it->dma);
+		*cs++ = upper_32_bits(encode | it->dma);
+
+		offset += 8;
+		total += I915_GTT_PAGE_SIZE;
+
+		it->dma += I915_GTT_PAGE_SIZE;
+		if (it->dma >= it->max) {
+			it->sg = __sg_next(it->sg);
+			if (!it->sg || sg_dma_len(it->sg) == 0)
+				break;
+
+			it->dma = sg_dma_address(it->sg);
+			it->max = it->dma + sg_dma_len(it->sg);
+		}
+	} while (total < length);
+
+	*hdr += cs - hdr - 2;
+	*cs++ = MI_NOOP;
+
+	ring->emit = (void *)cs - ring->vaddr;
+	intel_ring_advance(rq, cs);
+	intel_ring_update_space(ring);
+
+	return total;
+}
+
+static bool wa_1209644611_applies(int gen, u32 size)
+{
+	u32 height = size >> PAGE_SHIFT;
+
+	if (gen != 11)
+		return false;
+
+	return height % 4 == 3 && height <= 8;
+}
+
+static int emit_copy(struct i915_request *rq, int size)
+{
+	const int gen = INTEL_GEN(rq->engine->i915);
+	u32 instance = rq->engine->instance;
+	u32 *cs;
+
+	cs = intel_ring_begin(rq, gen >= 8 ? 10 : 6);
+	if (IS_ERR(cs))
+		return PTR_ERR(cs);
+
+	if (gen >= 9 && !wa_1209644611_applies(gen, size)) {
+		*cs++ = GEN9_XY_FAST_COPY_BLT_CMD | (10 - 2);
+		*cs++ = BLT_DEPTH_32 | PAGE_SIZE;
+		*cs++ = 0;
+		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
+		*cs++ = CHUNK_SZ; /* dst offset */
+		*cs++ = instance;
+		*cs++ = 0;
+		*cs++ = PAGE_SIZE;
+		*cs++ = 0; /* src offset */
+		*cs++ = instance;
+	} else if (gen >= 8) {
+		*cs++ = XY_SRC_COPY_BLT_CMD | BLT_WRITE_RGBA | (10 - 2);
+		*cs++ = BLT_DEPTH_32 | BLT_ROP_SRC_COPY | PAGE_SIZE;
+		*cs++ = 0;
+		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
+		*cs++ = CHUNK_SZ; /* dst offset */
+		*cs++ = instance;
+		*cs++ = 0;
+		*cs++ = PAGE_SIZE;
+		*cs++ = 0; /* src offset */
+		*cs++ = instance;
+	} else {
+		GEM_BUG_ON(instance);
+		*cs++ = SRC_COPY_BLT_CMD | BLT_WRITE_RGBA | (6 - 2);
+		*cs++ = BLT_DEPTH_32 | BLT_ROP_SRC_COPY | PAGE_SIZE;
+		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE;
+		*cs++ = CHUNK_SZ; /* dst offset */
+		*cs++ = PAGE_SIZE;
+		*cs++ = 0; /* src offset */
+	}
+
+	intel_ring_advance(rq, cs);
+	return 0;
+}
+
+int
+intel_context_migrate_copy(struct intel_context *ce,
+			   struct dma_fence *await,
+			   struct scatterlist *src,
+			   enum i915_cache_level src_cache_level,
+			   bool src_is_lmem,
+			   struct scatterlist *dst,
+			   enum i915_cache_level dst_cache_level,
+			   bool dst_is_lmem,
+			   struct i915_request **out)
+{
+	struct sgt_dma it_src = sg_sgt(src), it_dst = sg_sgt(dst);
+	struct i915_request *rq;
+	int err;
+
+	*out = NULL;
+
+	/* GEM_BUG_ON(ce->vm != migrate_vm); */
+
+	GEM_BUG_ON(ce->ring->size < SZ_64K);
+
+	do {
+		int len;
+
+		rq = i915_request_create(ce);
+		if (IS_ERR(rq)) {
+			err = PTR_ERR(rq);
+			goto out_ce;
+		}
+
+		if (await) {
+			err = i915_request_await_dma_fence(rq, await);
+			if (err)
+				goto out_rq;
+
+			if (rq->engine->emit_init_breadcrumb) {
+				err = rq->engine->emit_init_breadcrumb(rq);
+				if (err)
+					goto out_rq;
+			}
+
+			await = NULL;
+		}
+
+		/* The PTE updates + copy must not be interrupted. */
+		err = emit_no_arbitration(rq);
+		if (err)
+			goto out_rq;
+
+		len = emit_pte(rq, &it_src, src_cache_level, src_is_lmem, 0,
+			       CHUNK_SZ);
+		if (len <= 0) {
+			err = len;
+			goto out_rq;
+		}
+
+		err = emit_pte(rq, &it_dst, dst_cache_level, dst_is_lmem,
+			       CHUNK_SZ, len);
+		if (err < 0)
+			goto out_rq;
+		if (err < len) {
+			err = -EINVAL;
+			goto out_rq;
+		}
+
+		err = rq->engine->emit_flush(rq, EMIT_INVALIDATE);
+		if (err)
+			goto out_rq;
+
+		err = emit_copy(rq, len);
+
+		/* Arbitration is re-enabled between requests. */
+out_rq:
+		if (*out)
+			i915_request_put(*out);
+		*out = i915_request_get(rq);
+		i915_request_add(rq);
+		if (err || !it_src.sg || !sg_dma_len(it_src.sg))
+			break;
+
+		cond_resched();
+	} while (1);
+
+out_ce:
+	return err;
+}
+
+int intel_migrate_copy(struct intel_migrate *m,
+		       struct i915_gem_ww_ctx *ww,
+		       struct dma_fence *await,
+		       struct scatterlist *src,
+		       enum i915_cache_level src_cache_level,
+		       bool src_is_lmem,
+		       struct scatterlist *dst,
+		       enum i915_cache_level dst_cache_level,
+		       bool dst_is_lmem,
+		       struct i915_request **out)
+{
+	struct intel_context *ce;
+	int err;
+
+	*out = NULL;
+	if (!m->context)
+		return -ENODEV;
+
+	ce = intel_migrate_create_context(m);
+	if (IS_ERR(ce))
+		ce = intel_context_get(m->context);
+	GEM_BUG_ON(IS_ERR(ce));
+
+	err = intel_context_pin_ww(ce, ww);
+	if (err)
+		goto out;
+
+	err = intel_context_migrate_copy(ce, await,
+					 src, src_cache_level, src_is_lmem,
+					 dst, dst_cache_level, dst_is_lmem,
+					 out);
+
+	intel_context_unpin(ce);
+out:
+	intel_context_put(ce);
+	return err;
+}
+
+void intel_migrate_fini(struct intel_migrate *m)
+{
+	struct intel_context *ce;
+
+	ce = fetch_and_zero(&m->context);
+	if (!ce)
+		return;
+
+	intel_context_unpin(ce);
+	intel_context_put(ce);
+}
+
+#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
+#include "selftest_migrate.c"
+#endif
diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.h b/drivers/gpu/drm/i915/gt/intel_migrate.h
new file mode 100644
index 000000000000..32c61190ed73
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/intel_migrate.h
@@ -0,0 +1,45 @@ 
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2020 Intel Corporation
+ */
+
+#ifndef __INTEL_MIGRATE__
+#define __INTEL_MIGRATE__
+
+#include "intel_migrate_types.h"
+
+struct dma_fence;
+struct i915_request;
+struct i915_gem_ww_ctx;
+struct intel_gt;
+struct scatterlist;
+enum i915_cache_level;
+
+int intel_migrate_init(struct intel_migrate *m, struct intel_gt *gt);
+
+struct intel_context *intel_migrate_create_context(struct intel_migrate *m);
+
+int intel_migrate_copy(struct intel_migrate *m,
+		       struct i915_gem_ww_ctx *ww,
+		       struct dma_fence *await,
+		       struct scatterlist *src,
+		       enum i915_cache_level src_cache_level,
+		       bool src_is_lmem,
+		       struct scatterlist *dst,
+		       enum i915_cache_level dst_cache_level,
+		       bool dst_is_lmem,
+		       struct i915_request **out);
+
+int intel_context_migrate_copy(struct intel_context *ce,
+			       struct dma_fence *await,
+			       struct scatterlist *src,
+			       enum i915_cache_level src_cache_level,
+			       bool src_is_lmem,
+			       struct scatterlist *dst,
+			       enum i915_cache_level dst_cache_level,
+			       bool dst_is_lmem,
+			       struct i915_request **out);
+
+void intel_migrate_fini(struct intel_migrate *m);
+
+#endif /* __INTEL_MIGRATE__ */
diff --git a/drivers/gpu/drm/i915/gt/intel_migrate_types.h b/drivers/gpu/drm/i915/gt/intel_migrate_types.h
new file mode 100644
index 000000000000..d98230597f42
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/intel_migrate_types.h
@@ -0,0 +1,15 @@ 
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2020 Intel Corporation
+ */
+
+#ifndef __INTEL_MIGRATE_TYPES__
+#define __INTEL_MIGRATE_TYPES__
+
+struct intel_context;
+
+struct intel_migrate {
+	struct intel_context *context;
+};
+
+#endif /* __INTEL_MIGRATE_TYPES__ */
diff --git a/drivers/gpu/drm/i915/gt/intel_ring.h b/drivers/gpu/drm/i915/gt/intel_ring.h
index dbf5f14a136f..1b32dadfb8c3 100644
--- a/drivers/gpu/drm/i915/gt/intel_ring.h
+++ b/drivers/gpu/drm/i915/gt/intel_ring.h
@@ -49,6 +49,7 @@  static inline void intel_ring_advance(struct i915_request *rq, u32 *cs)
 	 * intel_ring_begin()).
 	 */
 	GEM_BUG_ON((rq->ring->vaddr + rq->ring->emit) != cs);
+	GEM_BUG_ON(!IS_ALIGNED(rq->ring->emit, 8)); /* RING_TAIL qword align */
 }
 
 static inline u32 intel_ring_wrap(const struct intel_ring *ring, u32 pos)
diff --git a/drivers/gpu/drm/i915/gt/selftest_migrate.c b/drivers/gpu/drm/i915/gt/selftest_migrate.c
new file mode 100644
index 000000000000..9784d149ebf1
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/selftest_migrate.c
@@ -0,0 +1,291 @@ 
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2020 Intel Corporation
+ */
+
+#include "selftests/i915_random.h"
+
+static const unsigned int sizes[] = {
+	SZ_4K,
+	SZ_64K,
+	SZ_2M,
+	CHUNK_SZ - SZ_4K,
+	CHUNK_SZ,
+	CHUNK_SZ + SZ_4K,
+	SZ_64M,
+};
+
+static struct drm_i915_gem_object *
+create_lmem_or_internal(struct drm_i915_private *i915, size_t size)
+{
+	if (HAS_LMEM(i915)) {
+		struct drm_i915_gem_object *obj;
+
+		obj = i915_gem_object_create_lmem(i915, size, 0);
+		if (!IS_ERR(obj))
+			return obj;
+	}
+
+	return i915_gem_object_create_internal(i915, size);
+}
+
+static int copy(struct intel_migrate *migrate,
+		int (*fn)(struct intel_migrate *migrate,
+			  struct i915_gem_ww_ctx *ww,
+			  struct drm_i915_gem_object *src,
+			  struct drm_i915_gem_object *dst,
+			  struct i915_request **out),
+		u32 sz, struct rnd_state *prng)
+{
+	struct drm_i915_private *i915 = migrate->context->engine->i915;
+	struct drm_i915_gem_object *src, *dst;
+	struct i915_request *rq;
+	struct i915_gem_ww_ctx ww;
+	u32 *vaddr;
+	int err = 0;
+	int i;
+
+	src = create_lmem_or_internal(i915, sz);
+	if (IS_ERR(src))
+		return 0;
+
+	dst = i915_gem_object_create_internal(i915, sz);
+	if (IS_ERR(dst))
+		goto err_free_src;
+
+	for_i915_gem_ww(&ww, err, true) {
+		err = i915_gem_object_lock(src, &ww);
+		if (err)
+			continue;
+
+		err = i915_gem_object_lock(dst, &ww);
+		if (err)
+			continue;
+
+		vaddr = i915_gem_object_pin_map(src, I915_MAP_WC);
+		if (IS_ERR(vaddr)) {
+			err = PTR_ERR(vaddr);
+			continue;
+		}
+
+		for (i = 0; i < sz / sizeof(u32); i++)
+			vaddr[i] = i;
+		i915_gem_object_flush_map(src);
+
+		vaddr = i915_gem_object_pin_map(dst, I915_MAP_WC);
+		if (IS_ERR(vaddr)) {
+			err = PTR_ERR(vaddr);
+			goto unpin_src;
+		}
+
+		for (i = 0; i < sz / sizeof(u32); i++)
+			vaddr[i] = ~i;
+		i915_gem_object_flush_map(dst);
+
+		err = fn(migrate, &ww, src, dst, &rq);
+		if (!err)
+			continue;
+
+		if (err != -EDEADLK && err != -EINTR && err != -ERESTARTSYS)
+			pr_err("%ps failed, size: %u\n", fn, sz);
+		if (rq) {
+			i915_request_wait(rq, 0, HZ);
+			i915_request_put(rq);
+		}
+		i915_gem_object_unpin_map(dst);
+unpin_src:
+		i915_gem_object_unpin_map(src);
+	}
+	if (err)
+		goto err_out;
+
+	if (rq) {
+		if (i915_request_wait(rq, 0, HZ) < 0) {
+			pr_err("%ps timed out, size: %u\n", fn, sz);
+			err = -ETIME;
+		}
+		i915_request_put(rq);
+	}
+
+	for (i = 0; !err && i < sz / PAGE_SIZE; i++) {
+		int x = i * 1024 + i915_prandom_u32_max_state(1024, prng);
+
+		if (vaddr[x] != x) {
+			pr_err("%ps failed, size: %u, offset: %zu\n",
+			       fn, sz, x * sizeof(u32));
+			igt_hexdump(vaddr + i * 1024, 4096);
+			err = -EINVAL;
+		}
+	}
+
+	i915_gem_object_unpin_map(dst);
+	i915_gem_object_unpin_map(src);
+
+err_out:
+	i915_gem_object_put(dst);
+err_free_src:
+	i915_gem_object_put(src);
+
+	return err;
+}
+
+static int __migrate_copy(struct intel_migrate *migrate,
+			  struct i915_gem_ww_ctx *ww,
+			  struct drm_i915_gem_object *src,
+			  struct drm_i915_gem_object *dst,
+			  struct i915_request **out)
+{
+	return intel_migrate_copy(migrate, ww, NULL,
+				  src->mm.pages->sgl, src->cache_level,
+				  i915_gem_object_is_lmem(src),
+				  dst->mm.pages->sgl, dst->cache_level,
+				  i915_gem_object_is_lmem(dst),
+				  out);
+}
+
+static int __global_copy(struct intel_migrate *migrate,
+			 struct i915_gem_ww_ctx *ww,
+			 struct drm_i915_gem_object *src,
+			 struct drm_i915_gem_object *dst,
+			 struct i915_request **out)
+{
+	return intel_context_migrate_copy(migrate->context, NULL,
+					  src->mm.pages->sgl, src->cache_level,
+					  i915_gem_object_is_lmem(src),
+					  dst->mm.pages->sgl, dst->cache_level,
+					  i915_gem_object_is_lmem(dst),
+					  out);
+}
+
+static int
+migrate_copy(struct intel_migrate *migrate, u32 sz, struct rnd_state *prng)
+{
+	return copy(migrate, __migrate_copy, sz, prng);
+}
+
+static int
+global_copy(struct intel_migrate *migrate, u32 sz, struct rnd_state *prng)
+{
+	return copy(migrate, __global_copy, sz, prng);
+}
+
+static int live_migrate_copy(void *arg)
+{
+	struct intel_migrate *migrate = arg;
+	struct drm_i915_private *i915 = migrate->context->engine->i915;
+	I915_RND_STATE(prng);
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(sizes); i++) {
+		int err;
+
+		err = migrate_copy(migrate, sizes[i], &prng);
+		if (err == 0)
+			err = global_copy(migrate, sizes[i], &prng);
+		i915_gem_drain_freed_objects(i915);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+struct threaded_migrate {
+	struct intel_migrate *migrate;
+	struct task_struct *tsk;
+	struct rnd_state prng;
+};
+
+static int threaded_migrate(struct intel_migrate *migrate,
+			    int (*fn)(void *arg),
+			    unsigned int flags)
+{
+	const unsigned int n_cpus = num_online_cpus() + 1;
+	struct threaded_migrate *thread;
+	I915_RND_STATE(prng);
+	unsigned int i;
+	int err = 0;
+
+	thread = kcalloc(n_cpus, sizeof(*thread), GFP_KERNEL);
+	if (!thread)
+		return 0;
+
+	for (i = 0; i < n_cpus; ++i) {
+		struct task_struct *tsk;
+
+		thread[i].migrate = migrate;
+		thread[i].prng =
+			I915_RND_STATE_INITIALIZER(prandom_u32_state(&prng));
+
+		tsk = kthread_run(fn, &thread[i], "igt-%d", i);
+		if (IS_ERR(tsk)) {
+			err = PTR_ERR(tsk);
+			break;
+		}
+
+		get_task_struct(tsk);
+		thread[i].tsk = tsk;
+	}
+
+	msleep(10); /* start all threads before we kthread_stop() */
+
+	for (i = 0; i < n_cpus; ++i) {
+		struct task_struct *tsk = thread[i].tsk;
+		int status;
+
+		if (IS_ERR_OR_NULL(tsk))
+			continue;
+
+		status = kthread_stop(tsk);
+		if (status && !err)
+			err = status;
+
+		put_task_struct(tsk);
+	}
+
+	kfree(thread);
+	return err;
+}
+
+static int __thread_migrate_copy(void *arg)
+{
+	struct threaded_migrate *tm = arg;
+
+	return migrate_copy(tm->migrate, 2 * CHUNK_SZ, &tm->prng);
+}
+
+static int thread_migrate_copy(void *arg)
+{
+	return threaded_migrate(arg, __thread_migrate_copy, 0);
+}
+
+static int __thread_global_copy(void *arg)
+{
+	struct threaded_migrate *tm = arg;
+
+	return global_copy(tm->migrate, 2 * CHUNK_SZ, &tm->prng);
+}
+
+static int thread_global_copy(void *arg)
+{
+	return threaded_migrate(arg, __thread_global_copy, 0);
+}
+
+int intel_migrate_live_selftests(struct drm_i915_private *i915)
+{
+	static const struct i915_subtest tests[] = {
+		SUBTEST(live_migrate_copy),
+		SUBTEST(thread_migrate_copy),
+		SUBTEST(thread_global_copy),
+	};
+	struct intel_migrate m;
+	int err;
+
+	if (intel_migrate_init(&m, &i915->gt))
+		return 0;
+
+	err = i915_subtests(tests, &m);
+	intel_migrate_fini(&m);
+
+	return err;
+}
diff --git a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
index a92c0e9b7e6b..be5e0191eaea 100644
--- a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
+++ b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h
@@ -26,6 +26,7 @@  selftest(gt_mocs, intel_mocs_live_selftests)
 selftest(gt_pm, intel_gt_pm_live_selftests)
 selftest(gt_heartbeat, intel_heartbeat_live_selftests)
 selftest(requests, i915_request_live_selftests)
+selftest(migrate, intel_migrate_live_selftests)
 selftest(active, i915_active_live_selftests)
 selftest(objects, i915_gem_object_live_selftests)
 selftest(mman, i915_gem_mman_live_selftests)

[7/9] drm/i915/gt: Pipelined page migration

Commit Message

Comments

Patch