Message ID | 20240703153813.182001-11-thomas.hellstrom@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | TTM shrinker helpers and xe buffer object shrinker | expand |
On Wed, Jul 03, 2024 at 05:38:11PM +0200, Thomas Hellström wrote: > Use fault-injection to test partial TTM swapout and interrupted swapin. > Return -EINTR for swapin to test the callers ability to handle and > restart the swapin, and on swapout perform a partial swapout to test that > the swapin and release_shrunken functionality. > > Cc: Christian König <christian.koenig@amd.com> > Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> > Cc: Matthew Brost <matthew.brost@intel.com> > Cc: <dri-devel@lists.freedesktop.org> > Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> > --- > drivers/gpu/drm/Kconfig | 10 ++++++++++ > drivers/gpu/drm/ttm/ttm_pool.c | 17 ++++++++++++++++- > 2 files changed, 26 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig > index fd0749c0c630..9f27271bfab8 100644 > --- a/drivers/gpu/drm/Kconfig > +++ b/drivers/gpu/drm/Kconfig > @@ -272,6 +272,16 @@ config DRM_GPUVM > GPU-VM representation providing helpers to manage a GPUs virtual > address space > > +config DRM_TTM_BACKUP_FAULT_INJECT > + bool "Enable fault injection during TTM backup" > + depends on DRM_TTM > + default n > + help > + Inject recoverable failures during TTM backup and recovery of > + backed-up objects. For DRM driver developers only. > + > + If in doubt, choose N. > + > config DRM_BUDDY > tristate > depends on DRM > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c > index 38e50cf81b0a..d32a1f2e5e50 100644 > --- a/drivers/gpu/drm/ttm/ttm_pool.c > +++ b/drivers/gpu/drm/ttm/ttm_pool.c > @@ -431,6 +431,7 @@ static int ttm_pool_restore_tt(struct ttm_pool_tt_restore *restore, > struct ttm_backup *backup, > struct ttm_operation_ctx *ctx) > { > + static unsigned long __maybe_unused swappedin; > unsigned int i, nr = 1 << restore->order; > int ret = 0; > > @@ -446,6 +447,13 @@ static int ttm_pool_restore_tt(struct ttm_pool_tt_restore *restore, > if (handle == 0) > continue; > > + if (IS_ENABLED(CONFIG_DRM_TTM_BACKUP_FAULT_INJECT) && > + ctx->interruptible && > + ++swappedin % 100 == 0) { > + ret = -EINTR; > + break; > + } So here this -EINTR would be kicked to the user IOCTL which triggered the BO validate and retry? The restore then should be able to successfully pick up where it left off? > + > ret = backup->ops->copy_backed_up_page > (backup, restore->first_page[i], > handle, ctx->interruptible); > @@ -892,7 +900,14 @@ long ttm_pool_backup_tt(struct ttm_pool *pool, struct ttm_tt *ttm, bool purge, > > alloc_gfp = GFP_KERNEL | __GFP_HIGH | __GFP_NOWARN | __GFP_RETRY_MAYFAIL; > > - for (i = 0; i < ttm->num_pages; ++i) { > + num_pages = ttm->num_pages; > + > + /* Pretend doing fault injection by shrinking only half of the pages. */ > + > + if (IS_ENABLED(CONFIG_DRM_TTM_BACKUP_FAULT_INJECT)) > + num_pages = DIV_ROUND_UP(num_pages, 2); So what happens here? Half the pages swapped out, then upon restore half swapped back in? The shrinker continues to walk until enough pages swapped out? Matt > + > + for (i = 0; i < num_pages; ++i) { > page = ttm->pages[i]; > if (unlikely(!page)) > continue; > -- > 2.44.0 >
On Wed, 2024-08-07 at 23:43 +0000, Matthew Brost wrote: > On Wed, Jul 03, 2024 at 05:38:11PM +0200, Thomas Hellström wrote: > > Use fault-injection to test partial TTM swapout and interrupted > > swapin. > > Return -EINTR for swapin to test the callers ability to handle and > > restart the swapin, and on swapout perform a partial swapout to > > test that > > the swapin and release_shrunken functionality. > > > > Cc: Christian König <christian.koenig@amd.com> > > Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> > > Cc: Matthew Brost <matthew.brost@intel.com> > > Cc: <dri-devel@lists.freedesktop.org> > > Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> > > --- > > drivers/gpu/drm/Kconfig | 10 ++++++++++ > > drivers/gpu/drm/ttm/ttm_pool.c | 17 ++++++++++++++++- > > 2 files changed, 26 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig > > index fd0749c0c630..9f27271bfab8 100644 > > --- a/drivers/gpu/drm/Kconfig > > +++ b/drivers/gpu/drm/Kconfig > > @@ -272,6 +272,16 @@ config DRM_GPUVM > > GPU-VM representation providing helpers to manage a GPUs > > virtual > > address space > > > > +config DRM_TTM_BACKUP_FAULT_INJECT > > + bool "Enable fault injection during TTM backup" > > + depends on DRM_TTM > > + default n > > + help > > + Inject recoverable failures during TTM backup and > > recovery of > > + backed-up objects. For DRM driver developers only. > > + > > + If in doubt, choose N. > > + > > config DRM_BUDDY > > tristate > > depends on DRM > > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c > > b/drivers/gpu/drm/ttm/ttm_pool.c > > index 38e50cf81b0a..d32a1f2e5e50 100644 > > --- a/drivers/gpu/drm/ttm/ttm_pool.c > > +++ b/drivers/gpu/drm/ttm/ttm_pool.c > > @@ -431,6 +431,7 @@ static int ttm_pool_restore_tt(struct > > ttm_pool_tt_restore *restore, > > struct ttm_backup *backup, > > struct ttm_operation_ctx *ctx) > > { > > + static unsigned long __maybe_unused swappedin; > > unsigned int i, nr = 1 << restore->order; > > int ret = 0; > > > > @@ -446,6 +447,13 @@ static int ttm_pool_restore_tt(struct > > ttm_pool_tt_restore *restore, > > if (handle == 0) > > continue; > > > > + if > > (IS_ENABLED(CONFIG_DRM_TTM_BACKUP_FAULT_INJECT) && > > + ctx->interruptible && > > + ++swappedin % 100 == 0) { > > + ret = -EINTR; > > + break; > > + } > > So here this -EINTR would be kicked to the user IOCTL which triggered > the BO validate and retry? The restore then should be able to > successfully pick up where it left off? Yes, that's the point. For the direct swap-cache backend I initially used (before concluding that the shmem one actually seemed to work fine), we had an interruptible wait here. Supporting interrupts is generally a good thing but for the pool code, this makes the code considerably more complicated. However, this is a good way to ensure drivers actually support -EINTR for the call chain. If not, adding interrupt capability "later" will most likely be a PITA. > > > + > > ret = backup->ops->copy_backed_up_page > > (backup, restore->first_page[i], > > handle, ctx->interruptible); > > @@ -892,7 +900,14 @@ long ttm_pool_backup_tt(struct ttm_pool *pool, > > struct ttm_tt *ttm, bool purge, > > > > alloc_gfp = GFP_KERNEL | __GFP_HIGH | __GFP_NOWARN | > > __GFP_RETRY_MAYFAIL; > > > > - for (i = 0; i < ttm->num_pages; ++i) { > > + num_pages = ttm->num_pages; > > + > > + /* Pretend doing fault injection by shrinking only half of > > the pages. */ > > + > > + if (IS_ENABLED(CONFIG_DRM_TTM_BACKUP_FAULT_INJECT)) > > + num_pages = DIV_ROUND_UP(num_pages, 2); > > So what happens here? Half the pages swapped out, then upon restore > half > swapped back in? The shrinker continues to walk until enough pages > swapped out? Yes, exactly. Ideally we'd want some intermediate state here so that a partially swapped out bo is still eligible for further shrinking. /Thomas > > Matt > > > + > > + for (i = 0; i < num_pages; ++i) { > > page = ttm->pages[i]; > > if (unlikely(!page)) > > continue; > > -- > > 2.44.0 > >
On Fri, Aug 09, 2024 at 03:53:20PM +0200, Thomas Hellström wrote: > On Wed, 2024-08-07 at 23:43 +0000, Matthew Brost wrote: > > On Wed, Jul 03, 2024 at 05:38:11PM +0200, Thomas Hellström wrote: > > > Use fault-injection to test partial TTM swapout and interrupted > > > swapin. > > > Return -EINTR for swapin to test the callers ability to handle and > > > restart the swapin, and on swapout perform a partial swapout to > > > test that > > > the swapin and release_shrunken functionality. > > > > > > Cc: Christian König <christian.koenig@amd.com> > > > Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> > > > Cc: Matthew Brost <matthew.brost@intel.com> > > > Cc: <dri-devel@lists.freedesktop.org> > > > Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> > > > --- > > > drivers/gpu/drm/Kconfig | 10 ++++++++++ > > > drivers/gpu/drm/ttm/ttm_pool.c | 17 ++++++++++++++++- > > > 2 files changed, 26 insertions(+), 1 deletion(-) > > > > > > diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig > > > index fd0749c0c630..9f27271bfab8 100644 > > > --- a/drivers/gpu/drm/Kconfig > > > +++ b/drivers/gpu/drm/Kconfig > > > @@ -272,6 +272,16 @@ config DRM_GPUVM > > > GPU-VM representation providing helpers to manage a GPUs > > > virtual > > > address space > > > > > > +config DRM_TTM_BACKUP_FAULT_INJECT > > > + bool "Enable fault injection during TTM backup" > > > + depends on DRM_TTM > > > + default n > > > + help > > > + Inject recoverable failures during TTM backup and > > > recovery of > > > + backed-up objects. For DRM driver developers only. > > > + > > > + If in doubt, choose N. > > > + > > > config DRM_BUDDY > > > tristate > > > depends on DRM > > > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c > > > b/drivers/gpu/drm/ttm/ttm_pool.c > > > index 38e50cf81b0a..d32a1f2e5e50 100644 > > > --- a/drivers/gpu/drm/ttm/ttm_pool.c > > > +++ b/drivers/gpu/drm/ttm/ttm_pool.c > > > @@ -431,6 +431,7 @@ static int ttm_pool_restore_tt(struct > > > ttm_pool_tt_restore *restore, > > > struct ttm_backup *backup, > > > struct ttm_operation_ctx *ctx) > > > { > > > + static unsigned long __maybe_unused swappedin; > > > unsigned int i, nr = 1 << restore->order; > > > int ret = 0; > > > > > > @@ -446,6 +447,13 @@ static int ttm_pool_restore_tt(struct > > > ttm_pool_tt_restore *restore, > > > if (handle == 0) > > > continue; > > > > > > + if > > > (IS_ENABLED(CONFIG_DRM_TTM_BACKUP_FAULT_INJECT) && > > > + ctx->interruptible && > > > + ++swappedin % 100 == 0) { > > > + ret = -EINTR; > > > + break; > > > + } > > > > So here this -EINTR would be kicked to the user IOCTL which triggered > > the BO validate and retry? The restore then should be able to > > successfully pick up where it left off? > > Yes, that's the point. For the direct swap-cache backend I initially > used (before concluding that the shmem one actually seemed to work > fine), we had an interruptible wait here. > > Supporting interrupts is generally a good thing but for the pool code, > this makes the code considerably more complicated. However, this is a > good way to ensure drivers actually support -EINTR for the call chain. > If not, adding interrupt capability "later" will most likely be a PITA. > > > > > > + > > > ret = backup->ops->copy_backed_up_page > > > (backup, restore->first_page[i], > > > handle, ctx->interruptible); > > > @@ -892,7 +900,14 @@ long ttm_pool_backup_tt(struct ttm_pool *pool, > > > struct ttm_tt *ttm, bool purge, > > > > > > alloc_gfp = GFP_KERNEL | __GFP_HIGH | __GFP_NOWARN | > > > __GFP_RETRY_MAYFAIL; > > > > > > - for (i = 0; i < ttm->num_pages; ++i) { > > > + num_pages = ttm->num_pages; > > > + > > > + /* Pretend doing fault injection by shrinking only half of > > > the pages. */ > > > + > > > + if (IS_ENABLED(CONFIG_DRM_TTM_BACKUP_FAULT_INJECT)) > > > + num_pages = DIV_ROUND_UP(num_pages, 2); > > > > So what happens here? Half the pages swapped out, then upon restore > > half > > swapped back in? The shrinker continues to walk until enough pages > > swapped out? > > Yes, exactly. Ideally we'd want some intermediate state here so that a > partially swapped out bo is still eligible for further shrinking. > Cool, glad my understanding is correct. Having error injection is always a good idea to ensure error paths / corner cases work rather than finding they blow up much later when these cases somehow get triggered. With that: Reviewed-by: Matthew Brost <matthew.brost@intel.com> > /Thomas > > > > > > Matt > > > > > + > > > + for (i = 0; i < num_pages; ++i) { > > > page = ttm->pages[i]; > > > if (unlikely(!page)) > > > continue; > > > -- > > > 2.44.0 > > > >
diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig index fd0749c0c630..9f27271bfab8 100644 --- a/drivers/gpu/drm/Kconfig +++ b/drivers/gpu/drm/Kconfig @@ -272,6 +272,16 @@ config DRM_GPUVM GPU-VM representation providing helpers to manage a GPUs virtual address space +config DRM_TTM_BACKUP_FAULT_INJECT + bool "Enable fault injection during TTM backup" + depends on DRM_TTM + default n + help + Inject recoverable failures during TTM backup and recovery of + backed-up objects. For DRM driver developers only. + + If in doubt, choose N. + config DRM_BUDDY tristate depends on DRM diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c index 38e50cf81b0a..d32a1f2e5e50 100644 --- a/drivers/gpu/drm/ttm/ttm_pool.c +++ b/drivers/gpu/drm/ttm/ttm_pool.c @@ -431,6 +431,7 @@ static int ttm_pool_restore_tt(struct ttm_pool_tt_restore *restore, struct ttm_backup *backup, struct ttm_operation_ctx *ctx) { + static unsigned long __maybe_unused swappedin; unsigned int i, nr = 1 << restore->order; int ret = 0; @@ -446,6 +447,13 @@ static int ttm_pool_restore_tt(struct ttm_pool_tt_restore *restore, if (handle == 0) continue; + if (IS_ENABLED(CONFIG_DRM_TTM_BACKUP_FAULT_INJECT) && + ctx->interruptible && + ++swappedin % 100 == 0) { + ret = -EINTR; + break; + } + ret = backup->ops->copy_backed_up_page (backup, restore->first_page[i], handle, ctx->interruptible); @@ -892,7 +900,14 @@ long ttm_pool_backup_tt(struct ttm_pool *pool, struct ttm_tt *ttm, bool purge, alloc_gfp = GFP_KERNEL | __GFP_HIGH | __GFP_NOWARN | __GFP_RETRY_MAYFAIL; - for (i = 0; i < ttm->num_pages; ++i) { + num_pages = ttm->num_pages; + + /* Pretend doing fault injection by shrinking only half of the pages. */ + + if (IS_ENABLED(CONFIG_DRM_TTM_BACKUP_FAULT_INJECT)) + num_pages = DIV_ROUND_UP(num_pages, 2); + + for (i = 0; i < num_pages; ++i) { page = ttm->pages[i]; if (unlikely(!page)) continue;
Use fault-injection to test partial TTM swapout and interrupted swapin. Return -EINTR for swapin to test the callers ability to handle and restart the swapin, and on swapout perform a partial swapout to test that the swapin and release_shrunken functionality. Cc: Christian König <christian.koenig@amd.com> Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: <dri-devel@lists.freedesktop.org> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> --- drivers/gpu/drm/Kconfig | 10 ++++++++++ drivers/gpu/drm/ttm/ttm_pool.c | 17 ++++++++++++++++- 2 files changed, 26 insertions(+), 1 deletion(-)