diff mbox series

[v6,10/12] drm/ttm: Use fault-injection to test error paths

Message ID 20240703153813.182001-11-thomas.hellstrom@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series TTM shrinker helpers and xe buffer object shrinker | expand

Commit Message

Thomas Hellström July 3, 2024, 3:38 p.m. UTC
Use fault-injection to test partial TTM swapout and interrupted swapin.
Return -EINTR for swapin to test the callers ability to handle and
restart the swapin, and on swapout perform a partial swapout to test that
the swapin and release_shrunken functionality.

Cc: Christian König <christian.koenig@amd.com>
Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <dri-devel@lists.freedesktop.org>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/Kconfig        | 10 ++++++++++
 drivers/gpu/drm/ttm/ttm_pool.c | 17 ++++++++++++++++-
 2 files changed, 26 insertions(+), 1 deletion(-)

Comments

Matthew Brost Aug. 7, 2024, 11:43 p.m. UTC | #1
On Wed, Jul 03, 2024 at 05:38:11PM +0200, Thomas Hellström wrote:
> Use fault-injection to test partial TTM swapout and interrupted swapin.
> Return -EINTR for swapin to test the callers ability to handle and
> restart the swapin, and on swapout perform a partial swapout to test that
> the swapin and release_shrunken functionality.
> 
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: <dri-devel@lists.freedesktop.org>
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> ---
>  drivers/gpu/drm/Kconfig        | 10 ++++++++++
>  drivers/gpu/drm/ttm/ttm_pool.c | 17 ++++++++++++++++-
>  2 files changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> index fd0749c0c630..9f27271bfab8 100644
> --- a/drivers/gpu/drm/Kconfig
> +++ b/drivers/gpu/drm/Kconfig
> @@ -272,6 +272,16 @@ config DRM_GPUVM
>  	  GPU-VM representation providing helpers to manage a GPUs virtual
>  	  address space
>  
> +config DRM_TTM_BACKUP_FAULT_INJECT
> +	bool "Enable fault injection during TTM backup"
> +	depends on DRM_TTM
> +	default n
> +	help
> +	  Inject recoverable failures during TTM backup and recovery of
> +	  backed-up objects. For DRM driver developers only.
> +
> +	  If in doubt, choose N.
> +
>  config DRM_BUDDY
>  	tristate
>  	depends on DRM
> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> index 38e50cf81b0a..d32a1f2e5e50 100644
> --- a/drivers/gpu/drm/ttm/ttm_pool.c
> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> @@ -431,6 +431,7 @@ static int ttm_pool_restore_tt(struct ttm_pool_tt_restore *restore,
>  			       struct ttm_backup *backup,
>  			       struct ttm_operation_ctx *ctx)
>  {
> +	static unsigned long __maybe_unused swappedin;
>  	unsigned int i, nr = 1 << restore->order;
>  	int ret = 0;
>  
> @@ -446,6 +447,13 @@ static int ttm_pool_restore_tt(struct ttm_pool_tt_restore *restore,
>  			if (handle == 0)
>  				continue;
>  
> +			if (IS_ENABLED(CONFIG_DRM_TTM_BACKUP_FAULT_INJECT) &&
> +			    ctx->interruptible &&
> +			    ++swappedin % 100 == 0) {
> +				ret = -EINTR;
> +				break;
> +			}

So here this -EINTR would be kicked to the user IOCTL which triggered
the BO validate and retry? The restore then should be able to
successfully pick up where it left off?

> +
>  			ret = backup->ops->copy_backed_up_page
>  				(backup, restore->first_page[i],
>  				 handle, ctx->interruptible);
> @@ -892,7 +900,14 @@ long ttm_pool_backup_tt(struct ttm_pool *pool, struct ttm_tt *ttm, bool purge,
>  
>  	alloc_gfp = GFP_KERNEL | __GFP_HIGH | __GFP_NOWARN | __GFP_RETRY_MAYFAIL;
>  
> -	for (i = 0; i < ttm->num_pages; ++i) {
> +	num_pages = ttm->num_pages;
> +
> +	/* Pretend doing fault injection by shrinking only half of the pages. */
> +
> +	if (IS_ENABLED(CONFIG_DRM_TTM_BACKUP_FAULT_INJECT))
> +		num_pages = DIV_ROUND_UP(num_pages, 2);

So what happens here? Half the pages swapped out, then upon restore half
swapped back in? The shrinker continues to walk until enough pages
swapped out?

Matt

> +
> +	for (i = 0; i < num_pages; ++i) {
>  		page = ttm->pages[i];
>  		if (unlikely(!page))
>  			continue;
> -- 
> 2.44.0
>
Thomas Hellström Aug. 9, 2024, 1:53 p.m. UTC | #2
On Wed, 2024-08-07 at 23:43 +0000, Matthew Brost wrote:
> On Wed, Jul 03, 2024 at 05:38:11PM +0200, Thomas Hellström wrote:
> > Use fault-injection to test partial TTM swapout and interrupted
> > swapin.
> > Return -EINTR for swapin to test the callers ability to handle and
> > restart the swapin, and on swapout perform a partial swapout to
> > test that
> > the swapin and release_shrunken functionality.
> > 
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: <dri-devel@lists.freedesktop.org>
> > Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > ---
> >  drivers/gpu/drm/Kconfig        | 10 ++++++++++
> >  drivers/gpu/drm/ttm/ttm_pool.c | 17 ++++++++++++++++-
> >  2 files changed, 26 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> > index fd0749c0c630..9f27271bfab8 100644
> > --- a/drivers/gpu/drm/Kconfig
> > +++ b/drivers/gpu/drm/Kconfig
> > @@ -272,6 +272,16 @@ config DRM_GPUVM
> >  	  GPU-VM representation providing helpers to manage a GPUs
> > virtual
> >  	  address space
> >  
> > +config DRM_TTM_BACKUP_FAULT_INJECT
> > +	bool "Enable fault injection during TTM backup"
> > +	depends on DRM_TTM
> > +	default n
> > +	help
> > +	  Inject recoverable failures during TTM backup and
> > recovery of
> > +	  backed-up objects. For DRM driver developers only.
> > +
> > +	  If in doubt, choose N.
> > +
> >  config DRM_BUDDY
> >  	tristate
> >  	depends on DRM
> > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c
> > b/drivers/gpu/drm/ttm/ttm_pool.c
> > index 38e50cf81b0a..d32a1f2e5e50 100644
> > --- a/drivers/gpu/drm/ttm/ttm_pool.c
> > +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> > @@ -431,6 +431,7 @@ static int ttm_pool_restore_tt(struct
> > ttm_pool_tt_restore *restore,
> >  			       struct ttm_backup *backup,
> >  			       struct ttm_operation_ctx *ctx)
> >  {
> > +	static unsigned long __maybe_unused swappedin;
> >  	unsigned int i, nr = 1 << restore->order;
> >  	int ret = 0;
> >  
> > @@ -446,6 +447,13 @@ static int ttm_pool_restore_tt(struct
> > ttm_pool_tt_restore *restore,
> >  			if (handle == 0)
> >  				continue;
> >  
> > +			if
> > (IS_ENABLED(CONFIG_DRM_TTM_BACKUP_FAULT_INJECT) &&
> > +			    ctx->interruptible &&
> > +			    ++swappedin % 100 == 0) {
> > +				ret = -EINTR;
> > +				break;
> > +			}
> 
> So here this -EINTR would be kicked to the user IOCTL which triggered
> the BO validate and retry? The restore then should be able to
> successfully pick up where it left off?

Yes, that's the point. For the direct swap-cache backend I initially
used (before concluding that the shmem one actually seemed to work
fine), we had an interruptible wait here.

Supporting interrupts is generally a good thing but for the pool code,
this makes the code considerably more complicated. However, this is a
good way to ensure drivers actually support -EINTR for the call chain.
If not, adding interrupt capability "later" will most likely be a PITA.

> 
> > +
> >  			ret = backup->ops->copy_backed_up_page
> >  				(backup, restore->first_page[i],
> >  				 handle, ctx->interruptible);
> > @@ -892,7 +900,14 @@ long ttm_pool_backup_tt(struct ttm_pool *pool,
> > struct ttm_tt *ttm, bool purge,
> >  
> >  	alloc_gfp = GFP_KERNEL | __GFP_HIGH | __GFP_NOWARN |
> > __GFP_RETRY_MAYFAIL;
> >  
> > -	for (i = 0; i < ttm->num_pages; ++i) {
> > +	num_pages = ttm->num_pages;
> > +
> > +	/* Pretend doing fault injection by shrinking only half of
> > the pages. */
> > +
> > +	if (IS_ENABLED(CONFIG_DRM_TTM_BACKUP_FAULT_INJECT))
> > +		num_pages = DIV_ROUND_UP(num_pages, 2);
> 
> So what happens here? Half the pages swapped out, then upon restore
> half
> swapped back in? The shrinker continues to walk until enough pages
> swapped out?

Yes, exactly. Ideally we'd want some intermediate state here so that a
partially swapped out bo is still eligible for further shrinking.

/Thomas


> 
> Matt
> 
> > +
> > +	for (i = 0; i < num_pages; ++i) {
> >  		page = ttm->pages[i];
> >  		if (unlikely(!page))
> >  			continue;
> > -- 
> > 2.44.0
> >
Matthew Brost Aug. 9, 2024, 4:40 p.m. UTC | #3
On Fri, Aug 09, 2024 at 03:53:20PM +0200, Thomas Hellström wrote:
> On Wed, 2024-08-07 at 23:43 +0000, Matthew Brost wrote:
> > On Wed, Jul 03, 2024 at 05:38:11PM +0200, Thomas Hellström wrote:
> > > Use fault-injection to test partial TTM swapout and interrupted
> > > swapin.
> > > Return -EINTR for swapin to test the callers ability to handle and
> > > restart the swapin, and on swapout perform a partial swapout to
> > > test that
> > > the swapin and release_shrunken functionality.
> > > 
> > > Cc: Christian König <christian.koenig@amd.com>
> > > Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
> > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > Cc: <dri-devel@lists.freedesktop.org>
> > > Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > ---
> > >  drivers/gpu/drm/Kconfig        | 10 ++++++++++
> > >  drivers/gpu/drm/ttm/ttm_pool.c | 17 ++++++++++++++++-
> > >  2 files changed, 26 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> > > index fd0749c0c630..9f27271bfab8 100644
> > > --- a/drivers/gpu/drm/Kconfig
> > > +++ b/drivers/gpu/drm/Kconfig
> > > @@ -272,6 +272,16 @@ config DRM_GPUVM
> > >  	  GPU-VM representation providing helpers to manage a GPUs
> > > virtual
> > >  	  address space
> > >  
> > > +config DRM_TTM_BACKUP_FAULT_INJECT
> > > +	bool "Enable fault injection during TTM backup"
> > > +	depends on DRM_TTM
> > > +	default n
> > > +	help
> > > +	  Inject recoverable failures during TTM backup and
> > > recovery of
> > > +	  backed-up objects. For DRM driver developers only.
> > > +
> > > +	  If in doubt, choose N.
> > > +
> > >  config DRM_BUDDY
> > >  	tristate
> > >  	depends on DRM
> > > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c
> > > b/drivers/gpu/drm/ttm/ttm_pool.c
> > > index 38e50cf81b0a..d32a1f2e5e50 100644
> > > --- a/drivers/gpu/drm/ttm/ttm_pool.c
> > > +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> > > @@ -431,6 +431,7 @@ static int ttm_pool_restore_tt(struct
> > > ttm_pool_tt_restore *restore,
> > >  			       struct ttm_backup *backup,
> > >  			       struct ttm_operation_ctx *ctx)
> > >  {
> > > +	static unsigned long __maybe_unused swappedin;
> > >  	unsigned int i, nr = 1 << restore->order;
> > >  	int ret = 0;
> > >  
> > > @@ -446,6 +447,13 @@ static int ttm_pool_restore_tt(struct
> > > ttm_pool_tt_restore *restore,
> > >  			if (handle == 0)
> > >  				continue;
> > >  
> > > +			if
> > > (IS_ENABLED(CONFIG_DRM_TTM_BACKUP_FAULT_INJECT) &&
> > > +			    ctx->interruptible &&
> > > +			    ++swappedin % 100 == 0) {
> > > +				ret = -EINTR;
> > > +				break;
> > > +			}
> > 
> > So here this -EINTR would be kicked to the user IOCTL which triggered
> > the BO validate and retry? The restore then should be able to
> > successfully pick up where it left off?
> 
> Yes, that's the point. For the direct swap-cache backend I initially
> used (before concluding that the shmem one actually seemed to work
> fine), we had an interruptible wait here.
> 
> Supporting interrupts is generally a good thing but for the pool code,
> this makes the code considerably more complicated. However, this is a
> good way to ensure drivers actually support -EINTR for the call chain.
> If not, adding interrupt capability "later" will most likely be a PITA.
> 
> > 
> > > +
> > >  			ret = backup->ops->copy_backed_up_page
> > >  				(backup, restore->first_page[i],
> > >  				 handle, ctx->interruptible);
> > > @@ -892,7 +900,14 @@ long ttm_pool_backup_tt(struct ttm_pool *pool,
> > > struct ttm_tt *ttm, bool purge,
> > >  
> > >  	alloc_gfp = GFP_KERNEL | __GFP_HIGH | __GFP_NOWARN |
> > > __GFP_RETRY_MAYFAIL;
> > >  
> > > -	for (i = 0; i < ttm->num_pages; ++i) {
> > > +	num_pages = ttm->num_pages;
> > > +
> > > +	/* Pretend doing fault injection by shrinking only half of
> > > the pages. */
> > > +
> > > +	if (IS_ENABLED(CONFIG_DRM_TTM_BACKUP_FAULT_INJECT))
> > > +		num_pages = DIV_ROUND_UP(num_pages, 2);
> > 
> > So what happens here? Half the pages swapped out, then upon restore
> > half
> > swapped back in? The shrinker continues to walk until enough pages
> > swapped out?
> 
> Yes, exactly. Ideally we'd want some intermediate state here so that a
> partially swapped out bo is still eligible for further shrinking.
> 

Cool, glad my understanding is correct. Having error injection is always
a good idea to ensure error paths / corner cases work rather than
finding they blow up much later when these cases somehow get triggered.

With that:
Reviewed-by: Matthew Brost <matthew.brost@intel.com>

> /Thomas
> 
> 
> > 
> > Matt
> > 
> > > +
> > > +	for (i = 0; i < num_pages; ++i) {
> > >  		page = ttm->pages[i];
> > >  		if (unlikely(!page))
> > >  			continue;
> > > -- 
> > > 2.44.0
> > > 
>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index fd0749c0c630..9f27271bfab8 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -272,6 +272,16 @@  config DRM_GPUVM
 	  GPU-VM representation providing helpers to manage a GPUs virtual
 	  address space
 
+config DRM_TTM_BACKUP_FAULT_INJECT
+	bool "Enable fault injection during TTM backup"
+	depends on DRM_TTM
+	default n
+	help
+	  Inject recoverable failures during TTM backup and recovery of
+	  backed-up objects. For DRM driver developers only.
+
+	  If in doubt, choose N.
+
 config DRM_BUDDY
 	tristate
 	depends on DRM
diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index 38e50cf81b0a..d32a1f2e5e50 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -431,6 +431,7 @@  static int ttm_pool_restore_tt(struct ttm_pool_tt_restore *restore,
 			       struct ttm_backup *backup,
 			       struct ttm_operation_ctx *ctx)
 {
+	static unsigned long __maybe_unused swappedin;
 	unsigned int i, nr = 1 << restore->order;
 	int ret = 0;
 
@@ -446,6 +447,13 @@  static int ttm_pool_restore_tt(struct ttm_pool_tt_restore *restore,
 			if (handle == 0)
 				continue;
 
+			if (IS_ENABLED(CONFIG_DRM_TTM_BACKUP_FAULT_INJECT) &&
+			    ctx->interruptible &&
+			    ++swappedin % 100 == 0) {
+				ret = -EINTR;
+				break;
+			}
+
 			ret = backup->ops->copy_backed_up_page
 				(backup, restore->first_page[i],
 				 handle, ctx->interruptible);
@@ -892,7 +900,14 @@  long ttm_pool_backup_tt(struct ttm_pool *pool, struct ttm_tt *ttm, bool purge,
 
 	alloc_gfp = GFP_KERNEL | __GFP_HIGH | __GFP_NOWARN | __GFP_RETRY_MAYFAIL;
 
-	for (i = 0; i < ttm->num_pages; ++i) {
+	num_pages = ttm->num_pages;
+
+	/* Pretend doing fault injection by shrinking only half of the pages. */
+
+	if (IS_ENABLED(CONFIG_DRM_TTM_BACKUP_FAULT_INJECT))
+		num_pages = DIV_ROUND_UP(num_pages, 2);
+
+	for (i = 0; i < num_pages; ++i) {
 		page = ttm->pages[i];
 		if (unlikely(!page))
 			continue;