drm/i915: Migrate stolen objects before hibernation

Message ID	1435658288-31656-1-git-send-email-chris@chris-wilson.co.uk (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Chris Wilson <chris@chris-wilson.co.uk> To: intel-gfx@lists.freedesktop.org Date: Tue, 30 Jun 2015 10:58:08 +0100 Message-Id: <1435658288-31656-1-git-send-email-chris@chris-wilson.co.uk> In-Reply-To: <1435646226.20771.140.camel@akashgoe-desktop> References: <1435646226.20771.140.camel@akashgoe-desktop> Cc: Imre Deak <imre.deak@linux.intel.com>, Akash Goel <akash.goel@intel.com> Subject: [Intel-gfx] [PATCH] drm/i915: Migrate stolen objects before hibernation Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Chris Wilson June 30, 2015, 9:58 a.m. UTC

Ville reminded us that stolen memory is not preserved across
hibernation, and a result of this was that context objects now being
allocated from stolen were being corrupted on S4 and promptly hanging
the GPU on resume.

We want to utilise stolen for as much as possible (nothing else will use
that wasted memory otherwise), so we need a strategy for handling
general objects allocated from stolen and hibernation. A simple solution
is to do a CPU copy through the GTT of the stolen object into a fresh
shmemfs backing store and thenceforth treat it as a normal objects. This
can be refined in future to either use a GPU copy to avoid the slow
uncached reads (though it's hibernation!) and recreate stolen objects
upon resume/first-use. For now, a simple approach should suffice for
testing the object migration.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
Cc: Akash Goel <akash.goel@intel.com>
Cc: Deepak S <deepak.s@linux.intel.com>
Cc: Imre Deak <imre.deak@linux.intel.com>
---
This uses a few constructs that are not upstream yet, but it should be
enough for sanity checking that my conversion of a stolen object to a 
shmemfs object is sane.
---
 drivers/gpu/drm/i915/i915_drv.c         |  17 +++-
 drivers/gpu/drm/i915/i915_drv.h         |   5 +
 drivers/gpu/drm/i915/i915_gem.c         | 156 +++++++++++++++++++++++++++++---
 drivers/gpu/drm/i915/i915_gem_context.c |   3 +
 drivers/gpu/drm/i915/i915_gem_stolen.c  |   9 ++
 drivers/gpu/drm/i915/intel_overlay.c    |   3 +
 6 files changed, 180 insertions(+), 13 deletions(-)

Chris Wilson June 30, 2015, 10:11 a.m. UTC | #1

On Tue, Jun 30, 2015 at 10:58:08AM +0100, Chris Wilson wrote:
> +static int
> +i915_gem_object_migrate_stolen_to_shmemfs(struct drm_i915_gem_object *obj)
> +{
> +	struct drm_i915_private *i915 = to_i915(obj->base.dev);
> +	struct i915_vma *vma, *vn;
> +	struct drm_mm_node node;
> +	struct file *file;
> +	struct address_space *mapping;
> +	int ret, i;
> +
> +	if (WARN_ON(i915_gem_object_needs_bit17_swizzle(obj)))
> +		return -EINVAL;

Missed:

ret = i915_gem_object_set_to_gtt_domain(obj, false);
if (ret)
	return ret;

The unbinding would have been enough to serialise with the GPU, but we
may as well be explicit in our domain management.
-Chris

Chris Wilson June 30, 2015, 10:31 a.m. UTC | #2

On Tue, Jun 30, 2015 at 10:58:08AM +0100, Chris Wilson wrote:
> +int
> +i915_gem_freeze(struct drm_device *dev)
> +{
> +	/* Called before freeze when hibernating */
> +	struct drm_i915_private *i915 = to_i915(dev);
> +	struct drm_i915_gem_object *obj, *tmp;
> +	int ret;
> +
> +	/* Across hibernation, the stolen area is not preserved.
> +	 * Anything inside stolen must copied back to normal
> +	 * memory if we wish to preserve it.
> +	 */

And I forgot to add the double pass + reference/unreference dance here

> +	list_for_each_entry_safe(obj, tmp, &i915->mm.stolen_list, stolen_link) {
> +		if (obj->madv != I915_MADV_WILLNEED) {
> +			/* XXX any point in setting this? The objects for which
> +			 * we preserve never unset it afterwards.
> +			 */
> +			obj->madv = __I915_MADV_PURGED;
> +			continue;
> +		}
> +
> +		ret = i915_gem_object_migrate_stolen_to_shmemfs(obj);
> +		if (ret)
> +			return ret;
> +	}

should be

+	INIT_LIST_HEAD(&migrate);
+       list_for_each_entry_safe(obj, tmp, &i915->mm.stolen_list, stolen_link) {
+               if (obj->madv != I915_MADV_WILLNEED)
+                       continue;
+
+               /* In the general case, this object may only be alive due to
+                * an active reference, and that may disappear when we unbind
+                * any of the objects (and so wait upon the GPU and retire
+                * requests). To prevent one of the objects from disappearing
+                * beneath us, we need to take a reference to each as we
+                * build the migration list.
+                *
+                * This is similar to the strategy required whilst shrinking
+                * or evicting objects (for the same reason).
+                */
+               drm_gem_object_reference(&obj->base);
+               list_move(&obj->stolen_link, &migrate);
+       }
+
+       ret = 0;
+       list_for_each_entry_safe(obj, tmp, &migrate, stolen_link) {
+               if (ret == 0)
+                       ret = i915_gem_object_migrate_stolen_to_shmemfs(obj);
+               drm_gem_object_unreference(&obj->base);
+       }
+       list_splice(&migrate, &i915->mm.stolen_list);
-Chris

Daniel Vetter June 30, 2015, 10:54 a.m. UTC | #3

On Tue, Jun 30, 2015 at 10:58:08AM +0100, Chris Wilson wrote:
> Ville reminded us that stolen memory is not preserved across
> hibernation, and a result of this was that context objects now being
> allocated from stolen were being corrupted on S4 and promptly hanging
> the GPU on resume.
> 
> We want to utilise stolen for as much as possible (nothing else will use
> that wasted memory otherwise), so we need a strategy for handling
> general objects allocated from stolen and hibernation. A simple solution
> is to do a CPU copy through the GTT of the stolen object into a fresh
> shmemfs backing store and thenceforth treat it as a normal objects. This
> can be refined in future to either use a GPU copy to avoid the slow
> uncached reads (though it's hibernation!) and recreate stolen objects
> upon resume/first-use. For now, a simple approach should suffice for
> testing the object migration.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Ville Syrjala <ville.syrjala@linux.intel.com>
> Cc: Akash Goel <akash.goel@intel.com>
> Cc: Deepak S <deepak.s@linux.intel.com>
> Cc: Imre Deak <imre.deak@linux.intel.com>
> ---
> This uses a few constructs that are not upstream yet, but it should be
> enough for sanity checking that my conversion of a stolen object to a 
> shmemfs object is sane.
> ---
>  drivers/gpu/drm/i915/i915_drv.c         |  17 +++-
>  drivers/gpu/drm/i915/i915_drv.h         |   5 +
>  drivers/gpu/drm/i915/i915_gem.c         | 156 +++++++++++++++++++++++++++++---
>  drivers/gpu/drm/i915/i915_gem_context.c |   3 +
>  drivers/gpu/drm/i915/i915_gem_stolen.c  |   9 ++
>  drivers/gpu/drm/i915/intel_overlay.c    |   3 +
>  6 files changed, 180 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index c45d5722e987..3ece56a98827 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -972,6 +972,21 @@ static int i915_pm_suspend(struct device *dev)
>  	return i915_drm_suspend(drm_dev);
>  }
>  
> +static int i915_pm_freeze(struct device *dev)
> +{
> +	int ret;
> +
> +	ret = i915_gem_freeze(pci_get_drvdata(to_pci_dev(dev)));
> +	if (ret)
> +		return ret;
> +
> +	ret = i915_pm_suspend(dev);
> +	if (ret)
> +		return ret;
> +
> +	return 0;
> +}
> +
>  static int i915_pm_suspend_late(struct device *dev)
>  {
>  	struct drm_device *drm_dev = dev_to_i915(dev)->dev;
> @@ -1618,7 +1633,7 @@ static const struct dev_pm_ops i915_pm_ops = {
>  	 * @restore, @restore_early : called after rebooting and restoring the
>  	 *                            hibernation image [PMSG_RESTORE]
>  	 */
> -	.freeze = i915_pm_suspend,
> +	.freeze = i915_pm_freeze,
>  	.freeze_late = i915_pm_suspend_late,
>  	.thaw_early = i915_pm_resume_early,
>  	.thaw = i915_pm_resume,
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index ea7881367084..ec10f389886e 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -1258,6 +1258,9 @@ struct intel_l3_parity {
>  struct i915_gem_mm {
>  	/** Memory allocator for GTT stolen memory */
>  	struct drm_mm stolen;
> +	/** List of all objects allocated from stolen */
> +	struct list_head stolen_list;
> +
>  	/** List of all objects in gtt_space. Used to restore gtt
>  	 * mappings on resume */
>  	struct list_head bound_list;
> @@ -2024,6 +2027,7 @@ struct drm_i915_gem_object {
>  	struct drm_mm_node *stolen;
>  	struct list_head global_list;
>  
> +	struct list_head stolen_link;
>  	struct list_head batch_pool_link;
>  	struct list_head tmp_link;
>  
> @@ -3003,6 +3007,7 @@ int i915_gem_l3_remap(struct drm_i915_gem_request *req, int slice);
>  void i915_gem_init_swizzling(struct drm_device *dev);
>  void i915_gem_cleanup_ringbuffer(struct drm_device *dev);
>  int __must_check i915_gpu_idle(struct drm_device *dev);
> +int __must_check i915_gem_freeze(struct drm_device *dev);
>  int __must_check i915_gem_suspend(struct drm_device *dev);
>  void __i915_add_request(struct drm_i915_gem_request *req,
>  			struct i915_vma *batch,
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 477cd1cb7679..c501e20b7ed0 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -4580,12 +4580,27 @@ static const struct drm_i915_gem_object_ops i915_gem_object_ops = {
>  	.put_pages = i915_gem_object_put_pages_gtt,
>  };
>  
> +static struct address_space *
> +i915_gem_set_inode_gfp(struct drm_device *dev, struct file *file)
> +{
> +	struct address_space *mapping = file_inode(file)->i_mapping;
> +	gfp_t mask;
> +
> +	mask = GFP_HIGHUSER | __GFP_RECLAIMABLE;
> +	if (IS_CRESTLINE(dev) || IS_BROADWATER(dev)) {
> +		/* 965gm cannot relocate objects above 4GiB. */
> +		mask &= ~__GFP_HIGHMEM;
> +		mask |= __GFP_DMA32;
> +	}
> +	mapping_set_gfp_mask(mapping, mask);
> +
> +	return mapping;
> +}
> +
>  struct drm_i915_gem_object *i915_gem_alloc_object(struct drm_device *dev,
>  						  size_t size)
>  {
>  	struct drm_i915_gem_object *obj;
> -	struct address_space *mapping;
> -	gfp_t mask;
>  	int ret;
>  
>  	obj = i915_gem_object_alloc(dev);
> @@ -4597,16 +4612,7 @@ struct drm_i915_gem_object *i915_gem_alloc_object(struct drm_device *dev,
>  		i915_gem_object_free(obj);
>  		return ERR_PTR(ret);
>  	}
> -
> -	mask = GFP_HIGHUSER | __GFP_RECLAIMABLE;
> -	if (IS_CRESTLINE(dev) || IS_BROADWATER(dev)) {
> -		/* 965gm cannot relocate objects above 4GiB. */
> -		mask &= ~__GFP_HIGHMEM;
> -		mask |= __GFP_DMA32;
> -	}
> -
> -	mapping = file_inode(obj->base.filp)->i_mapping;
> -	mapping_set_gfp_mask(mapping, mask);
> +	i915_gem_set_inode_gfp(dev, obj->base.filp);
>  
>  	i915_gem_object_init(obj, &i915_gem_object_ops);
>  
> @@ -4738,6 +4744,132 @@ i915_gem_stop_ringbuffers(struct drm_device *dev)
>  		dev_priv->gt.stop_ring(ring);
>  }
>  
> +static int
> +i915_gem_object_migrate_stolen_to_shmemfs(struct drm_i915_gem_object *obj)
> +{
> +	struct drm_i915_private *i915 = to_i915(obj->base.dev);
> +	struct i915_vma *vma, *vn;
> +	struct drm_mm_node node;
> +	struct file *file;
> +	struct address_space *mapping;
> +	int ret, i;
> +
> +	if (WARN_ON(i915_gem_object_needs_bit17_swizzle(obj)))
> +		return -EINVAL;
> +
> +	list_for_each_entry_safe(vma, vn, &obj->vma_list, obj_link)
> +		if (i915_vma_unbind(vma))
> +			break;
> +
> +	ret = i915_gem_object_pin_pages(obj);
> +	if (ret)
> +		return ret;
> +
> +	memset(&node, 0, sizeof(node));
> +	ret = drm_mm_insert_node_in_range_generic(&i915->gtt.base.mm,
> +						  &node,
> +						  4096, 0, I915_CACHE_NONE,
> +						  0, i915->gtt.mappable_end,
> +						  DRM_MM_SEARCH_DEFAULT,
> +						  DRM_MM_CREATE_DEFAULT);

Hm, I think the plan with stolen is to mostly use it for giant scanout
buffers where we never plan to access them with the gpu. Maybe go with a
per-page loop here instead? You have a low-level pte writing call below
anyway. Would mean we'd need a 1-entry onstack sg_table too, but that
won't hurt.

And performance doesn't matter either because hibernate is seriously slow.

> +	if (ret)
> +		goto err_unpin;
> +
> +	file = shmem_file_setup("drm mm object", obj->base.size, VM_NORESERVE);
> +	if (IS_ERR(file)) {
> +		ret = PTR_ERR(file);
> +		goto err_unpin;
> +	}
> +
> +	mapping = i915_gem_set_inode_gfp(obj->base.dev, file);
> +	for (i = 0; i < obj->base.size / PAGE_SIZE; i++) {
> +		struct page *page;
> +		void *dst, *src;
> +
> +		wmb();
> +		i915->gtt.base.insert_page(&i915->gtt.base,
> +					   i915_gem_object_get_dma_address(obj, i),
> +					   node.start,
> +					   I915_CACHE_NONE,
> +					   0);
> +		wmb();
> +
> +		page = shmem_read_mapping_page(mapping, i);
> +		if (IS_ERR(page)) {
> +			ret = PTR_ERR(page);
> +			goto err_file;
> +		}
> +
> +		src = io_mapping_map_atomic_wc(i915->gtt.mappable, node.start);
> +		dst = kmap_atomic(page);
> +		memcpy(dst, src, PAGE_SIZE);

memcopy_fromio to appease sparse I think.

> +		kunmap_atomic(dst);
> +		io_mapping_unmap_atomic(src);
> +
> +		page_cache_release(page);
> +	}
> +
> +	wmb();
> +	i915->gtt.base.clear_range(&i915->gtt.base,
> +				   node.start, node.size,
> +				   true);
> +	drm_mm_remove_node(&node);
> +
> +	i915_gem_object_unpin_pages(obj);
> +	ret = i915_gem_object_put_pages(obj);
> +	if (ret) {
> +		fput(file);
> +		return ret;
> +	}
> +
> +	if (obj->ops->release)
> +		obj->ops->release(obj);
> +
> +	obj->base.filp = file;
> +	obj->ops = &i915_gem_object_ops;
> +
> +	obj->base.read_domains = I915_GEM_DOMAIN_CPU;
> +	obj->base.write_domain = I915_GEM_DOMAIN_CPU;

maybe wrap up in a shmem_reinit function and reuse in the gem object creation
path for a bit more clarity (and to reduce chances we forget about this).

Wrt igt not sure whether we can do this in general, since it requires a
working swap partition and working initrd support for hibernate. And
hibernate is kinda a disregarded feature anyway, so imo meh. We could do
an igt_system_hibernate_autowake() if we do the test though.

Imo trying to migrate backwards would be serious pain since we'd need to
make sure no shm mmap is around and audit all the other places. Too much
trouble for no gain (userspace can just realloc if needed).

> +
> +	return 0;
> +
> +err_file:
> +	fput(file);
> +err_unpin:
> +	i915_gem_object_unpin_pages(obj);
> +	return ret;
> +}
> +
> +int
> +i915_gem_freeze(struct drm_device *dev)
> +{
> +	/* Called before freeze when hibernating */
> +	struct drm_i915_private *i915 = to_i915(dev);
> +	struct drm_i915_gem_object *obj, *tmp;
> +	int ret;
> +
> +	/* Across hibernation, the stolen area is not preserved.
> +	 * Anything inside stolen must copied back to normal
> +	 * memory if we wish to preserve it.
> +	 */
> +
> +	list_for_each_entry_safe(obj, tmp, &i915->mm.stolen_list, stolen_link) {
> +		if (obj->madv != I915_MADV_WILLNEED) {
> +			/* XXX any point in setting this? The objects for which
> +			 * we preserve never unset it afterwards.
> +			 */
> +			obj->madv = __I915_MADV_PURGED;
> +			continue;
> +		}
> +
> +		ret = i915_gem_object_migrate_stolen_to_shmemfs(obj);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}
> +
>  int
>  i915_gem_suspend(struct drm_device *dev)
>  {
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index 03fa6e87f5ac..01e8fe7ae632 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -244,6 +244,9 @@ i915_gem_alloc_context_obj(struct drm_device *dev, size_t size)
>  	if (IS_ERR_OR_NULL(obj))
>  		return ERR_PTR(-ENOMEM);
>  
> +	/* contexts need to always be preserved across hibernation/swap-out */
> +	obj->madv = I915_MADV_WILLNEED;
> +
>  	/*
>  	 * Try to make the context utilize L3 as well as LLC.
>  	 *
> diff --git a/drivers/gpu/drm/i915/i915_gem_stolen.c b/drivers/gpu/drm/i915/i915_gem_stolen.c
> index 21ee83bf6307..86e84554e8ae 100644
> --- a/drivers/gpu/drm/i915/i915_gem_stolen.c
> +++ b/drivers/gpu/drm/i915/i915_gem_stolen.c
> @@ -331,6 +331,7 @@ int i915_gem_init_stolen(struct drm_device *dev)
>  	drm_mm_init(&dev_priv->mm.stolen, 0, dev_priv->gtt.stolen_size -
>  		    bios_reserved);
>  
> +	INIT_LIST_HEAD(&dev_priv->mm.stolen_list);
>  	return 0;
>  }
>  
> @@ -387,6 +388,7 @@ static void
>  i915_gem_object_release_stolen(struct drm_i915_gem_object *obj)
>  {
>  	if (obj->stolen) {
> +		list_del(&obj->stolen_link);
>  		drm_mm_remove_node(obj->stolen);
>  		kfree(obj->stolen);
>  		obj->stolen = NULL;
> @@ -419,6 +421,13 @@ _i915_gem_object_create_stolen(struct drm_device *dev,
>  	__i915_gem_object_pin_pages(obj);
>  	obj->has_dma_mapping = true;
>  	obj->stolen = stolen;
> +	list_add(&obj->stolen_link, &to_i915(dev)->mm.stolen_list);
> +
> +	/* By default, treat the contexts of stolen as volatile. If the object
> +	 * must be saved across hibernation, then the caller must take
> +	 * action and flag it as WILLNEED.
> +	 */
> +	obj->madv = I915_MADV_DONTNEED;

Won't this interfere with autoreclaim of stolen objects (to make room for
users which really need it like fbc) which are still in use by userspace?
I think we need a new madv flag for "REAP_ON_HIBERNATE" or something
similar. Otherwise this is way too surprising for userspace.

And with the REAP_ON_HIBERNATE we could also try to just purge them if
they're not pinned or similar (otherwise garbage on screen, ugh).

I guess the condition would then be

	if (madv == REAP_ON_HIBERNATE && !pinned)
		purge(obj)
	else
		convert_to_shmem

Also I'd make the REAPING opt-in since hibernate is such a rarely-used
path. And then perhaps only use it with create2.

Thinking about this more maybe igt would be good indeed: Use create2 to
allocate stolen with autoreaping, wrap in framebuffer and use it,
hibernate.

The gpu should be able so pinned for scanout is the only corner case I can
think of atm.

But overall I think the approach looks good, just a lot of details to get
right as usual.
-Daniel

>  
>  	obj->base.read_domains = I915_GEM_DOMAIN_CPU | I915_GEM_DOMAIN_GTT;
>  	obj->cache_level = HAS_LLC(dev) ? I915_CACHE_LLC : I915_CACHE_NONE;
> diff --git a/drivers/gpu/drm/i915/intel_overlay.c b/drivers/gpu/drm/i915/intel_overlay.c
> index cebe857f6df2..ec3ab351e9ff 100644
> --- a/drivers/gpu/drm/i915/intel_overlay.c
> +++ b/drivers/gpu/drm/i915/intel_overlay.c
> @@ -1401,6 +1401,9 @@ void intel_setup_overlay(struct drm_device *dev)
>  		goto out_free;
>  	overlay->reg_bo = reg_bo;
>  
> +	/* XXX preserve register values across hibernation */
> +	reg_bo->madv = I915_MADV_WILLNEED;
> +
>  	if (OVERLAY_NEEDS_PHYSICAL(dev)) {
>  		ret = i915_gem_object_attach_phys(reg_bo, PAGE_SIZE);
>  		if (ret) {
> -- 
> 2.1.4
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

Chris Wilson June 30, 2015, 11:03 a.m. UTC | #4

On Tue, Jun 30, 2015 at 12:54:02PM +0200, Daniel Vetter wrote:
> > +	memset(&node, 0, sizeof(node));
> > +	ret = drm_mm_insert_node_in_range_generic(&i915->gtt.base.mm,
> > +						  &node,
> > +						  4096, 0, I915_CACHE_NONE,
> > +						  0, i915->gtt.mappable_end,
> > +						  DRM_MM_SEARCH_DEFAULT,
> > +						  DRM_MM_CREATE_DEFAULT);
> 
> Hm, I think the plan with stolen is to mostly use it for giant scanout
> buffers where we never plan to access them with the gpu. Maybe go with a
> per-page loop here instead? You have a low-level pte writing call below
> anyway. Would mean we'd need a 1-entry onstack sg_table too, but that
> won't hurt.

I'm not understanding. This is a per-page loop (because we don't need to
bind the entire stolen vma into GGTT for copying with the CPU and
thereby increase the risk of failure). Speaking of failure, should
hibernation be interruptible? I guess it is usually called from an
interruptible process context.
-Chris

Chris Wilson June 30, 2015, 11:16 a.m. UTC | #5

On Tue, Jun 30, 2015 at 12:54:02PM +0200, Daniel Vetter wrote:
> > +	if (obj->ops->release)
> > +		obj->ops->release(obj);
> > +
> > +	obj->base.filp = file;
> > +	obj->ops = &i915_gem_object_ops;
> > +
> > +	obj->base.read_domains = I915_GEM_DOMAIN_CPU;
> > +	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
> 
> maybe wrap up in a shmem_reinit function and reuse in the gem object creation
> path for a bit more clarity (and to reduce chances we forget about this).

A bit dubious about this. I like the explicit domain handling as it
clearly matches the status of obj->file. The only thing that is then
shared with the normal creation is obj->ops, which is actually set in
i915_gem_object_init(). It's not clear that we actually have semantic
reuse with gem object creation.

> Wrt igt not sure whether we can do this in general, since it requires a
> working swap partition and working initrd support for hibernate. And
> hibernate is kinda a disregarded feature anyway, so imo meh. We could do
> an igt_system_hibernate_autowake() if we do the test though.

I was thinking of just a flush-stolen-to-shmemfs debugfs interface. With
create2 we will be able to create the stolen objects and can use the
existing debugfs files to verify placement.

> Imo trying to migrate backwards would be serious pain since we'd need to
> make sure no shm mmap is around and audit all the other places. Too much
> trouble for no gain (userspace can just realloc if needed).

Otoh, that would be a userspace error since part of the stolen ABI is
that CPU mmaps are verboten. The simplest way to do this would be keep
the obj->base.filp private (so not to change the ABI of initially stolen
objects) and then fix the existing i915_gem_object_get_pages_stolen()
function (which is currently a BUG()) to do the stolen recreation.
However, that introduces the possibilty of an error being returned to
userspace, so at that point to hide the failure we probably do want to a
full shmemfs conversion.

Hmm, honestly that seems quite tractable and self-contained.
-Chris

Daniel Vetter June 30, 2015, 11:22 a.m. UTC | #6

On Tue, Jun 30, 2015 at 12:03:44PM +0100, Chris Wilson wrote:
> On Tue, Jun 30, 2015 at 12:54:02PM +0200, Daniel Vetter wrote:
> > > +	memset(&node, 0, sizeof(node));
> > > +	ret = drm_mm_insert_node_in_range_generic(&i915->gtt.base.mm,
> > > +						  &node,
> > > +						  4096, 0, I915_CACHE_NONE,
> > > +						  0, i915->gtt.mappable_end,
> > > +						  DRM_MM_SEARCH_DEFAULT,
> > > +						  DRM_MM_CREATE_DEFAULT);
> > 
> > Hm, I think the plan with stolen is to mostly use it for giant scanout
> > buffers where we never plan to access them with the gpu. Maybe go with a
> > per-page loop here instead? You have a low-level pte writing call below
> > anyway. Would mean we'd need a 1-entry onstack sg_table too, but that
> > won't hurt.
> 
> I'm not understanding. This is a per-page loop (because we don't need to
> bind the entire stolen vma into GGTT for copying with the CPU and
> thereby increase the risk of failure). Speaking of failure, should
> hibernation be interruptible? I guess it is usually called from an
> interruptible process context.

I was blind and confused by the insert_entries we have in upstream, which
takes a sg_table and hence can only map the full view without some jumping
through hoops. Concern fully addressed already ;-)

Wrt uninterruptible: GPU should be idle already completely (and reset if
something went wrong) so no need for interruptible. Hm, thinking about
this: Do we handle a gpu death only detected in gpu_idle? Nasty igt:
- inject hang, but be very careful to not cause any wait at all
- suspend

BOOM or not?
-Daniel

Chris Wilson June 30, 2015, 11:25 a.m. UTC | #7

On Tue, Jun 30, 2015 at 12:54:02PM +0200, Daniel Vetter wrote:
> The gpu should be able so pinned for scanout is the only corner case I can
> think of atm.

Hmm. That's a nuisance. But... We can throw away all the VM bindings
that are unpinned, and then rewrite those that are left with the shmemfs
pages.

It's ugly. On migration back we would have to do a similar trick, and we
need to tell a few white lies to get past some of our internal BUG_ON.
However, as the contents are the same and the PTE writes are atomic into
the GGTT, it should be invisible to the user.

For internally allocated frontbuffers, I was expecting to mark them as
volatile and let them lose their contents across hibernation - because
they will be immediately cleared afterwards (at least so I expect).
-Chris

Chris Wilson June 30, 2015, 11:32 a.m. UTC | #8

On Tue, Jun 30, 2015 at 01:22:59PM +0200, Daniel Vetter wrote:
> On Tue, Jun 30, 2015 at 12:03:44PM +0100, Chris Wilson wrote:
> > On Tue, Jun 30, 2015 at 12:54:02PM +0200, Daniel Vetter wrote:
> > > > +	memset(&node, 0, sizeof(node));
> > > > +	ret = drm_mm_insert_node_in_range_generic(&i915->gtt.base.mm,
> > > > +						  &node,
> > > > +						  4096, 0, I915_CACHE_NONE,
> > > > +						  0, i915->gtt.mappable_end,
> > > > +						  DRM_MM_SEARCH_DEFAULT,
> > > > +						  DRM_MM_CREATE_DEFAULT);
> > > 
> > > Hm, I think the plan with stolen is to mostly use it for giant scanout
> > > buffers where we never plan to access them with the gpu. Maybe go with a
> > > per-page loop here instead? You have a low-level pte writing call below
> > > anyway. Would mean we'd need a 1-entry onstack sg_table too, but that
> > > won't hurt.
> > 
> > I'm not understanding. This is a per-page loop (because we don't need to
> > bind the entire stolen vma into GGTT for copying with the CPU and
> > thereby increase the risk of failure). Speaking of failure, should
> > hibernation be interruptible? I guess it is usually called from an
> > interruptible process context.
> 
> I was blind and confused by the insert_entries we have in upstream, which
> takes a sg_table and hence can only map the full view without some jumping
> through hoops. Concern fully addressed already ;-)
> 
> Wrt uninterruptible: GPU should be idle already completely (and reset if
> something went wrong) so no need for interruptible.

Note that I put the migration loop before the suspend, i.e. before the
gpu_idle. Partly because, I felt the migration has the biggest chance of
failure so should go first, and the gpu idle in suspend is quite
convenient if we do use the GPU for blitting, but mainly because
after i915_gem_suspend() doing GEM operations feels very wrong (there is
a strong possibilty that we kick off some work queue or other that must
be idle).

> Hm, thinking about
> this: Do we handle a gpu death only detected in gpu_idle? Nasty igt:
> - inject hang, but be very careful to not cause any wait at all
> - suspend
> 
> BOOM or not?

In my kernels, no boom. GPU hang waiting for idle is business as usual!
In upstream, we have seen suspend/hibernate fail due to an untimely hang
(iirc, usually worked on the second attempt so the bug report in
question was about something else entirely except the logs contained the
hibernate failure).
-Chris

Daniel Vetter June 30, 2015, 11:54 a.m. UTC | #9

On Tue, Jun 30, 2015 at 12:32:38PM +0100, Chris Wilson wrote:
> On Tue, Jun 30, 2015 at 01:22:59PM +0200, Daniel Vetter wrote:
> > On Tue, Jun 30, 2015 at 12:03:44PM +0100, Chris Wilson wrote:
> > > On Tue, Jun 30, 2015 at 12:54:02PM +0200, Daniel Vetter wrote:
> > > > > +	memset(&node, 0, sizeof(node));
> > > > > +	ret = drm_mm_insert_node_in_range_generic(&i915->gtt.base.mm,
> > > > > +						  &node,
> > > > > +						  4096, 0, I915_CACHE_NONE,
> > > > > +						  0, i915->gtt.mappable_end,
> > > > > +						  DRM_MM_SEARCH_DEFAULT,
> > > > > +						  DRM_MM_CREATE_DEFAULT);
> > > > 
> > > > Hm, I think the plan with stolen is to mostly use it for giant scanout
> > > > buffers where we never plan to access them with the gpu. Maybe go with a
> > > > per-page loop here instead? You have a low-level pte writing call below
> > > > anyway. Would mean we'd need a 1-entry onstack sg_table too, but that
> > > > won't hurt.
> > > 
> > > I'm not understanding. This is a per-page loop (because we don't need to
> > > bind the entire stolen vma into GGTT for copying with the CPU and
> > > thereby increase the risk of failure). Speaking of failure, should
> > > hibernation be interruptible? I guess it is usually called from an
> > > interruptible process context.
> > 
> > I was blind and confused by the insert_entries we have in upstream, which
> > takes a sg_table and hence can only map the full view without some jumping
> > through hoops. Concern fully addressed already ;-)
> > 
> > Wrt uninterruptible: GPU should be idle already completely (and reset if
> > something went wrong) so no need for interruptible.
> 
> Note that I put the migration loop before the suspend, i.e. before the
> gpu_idle. Partly because, I felt the migration has the biggest chance of
> failure so should go first, and the gpu idle in suspend is quite
> convenient if we do use the GPU for blitting, but mainly because
> after i915_gem_suspend() doing GEM operations feels very wrong (there is
> a strong possibilty that we kick off some work queue or other that must
> be idle).

Hm, conceptually I think this should be part of i915_gem_suspend, but
means we'd get to wire a bool hibernate around a lot. And I also think we
should do this after gpu idle and forget about optimizations like using
the blitter - too much of a slowpath really to matter.

> > Hm, thinking about
> > this: Do we handle a gpu death only detected in gpu_idle? Nasty igt:
> > - inject hang, but be very careful to not cause any wait at all
> > - suspend
> > 
> > BOOM or not?
> 
> In my kernels, no boom. GPU hang waiting for idle is business as usual!
> In upstream, we have seen suspend/hibernate fail due to an untimely hang
> (iirc, usually worked on the second attempt so the bug report in
> question was about something else entirely except the logs contained the
> hibernate failure).

Ok, reality still matches with my expectations then. I'll wish for an igt
and your fix extracted ;-)

Without looking, do you just lock-drop and then retry (and block in the
interruptible acquisition of dev->struct_mutex)?
-Daniel

Daniel Vetter June 30, 2015, noon UTC | #10

On Tue, Jun 30, 2015 at 12:16:53PM +0100, Chris Wilson wrote:
> On Tue, Jun 30, 2015 at 12:54:02PM +0200, Daniel Vetter wrote:
> > > +	if (obj->ops->release)
> > > +		obj->ops->release(obj);
> > > +
> > > +	obj->base.filp = file;
> > > +	obj->ops = &i915_gem_object_ops;
> > > +
> > > +	obj->base.read_domains = I915_GEM_DOMAIN_CPU;
> > > +	obj->base.write_domain = I915_GEM_DOMAIN_CPU;
> > 
> > maybe wrap up in a shmem_reinit function and reuse in the gem object creation
> > path for a bit more clarity (and to reduce chances we forget about this).
> 
> A bit dubious about this. I like the explicit domain handling as it
> clearly matches the status of obj->file. The only thing that is then
> shared with the normal creation is obj->ops, which is actually set in
> i915_gem_object_init(). It's not clear that we actually have semantic
> reuse with gem object creation.

Yeah was just wondering really, maybe just as a marker. Doesn't sound like
it'd be useful at all.

> > Wrt igt not sure whether we can do this in general, since it requires a
> > working swap partition and working initrd support for hibernate. And
> > hibernate is kinda a disregarded feature anyway, so imo meh. We could do
> > an igt_system_hibernate_autowake() if we do the test though.
> 
> I was thinking of just a flush-stolen-to-shmemfs debugfs interface. With
> create2 we will be able to create the stolen objects and can use the
> existing debugfs files to verify placement.

Hm yeah manually exercising this code would be great for testing indeed.

> > Imo trying to migrate backwards would be serious pain since we'd need to
> > make sure no shm mmap is around and audit all the other places. Too much
> > trouble for no gain (userspace can just realloc if needed).
> 
> Otoh, that would be a userspace error since part of the stolen ABI is
> that CPU mmaps are verboten. The simplest way to do this would be keep
> the obj->base.filp private (so not to change the ABI of initially stolen
> objects) and then fix the existing i915_gem_object_get_pages_stolen()
> function (which is currently a BUG()) to do the stolen recreation.
> However, that introduces the possibilty of an error being returned to
> userspace, so at that point to hide the failure we probably do want to a
> full shmemfs conversion.
> 
> Hmm, honestly that seems quite tractable and self-contained.

Imo your minimal approach here to convert the bo into a shmem backed one
behind the scenes has a lot of appeal - we don't need to audit any ioctls
at all for userspace to be able to sneak in something nasty. And hence
also no need for negative igt testcases, only for one that exercises the
actual stolen-to-shmem function in a few corner cases (display pinning is
still the only one I can think of). Downside is that once converted it'll
stay converted, but I really don't see the downside of that.
-Daniel

Daniel Vetter June 30, 2015, 12:07 p.m. UTC | #11

On Tue, Jun 30, 2015 at 12:25:54PM +0100, Chris Wilson wrote:
> On Tue, Jun 30, 2015 at 12:54:02PM +0200, Daniel Vetter wrote:
> > The gpu should be able so pinned for scanout is the only corner case I can
> > think of atm.
> 
> Hmm. That's a nuisance. But... We can throw away all the VM bindings
> that are unpinned, and then rewrite those that are left with the shmemfs
> pages.
> 
> It's ugly. On migration back we would have to do a similar trick, and we
> need to tell a few white lies to get past some of our internal BUG_ON.
> However, as the contents are the same and the PTE writes are atomic into
> the GGTT, it should be invisible to the user.
> 
> For internally allocated frontbuffers, I was expecting to mark them as
> volatile and let them lose their contents across hibernation - because
> they will be immediately cleared afterwards (at least so I expect).

New upcoming complications: On some still super-secret upcoming platform
we need special ggtt vmas for scanout, to pin the backing storage into
suitable memory. And because the hw implementation does replace the ggtt
pte with the physical page entry (i.e. resolving dmar and stuff) it
disallows any writes to the ptes. So we can't replace the ggtt ptes while
the vma is pinned at all.

So I guess the scheme would need to be:
- On hibernate migrate the backing storage around shmem.
- On resume we need to repin (need to do that anyway) with the new vma
  settings.

Cheers, Daniel

Chris Wilson June 30, 2015, 12:37 p.m. UTC | #12

On Tue, Jun 30, 2015 at 01:54:14PM +0200, Daniel Vetter wrote:
> Ok, reality still matches with my expectations then. I'll wish for an igt
> and your fix extracted ;-)
> 
> Without looking, do you just lock-drop and then retry (and block in the
> interruptible acquisition of dev->struct_mutex)?

It will have to change behaviour with TDR and partial resets, but the
obversation is that with a total GPU reset pending, the current contents
of the active bo are irrelevant. The reset will clobber state and it
doesn't matter if that happens before this ioctl or the next. Therefore
there is only place (waiting for ring space) where it matters that a
reset is pending, and moving the special case handling there is much
easier than adding the recovery protocol everywhere.
-Chris

Chris Wilson June 30, 2015, 12:47 p.m. UTC | #13

On Tue, Jun 30, 2015 at 02:07:41PM +0200, Daniel Vetter wrote:
> On Tue, Jun 30, 2015 at 12:25:54PM +0100, Chris Wilson wrote:
> > On Tue, Jun 30, 2015 at 12:54:02PM +0200, Daniel Vetter wrote:
> > > The gpu should be able so pinned for scanout is the only corner case I can
> > > think of atm.
> > 
> > Hmm. That's a nuisance. But... We can throw away all the VM bindings
> > that are unpinned, and then rewrite those that are left with the shmemfs
> > pages.
> > 
> > It's ugly. On migration back we would have to do a similar trick, and we
> > need to tell a few white lies to get past some of our internal BUG_ON.
> > However, as the contents are the same and the PTE writes are atomic into
> > the GGTT, it should be invisible to the user.
> > 
> > For internally allocated frontbuffers, I was expecting to mark them as
> > volatile and let them lose their contents across hibernation - because
> > they will be immediately cleared afterwards (at least so I expect).
> 
> New upcoming complications: On some still super-secret upcoming platform
> we need special ggtt vmas for scanout, to pin the backing storage into
> suitable memory. And because the hw implementation does replace the ggtt
> pte with the physical page entry (i.e. resolving dmar and stuff) it
> disallows any writes to the ptes. So we can't replace the ggtt ptes while
> the vma is pinned at all.

In that scheme, our backing storage is incoherent with the actual
contents. So we don't need the backing storage at all after setting the
PTE and the entire object is volatile (so not saved across hibernate
anyway). We wouldn't want to reserve stolen space for that, just a
custom allocator and discard the pages asap. (Or we have a phys object
scheme that does a lot of copying...)
-Chris

Daniel Vetter July 1, 2015, 12:47 p.m. UTC | #14

On Tue, Jun 30, 2015 at 01:47:06PM +0100, Chris Wilson wrote:
> On Tue, Jun 30, 2015 at 02:07:41PM +0200, Daniel Vetter wrote:
> > On Tue, Jun 30, 2015 at 12:25:54PM +0100, Chris Wilson wrote:
> > > On Tue, Jun 30, 2015 at 12:54:02PM +0200, Daniel Vetter wrote:
> > > > The gpu should be able so pinned for scanout is the only corner case I can
> > > > think of atm.
> > > 
> > > Hmm. That's a nuisance. But... We can throw away all the VM bindings
> > > that are unpinned, and then rewrite those that are left with the shmemfs
> > > pages.
> > > 
> > > It's ugly. On migration back we would have to do a similar trick, and we
> > > need to tell a few white lies to get past some of our internal BUG_ON.
> > > However, as the contents are the same and the PTE writes are atomic into
> > > the GGTT, it should be invisible to the user.
> > > 
> > > For internally allocated frontbuffers, I was expecting to mark them as
> > > volatile and let them lose their contents across hibernation - because
> > > they will be immediately cleared afterwards (at least so I expect).
> > 
> > New upcoming complications: On some still super-secret upcoming platform
> > we need special ggtt vmas for scanout, to pin the backing storage into
> > suitable memory. And because the hw implementation does replace the ggtt
> > pte with the physical page entry (i.e. resolving dmar and stuff) it
> > disallows any writes to the ptes. So we can't replace the ggtt ptes while
> > the vma is pinned at all.
> 
> In that scheme, our backing storage is incoherent with the actual
> contents. So we don't need the backing storage at all after setting the
> PTE and the entire object is volatile (so not saved across hibernate
> anyway). We wouldn't want to reserve stolen space for that, just a
> custom allocator and discard the pages asap. (Or we have a phys object
> scheme that does a lot of copying...)

The backing storage will still be coherent, it's just that the specific
view used to scan it out can't be touched at all (i.e. not even ptes
rewritten) while in use. Which means you can't rewrite pts, that's all.

We can still copy the actual backing storage using a new ggtt view
(they'll have different types) and convert to shmem. It's only that
rewriting the ptes for the scanout view can only happen on resume from
hibernate (we need to restore it all ofc again). Youre description above
sounded like you wanted to rewrite all ptes right away which I think isn't
need and at least with that fancy platform not possible - if we have
concurrent writes while we do the conversion there's a bug anyway.
-Daniel

Chris Wilson July 1, 2015, 12:59 p.m. UTC | #15

On Wed, Jul 01, 2015 at 02:47:15PM +0200, Daniel Vetter wrote:
> The backing storage will still be coherent, it's just that the specific
> view used to scan it out can't be touched at all (i.e. not even ptes
> rewritten) while in use. Which means you can't rewrite pts, that's all.

Hmm, I thought there was going to be some fenced off memory. I was
hoping to avoid allocating any ourseleves :)
 
> We can still copy the actual backing storage using a new ggtt view
> (they'll have different types) and convert to shmem. It's only that
> rewriting the ptes for the scanout view can only happen on resume from
> hibernate (we need to restore it all ofc again). Youre description above
> sounded like you wanted to rewrite all ptes right away which I think isn't
> need and at least with that fancy platform not possible - if we have
> concurrent writes while we do the conversion there's a bug anyway.

It's just the difference between making the function a generic
migrate-stolen-to-shmem, and making it peculiar to hibernate.

But yes, since we need to rebind all the vma after hibernate, we can add
a parameter to tell us to skip it during migrate() and expect everything
to come out in the wash (since the stolen memory won't be reused).
-Chris

Daniel Vetter July 1, 2015, 1:49 p.m. UTC | #16

On Wed, Jul 01, 2015 at 01:59:23PM +0100, Chris Wilson wrote:
> On Wed, Jul 01, 2015 at 02:47:15PM +0200, Daniel Vetter wrote:
> > The backing storage will still be coherent, it's just that the specific
> > view used to scan it out can't be touched at all (i.e. not even ptes
> > rewritten) while in use. Which means you can't rewrite pts, that's all.
> 
> Hmm, I thought there was going to be some fenced off memory. I was
> hoping to avoid allocating any ourseleves :)

It works like a cache, but with some funky pinning which needs a separate
view because that's how the hw works.

> > We can still copy the actual backing storage using a new ggtt view
> > (they'll have different types) and convert to shmem. It's only that
> > rewriting the ptes for the scanout view can only happen on resume from
> > hibernate (we need to restore it all ofc again). Youre description above
> > sounded like you wanted to rewrite all ptes right away which I think isn't
> > need and at least with that fancy platform not possible - if we have
> > concurrent writes while we do the conversion there's a bug anyway.
> 
> It's just the difference between making the function a generic
> migrate-stolen-to-shmem, and making it peculiar to hibernate.
> 
> But yes, since we need to rebind all the vma after hibernate, we can add
> a parameter to tell us to skip it during migrate() and expect everything
> to come out in the wash (since the stolen memory won't be reused).

Yeah generic stolen-to-shmem is probably overshooting for this.
-Daniel

drm/i915: Migrate stolen objects before hibernation

Commit Message

Comments

Patch