[v3,3/8] drm/i915: Add a new "remapped" gtt_view

Message ID	20180925193714.25280-4-ville.syrjala@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Ville Syrjala <ville.syrjala@linux.intel.com> To: intel-gfx@lists.freedesktop.org Date: Tue, 25 Sep 2018 22:37:09 +0300 Message-Id: <20180925193714.25280-4-ville.syrjala@linux.intel.com> In-Reply-To: <20180925193714.25280-1-ville.syrjala@linux.intel.com> References: <20180925193714.25280-1-ville.syrjala@linux.intel.com> MIME-Version: 1.0 Subject: [Intel-gfx] [PATCH v3 3/8] drm/i915: Add a new "remapped" gtt_view Precedence: list Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
Series	drm/i915: GTT remapping for display \| expand [v3,0/8] drm/i915: GTT remapping for display [v3,1/8] drm/i915: Make sure fb gtt offsets stay within 32bits [v3,2/8] drm/i915: Decouple SKL stride units from intel_fb_stride_alignment() [v3,3/8] drm/i915: Add a new "remapped" gtt_view [v3,4/8] drm/i915/selftests: Add mock selftest for remapped vmas [v3,5/8] drm/i915/selftests: Add live vma selftest [v3,6/8] drm/i915: Overcome display engine stride limits via GTT remapping [v3,7/8] drm/i915: Bump gen4+ fb stride limit to 256KiB [v3,8/8] drm/i915: Bump gen7+ fb size limits to 16kx16k

Ville Syrjälä Sept. 25, 2018, 7:37 p.m. UTC

From: Ville Syrjälä <ville.syrjala@linux.intel.com>

To overcome display engine stride limits we'll want to remap the
pages in the GTT. To that end we need a new gtt_view type which
is just like the "rotated" type except not rotated.

v2: Use intel_remapped_plane_info base type
    s/unused/unused_mbz/ (Chris)
    Separate BUILD_BUG_ON()s (Chris)
    Use I915_GTT_PAGE_SIZE (Chris)

Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c       | 12 ++++
 drivers/gpu/drm/i915/i915_gem_gtt.c       | 91 +++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_gem_gtt.h       | 25 +++++++--
 drivers/gpu/drm/i915/i915_vma.c           |  6 +-
 drivers/gpu/drm/i915/i915_vma.h           |  3 +
 drivers/gpu/drm/i915/intel_display.c      | 11 ++++
 drivers/gpu/drm/i915/intel_drv.h          |  1 +
 drivers/gpu/drm/i915/selftests/i915_vma.c |  6 +-
 8 files changed, 147 insertions(+), 8 deletions(-)

Tvrtko Ursulin Sept. 26, 2018, 7:50 a.m. UTC | #1

On 25/09/2018 20:37, Ville Syrjala wrote:
> From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> 
> To overcome display engine stride limits we'll want to remap the
> pages in the GTT. To that end we need a new gtt_view type which
> is just like the "rotated" type except not rotated.
> 
> v2: Use intel_remapped_plane_info base type
>      s/unused/unused_mbz/ (Chris)
>      Separate BUILD_BUG_ON()s (Chris)
>      Use I915_GTT_PAGE_SIZE (Chris)
> 
> Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
> ---
>   drivers/gpu/drm/i915/i915_debugfs.c       | 12 ++++
>   drivers/gpu/drm/i915/i915_gem_gtt.c       | 91 +++++++++++++++++++++++++++++++
>   drivers/gpu/drm/i915/i915_gem_gtt.h       | 25 +++++++--
>   drivers/gpu/drm/i915/i915_vma.c           |  6 +-
>   drivers/gpu/drm/i915/i915_vma.h           |  3 +
>   drivers/gpu/drm/i915/intel_display.c      | 11 ++++
>   drivers/gpu/drm/i915/intel_drv.h          |  1 +
>   drivers/gpu/drm/i915/selftests/i915_vma.c |  6 +-
>   8 files changed, 147 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
> index b4744a68cd88..edb9f6f3c0cf 100644
> --- a/drivers/gpu/drm/i915/i915_debugfs.c
> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> @@ -196,6 +196,18 @@ describe_obj(struct seq_file *m, struct drm_i915_gem_object *obj)
>   					   vma->ggtt_view.rotated.plane[1].offset);
>   				break;
>   
> +			case I915_GGTT_VIEW_REMAPPED:
> +				seq_printf(m, ", remapped [(%ux%u, stride=%u, offset=%u), (%ux%u, stride=%u, offset=%u)]",
> +					   vma->ggtt_view.remapped.plane[0].width,
> +					   vma->ggtt_view.remapped.plane[0].height,
> +					   vma->ggtt_view.remapped.plane[0].stride,
> +					   vma->ggtt_view.remapped.plane[0].offset,
> +					   vma->ggtt_view.remapped.plane[1].width,
> +					   vma->ggtt_view.remapped.plane[1].height,
> +					   vma->ggtt_view.remapped.plane[1].stride,
> +					   vma->ggtt_view.remapped.plane[1].offset);
> +				break;
> +
>   			default:
>   				MISSING_CASE(vma->ggtt_view.type);
>   				break;
> diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c
> index f6c7ab413081..e4d49c345c81 100644
> --- a/drivers/gpu/drm/i915/i915_gem_gtt.c
> +++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
> @@ -3798,6 +3798,92 @@ intel_rotate_pages(struct intel_rotation_info *rot_info,
>   	return ERR_PTR(ret);
>   }
>   
> +static struct scatterlist *
> +remap_pages(const dma_addr_t *in, unsigned int offset,
> +	    unsigned int width, unsigned int height,
> +	    unsigned int stride,
> +	    struct sg_table *st, struct scatterlist *sg)
> +{
> +	unsigned int column, row;
> +
> +	for (row = 0; row < height; row++) {
> +		for (column = 0; column < width; column++) {
> +			st->nents++;
> +			/* We don't need the pages, but need to initialize
> +			 * the entries so the sg list can be happily traversed.
> +			 * The only thing we need are DMA addresses.
> +			 */
> +			sg_set_page(sg, NULL, I915_GTT_PAGE_SIZE, 0);
> +			sg_dma_address(sg) = in[offset + column];
> +			sg_dma_len(sg) = I915_GTT_PAGE_SIZE;
> +			sg = sg_next(sg);

For this type of remapping you could I think potentially get some 
decreased memory use benefits by coalescing the consecutive addresses 
into single sg entries. If I understood correctly what it's doing at 
least.. Since framebuffers could have a good chance to have been 
initially allocated in larger than page size chunks.

If you decide to do that, then...

> +		}
> +		offset += stride;
> +	}
> +
> +	return sg;
> +}
> +
> +static noinline struct sg_table *
> +intel_remap_pages(struct intel_remapped_info *rem_info,
> +		  struct drm_i915_gem_object *obj)
> +{
> +	const unsigned long n_pages = obj->base.size / I915_GTT_PAGE_SIZE;
> +	unsigned int size = intel_remapped_info_size(rem_info);
> +	struct sgt_iter sgt_iter;
> +	dma_addr_t dma_addr;
> +	unsigned long i;
> +	dma_addr_t *page_addr_list;
> +	struct sg_table *st;
> +	struct scatterlist *sg;
> +	int ret = -ENOMEM;
> +
> +	/* Allocate a temporary list of source pages for random access. */
> +	page_addr_list = kvmalloc_array(n_pages,
> +					sizeof(dma_addr_t),
> +					GFP_KERNEL);
> +	if (!page_addr_list)
> +		return ERR_PTR(ret);
> +
> +	/* Allocate target SG list. */
> +	st = kmalloc(sizeof(*st), GFP_KERNEL);
> +	if (!st)
> +		goto err_st_alloc;
> +
> +	ret = sg_alloc_table(st, size, GFP_KERNEL);
> +	if (ret)
> +		goto err_sg_alloc;
> +
> +	/* Populate source page list from the object. */
> +	i = 0;
> +	for_each_sgt_dma(dma_addr, sgt_iter, obj->mm.pages)
> +		page_addr_list[i++] = dma_addr;
> +
> +	GEM_BUG_ON(i != n_pages);
> +	st->nents = 0;
> +	sg = st->sgl;
> +
> +	for (i = 0 ; i < ARRAY_SIZE(rem_info->plane); i++) {
> +		sg = remap_pages(page_addr_list, rem_info->plane[i].offset,
> +				 rem_info->plane[i].width, rem_info->plane[i].height,
> +				 rem_info->plane[i].stride, st, sg);
> +	}
> +
> +	kvfree(page_addr_list);
> +

... put i915_sg_trim here to lose the trailing end of unused sg entries.

One more thing, do you really need random access for this 
transformation? Or you could walk the sg list as it is? Just if you hit 
a too long chunk you need to copy a trimmed version over and know where 
to continue for the next row. If doable it would be better than having 
to kvmalloc_array.

Regards,

Tvrtko

> +	return st;
> +
> +err_sg_alloc:
> +	kfree(st);
> +err_st_alloc:
> +	kvfree(page_addr_list);
> +
> +	DRM_DEBUG_DRIVER("Failed to create remapped mapping for object size %zu! (%ux%u tiles, %u pages)\n",
> +			 obj->base.size, rem_info->plane[0].width, rem_info->plane[0].height, size);
> +
> +	return ERR_PTR(ret);
> +}
> +
>   static noinline struct sg_table *
>   intel_partial_pages(const struct i915_ggtt_view *view,
>   		    struct drm_i915_gem_object *obj)
> @@ -3874,6 +3960,11 @@ i915_get_ggtt_vma_pages(struct i915_vma *vma)
>   			intel_rotate_pages(&vma->ggtt_view.rotated, vma->obj);
>   		break;
>   
> +	case I915_GGTT_VIEW_REMAPPED:
> +		vma->pages =
> +			intel_remap_pages(&vma->ggtt_view.remapped, vma->obj);
> +		break;
> +
>   	case I915_GGTT_VIEW_PARTIAL:
>   		vma->pages = intel_partial_pages(&vma->ggtt_view, vma->obj);
>   		break;
> diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.h b/drivers/gpu/drm/i915/i915_gem_gtt.h
> index 7e2af5f4f39b..69a22f57e6ca 100644
> --- a/drivers/gpu/drm/i915/i915_gem_gtt.h
> +++ b/drivers/gpu/drm/i915/i915_gem_gtt.h
> @@ -160,11 +160,18 @@ typedef u64 gen8_ppgtt_pml4e_t;
>   
>   struct sg_table;
>   
> +struct intel_remapped_plane_info {
> +	/* in gtt pages */
> +	unsigned int width, height, stride, offset;
> +} __packed;
> +
> +struct intel_remapped_info {
> +	struct intel_remapped_plane_info plane[2];
> +	unsigned int unused_mbz;
> +} __packed;
> +
>   struct intel_rotation_info {
> -	struct intel_rotation_plane_info {
> -		/* tiles */
> -		unsigned int width, height, stride, offset;
> -	} plane[2];
> +	struct intel_remapped_plane_info plane[2];
>   } __packed;
>   
>   struct intel_partial_info {
> @@ -176,12 +183,20 @@ enum i915_ggtt_view_type {
>   	I915_GGTT_VIEW_NORMAL = 0,
>   	I915_GGTT_VIEW_ROTATED = sizeof(struct intel_rotation_info),
>   	I915_GGTT_VIEW_PARTIAL = sizeof(struct intel_partial_info),
> +	I915_GGTT_VIEW_REMAPPED = sizeof(struct intel_remapped_info),
>   };
>   
>   static inline void assert_i915_gem_gtt_types(void)
>   {
>   	BUILD_BUG_ON(sizeof(struct intel_rotation_info) != 8*sizeof(unsigned int));
>   	BUILD_BUG_ON(sizeof(struct intel_partial_info) != sizeof(u64) + sizeof(unsigned int));
> +	BUILD_BUG_ON(sizeof(struct intel_remapped_info) != 9*sizeof(unsigned int));
> +
> +	/* Check that rotation/remapped shares offsets for simplicity */
> +	BUILD_BUG_ON(offsetof(struct intel_remapped_info, plane[0]) !=
> +		     offsetof(struct intel_rotation_info, plane[0]));
> +	BUILD_BUG_ON(offsetofend(struct intel_remapped_info, plane[1]) !=
> +		     offsetofend(struct intel_rotation_info, plane[1]));
>   
>   	/* As we encode the size of each branch inside the union into its type,
>   	 * we have to be careful that each branch has a unique size.
> @@ -190,6 +205,7 @@ static inline void assert_i915_gem_gtt_types(void)
>   	case I915_GGTT_VIEW_NORMAL:
>   	case I915_GGTT_VIEW_PARTIAL:
>   	case I915_GGTT_VIEW_ROTATED:
> +	case I915_GGTT_VIEW_REMAPPED:
>   		/* gcc complains if these are identical cases */
>   		break;
>   	}
> @@ -201,6 +217,7 @@ struct i915_ggtt_view {
>   		/* Members need to contain no holes/padding */
>   		struct intel_partial_info partial;
>   		struct intel_rotation_info rotated;
> +		struct intel_remapped_info remapped;
>   	};
>   };
>   
> diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
> index 31efc971a3a8..1793c702ab94 100644
> --- a/drivers/gpu/drm/i915/i915_vma.c
> +++ b/drivers/gpu/drm/i915/i915_vma.c
> @@ -164,6 +164,9 @@ vma_create(struct drm_i915_gem_object *obj,
>   		} else if (view->type == I915_GGTT_VIEW_ROTATED) {
>   			vma->size = intel_rotation_info_size(&view->rotated);
>   			vma->size <<= PAGE_SHIFT;
> +		} else if (view->type == I915_GGTT_VIEW_REMAPPED) {
> +			vma->size = intel_remapped_info_size(&view->remapped);
> +			vma->size <<= PAGE_SHIFT;
>   		}
>   	}
>   
> @@ -464,7 +467,8 @@ void __i915_vma_set_map_and_fenceable(struct i915_vma *vma)
>   	 * Explicitly disable for rotated VMA since the display does not
>   	 * need the fence and the VMA is not accessible to other users.
>   	 */
> -	if (vma->ggtt_view.type == I915_GGTT_VIEW_ROTATED)
> +	if (vma->ggtt_view.type == I915_GGTT_VIEW_ROTATED ||
> +	    vma->ggtt_view.type == I915_GGTT_VIEW_REMAPPED)
>   		return;
>   
>   	fenceable = (vma->node.size >= vma->fence_size &&
> diff --git a/drivers/gpu/drm/i915/i915_vma.h b/drivers/gpu/drm/i915/i915_vma.h
> index 4f7c1c7599f4..64cf029c028a 100644
> --- a/drivers/gpu/drm/i915/i915_vma.h
> +++ b/drivers/gpu/drm/i915/i915_vma.h
> @@ -265,8 +265,11 @@ i915_vma_compare(struct i915_vma *vma,
>   	 */
>   	BUILD_BUG_ON(I915_GGTT_VIEW_NORMAL >= I915_GGTT_VIEW_PARTIAL);
>   	BUILD_BUG_ON(I915_GGTT_VIEW_PARTIAL >= I915_GGTT_VIEW_ROTATED);
> +	BUILD_BUG_ON(I915_GGTT_VIEW_ROTATED >= I915_GGTT_VIEW_REMAPPED);
>   	BUILD_BUG_ON(offsetof(typeof(*view), rotated) !=
>   		     offsetof(typeof(*view), partial));
> +	BUILD_BUG_ON(offsetof(typeof(*view), rotated) !=
> +		     offsetof(typeof(*view), remapped));
>   	return memcmp(&vma->ggtt_view.partial, &view->partial, view->type);
>   }
>   
> diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
> index 1fe5cf3ed062..4f9f519ccd46 100644
> --- a/drivers/gpu/drm/i915/intel_display.c
> +++ b/drivers/gpu/drm/i915/intel_display.c
> @@ -2006,6 +2006,17 @@ unsigned int intel_rotation_info_size(const struct intel_rotation_info *rot_info
>   	return size;
>   }
>   
> +unsigned int intel_remapped_info_size(const struct intel_remapped_info *rem_info)
> +{
> +	unsigned int size = 0;
> +	int i;
> +
> +	for (i = 0 ; i < ARRAY_SIZE(rem_info->plane); i++)
> +		size += rem_info->plane[i].width * rem_info->plane[i].height;
> +
> +	return size;
> +}
> +
>   static void
>   intel_fill_fb_ggtt_view(struct i915_ggtt_view *view,
>   			const struct drm_framebuffer *fb,
> diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
> index 4fcba99d12c0..a88a16b74f24 100644
> --- a/drivers/gpu/drm/i915/intel_drv.h
> +++ b/drivers/gpu/drm/i915/intel_drv.h
> @@ -1503,6 +1503,7 @@ unsigned int intel_fb_xy_to_linear(int x, int y,
>   void intel_add_fb_offsets(int *x, int *y,
>   			  const struct intel_plane_state *state, int plane);
>   unsigned int intel_rotation_info_size(const struct intel_rotation_info *rot_info);
> +unsigned int intel_remapped_info_size(const struct intel_remapped_info *rem_info);
>   bool intel_has_pending_fb_unpin(struct drm_i915_private *dev_priv);
>   void intel_mark_busy(struct drm_i915_private *dev_priv);
>   void intel_mark_idle(struct drm_i915_private *dev_priv);
> diff --git a/drivers/gpu/drm/i915/selftests/i915_vma.c b/drivers/gpu/drm/i915/selftests/i915_vma.c
> index ffa74290e054..4fc49c27f13c 100644
> --- a/drivers/gpu/drm/i915/selftests/i915_vma.c
> +++ b/drivers/gpu/drm/i915/selftests/i915_vma.c
> @@ -395,8 +395,8 @@ assert_rotated(struct drm_i915_gem_object *obj,
>   	return sg;
>   }
>   
> -static unsigned int rotated_size(const struct intel_rotation_plane_info *a,
> -				 const struct intel_rotation_plane_info *b)
> +static unsigned int rotated_size(const struct intel_remapped_plane_info *a,
> +				 const struct intel_remapped_plane_info *b)
>   {
>   	return a->width * a->height + b->width * b->height;
>   }
> @@ -406,7 +406,7 @@ static int igt_vma_rotate(void *arg)
>   	struct drm_i915_private *i915 = arg;
>   	struct i915_address_space *vm = &i915->ggtt.vm;
>   	struct drm_i915_gem_object *obj;
> -	const struct intel_rotation_plane_info planes[] = {
> +	const struct intel_remapped_plane_info planes[] = {
>   		{ .width = 1, .height = 1, .stride = 1 },
>   		{ .width = 2, .height = 2, .stride = 2 },
>   		{ .width = 4, .height = 4, .stride = 4 },
>

Ville Syrjälä Oct. 1, 2018, 3:03 p.m. UTC | #2

On Wed, Sep 26, 2018 at 08:50:25AM +0100, Tvrtko Ursulin wrote:
> 
> On 25/09/2018 20:37, Ville Syrjala wrote:
> > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > 
> > To overcome display engine stride limits we'll want to remap the
> > pages in the GTT. To that end we need a new gtt_view type which
> > is just like the "rotated" type except not rotated.
> > 
> > v2: Use intel_remapped_plane_info base type
> >      s/unused/unused_mbz/ (Chris)
> >      Separate BUILD_BUG_ON()s (Chris)
> >      Use I915_GTT_PAGE_SIZE (Chris)
> > 
> > Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_debugfs.c       | 12 ++++
> >   drivers/gpu/drm/i915/i915_gem_gtt.c       | 91 +++++++++++++++++++++++++++++++
> >   drivers/gpu/drm/i915/i915_gem_gtt.h       | 25 +++++++--
> >   drivers/gpu/drm/i915/i915_vma.c           |  6 +-
> >   drivers/gpu/drm/i915/i915_vma.h           |  3 +
> >   drivers/gpu/drm/i915/intel_display.c      | 11 ++++
> >   drivers/gpu/drm/i915/intel_drv.h          |  1 +
> >   drivers/gpu/drm/i915/selftests/i915_vma.c |  6 +-
> >   8 files changed, 147 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
> > index b4744a68cd88..edb9f6f3c0cf 100644
> > --- a/drivers/gpu/drm/i915/i915_debugfs.c
> > +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> > @@ -196,6 +196,18 @@ describe_obj(struct seq_file *m, struct drm_i915_gem_object *obj)
> >   					   vma->ggtt_view.rotated.plane[1].offset);
> >   				break;
> >   
> > +			case I915_GGTT_VIEW_REMAPPED:
> > +				seq_printf(m, ", remapped [(%ux%u, stride=%u, offset=%u), (%ux%u, stride=%u, offset=%u)]",
> > +					   vma->ggtt_view.remapped.plane[0].width,
> > +					   vma->ggtt_view.remapped.plane[0].height,
> > +					   vma->ggtt_view.remapped.plane[0].stride,
> > +					   vma->ggtt_view.remapped.plane[0].offset,
> > +					   vma->ggtt_view.remapped.plane[1].width,
> > +					   vma->ggtt_view.remapped.plane[1].height,
> > +					   vma->ggtt_view.remapped.plane[1].stride,
> > +					   vma->ggtt_view.remapped.plane[1].offset);
> > +				break;
> > +
> >   			default:
> >   				MISSING_CASE(vma->ggtt_view.type);
> >   				break;
> > diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c
> > index f6c7ab413081..e4d49c345c81 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_gtt.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
> > @@ -3798,6 +3798,92 @@ intel_rotate_pages(struct intel_rotation_info *rot_info,
> >   	return ERR_PTR(ret);
> >   }
> >   
> > +static struct scatterlist *
> > +remap_pages(const dma_addr_t *in, unsigned int offset,
> > +	    unsigned int width, unsigned int height,
> > +	    unsigned int stride,
> > +	    struct sg_table *st, struct scatterlist *sg)
> > +{
> > +	unsigned int column, row;
> > +
> > +	for (row = 0; row < height; row++) {
> > +		for (column = 0; column < width; column++) {
> > +			st->nents++;
> > +			/* We don't need the pages, but need to initialize
> > +			 * the entries so the sg list can be happily traversed.
> > +			 * The only thing we need are DMA addresses.
> > +			 */
> > +			sg_set_page(sg, NULL, I915_GTT_PAGE_SIZE, 0);
> > +			sg_dma_address(sg) = in[offset + column];
> > +			sg_dma_len(sg) = I915_GTT_PAGE_SIZE;
> > +			sg = sg_next(sg);
> 
> For this type of remapping you could I think potentially get some 
> decreased memory use benefits by coalescing the consecutive addresses 
> into single sg entries. If I understood correctly what it's doing at 
> least.. Since framebuffers could have a good chance to have been 
> initially allocated in larger than page size chunks.
> 
> If you decide to do that, then...
> 
> > +		}
> > +		offset += stride;
> > +	}
> > +
> > +	return sg;
> > +}
> > +
> > +static noinline struct sg_table *
> > +intel_remap_pages(struct intel_remapped_info *rem_info,
> > +		  struct drm_i915_gem_object *obj)
> > +{
> > +	const unsigned long n_pages = obj->base.size / I915_GTT_PAGE_SIZE;
> > +	unsigned int size = intel_remapped_info_size(rem_info);
> > +	struct sgt_iter sgt_iter;
> > +	dma_addr_t dma_addr;
> > +	unsigned long i;
> > +	dma_addr_t *page_addr_list;
> > +	struct sg_table *st;
> > +	struct scatterlist *sg;
> > +	int ret = -ENOMEM;
> > +
> > +	/* Allocate a temporary list of source pages for random access. */
> > +	page_addr_list = kvmalloc_array(n_pages,
> > +					sizeof(dma_addr_t),
> > +					GFP_KERNEL);
> > +	if (!page_addr_list)
> > +		return ERR_PTR(ret);
> > +
> > +	/* Allocate target SG list. */
> > +	st = kmalloc(sizeof(*st), GFP_KERNEL);
> > +	if (!st)
> > +		goto err_st_alloc;
> > +
> > +	ret = sg_alloc_table(st, size, GFP_KERNEL);
> > +	if (ret)
> > +		goto err_sg_alloc;
> > +
> > +	/* Populate source page list from the object. */
> > +	i = 0;
> > +	for_each_sgt_dma(dma_addr, sgt_iter, obj->mm.pages)
> > +		page_addr_list[i++] = dma_addr;
> > +
> > +	GEM_BUG_ON(i != n_pages);
> > +	st->nents = 0;
> > +	sg = st->sgl;
> > +
> > +	for (i = 0 ; i < ARRAY_SIZE(rem_info->plane); i++) {
> > +		sg = remap_pages(page_addr_list, rem_info->plane[i].offset,
> > +				 rem_info->plane[i].width, rem_info->plane[i].height,
> > +				 rem_info->plane[i].stride, st, sg);
> > +	}
> > +
> > +	kvfree(page_addr_list);
> > +
> 
> ... put i915_sg_trim here to lose the trailing end of unused sg entries.

Yeah, I guess that makes sense.

> 
> One more thing, do you really need random access for this 
> transformation? Or you could walk the sg list as it is? Just if you hit 
> a too long chunk you need to copy a trimmed version over and know where 
> to continue for the next row. If doable it would be better than having 
> to kvmalloc_array.

I think Chris suggested just using i915_gem_object_get_dma_address()
here. But I'm not sure why we're not using it for rotate_pages()
as well.

> 
> Regards,
> 
> Tvrtko
> 
> > +	return st;
> > +
> > +err_sg_alloc:
> > +	kfree(st);
> > +err_st_alloc:
> > +	kvfree(page_addr_list);
> > +
> > +	DRM_DEBUG_DRIVER("Failed to create remapped mapping for object size %zu! (%ux%u tiles, %u pages)\n",
> > +			 obj->base.size, rem_info->plane[0].width, rem_info->plane[0].height, size);
> > +
> > +	return ERR_PTR(ret);
> > +}
> > +
> >   static noinline struct sg_table *
> >   intel_partial_pages(const struct i915_ggtt_view *view,
> >   		    struct drm_i915_gem_object *obj)
> > @@ -3874,6 +3960,11 @@ i915_get_ggtt_vma_pages(struct i915_vma *vma)
> >   			intel_rotate_pages(&vma->ggtt_view.rotated, vma->obj);
> >   		break;
> >   
> > +	case I915_GGTT_VIEW_REMAPPED:
> > +		vma->pages =
> > +			intel_remap_pages(&vma->ggtt_view.remapped, vma->obj);
> > +		break;
> > +
> >   	case I915_GGTT_VIEW_PARTIAL:
> >   		vma->pages = intel_partial_pages(&vma->ggtt_view, vma->obj);
> >   		break;
> > diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.h b/drivers/gpu/drm/i915/i915_gem_gtt.h
> > index 7e2af5f4f39b..69a22f57e6ca 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_gtt.h
> > +++ b/drivers/gpu/drm/i915/i915_gem_gtt.h
> > @@ -160,11 +160,18 @@ typedef u64 gen8_ppgtt_pml4e_t;
> >   
> >   struct sg_table;
> >   
> > +struct intel_remapped_plane_info {
> > +	/* in gtt pages */
> > +	unsigned int width, height, stride, offset;
> > +} __packed;
> > +
> > +struct intel_remapped_info {
> > +	struct intel_remapped_plane_info plane[2];
> > +	unsigned int unused_mbz;
> > +} __packed;
> > +
> >   struct intel_rotation_info {
> > -	struct intel_rotation_plane_info {
> > -		/* tiles */
> > -		unsigned int width, height, stride, offset;
> > -	} plane[2];
> > +	struct intel_remapped_plane_info plane[2];
> >   } __packed;
> >   
> >   struct intel_partial_info {
> > @@ -176,12 +183,20 @@ enum i915_ggtt_view_type {
> >   	I915_GGTT_VIEW_NORMAL = 0,
> >   	I915_GGTT_VIEW_ROTATED = sizeof(struct intel_rotation_info),
> >   	I915_GGTT_VIEW_PARTIAL = sizeof(struct intel_partial_info),
> > +	I915_GGTT_VIEW_REMAPPED = sizeof(struct intel_remapped_info),
> >   };
> >   
> >   static inline void assert_i915_gem_gtt_types(void)
> >   {
> >   	BUILD_BUG_ON(sizeof(struct intel_rotation_info) != 8*sizeof(unsigned int));
> >   	BUILD_BUG_ON(sizeof(struct intel_partial_info) != sizeof(u64) + sizeof(unsigned int));
> > +	BUILD_BUG_ON(sizeof(struct intel_remapped_info) != 9*sizeof(unsigned int));
> > +
> > +	/* Check that rotation/remapped shares offsets for simplicity */
> > +	BUILD_BUG_ON(offsetof(struct intel_remapped_info, plane[0]) !=
> > +		     offsetof(struct intel_rotation_info, plane[0]));
> > +	BUILD_BUG_ON(offsetofend(struct intel_remapped_info, plane[1]) !=
> > +		     offsetofend(struct intel_rotation_info, plane[1]));
> >   
> >   	/* As we encode the size of each branch inside the union into its type,
> >   	 * we have to be careful that each branch has a unique size.
> > @@ -190,6 +205,7 @@ static inline void assert_i915_gem_gtt_types(void)
> >   	case I915_GGTT_VIEW_NORMAL:
> >   	case I915_GGTT_VIEW_PARTIAL:
> >   	case I915_GGTT_VIEW_ROTATED:
> > +	case I915_GGTT_VIEW_REMAPPED:
> >   		/* gcc complains if these are identical cases */
> >   		break;
> >   	}
> > @@ -201,6 +217,7 @@ struct i915_ggtt_view {
> >   		/* Members need to contain no holes/padding */
> >   		struct intel_partial_info partial;
> >   		struct intel_rotation_info rotated;
> > +		struct intel_remapped_info remapped;
> >   	};
> >   };
> >   
> > diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
> > index 31efc971a3a8..1793c702ab94 100644
> > --- a/drivers/gpu/drm/i915/i915_vma.c
> > +++ b/drivers/gpu/drm/i915/i915_vma.c
> > @@ -164,6 +164,9 @@ vma_create(struct drm_i915_gem_object *obj,
> >   		} else if (view->type == I915_GGTT_VIEW_ROTATED) {
> >   			vma->size = intel_rotation_info_size(&view->rotated);
> >   			vma->size <<= PAGE_SHIFT;
> > +		} else if (view->type == I915_GGTT_VIEW_REMAPPED) {
> > +			vma->size = intel_remapped_info_size(&view->remapped);
> > +			vma->size <<= PAGE_SHIFT;
> >   		}
> >   	}
> >   
> > @@ -464,7 +467,8 @@ void __i915_vma_set_map_and_fenceable(struct i915_vma *vma)
> >   	 * Explicitly disable for rotated VMA since the display does not
> >   	 * need the fence and the VMA is not accessible to other users.
> >   	 */
> > -	if (vma->ggtt_view.type == I915_GGTT_VIEW_ROTATED)
> > +	if (vma->ggtt_view.type == I915_GGTT_VIEW_ROTATED ||
> > +	    vma->ggtt_view.type == I915_GGTT_VIEW_REMAPPED)
> >   		return;
> >   
> >   	fenceable = (vma->node.size >= vma->fence_size &&
> > diff --git a/drivers/gpu/drm/i915/i915_vma.h b/drivers/gpu/drm/i915/i915_vma.h
> > index 4f7c1c7599f4..64cf029c028a 100644
> > --- a/drivers/gpu/drm/i915/i915_vma.h
> > +++ b/drivers/gpu/drm/i915/i915_vma.h
> > @@ -265,8 +265,11 @@ i915_vma_compare(struct i915_vma *vma,
> >   	 */
> >   	BUILD_BUG_ON(I915_GGTT_VIEW_NORMAL >= I915_GGTT_VIEW_PARTIAL);
> >   	BUILD_BUG_ON(I915_GGTT_VIEW_PARTIAL >= I915_GGTT_VIEW_ROTATED);
> > +	BUILD_BUG_ON(I915_GGTT_VIEW_ROTATED >= I915_GGTT_VIEW_REMAPPED);
> >   	BUILD_BUG_ON(offsetof(typeof(*view), rotated) !=
> >   		     offsetof(typeof(*view), partial));
> > +	BUILD_BUG_ON(offsetof(typeof(*view), rotated) !=
> > +		     offsetof(typeof(*view), remapped));
> >   	return memcmp(&vma->ggtt_view.partial, &view->partial, view->type);
> >   }
> >   
> > diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
> > index 1fe5cf3ed062..4f9f519ccd46 100644
> > --- a/drivers/gpu/drm/i915/intel_display.c
> > +++ b/drivers/gpu/drm/i915/intel_display.c
> > @@ -2006,6 +2006,17 @@ unsigned int intel_rotation_info_size(const struct intel_rotation_info *rot_info
> >   	return size;
> >   }
> >   
> > +unsigned int intel_remapped_info_size(const struct intel_remapped_info *rem_info)
> > +{
> > +	unsigned int size = 0;
> > +	int i;
> > +
> > +	for (i = 0 ; i < ARRAY_SIZE(rem_info->plane); i++)
> > +		size += rem_info->plane[i].width * rem_info->plane[i].height;
> > +
> > +	return size;
> > +}
> > +
> >   static void
> >   intel_fill_fb_ggtt_view(struct i915_ggtt_view *view,
> >   			const struct drm_framebuffer *fb,
> > diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
> > index 4fcba99d12c0..a88a16b74f24 100644
> > --- a/drivers/gpu/drm/i915/intel_drv.h
> > +++ b/drivers/gpu/drm/i915/intel_drv.h
> > @@ -1503,6 +1503,7 @@ unsigned int intel_fb_xy_to_linear(int x, int y,
> >   void intel_add_fb_offsets(int *x, int *y,
> >   			  const struct intel_plane_state *state, int plane);
> >   unsigned int intel_rotation_info_size(const struct intel_rotation_info *rot_info);
> > +unsigned int intel_remapped_info_size(const struct intel_remapped_info *rem_info);
> >   bool intel_has_pending_fb_unpin(struct drm_i915_private *dev_priv);
> >   void intel_mark_busy(struct drm_i915_private *dev_priv);
> >   void intel_mark_idle(struct drm_i915_private *dev_priv);
> > diff --git a/drivers/gpu/drm/i915/selftests/i915_vma.c b/drivers/gpu/drm/i915/selftests/i915_vma.c
> > index ffa74290e054..4fc49c27f13c 100644
> > --- a/drivers/gpu/drm/i915/selftests/i915_vma.c
> > +++ b/drivers/gpu/drm/i915/selftests/i915_vma.c
> > @@ -395,8 +395,8 @@ assert_rotated(struct drm_i915_gem_object *obj,
> >   	return sg;
> >   }
> >   
> > -static unsigned int rotated_size(const struct intel_rotation_plane_info *a,
> > -				 const struct intel_rotation_plane_info *b)
> > +static unsigned int rotated_size(const struct intel_remapped_plane_info *a,
> > +				 const struct intel_remapped_plane_info *b)
> >   {
> >   	return a->width * a->height + b->width * b->height;
> >   }
> > @@ -406,7 +406,7 @@ static int igt_vma_rotate(void *arg)
> >   	struct drm_i915_private *i915 = arg;
> >   	struct i915_address_space *vm = &i915->ggtt.vm;
> >   	struct drm_i915_gem_object *obj;
> > -	const struct intel_rotation_plane_info planes[] = {
> > +	const struct intel_remapped_plane_info planes[] = {
> >   		{ .width = 1, .height = 1, .stride = 1 },
> >   		{ .width = 2, .height = 2, .stride = 2 },
> >   		{ .width = 4, .height = 4, .stride = 4 },
> >

Chris Wilson Oct. 1, 2018, 3:12 p.m. UTC | #3

Quoting Ville Syrjälä (2018-10-01 16:03:30)
> On Wed, Sep 26, 2018 at 08:50:25AM +0100, Tvrtko Ursulin wrote:
> > 
> > On 25/09/2018 20:37, Ville Syrjala wrote:
> > One more thing, do you really need random access for this 
> > transformation? Or you could walk the sg list as it is? Just if you hit 
> > a too long chunk you need to copy a trimmed version over and know where 
> > to continue for the next row. If doable it would be better than having 
> > to kvmalloc_array.
> 
> I think Chris suggested just using i915_gem_object_get_dma_address()
> here. But I'm not sure why we're not using it for rotate_pages()
> as well.

Tvrtko is opposed to populating the obj->mm.pages cache with no defined
release point. I say the mempressure and shrinker should to the right
thing, but it's a big if.
-Chris

Ville Syrjälä Oct. 1, 2018, 3:27 p.m. UTC | #4

On Mon, Oct 01, 2018 at 04:12:09PM +0100, Chris Wilson wrote:
> Quoting Ville Syrjälä (2018-10-01 16:03:30)
> > On Wed, Sep 26, 2018 at 08:50:25AM +0100, Tvrtko Ursulin wrote:
> > > 
> > > On 25/09/2018 20:37, Ville Syrjala wrote:
> > > One more thing, do you really need random access for this 
> > > transformation? Or you could walk the sg list as it is? Just if you hit 
> > > a too long chunk you need to copy a trimmed version over and know where 
> > > to continue for the next row. If doable it would be better than having 
> > > to kvmalloc_array.
> > 
> > I think Chris suggested just using i915_gem_object_get_dma_address()
> > here. But I'm not sure why we're not using it for rotate_pages()
> > as well.
> 
> Tvrtko is opposed to populating the obj->mm.pages cache with no defined
> release point. I say the mempressure and shrinker should to the right
> thing, but it's a big if.

OK.

Well, looks to me like i915_gem_object_get_dma_address() is the
only convenient looking thing for iterating the pages without
arowning the code in irrelevant details about sgs and whatnot.
I suppose it should be possible to write some helpers that avoid
all that and don't need the temp array, but I'm not really
motivated enough to do that myself.

Tvrtko Ursulin Oct. 1, 2018, 3:35 p.m. UTC | #5

On 01/10/2018 16:12, Chris Wilson wrote:
> Quoting Ville Syrjälä (2018-10-01 16:03:30)
>> On Wed, Sep 26, 2018 at 08:50:25AM +0100, Tvrtko Ursulin wrote:
>>>
>>> On 25/09/2018 20:37, Ville Syrjala wrote:
>>> One more thing, do you really need random access for this
>>> transformation? Or you could walk the sg list as it is? Just if you hit
>>> a too long chunk you need to copy a trimmed version over and know where
>>> to continue for the next row. If doable it would be better than having
>>> to kvmalloc_array.
>>
>> I think Chris suggested just using i915_gem_object_get_dma_address()
>> here. But I'm not sure why we're not using it for rotate_pages()
>> as well.
> 
> Tvrtko is opposed to populating the obj->mm.pages cache with no defined
> release point. I say the mempressure and shrinker should to the right
> thing, but it's a big if.

Wasn't there some compromise we agreed last time? Like that you would 
just add an API to drop the cache? So users like this one and rotation, 
where it is known to be very unlikely the cache would be used again, 
would be able to drop it immediately after remapping?

But maybe you had an argument that the cache could be used after all? So 
pread/pwrite on framebuffers likely? Relocations wouldn't be. Is there 
something else?

Maybe we should resolve this since it's been going for a very long time. :)

Regards,

Tvrtko

Chris Wilson Oct. 1, 2018, 3:37 p.m. UTC | #6

Quoting Ville Syrjälä (2018-10-01 16:27:43)
> On Mon, Oct 01, 2018 at 04:12:09PM +0100, Chris Wilson wrote:
> > Quoting Ville Syrjälä (2018-10-01 16:03:30)
> > > On Wed, Sep 26, 2018 at 08:50:25AM +0100, Tvrtko Ursulin wrote:
> > > > 
> > > > On 25/09/2018 20:37, Ville Syrjala wrote:
> > > > One more thing, do you really need random access for this 
> > > > transformation? Or you could walk the sg list as it is? Just if you hit 
> > > > a too long chunk you need to copy a trimmed version over and know where 
> > > > to continue for the next row. If doable it would be better than having 
> > > > to kvmalloc_array.
> > > 
> > > I think Chris suggested just using i915_gem_object_get_dma_address()
> > > here. But I'm not sure why we're not using it for rotate_pages()
> > > as well.
> > 
> > Tvrtko is opposed to populating the obj->mm.pages cache with no defined
> > release point. I say the mempressure and shrinker should to the right
> > thing, but it's a big if.
> 
> OK.
> 
> Well, looks to me like i915_gem_object_get_dma_address() is the
> only convenient looking thing for iterating the pages without
> arowning the code in irrelevant details about sgs and whatnot.
> I suppose it should be possible to write some helpers that avoid
> all that and don't need the temp array, but I'm not really
> motivated enough to do that myself.

Keep it simple and use get_dma_address(). We can find ways to throw away
the cache later if need be.
-Chris

Tvrtko Ursulin Oct. 1, 2018, 3:38 p.m. UTC | #7

On 01/10/2018 16:27, Ville Syrjälä wrote:
> On Mon, Oct 01, 2018 at 04:12:09PM +0100, Chris Wilson wrote:
>> Quoting Ville Syrjälä (2018-10-01 16:03:30)
>>> On Wed, Sep 26, 2018 at 08:50:25AM +0100, Tvrtko Ursulin wrote:
>>>>
>>>> On 25/09/2018 20:37, Ville Syrjala wrote:
>>>> One more thing, do you really need random access for this
>>>> transformation? Or you could walk the sg list as it is? Just if you hit
>>>> a too long chunk you need to copy a trimmed version over and know where
>>>> to continue for the next row. If doable it would be better than having
>>>> to kvmalloc_array.
>>>
>>> I think Chris suggested just using i915_gem_object_get_dma_address()
>>> here. But I'm not sure why we're not using it for rotate_pages()
>>> as well.
>>
>> Tvrtko is opposed to populating the obj->mm.pages cache with no defined
>> release point. I say the mempressure and shrinker should to the right
>> thing, but it's a big if.
> 
> OK.
> 
> Well, looks to me like i915_gem_object_get_dma_address() is the
> only convenient looking thing for iterating the pages without
> arowning the code in irrelevant details about sgs and whatnot.
> I suppose it should be possible to write some helpers that avoid
> all that and don't need the temp array, but I'm not really
> motivated enough to do that myself.

That's fine by me, I primarily suggested you could coalesce and trim 
when done.

Regards,

Tvrtko

Tvrtko Ursulin Oct. 1, 2018, 3:48 p.m. UTC | #8

On 01/10/2018 16:37, Chris Wilson wrote:
> Quoting Ville Syrjälä (2018-10-01 16:27:43)
>> On Mon, Oct 01, 2018 at 04:12:09PM +0100, Chris Wilson wrote:
>>> Quoting Ville Syrjälä (2018-10-01 16:03:30)
>>>> On Wed, Sep 26, 2018 at 08:50:25AM +0100, Tvrtko Ursulin wrote:
>>>>>
>>>>> On 25/09/2018 20:37, Ville Syrjala wrote:
>>>>> One more thing, do you really need random access for this
>>>>> transformation? Or you could walk the sg list as it is? Just if you hit
>>>>> a too long chunk you need to copy a trimmed version over and know where
>>>>> to continue for the next row. If doable it would be better than having
>>>>> to kvmalloc_array.
>>>>
>>>> I think Chris suggested just using i915_gem_object_get_dma_address()
>>>> here. But I'm not sure why we're not using it for rotate_pages()
>>>> as well.
>>>
>>> Tvrtko is opposed to populating the obj->mm.pages cache with no defined
>>> release point. I say the mempressure and shrinker should to the right
>>> thing, but it's a big if.
>>
>> OK.
>>
>> Well, looks to me like i915_gem_object_get_dma_address() is the
>> only convenient looking thing for iterating the pages without
>> arowning the code in irrelevant details about sgs and whatnot.
>> I suppose it should be possible to write some helpers that avoid
>> all that and don't need the temp array, but I'm not really
>> motivated enough to do that myself.
> 
> Keep it simple and use get_dma_address(). We can find ways to throw away
> the cache later if need be.

I'd do it straight away. I think cache for a large framebuffer, the kind 
which needs remapping could be quite big! Even the more fragmented 
memory the bigger the cache, and so if it sticks around pointlessly for 
the lifetime of the framebuffer it is a double whammy.

Regards,

Tvrtko

Ville Syrjälä Oct. 5, 2018, 6:42 p.m. UTC | #9

On Mon, Oct 01, 2018 at 04:48:21PM +0100, Tvrtko Ursulin wrote:
> 
> On 01/10/2018 16:37, Chris Wilson wrote:
> > Quoting Ville Syrjälä (2018-10-01 16:27:43)
> >> On Mon, Oct 01, 2018 at 04:12:09PM +0100, Chris Wilson wrote:
> >>> Quoting Ville Syrjälä (2018-10-01 16:03:30)
> >>>> On Wed, Sep 26, 2018 at 08:50:25AM +0100, Tvrtko Ursulin wrote:
> >>>>>
> >>>>> On 25/09/2018 20:37, Ville Syrjala wrote:
> >>>>> One more thing, do you really need random access for this
> >>>>> transformation? Or you could walk the sg list as it is? Just if you hit
> >>>>> a too long chunk you need to copy a trimmed version over and know where
> >>>>> to continue for the next row. If doable it would be better than having
> >>>>> to kvmalloc_array.
> >>>>
> >>>> I think Chris suggested just using i915_gem_object_get_dma_address()
> >>>> here. But I'm not sure why we're not using it for rotate_pages()
> >>>> as well.
> >>>
> >>> Tvrtko is opposed to populating the obj->mm.pages cache with no defined
> >>> release point. I say the mempressure and shrinker should to the right
> >>> thing, but it's a big if.
> >>
> >> OK.
> >>
> >> Well, looks to me like i915_gem_object_get_dma_address() is the
> >> only convenient looking thing for iterating the pages without
> >> arowning the code in irrelevant details about sgs and whatnot.
> >> I suppose it should be possible to write some helpers that avoid
> >> all that and don't need the temp array, but I'm not really
> >> motivated enough to do that myself.
> > 
> > Keep it simple and use get_dma_address(). We can find ways to throw away
> > the cache later if need be.
> 
> I'd do it straight away. I think cache for a large framebuffer, the kind 
> which needs remapping could be quite big! Even the more fragmented 
> memory the bigger the cache, and so if it sticks around pointlessly for 
> the lifetime of the framebuffer it is a double whammy.

The tree is indexed with the ggtt offset so memory fragmentation
shouldn't matter I think. Or did I totally miss something?

The relative overhead should be highest for a single page object
(576/4096 = ~15%). For big objects it should be something around
.2% AFAICS.

The shrinker can throw the cache out for non-pinned fbs. Looks
like it does consider all fbs active so it tries not to kick
them out as the first thing though. For pinned fbs the cache
would remain allocated forever however.

Keeping the cache around should be great for panning within
large remapped fbs. Although I suppose that panning isn't a
very common use case these days

Tvrtko Ursulin Oct. 9, 2018, 8:24 a.m. UTC | #10

On 05/10/2018 19:42, Ville Syrjälä wrote:
> On Mon, Oct 01, 2018 at 04:48:21PM +0100, Tvrtko Ursulin wrote:
>>
>> On 01/10/2018 16:37, Chris Wilson wrote:
>>> Quoting Ville Syrjälä (2018-10-01 16:27:43)
>>>> On Mon, Oct 01, 2018 at 04:12:09PM +0100, Chris Wilson wrote:
>>>>> Quoting Ville Syrjälä (2018-10-01 16:03:30)
>>>>>> On Wed, Sep 26, 2018 at 08:50:25AM +0100, Tvrtko Ursulin wrote:
>>>>>>>
>>>>>>> On 25/09/2018 20:37, Ville Syrjala wrote:
>>>>>>> One more thing, do you really need random access for this
>>>>>>> transformation? Or you could walk the sg list as it is? Just if you hit
>>>>>>> a too long chunk you need to copy a trimmed version over and know where
>>>>>>> to continue for the next row. If doable it would be better than having
>>>>>>> to kvmalloc_array.
>>>>>>
>>>>>> I think Chris suggested just using i915_gem_object_get_dma_address()
>>>>>> here. But I'm not sure why we're not using it for rotate_pages()
>>>>>> as well.
>>>>>
>>>>> Tvrtko is opposed to populating the obj->mm.pages cache with no defined
>>>>> release point. I say the mempressure and shrinker should to the right
>>>>> thing, but it's a big if.
>>>>
>>>> OK.
>>>>
>>>> Well, looks to me like i915_gem_object_get_dma_address() is the
>>>> only convenient looking thing for iterating the pages without
>>>> arowning the code in irrelevant details about sgs and whatnot.
>>>> I suppose it should be possible to write some helpers that avoid
>>>> all that and don't need the temp array, but I'm not really
>>>> motivated enough to do that myself.
>>>
>>> Keep it simple and use get_dma_address(). We can find ways to throw away
>>> the cache later if need be.
>>
>> I'd do it straight away. I think cache for a large framebuffer, the kind
>> which needs remapping could be quite big! Even the more fragmented
>> memory the bigger the cache, and so if it sticks around pointlessly for
>> the lifetime of the framebuffer it is a double whammy.
> 
> The tree is indexed with the ggtt offset so memory fragmentation
> shouldn't matter I think. Or did I totally miss something?

I think it is indexed by page index, but the number of tree entries 
depends on the number of sg chunks, which in turn depends on the system 
memory fragmentation. Consecutive pages are stored as "exceptional" 
entries so that is more efficient IIRC, but TBH I don't remember how 
many of those it can store before it needs a new tree node. Anyways, if 
I am not completely mistaken then the tree metadata size is proportional 
to backing store fragmentation, and secondary proportional to object 
size. (Due exceptional entry spill.)

> The relative overhead should be highest for a single page object
> (576/4096 = ~15%). For big objects it should be something around
> .2% AFAICS.

It is a bit annoying since radix tree does not have a helper to tell us 
the number of allocated nodes. I don't remember how I measure this last 
time.. Will try to recall.

> 
> The shrinker can throw the cache out for non-pinned fbs. Looks
> like it does consider all fbs active so it tries not to kick
> them out as the first thing though. For pinned fbs the cache
> would remain allocated forever however.
> 
> Keeping the cache around should be great for panning within
> large remapped fbs. Although I suppose that panning isn't a
> very common use case these days

We could just export __i915_gem_object_reset_page_iter, and look at the 
required locking when used from rotate/remap and I'd be happy to start 
with. But okay, let me first try to recall/re-measure how big are these 
trees...

Regards,

Tvrtko

Chris Wilson Oct. 9, 2018, 8:41 a.m. UTC | #11

Quoting Tvrtko Ursulin (2018-10-09 09:24:33)
> 
> On 05/10/2018 19:42, Ville Syrjälä wrote:
> > On Mon, Oct 01, 2018 at 04:48:21PM +0100, Tvrtko Ursulin wrote:
> >>
> >> On 01/10/2018 16:37, Chris Wilson wrote:
> >>> Quoting Ville Syrjälä (2018-10-01 16:27:43)
> >>>> On Mon, Oct 01, 2018 at 04:12:09PM +0100, Chris Wilson wrote:
> >>>>> Quoting Ville Syrjälä (2018-10-01 16:03:30)
> >>>>>> On Wed, Sep 26, 2018 at 08:50:25AM +0100, Tvrtko Ursulin wrote:
> >>>>>>>
> >>>>>>> On 25/09/2018 20:37, Ville Syrjala wrote:
> >>>>>>> One more thing, do you really need random access for this
> >>>>>>> transformation? Or you could walk the sg list as it is? Just if you hit
> >>>>>>> a too long chunk you need to copy a trimmed version over and know where
> >>>>>>> to continue for the next row. If doable it would be better than having
> >>>>>>> to kvmalloc_array.
> >>>>>>
> >>>>>> I think Chris suggested just using i915_gem_object_get_dma_address()
> >>>>>> here. But I'm not sure why we're not using it for rotate_pages()
> >>>>>> as well.
> >>>>>
> >>>>> Tvrtko is opposed to populating the obj->mm.pages cache with no defined
> >>>>> release point. I say the mempressure and shrinker should to the right
> >>>>> thing, but it's a big if.
> >>>>
> >>>> OK.
> >>>>
> >>>> Well, looks to me like i915_gem_object_get_dma_address() is the
> >>>> only convenient looking thing for iterating the pages without
> >>>> arowning the code in irrelevant details about sgs and whatnot.
> >>>> I suppose it should be possible to write some helpers that avoid
> >>>> all that and don't need the temp array, but I'm not really
> >>>> motivated enough to do that myself.
> >>>
> >>> Keep it simple and use get_dma_address(). We can find ways to throw away
> >>> the cache later if need be.
> >>
> >> I'd do it straight away. I think cache for a large framebuffer, the kind
> >> which needs remapping could be quite big! Even the more fragmented
> >> memory the bigger the cache, and so if it sticks around pointlessly for
> >> the lifetime of the framebuffer it is a double whammy.
> > 
> > The tree is indexed with the ggtt offset so memory fragmentation
> > shouldn't matter I think. Or did I totally miss something?
> 
> I think it is indexed by page index, but the number of tree entries 
> depends on the number of sg chunks, which in turn depends on the system 
> memory fragmentation. Consecutive pages are stored as "exceptional" 
> entries so that is more efficient IIRC, but TBH I don't remember how 
> many of those it can store before it needs a new tree node. Anyways, if 
> I am not completely mistaken then the tree metadata size is proportional 
> to backing store fragmentation, and secondary proportional to object 
> size. (Due exceptional entry spill.)
> 
> > The relative overhead should be highest for a single page object
> > (576/4096 = ~15%). For big objects it should be something around
> > .2% AFAICS.
> 
> It is a bit annoying since radix tree does not have a helper to tell us 
> the number of allocated nodes. I don't remember how I measure this last 
> time.. Will try to recall.
> 
> > 
> > The shrinker can throw the cache out for non-pinned fbs. Looks
> > like it does consider all fbs active so it tries not to kick
> > them out as the first thing though. For pinned fbs the cache
> > would remain allocated forever however.
> > 
> > Keeping the cache around should be great for panning within
> > large remapped fbs. Although I suppose that panning isn't a
> > very common use case these days
> 
> We could just export __i915_gem_object_reset_page_iter, and look at the 
> required locking when used from rotate/remap and I'd be happy to start 
> with. But okay, let me first try to recall/re-measure how big are these 
> trees...

Also, the should soon be a generic radix tree implementation so we
should be free to swap out the tagged nodes for a more streamline tree.

But seriously, the sole argument is that you dislike the caching and you
even argue against the caching in situations like this where the cache
will be reused! (Since we need to remap each crtc into the single fb.)

We can't even randomly call reset_page_iter unless you know you own the
iter (since it's common to everyone who has a pin_pages). Revoking
unpinned pages under mempressure, i.e. the default shrinker behaviour,
seems satisfactory.
-Chris

Ville Syrjälä Oct. 9, 2018, 11:54 a.m. UTC | #12

On Tue, Oct 09, 2018 at 09:24:33AM +0100, Tvrtko Ursulin wrote:
> 
> On 05/10/2018 19:42, Ville Syrjälä wrote:
> > On Mon, Oct 01, 2018 at 04:48:21PM +0100, Tvrtko Ursulin wrote:
> >>
> >> On 01/10/2018 16:37, Chris Wilson wrote:
> >>> Quoting Ville Syrjälä (2018-10-01 16:27:43)
> >>>> On Mon, Oct 01, 2018 at 04:12:09PM +0100, Chris Wilson wrote:
> >>>>> Quoting Ville Syrjälä (2018-10-01 16:03:30)
> >>>>>> On Wed, Sep 26, 2018 at 08:50:25AM +0100, Tvrtko Ursulin wrote:
> >>>>>>>
> >>>>>>> On 25/09/2018 20:37, Ville Syrjala wrote:
> >>>>>>> One more thing, do you really need random access for this
> >>>>>>> transformation? Or you could walk the sg list as it is? Just if you hit
> >>>>>>> a too long chunk you need to copy a trimmed version over and know where
> >>>>>>> to continue for the next row. If doable it would be better than having
> >>>>>>> to kvmalloc_array.
> >>>>>>
> >>>>>> I think Chris suggested just using i915_gem_object_get_dma_address()
> >>>>>> here. But I'm not sure why we're not using it for rotate_pages()
> >>>>>> as well.
> >>>>>
> >>>>> Tvrtko is opposed to populating the obj->mm.pages cache with no defined
> >>>>> release point. I say the mempressure and shrinker should to the right
> >>>>> thing, but it's a big if.
> >>>>
> >>>> OK.
> >>>>
> >>>> Well, looks to me like i915_gem_object_get_dma_address() is the
> >>>> only convenient looking thing for iterating the pages without
> >>>> arowning the code in irrelevant details about sgs and whatnot.
> >>>> I suppose it should be possible to write some helpers that avoid
> >>>> all that and don't need the temp array, but I'm not really
> >>>> motivated enough to do that myself.
> >>>
> >>> Keep it simple and use get_dma_address(). We can find ways to throw away
> >>> the cache later if need be.
> >>
> >> I'd do it straight away. I think cache for a large framebuffer, the kind
> >> which needs remapping could be quite big! Even the more fragmented
> >> memory the bigger the cache, and so if it sticks around pointlessly for
> >> the lifetime of the framebuffer it is a double whammy.
> > 
> > The tree is indexed with the ggtt offset so memory fragmentation
> > shouldn't matter I think. Or did I totally miss something?
> 
> I think it is indexed by page index, but the number of tree entries 
> depends on the number of sg chunks, which in turn depends on the system 
> memory fragmentation. Consecutive pages are stored as "exceptional" 
> entries so that is more efficient IIRC,

Didn't look any more efficient to me. It's still one page per slot
AFAICS. The exceptional entry just tells you where to find the start
of the current sg chunk.

> but TBH I don't remember how 
> many of those it can store before it needs a new tree node. Anyways, if 
> I am not completely mistaken then the tree metadata size is proportional 
> to backing store fragmentation, and secondary proportional to object 
> size. (Due exceptional entry spill.)
> 
> > The relative overhead should be highest for a single page object
> > (576/4096 = ~15%). For big objects it should be something around
> > .2% AFAICS.
> 
> It is a bit annoying since radix tree does not have a helper to tell us 
> the number of allocated nodes. I don't remember how I measure this last 
> time.. Will try to recall.

I wrote myself a helper to count the nodes and that seems to
agree with my back of the envelope calculations.

Tvrtko Ursulin Oct. 10, 2018, 7:04 a.m. UTC | #13

On 09/10/2018 12:54, Ville Syrjälä wrote:
> On Tue, Oct 09, 2018 at 09:24:33AM +0100, Tvrtko Ursulin wrote:
>>
>> On 05/10/2018 19:42, Ville Syrjälä wrote:
>>> On Mon, Oct 01, 2018 at 04:48:21PM +0100, Tvrtko Ursulin wrote:
>>>>
>>>> On 01/10/2018 16:37, Chris Wilson wrote:
>>>>> Quoting Ville Syrjälä (2018-10-01 16:27:43)
>>>>>> On Mon, Oct 01, 2018 at 04:12:09PM +0100, Chris Wilson wrote:
>>>>>>> Quoting Ville Syrjälä (2018-10-01 16:03:30)
>>>>>>>> On Wed, Sep 26, 2018 at 08:50:25AM +0100, Tvrtko Ursulin wrote:
>>>>>>>>>
>>>>>>>>> On 25/09/2018 20:37, Ville Syrjala wrote:
>>>>>>>>> One more thing, do you really need random access for this
>>>>>>>>> transformation? Or you could walk the sg list as it is? Just if you hit
>>>>>>>>> a too long chunk you need to copy a trimmed version over and know where
>>>>>>>>> to continue for the next row. If doable it would be better than having
>>>>>>>>> to kvmalloc_array.
>>>>>>>>
>>>>>>>> I think Chris suggested just using i915_gem_object_get_dma_address()
>>>>>>>> here. But I'm not sure why we're not using it for rotate_pages()
>>>>>>>> as well.
>>>>>>>
>>>>>>> Tvrtko is opposed to populating the obj->mm.pages cache with no defined
>>>>>>> release point. I say the mempressure and shrinker should to the right
>>>>>>> thing, but it's a big if.
>>>>>>
>>>>>> OK.
>>>>>>
>>>>>> Well, looks to me like i915_gem_object_get_dma_address() is the
>>>>>> only convenient looking thing for iterating the pages without
>>>>>> arowning the code in irrelevant details about sgs and whatnot.
>>>>>> I suppose it should be possible to write some helpers that avoid
>>>>>> all that and don't need the temp array, but I'm not really
>>>>>> motivated enough to do that myself.
>>>>>
>>>>> Keep it simple and use get_dma_address(). We can find ways to throw away
>>>>> the cache later if need be.
>>>>
>>>> I'd do it straight away. I think cache for a large framebuffer, the kind
>>>> which needs remapping could be quite big! Even the more fragmented
>>>> memory the bigger the cache, and so if it sticks around pointlessly for
>>>> the lifetime of the framebuffer it is a double whammy.
>>>
>>> The tree is indexed with the ggtt offset so memory fragmentation
>>> shouldn't matter I think. Or did I totally miss something?
>>
>> I think it is indexed by page index, but the number of tree entries
>> depends on the number of sg chunks, which in turn depends on the system
>> memory fragmentation. Consecutive pages are stored as "exceptional"
>> entries so that is more efficient IIRC,
> 
> Didn't look any more efficient to me. It's still one page per slot
> AFAICS. The exceptional entry just tells you where to find the start
> of the current sg chunk.
> 
>> but TBH I don't remember how
>> many of those it can store before it needs a new tree node. Anyways, if
>> I am not completely mistaken then the tree metadata size is proportional
>> to backing store fragmentation, and secondary proportional to object
>> size. (Due exceptional entry spill.)
>>
>>> The relative overhead should be highest for a single page object
>>> (576/4096 = ~15%). For big objects it should be something around
>>> .2% AFAICS.
>>
>> It is a bit annoying since radix tree does not have a helper to tell us
>> the number of allocated nodes. I don't remember how I measure this last
>> time.. Will try to recall.
> 
> I wrote myself a helper to count the nodes and that seems to
> agree with my back of the envelope calculations.

How many sg chunks and then radix nodes for some typical framebuffer did 
that show? Or maybe not typical but the one which will require 
remapping. Did you measure after a fresh boot only or also on a bit of a 
memory constrained system (some memory pressure should equal more 
fragmentation)?

If not even that shows the cache size as significant, then I guess my 
worries that large framebuffers would be sufficiently fragmented for 
this to be much more were unfounded, so feel free to go with it.

Regards,

Tvrtko

[v3,3/8] drm/i915: Add a new "remapped" gtt_view

Commit Message

Comments

Patch