diff mbox

[v3,4/7] drm/i915: Cache LRC state page in the context

Message ID 1452599771-36130-1-git-send-email-tvrtko.ursulin@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Tvrtko Ursulin Jan. 12, 2016, 11:56 a.m. UTC
From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

LRC lifetime is well defined so we can cache the page pointing
to the object backing store in the context in order to avoid
walking over the object SG page list from the interrupt context
without the big lock held.

v2: Also cache the mapping. (Chris Wilson)
v3: Unmap on the error path.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_drv.h  |  2 ++
 drivers/gpu/drm/i915/intel_lrc.c | 34 +++++++++++++++++++++++-----------
 2 files changed, 25 insertions(+), 11 deletions(-)

Comments

Chris Wilson Jan. 12, 2016, 12:12 p.m. UTC | #1
On Tue, Jan 12, 2016 at 11:56:11AM +0000, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> LRC lifetime is well defined so we can cache the page pointing
> to the object backing store in the context in order to avoid
> walking over the object SG page list from the interrupt context
> without the big lock held.
> 
> v2: Also cache the mapping. (Chris Wilson)
> v3: Unmap on the error path.

Then we only use the lrc_state_page in the unmapping, hardly worth the
cost of saving it.

The reg_state is better associated with the ring (since it basically
contains the analog of the RING_HEAD and friends).
-Chris
Tvrtko Ursulin Jan. 12, 2016, 12:54 p.m. UTC | #2
On 12/01/16 12:12, Chris Wilson wrote:
> On Tue, Jan 12, 2016 at 11:56:11AM +0000, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>
>> LRC lifetime is well defined so we can cache the page pointing
>> to the object backing store in the context in order to avoid
>> walking over the object SG page list from the interrupt context
>> without the big lock held.
>>
>> v2: Also cache the mapping. (Chris Wilson)
>> v3: Unmap on the error path.
>
> Then we only use the lrc_state_page in the unmapping, hardly worth the
> cost of saving it.

Ok.

Do you also know if this would now require any flushing or something if 
previously kunmap_atomic might have done something under the covers?

> The reg_state is better associated with the ring (since it basically
> contains the analog of the RING_HEAD and friends).

Hm, not sure. The page belongs to the object from that anonymous struct 
in intel_context so I think it is best to keep them together.

Regards,

Tvrtko
Chris Wilson Jan. 12, 2016, 1:11 p.m. UTC | #3
On Tue, Jan 12, 2016 at 12:54:25PM +0000, Tvrtko Ursulin wrote:
> 
> On 12/01/16 12:12, Chris Wilson wrote:
> >On Tue, Jan 12, 2016 at 11:56:11AM +0000, Tvrtko Ursulin wrote:
> >>From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>
> >>LRC lifetime is well defined so we can cache the page pointing
> >>to the object backing store in the context in order to avoid
> >>walking over the object SG page list from the interrupt context
> >>without the big lock held.
> >>
> >>v2: Also cache the mapping. (Chris Wilson)
> >>v3: Unmap on the error path.
> >
> >Then we only use the lrc_state_page in the unmapping, hardly worth the
> >cost of saving it.
> 
> Ok.
> 
> Do you also know if this would now require any flushing or something
> if previously kunmap_atomic might have done something under the
> covers?

kmap_atomic only changes the PTE and the unmap would flush the TLB. In
terms of our pointer access, using kmap/kmap_atomic is equivalent. (Just
kmap_atomic is meant to be cheaper to set up and a limited resource
which can only be used without preemption.)

> >The reg_state is better associated with the ring (since it basically
> >contains the analog of the RING_HEAD and friends).
> 
> Hm, not sure. The page belongs to the object from that anonymous
> struct in intel_context so I think it is best to keep them together.

The page does sure, but the state does not.

The result is that we get:

ring->registers[CTX_RING_BUFFER_STATE+1] = ring->vma->node.state;
ASSIGN_CTX_PDP(ppgtt, ring->registers, 3);
ASSIGN_CTX_PDP(ppgtt, ring->registers, 2);
ASSIGN_CTX_PDP(ppgtt, ring->registers, 1);
ASSIGN_CTX_PDP(ppgtt, ring->registers, 0);
ring->registers[CTX_RING_TAIL+1] = req->tail;

which makes a lot more sense, to me, when viewed against the underlying
architecture of the hardware and comparing against the legacy ringbuffer.
-Chris
Dave Gordon Jan. 12, 2016, 4:45 p.m. UTC | #4
On 12/01/16 13:11, Chris Wilson wrote:
> On Tue, Jan 12, 2016 at 12:54:25PM +0000, Tvrtko Ursulin wrote:
>>
>> On 12/01/16 12:12, Chris Wilson wrote:
>>> On Tue, Jan 12, 2016 at 11:56:11AM +0000, Tvrtko Ursulin wrote:
>>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>
>>>> LRC lifetime is well defined so we can cache the page pointing
>>>> to the object backing store in the context in order to avoid
>>>> walking over the object SG page list from the interrupt context
>>>> without the big lock held.
>>>>
>>>> v2: Also cache the mapping. (Chris Wilson)
>>>> v3: Unmap on the error path.
>>>
>>> Then we only use the lrc_state_page in the unmapping, hardly worth the
>>> cost of saving it.
>>
>> Ok.
>>
>> Do you also know if this would now require any flushing or something
>> if previously kunmap_atomic might have done something under the
>> covers?
>
> kmap_atomic only changes the PTE and the unmap would flush the TLB. In
> terms of our pointer access, using kmap/kmap_atomic is equivalent. (Just
> kmap_atomic is meant to be cheaper to set up and a limited resource
> which can only be used without preemption.)
>
>>> The reg_state is better associated with the ring (since it basically
>>> contains the analog of the RING_HEAD and friends).
>>
>> Hm, not sure. The page belongs to the object from that anonymous
>> struct in intel_context so I think it is best to keep them together.
>
> The page does sure, but the state does not.
>
> The result is that we get:
>
> ring->registers[CTX_RING_BUFFER_STATE+1] = ring->vma->node.state;
> ASSIGN_CTX_PDP(ppgtt, ring->registers, 3);
> ASSIGN_CTX_PDP(ppgtt, ring->registers, 2);
> ASSIGN_CTX_PDP(ppgtt, ring->registers, 1);
> ASSIGN_CTX_PDP(ppgtt, ring->registers, 0);
> ring->registers[CTX_RING_TAIL+1] = req->tail;
>
> which makes a lot more sense, to me, when viewed against the underlying
> architecture of the hardware and comparing against the legacy ringbuffer.
> -Chris

When you say "ring", do you mean "engine" or "ringbuffer"?

The register state can't be associated with the engine /per se/, because 
it has to be per-context as well as per-engine. It doesn't really belong 
with the ringbuffer; in particular I've seen code for allocating the 
ringbuffer in stolen memory and dropping it during hibernation, whereas 
this register state shouldn't be lost across hibernation. So the 
lifetime of a register state image and a ringbuffer are different, 
therefore they don't belong together.

The register state is pretty much the /definition/ of a context, in 
hardware terms. OK, it's got an HWSP and other bits, but the register 
save area is what the h/w really cares about. So it makes sense that the 
kmapping for that area is also part of (the CPU's idea of) the context. Yes,

	ctx.engine[engine_id].registers[regno] = ...

is a bit clumsy, but I'd expect to cache the register-image pointer in a 
local here, along the lines of:

	uint32_t *registers = ctx.engine[engine_id].registers;
	registers[CTX_RING_TAIL+1] = req->tail;

etc.

[aside]
Some of this would be much less clumsy if the anonymous engine[]
struct had a name, say, "engine_info", so we could just do
	struct engine_info *eip = &ctx.engines[engine->id];
once at the beginning of any per-engine function and then use
eip->ringbuf, eip->state, eip->registers, etc without continually 
repeating the 'ctx.' and 'engine->id' bits over and over ...
[/aside]

Apart from that, I think Tvrtko's patch has lost the dirtying from:

 > -	page = i915_gem_object_get_dirty_page(ctx_obj, LRC_STATE_PN);

in execlists_update_context(), so should add

	set_page_dirty(lrc_state_page)

instead (and that's the use for it, so you /do/ want to cache it).

.Dave.
Chris Wilson Jan. 13, 2016, 1:37 a.m. UTC | #5
On Tue, Jan 12, 2016 at 04:45:02PM +0000, Dave Gordon wrote:
> On 12/01/16 13:11, Chris Wilson wrote:
> >On Tue, Jan 12, 2016 at 12:54:25PM +0000, Tvrtko Ursulin wrote:
> >>
> >>On 12/01/16 12:12, Chris Wilson wrote:
> >>>On Tue, Jan 12, 2016 at 11:56:11AM +0000, Tvrtko Ursulin wrote:
> >>>>From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>>
> >>>>LRC lifetime is well defined so we can cache the page pointing
> >>>>to the object backing store in the context in order to avoid
> >>>>walking over the object SG page list from the interrupt context
> >>>>without the big lock held.
> >>>>
> >>>>v2: Also cache the mapping. (Chris Wilson)
> >>>>v3: Unmap on the error path.
> >>>
> >>>Then we only use the lrc_state_page in the unmapping, hardly worth the
> >>>cost of saving it.
> >>
> >>Ok.
> >>
> >>Do you also know if this would now require any flushing or something
> >>if previously kunmap_atomic might have done something under the
> >>covers?
> >
> >kmap_atomic only changes the PTE and the unmap would flush the TLB. In
> >terms of our pointer access, using kmap/kmap_atomic is equivalent. (Just
> >kmap_atomic is meant to be cheaper to set up and a limited resource
> >which can only be used without preemption.)
> >
> >>>The reg_state is better associated with the ring (since it basically
> >>>contains the analog of the RING_HEAD and friends).
> >>
> >>Hm, not sure. The page belongs to the object from that anonymous
> >>struct in intel_context so I think it is best to keep them together.
> >
> >The page does sure, but the state does not.
> >
> >The result is that we get:
> >
> >ring->registers[CTX_RING_BUFFER_STATE+1] = ring->vma->node.state;
> >ASSIGN_CTX_PDP(ppgtt, ring->registers, 3);
> >ASSIGN_CTX_PDP(ppgtt, ring->registers, 2);
> >ASSIGN_CTX_PDP(ppgtt, ring->registers, 1);
> >ASSIGN_CTX_PDP(ppgtt, ring->registers, 0);
> >ring->registers[CTX_RING_TAIL+1] = req->tail;
> >
> >which makes a lot more sense, to me, when viewed against the underlying
> >architecture of the hardware and comparing against the legacy ringbuffer.
> >-Chris
> 
> When you say "ring", do you mean "engine" or "ringbuffer"?

The ring; the buffer object, its ggtt assignment, its
virtual address, the cpu head/tail, its registers - pretty much
everything we update and use between pinning the context for a request
and submitting that request to the engine (via any intermediate). I also
included the ctx_descriptor on the ring for example.
 
> The register state can't be associated with the engine /per se/,
> because it has to be per-context as well as per-engine. It doesn't
> really belong with the ringbuffer; in particular I've seen code for
> allocating the ringbuffer in stolen memory and dropping it during
> hibernation, whereas this register state shouldn't be lost across
> hibernation. So the lifetime of a register state image and a
> ringbuffer are different, therefore they don't belong together.

Indeed, we are not talking about the backing storage for the GPU ring
buffer. Just the register kmap and associated transient state.

struct intel_ring, if you like, is the part of struct intel_context that
we use to build individual requests.

> The register state is pretty much the /definition/ of a context, in
> hardware terms. OK, it's got an HWSP and other bits, but the
> register save area is what the h/w really cares about. So it makes
> sense that the kmapping for that area is also part of (the CPU's
> idea of) the context.

The ring is part of the context, too. The entire view of a driver for a
particular process/state is the context.  The logical state for the GPU
is just a part of that.

> Yes,
> 
> 	ctx.engine[engine_id].registers[regno] = ...
> 
> is a bit clumsy, but I'd expect to cache the register-image pointer
> in a local here, along the lines of:
> 
> 	uint32_t *registers = ctx.engine[engine_id].registers;
> 	registers[CTX_RING_TAIL+1] = req->tail;

I liked ring->registers precisely because I felt it was more descriptive
here:

req->ring->registers[CTX_RING_TAIL+1] = req->tail; where req->ring is the
intel_ring and not the intel_engine_cs.

So instead of adding more to struct intel_context_engine that is only to
be used for the request construction, I have been putting that same
information into struct intel_ring, to keep the two methods for
constructing requests using the same structs in a similar fashion and
looking roughly the same and sharing more code.

> etc.
> 
> [aside]
> Some of this would be much less clumsy if the anonymous engine[]
> struct had a name, say, "engine_info", so we could just do

Many times I have said the same thing and tried to get it changed.
-Chris
Tvrtko Ursulin Jan. 13, 2016, 1:49 p.m. UTC | #6
On 12/01/16 16:45, Dave Gordon wrote:
> On 12/01/16 13:11, Chris Wilson wrote:
>> On Tue, Jan 12, 2016 at 12:54:25PM +0000, Tvrtko Ursulin wrote:
>>>
>>> On 12/01/16 12:12, Chris Wilson wrote:
>>>> On Tue, Jan 12, 2016 at 11:56:11AM +0000, Tvrtko Ursulin wrote:
>>>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>>
>>>>> LRC lifetime is well defined so we can cache the page pointing
>>>>> to the object backing store in the context in order to avoid
>>>>> walking over the object SG page list from the interrupt context
>>>>> without the big lock held.
>>>>>
>>>>> v2: Also cache the mapping. (Chris Wilson)
>>>>> v3: Unmap on the error path.
>>>>
>>>> Then we only use the lrc_state_page in the unmapping, hardly worth the
>>>> cost of saving it.
>>>
>>> Ok.
>>>
>>> Do you also know if this would now require any flushing or something
>>> if previously kunmap_atomic might have done something under the
>>> covers?
>>
>> kmap_atomic only changes the PTE and the unmap would flush the TLB. In
>> terms of our pointer access, using kmap/kmap_atomic is equivalent. (Just
>> kmap_atomic is meant to be cheaper to set up and a limited resource
>> which can only be used without preemption.)
>>
>>>> The reg_state is better associated with the ring (since it basically
>>>> contains the analog of the RING_HEAD and friends).
>>>
>>> Hm, not sure. The page belongs to the object from that anonymous
>>> struct in intel_context so I think it is best to keep them together.
>>
>> The page does sure, but the state does not.
>>
>> The result is that we get:
>>
>> ring->registers[CTX_RING_BUFFER_STATE+1] = ring->vma->node.state;
>> ASSIGN_CTX_PDP(ppgtt, ring->registers, 3);
>> ASSIGN_CTX_PDP(ppgtt, ring->registers, 2);
>> ASSIGN_CTX_PDP(ppgtt, ring->registers, 1);
>> ASSIGN_CTX_PDP(ppgtt, ring->registers, 0);
>> ring->registers[CTX_RING_TAIL+1] = req->tail;
>>
>> which makes a lot more sense, to me, when viewed against the underlying
>> architecture of the hardware and comparing against the legacy ringbuffer.
>> -Chris
>
> When you say "ring", do you mean "engine" or "ringbuffer"?
>
> The register state can't be associated with the engine /per se/, because
> it has to be per-context as well as per-engine. It doesn't really belong
> with the ringbuffer; in particular I've seen code for allocating the
> ringbuffer in stolen memory and dropping it during hibernation, whereas
> this register state shouldn't be lost across hibernation. So the
> lifetime of a register state image and a ringbuffer are different,
> therefore they don't belong together.
>
> The register state is pretty much the /definition/ of a context, in
> hardware terms. OK, it's got an HWSP and other bits, but the register
> save area is what the h/w really cares about. So it makes sense that the
> kmapping for that area is also part of (the CPU's idea of) the context.
> Yes,
>
>      ctx.engine[engine_id].registers[regno] = ...
>
> is a bit clumsy, but I'd expect to cache the register-image pointer in a
> local here, along the lines of:
>
>      uint32_t *registers = ctx.engine[engine_id].registers;
>      registers[CTX_RING_TAIL+1] = req->tail;
>
> etc.
>
> [aside]
> Some of this would be much less clumsy if the anonymous engine[]
> struct had a name, say, "engine_info", so we could just do
>      struct engine_info *eip = &ctx.engines[engine->id];
> once at the beginning of any per-engine function and then use
> eip->ringbuf, eip->state, eip->registers, etc without continually
> repeating the 'ctx.' and 'engine->id' bits over and over ...
> [/aside]
>
> Apart from that, I think Tvrtko's patch has lost the dirtying from:
>
>  > -    page = i915_gem_object_get_dirty_page(ctx_obj, LRC_STATE_PN);
>
> in execlists_update_context(), so should add
>
>      set_page_dirty(lrc_state_page)
>
> instead (and that's the use for it, so you /do/ want to cache it).

It is still there (i915_gem_object_get_dirty_page) but at the point of 
pinning, not each ctx update. Do you think that is a problem? I thought 
no one can flush and clear the dirty page while the ctx is pinned.

On the more general level, can we agree to let this fix with the pointer 
cached where it currently is (in the context) and leave the anonymous 
struct - which we all dislike - and I'd add it should probably be 
defined in intel_lrc.h - for a later cleanup?

If so I only need to respin with the struct page * caching removed.

Regards,

Tvrtko
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 79bb3671a15e..0c6a274a2150 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -885,6 +885,8 @@  struct intel_context {
 		int pin_count;
 		struct i915_vma *lrc_vma;
 		u64 lrc_desc;
+		struct page *lrc_state_page;
+		uint32_t *lrc_reg_state;
 	} engine[I915_NUM_RINGS];
 
 	struct list_head link;
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 94314b344f38..213c4b789683 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -339,14 +339,7 @@  static int execlists_update_context(struct drm_i915_gem_request *rq)
 {
 	struct intel_engine_cs *ring = rq->ring;
 	struct i915_hw_ppgtt *ppgtt = rq->ctx->ppgtt;
-	struct drm_i915_gem_object *ctx_obj = rq->ctx->engine[ring->id].state;
-	struct page *page;
-	uint32_t *reg_state;
-
-	BUG_ON(!ctx_obj);
-
-	page = i915_gem_object_get_dirty_page(ctx_obj, LRC_STATE_PN);
-	reg_state = kmap_atomic(page);
+	uint32_t *reg_state = rq->ctx->engine[ring->id].lrc_reg_state;
 
 	reg_state[CTX_RING_TAIL+1] = rq->tail;
 	reg_state[CTX_RING_BUFFER_START+1] = rq->ringbuf->vma->node.start;
@@ -363,8 +356,6 @@  static int execlists_update_context(struct drm_i915_gem_request *rq)
 		ASSIGN_CTX_PDP(ppgtt, reg_state, 0);
 	}
 
-	kunmap_atomic(reg_state);
-
 	return 0;
 }
 
@@ -1050,6 +1041,8 @@  static int intel_lr_context_do_pin(struct intel_engine_cs *ring,
 	struct drm_i915_private *dev_priv = dev->dev_private;
 	struct drm_i915_gem_object *ctx_obj = ctx->engine[ring->id].state;
 	struct intel_ringbuffer *ringbuf = ctx->engine[ring->id].ringbuf;
+	struct page *lrc_state_page;
+	uint32_t *lrc_reg_state;
 	int ret;
 
 	WARN_ON(!mutex_is_locked(&ring->dev->struct_mutex));
@@ -1059,12 +1052,26 @@  static int intel_lr_context_do_pin(struct intel_engine_cs *ring,
 	if (ret)
 		return ret;
 
+	lrc_state_page = i915_gem_object_get_dirty_page(ctx_obj, LRC_STATE_PN);
+	if (WARN_ON(!lrc_state_page)) {
+		ret = -ENODEV;
+		goto unpin_ctx_obj;
+	}
+
+	lrc_reg_state = kmap(lrc_state_page);
+	if (!lrc_reg_state) {
+		ret = -ENOMEM;
+		goto unpin_ctx_obj;
+	}
+
 	ret = intel_pin_and_map_ringbuffer_obj(ring->dev, ringbuf);
 	if (ret)
-		goto unpin_ctx_obj;
+		goto unmap_state_page;
 
 	ctx->engine[ring->id].lrc_vma = i915_gem_obj_to_ggtt(ctx_obj);
 	intel_lr_context_descriptor_update(ctx, ring);
+	ctx->engine[ring->id].lrc_state_page = lrc_state_page;
+	ctx->engine[ring->id].lrc_reg_state = lrc_reg_state;
 	ctx_obj->dirty = true;
 
 	/* Invalidate GuC TLB. */
@@ -1073,6 +1080,8 @@  static int intel_lr_context_do_pin(struct intel_engine_cs *ring,
 
 	return ret;
 
+unmap_state_page;
+	kunmap(lrc_state_page);
 unpin_ctx_obj:
 	i915_gem_object_ggtt_unpin(ctx_obj);
 
@@ -1105,10 +1114,13 @@  void intel_lr_context_unpin(struct drm_i915_gem_request *rq)
 	if (ctx_obj) {
 		WARN_ON(!mutex_is_locked(&ring->dev->struct_mutex));
 		if (--rq->ctx->engine[ring->id].pin_count == 0) {
+			kunmap(rq->ctx->engine[ring->id].lrc_state_page);
 			intel_unpin_ringbuffer_obj(ringbuf);
 			i915_gem_object_ggtt_unpin(ctx_obj);
 			rq->ctx->engine[ring->id].lrc_vma = NULL;
 			rq->ctx->engine[ring->id].lrc_desc = 0;
+			rq->ctx->engine[ring->id].lrc_state_page = NULL;
+			rq->ctx->engine[ring->id].lrc_reg_state = NULL;
 		}
 	}
 }