diff mbox series

[v8,01/69] drm/i915: Do not share hwsp across contexts any more, v7.

Message ID 20210311134249.588632-2-maarten.lankhorst@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series drm/i915: Remove obj->mm.lock! | expand

Commit Message

Maarten Lankhorst March 11, 2021, 1:41 p.m. UTC
Instead of sharing pages with breadcrumbs, give each timeline a
single page. This allows unrelated timelines not to share locks
any more during command submission.

As an additional benefit, seqno wraparound no longer requires
i915_vma_pin, which means we no longer need to worry about a
potential -EDEADLK at a point where we are ready to submit.

Changes since v1:
- Fix erroneous i915_vma_acquire that should be a i915_vma_release (ickle).
- Extra check for completion in intel_read_hwsp().
Changes since v2:
- Fix inconsistent indent in hwsp_alloc() (kbuild)
- memset entire cacheline to 0.
Changes since v3:
- Do same in intel_timeline_reset_seqno(), and clflush for good measure.
Changes since v4:
- Use refcounting on timeline, instead of relying on i915_active.
- Fix waiting on kernel requests.
Changes since v5:
- Bump amount of slots to maximum (256), for best wraparounds.
- Add hwsp_offset to i915_request to fix potential wraparound hang.
- Ensure timeline wrap test works with the changes.
- Assign hwsp in intel_timeline_read_hwsp() within the rcu lock to
  fix a hang.
Changes since v6:
- Rename i915_request_active_offset to i915_request_active_seqno(),
  and elaborate the function. (tvrtko)

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com> #v1
Reported-by: kernel test robot <lkp@intel.com>
---
 drivers/gpu/drm/i915/gt/gen2_engine_cs.c      |   2 +-
 drivers/gpu/drm/i915/gt/gen6_engine_cs.c      |   8 +-
 drivers/gpu/drm/i915/gt/gen8_engine_cs.c      |  13 +-
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     |   1 +
 drivers/gpu/drm/i915/gt/intel_gt_types.h      |   4 -
 drivers/gpu/drm/i915/gt/intel_timeline.c      | 422 ++++--------------
 .../gpu/drm/i915/gt/intel_timeline_types.h    |  17 +-
 drivers/gpu/drm/i915/gt/selftest_engine_cs.c  |   5 +-
 drivers/gpu/drm/i915/gt/selftest_timeline.c   |  83 ++--
 drivers/gpu/drm/i915/i915_request.c           |   4 -
 drivers/gpu/drm/i915/i915_request.h           |  31 +-
 11 files changed, 175 insertions(+), 415 deletions(-)

Comments

Jason Ekstrand March 11, 2021, 9:22 p.m. UTC | #1
First off, I'm just here asking questions right now trying to start
getting my head around some of this stuff.  Feel free to ignore me or
tell me to go away if I'm being annoying. :-)

On Thu, Mar 11, 2021 at 7:49 AM Maarten Lankhorst
<maarten.lankhorst@linux.intel.com> wrote:
>
> Instead of sharing pages with breadcrumbs, give each timeline a
> single page. This allows unrelated timelines not to share locks
> any more during command submission.
>
> As an additional benefit, seqno wraparound no longer requires
> i915_vma_pin, which means we no longer need to worry about a
> potential -EDEADLK at a point where we are ready to submit.
>
> Changes since v1:
> - Fix erroneous i915_vma_acquire that should be a i915_vma_release (ickle).
> - Extra check for completion in intel_read_hwsp().
> Changes since v2:
> - Fix inconsistent indent in hwsp_alloc() (kbuild)
> - memset entire cacheline to 0.
> Changes since v3:
> - Do same in intel_timeline_reset_seqno(), and clflush for good measure.
> Changes since v4:
> - Use refcounting on timeline, instead of relying on i915_active.
> - Fix waiting on kernel requests.
> Changes since v5:
> - Bump amount of slots to maximum (256), for best wraparounds.
> - Add hwsp_offset to i915_request to fix potential wraparound hang.
> - Ensure timeline wrap test works with the changes.
> - Assign hwsp in intel_timeline_read_hwsp() within the rcu lock to
>   fix a hang.
> Changes since v6:
> - Rename i915_request_active_offset to i915_request_active_seqno(),
>   and elaborate the function. (tvrtko)
>
> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com> #v1
> Reported-by: kernel test robot <lkp@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/gen2_engine_cs.c      |   2 +-
>  drivers/gpu/drm/i915/gt/gen6_engine_cs.c      |   8 +-
>  drivers/gpu/drm/i915/gt/gen8_engine_cs.c      |  13 +-
>  drivers/gpu/drm/i915/gt/intel_engine_cs.c     |   1 +
>  drivers/gpu/drm/i915/gt/intel_gt_types.h      |   4 -
>  drivers/gpu/drm/i915/gt/intel_timeline.c      | 422 ++++--------------
>  .../gpu/drm/i915/gt/intel_timeline_types.h    |  17 +-
>  drivers/gpu/drm/i915/gt/selftest_engine_cs.c  |   5 +-
>  drivers/gpu/drm/i915/gt/selftest_timeline.c   |  83 ++--
>  drivers/gpu/drm/i915/i915_request.c           |   4 -
>  drivers/gpu/drm/i915/i915_request.h           |  31 +-
>  11 files changed, 175 insertions(+), 415 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/gen2_engine_cs.c b/drivers/gpu/drm/i915/gt/gen2_engine_cs.c
> index b491a64919c8..9646200d2792 100644
> --- a/drivers/gpu/drm/i915/gt/gen2_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/gen2_engine_cs.c
> @@ -143,7 +143,7 @@ static u32 *__gen2_emit_breadcrumb(struct i915_request *rq, u32 *cs,
>                                    int flush, int post)
>  {
>         GEM_BUG_ON(i915_request_active_timeline(rq)->hwsp_ggtt != rq->engine->status_page.vma);
> -       GEM_BUG_ON(offset_in_page(i915_request_active_timeline(rq)->hwsp_offset) != I915_GEM_HWS_SEQNO_ADDR);
> +       GEM_BUG_ON(offset_in_page(rq->hwsp_seqno) != I915_GEM_HWS_SEQNO_ADDR);
>
>         *cs++ = MI_FLUSH;
>
> diff --git a/drivers/gpu/drm/i915/gt/gen6_engine_cs.c b/drivers/gpu/drm/i915/gt/gen6_engine_cs.c
> index ce38d1bcaba3..b388ceeeb1c9 100644
> --- a/drivers/gpu/drm/i915/gt/gen6_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/gen6_engine_cs.c
> @@ -161,7 +161,7 @@ u32 *gen6_emit_breadcrumb_rcs(struct i915_request *rq, u32 *cs)
>                  PIPE_CONTROL_DC_FLUSH_ENABLE |
>                  PIPE_CONTROL_QW_WRITE |
>                  PIPE_CONTROL_CS_STALL);
> -       *cs++ = i915_request_active_timeline(rq)->hwsp_offset |
> +       *cs++ = i915_request_active_seqno(rq) |
>                 PIPE_CONTROL_GLOBAL_GTT;
>         *cs++ = rq->fence.seqno;
>
> @@ -359,7 +359,7 @@ u32 *gen7_emit_breadcrumb_rcs(struct i915_request *rq, u32 *cs)
>                  PIPE_CONTROL_QW_WRITE |
>                  PIPE_CONTROL_GLOBAL_GTT_IVB |
>                  PIPE_CONTROL_CS_STALL);
> -       *cs++ = i915_request_active_timeline(rq)->hwsp_offset;
> +       *cs++ = i915_request_active_seqno(rq);
>         *cs++ = rq->fence.seqno;
>
>         *cs++ = MI_USER_INTERRUPT;
> @@ -374,7 +374,7 @@ u32 *gen7_emit_breadcrumb_rcs(struct i915_request *rq, u32 *cs)
>  u32 *gen6_emit_breadcrumb_xcs(struct i915_request *rq, u32 *cs)
>  {
>         GEM_BUG_ON(i915_request_active_timeline(rq)->hwsp_ggtt != rq->engine->status_page.vma);
> -       GEM_BUG_ON(offset_in_page(i915_request_active_timeline(rq)->hwsp_offset) != I915_GEM_HWS_SEQNO_ADDR);
> +       GEM_BUG_ON(offset_in_page(rq->hwsp_seqno) != I915_GEM_HWS_SEQNO_ADDR);
>
>         *cs++ = MI_FLUSH_DW | MI_FLUSH_DW_OP_STOREDW | MI_FLUSH_DW_STORE_INDEX;
>         *cs++ = I915_GEM_HWS_SEQNO_ADDR | MI_FLUSH_DW_USE_GTT;
> @@ -394,7 +394,7 @@ u32 *gen7_emit_breadcrumb_xcs(struct i915_request *rq, u32 *cs)
>         int i;
>
>         GEM_BUG_ON(i915_request_active_timeline(rq)->hwsp_ggtt != rq->engine->status_page.vma);
> -       GEM_BUG_ON(offset_in_page(i915_request_active_timeline(rq)->hwsp_offset) != I915_GEM_HWS_SEQNO_ADDR);
> +       GEM_BUG_ON(offset_in_page(rq->hwsp_seqno) != I915_GEM_HWS_SEQNO_ADDR);
>
>         *cs++ = MI_FLUSH_DW | MI_INVALIDATE_TLB |
>                 MI_FLUSH_DW_OP_STOREDW | MI_FLUSH_DW_STORE_INDEX;
> diff --git a/drivers/gpu/drm/i915/gt/gen8_engine_cs.c b/drivers/gpu/drm/i915/gt/gen8_engine_cs.c
> index cac80af7ad1c..6b9c34d3ac8d 100644
> --- a/drivers/gpu/drm/i915/gt/gen8_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/gen8_engine_cs.c
> @@ -338,15 +338,14 @@ static u32 preempt_address(struct intel_engine_cs *engine)
>
>  static u32 hwsp_offset(const struct i915_request *rq)
>  {
> -       const struct intel_timeline_cacheline *cl;
> +       const struct intel_timeline *tl;
>
> -       /* Before the request is executed, the timeline/cachline is fixed */
> +       /* Before the request is executed, the timeline is fixed */
> +       tl = rcu_dereference_protected(rq->timeline,
> +                                      !i915_request_signaled(rq));

Why is Gen8+ different from Gen2/6 here?  In particular, why not use
i915_request_active_timeline(rq) or, better yet,
i915_request_active_seqno()?  The primary difference I see is that the
guard on the RCU is different but it's not immediately obvious to me
why this should be different between hardware generations.  Also,
i915_request_active_seqno() returns a u32, but that could be fixed.

>
> -       cl = rcu_dereference_protected(rq->hwsp_cacheline, 1);
> -       if (cl)
> -               return cl->ggtt_offset;
> -
> -       return rcu_dereference_protected(rq->timeline, 1)->hwsp_offset;
> +       /* See the comment in i915_request_active_seqno(). */
> +       return page_mask_bits(tl->hwsp_offset) + offset_in_page(rq->hwsp_seqno);
>  }
>
>  int gen8_emit_init_breadcrumb(struct i915_request *rq)
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index b4df275cba79..e6cefd00b4a1 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -752,6 +752,7 @@ static int measure_breadcrumb_dw(struct intel_context *ce)
>         frame->rq.engine = engine;
>         frame->rq.context = ce;
>         rcu_assign_pointer(frame->rq.timeline, ce->timeline);
> +       frame->rq.hwsp_seqno = ce->timeline->hwsp_seqno;
>
>         frame->ring.vaddr = frame->cs;
>         frame->ring.size = sizeof(frame->cs);
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_types.h b/drivers/gpu/drm/i915/gt/intel_gt_types.h
> index 626af37c7790..3f6db8357309 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_types.h
> @@ -45,10 +45,6 @@ struct intel_gt {
>         struct intel_gt_timelines {
>                 spinlock_t lock; /* protects active_list */
>                 struct list_head active_list;
> -
> -               /* Pack multiple timelines' seqnos into the same page */
> -               spinlock_t hwsp_lock;
> -               struct list_head hwsp_free_list;
>         } timelines;
>
>         struct intel_gt_requests {
> diff --git a/drivers/gpu/drm/i915/gt/intel_timeline.c b/drivers/gpu/drm/i915/gt/intel_timeline.c
> index 491b8df174c2..efe2030cfe5e 100644
> --- a/drivers/gpu/drm/i915/gt/intel_timeline.c
> +++ b/drivers/gpu/drm/i915/gt/intel_timeline.c
> @@ -11,21 +11,9 @@
>  #include "intel_ring.h"
>  #include "intel_timeline.h"
>
> -#define ptr_set_bit(ptr, bit) ((typeof(ptr))((unsigned long)(ptr) | BIT(bit)))
> -#define ptr_test_bit(ptr, bit) ((unsigned long)(ptr) & BIT(bit))
> +#define TIMELINE_SEQNO_BYTES 8
>
> -#define CACHELINE_BITS 6
> -#define CACHELINE_FREE CACHELINE_BITS
> -
> -struct intel_timeline_hwsp {
> -       struct intel_gt *gt;
> -       struct intel_gt_timelines *gt_timelines;
> -       struct list_head free_link;
> -       struct i915_vma *vma;
> -       u64 free_bitmap;
> -};
> -
> -static struct i915_vma *__hwsp_alloc(struct intel_gt *gt)
> +static struct i915_vma *hwsp_alloc(struct intel_gt *gt)
>  {
>         struct drm_i915_private *i915 = gt->i915;
>         struct drm_i915_gem_object *obj;
> @@ -44,220 +32,59 @@ static struct i915_vma *__hwsp_alloc(struct intel_gt *gt)
>         return vma;
>  }
>
> -static struct i915_vma *
> -hwsp_alloc(struct intel_timeline *timeline, unsigned int *cacheline)
> -{
> -       struct intel_gt_timelines *gt = &timeline->gt->timelines;
> -       struct intel_timeline_hwsp *hwsp;
> -
> -       BUILD_BUG_ON(BITS_PER_TYPE(u64) * CACHELINE_BYTES > PAGE_SIZE);
> -
> -       spin_lock_irq(&gt->hwsp_lock);
> -
> -       /* hwsp_free_list only contains HWSP that have available cachelines */
> -       hwsp = list_first_entry_or_null(&gt->hwsp_free_list,
> -                                       typeof(*hwsp), free_link);
> -       if (!hwsp) {
> -               struct i915_vma *vma;
> -
> -               spin_unlock_irq(&gt->hwsp_lock);
> -
> -               hwsp = kmalloc(sizeof(*hwsp), GFP_KERNEL);
> -               if (!hwsp)
> -                       return ERR_PTR(-ENOMEM);
> -
> -               vma = __hwsp_alloc(timeline->gt);
> -               if (IS_ERR(vma)) {
> -                       kfree(hwsp);
> -                       return vma;
> -               }
> -
> -               GT_TRACE(timeline->gt, "new HWSP allocated\n");
> -
> -               vma->private = hwsp;
> -               hwsp->gt = timeline->gt;
> -               hwsp->vma = vma;
> -               hwsp->free_bitmap = ~0ull;
> -               hwsp->gt_timelines = gt;
> -
> -               spin_lock_irq(&gt->hwsp_lock);
> -               list_add(&hwsp->free_link, &gt->hwsp_free_list);
> -       }
> -
> -       GEM_BUG_ON(!hwsp->free_bitmap);
> -       *cacheline = __ffs64(hwsp->free_bitmap);
> -       hwsp->free_bitmap &= ~BIT_ULL(*cacheline);
> -       if (!hwsp->free_bitmap)
> -               list_del(&hwsp->free_link);
> -
> -       spin_unlock_irq(&gt->hwsp_lock);
> -
> -       GEM_BUG_ON(hwsp->vma->private != hwsp);
> -       return hwsp->vma;
> -}
> -
> -static void __idle_hwsp_free(struct intel_timeline_hwsp *hwsp, int cacheline)
> -{
> -       struct intel_gt_timelines *gt = hwsp->gt_timelines;
> -       unsigned long flags;
> -
> -       spin_lock_irqsave(&gt->hwsp_lock, flags);
> -
> -       /* As a cacheline becomes available, publish the HWSP on the freelist */
> -       if (!hwsp->free_bitmap)
> -               list_add_tail(&hwsp->free_link, &gt->hwsp_free_list);
> -
> -       GEM_BUG_ON(cacheline >= BITS_PER_TYPE(hwsp->free_bitmap));
> -       hwsp->free_bitmap |= BIT_ULL(cacheline);
> -
> -       /* And if no one is left using it, give the page back to the system */
> -       if (hwsp->free_bitmap == ~0ull) {
> -               i915_vma_put(hwsp->vma);
> -               list_del(&hwsp->free_link);
> -               kfree(hwsp);
> -       }
> -
> -       spin_unlock_irqrestore(&gt->hwsp_lock, flags);
> -}
> -
> -static void __rcu_cacheline_free(struct rcu_head *rcu)
> -{
> -       struct intel_timeline_cacheline *cl =
> -               container_of(rcu, typeof(*cl), rcu);
> -
> -       /* Must wait until after all *rq->hwsp are complete before removing */
> -       i915_gem_object_unpin_map(cl->hwsp->vma->obj);
> -       __idle_hwsp_free(cl->hwsp, ptr_unmask_bits(cl->vaddr, CACHELINE_BITS));
> -
> -       i915_active_fini(&cl->active);
> -       kfree(cl);
> -}
> -
> -static void __idle_cacheline_free(struct intel_timeline_cacheline *cl)
> -{
> -       GEM_BUG_ON(!i915_active_is_idle(&cl->active));
> -       call_rcu(&cl->rcu, __rcu_cacheline_free);
> -}
> -
>  __i915_active_call
> -static void __cacheline_retire(struct i915_active *active)
> +static void __timeline_retire(struct i915_active *active)
>  {
> -       struct intel_timeline_cacheline *cl =
> -               container_of(active, typeof(*cl), active);
> +       struct intel_timeline *tl =
> +               container_of(active, typeof(*tl), active);
>
> -       i915_vma_unpin(cl->hwsp->vma);
> -       if (ptr_test_bit(cl->vaddr, CACHELINE_FREE))
> -               __idle_cacheline_free(cl);
> +       i915_vma_unpin(tl->hwsp_ggtt);
> +       intel_timeline_put(tl);
>  }
>
> -static int __cacheline_active(struct i915_active *active)
> +static int __timeline_active(struct i915_active *active)
>  {
> -       struct intel_timeline_cacheline *cl =
> -               container_of(active, typeof(*cl), active);
> +       struct intel_timeline *tl =
> +               container_of(active, typeof(*tl), active);
>
> -       __i915_vma_pin(cl->hwsp->vma);
> +       __i915_vma_pin(tl->hwsp_ggtt);
> +       intel_timeline_get(tl);
>         return 0;
>  }
>
> -static struct intel_timeline_cacheline *
> -cacheline_alloc(struct intel_timeline_hwsp *hwsp, unsigned int cacheline)
> -{
> -       struct intel_timeline_cacheline *cl;
> -       void *vaddr;
> -
> -       GEM_BUG_ON(cacheline >= BIT(CACHELINE_BITS));
> -
> -       cl = kmalloc(sizeof(*cl), GFP_KERNEL);
> -       if (!cl)
> -               return ERR_PTR(-ENOMEM);
> -
> -       vaddr = i915_gem_object_pin_map(hwsp->vma->obj, I915_MAP_WB);
> -       if (IS_ERR(vaddr)) {
> -               kfree(cl);
> -               return ERR_CAST(vaddr);
> -       }
> -
> -       cl->hwsp = hwsp;
> -       cl->vaddr = page_pack_bits(vaddr, cacheline);
> -
> -       i915_active_init(&cl->active, __cacheline_active, __cacheline_retire);
> -
> -       return cl;
> -}
> -
> -static void cacheline_acquire(struct intel_timeline_cacheline *cl,
> -                             u32 ggtt_offset)
> -{
> -       if (!cl)
> -               return;
> -
> -       cl->ggtt_offset = ggtt_offset;
> -       i915_active_acquire(&cl->active);
> -}
> -
> -static void cacheline_release(struct intel_timeline_cacheline *cl)
> -{
> -       if (cl)
> -               i915_active_release(&cl->active);
> -}
> -
> -static void cacheline_free(struct intel_timeline_cacheline *cl)
> -{
> -       if (!i915_active_acquire_if_busy(&cl->active)) {
> -               __idle_cacheline_free(cl);
> -               return;
> -       }
> -
> -       GEM_BUG_ON(ptr_test_bit(cl->vaddr, CACHELINE_FREE));
> -       cl->vaddr = ptr_set_bit(cl->vaddr, CACHELINE_FREE);
> -
> -       i915_active_release(&cl->active);
> -}
> -
>  static int intel_timeline_init(struct intel_timeline *timeline,
>                                struct intel_gt *gt,
>                                struct i915_vma *hwsp,
>                                unsigned int offset)
>  {
>         void *vaddr;
> +       u32 *seqno;
>
>         kref_init(&timeline->kref);
>         atomic_set(&timeline->pin_count, 0);
>
>         timeline->gt = gt;
>
> -       timeline->has_initial_breadcrumb = !hwsp;
> -       timeline->hwsp_cacheline = NULL;
> -
> -       if (!hwsp) {
> -               struct intel_timeline_cacheline *cl;
> -               unsigned int cacheline;
> -
> -               hwsp = hwsp_alloc(timeline, &cacheline);
> +       if (hwsp) {
> +               timeline->hwsp_offset = offset;
> +               timeline->hwsp_ggtt = i915_vma_get(hwsp);
> +       } else {
> +               timeline->has_initial_breadcrumb = true;
> +               hwsp = hwsp_alloc(gt);
>                 if (IS_ERR(hwsp))
>                         return PTR_ERR(hwsp);
> -
> -               cl = cacheline_alloc(hwsp->private, cacheline);
> -               if (IS_ERR(cl)) {
> -                       __idle_hwsp_free(hwsp->private, cacheline);
> -                       return PTR_ERR(cl);
> -               }
> -
> -               timeline->hwsp_cacheline = cl;
> -               timeline->hwsp_offset = cacheline * CACHELINE_BYTES;
> -
> -               vaddr = page_mask_bits(cl->vaddr);
> -       } else {
> -               timeline->hwsp_offset = offset;
> -               vaddr = i915_gem_object_pin_map(hwsp->obj, I915_MAP_WB);
> -               if (IS_ERR(vaddr))
> -                       return PTR_ERR(vaddr);
> +               timeline->hwsp_ggtt = hwsp;
>         }
>
> -       timeline->hwsp_seqno =
> -               memset(vaddr + timeline->hwsp_offset, 0, CACHELINE_BYTES);
> +       vaddr = i915_gem_object_pin_map(hwsp->obj, I915_MAP_WB);
> +       if (IS_ERR(vaddr))
> +               return PTR_ERR(vaddr);
> +
> +       timeline->hwsp_map = vaddr;
> +       seqno = vaddr + timeline->hwsp_offset;
> +       WRITE_ONCE(*seqno, 0);
> +       timeline->hwsp_seqno = seqno;
>
> -       timeline->hwsp_ggtt = i915_vma_get(hwsp);
>         GEM_BUG_ON(timeline->hwsp_offset >= hwsp->size);
>
>         timeline->fence_context = dma_fence_context_alloc(1);
> @@ -268,6 +95,7 @@ static int intel_timeline_init(struct intel_timeline *timeline,
>         INIT_LIST_HEAD(&timeline->requests);
>
>         i915_syncmap_init(&timeline->sync);
> +       i915_active_init(&timeline->active, __timeline_active, __timeline_retire);
>
>         return 0;
>  }
> @@ -278,23 +106,18 @@ void intel_gt_init_timelines(struct intel_gt *gt)
>
>         spin_lock_init(&timelines->lock);
>         INIT_LIST_HEAD(&timelines->active_list);
> -
> -       spin_lock_init(&timelines->hwsp_lock);
> -       INIT_LIST_HEAD(&timelines->hwsp_free_list);
>  }
>
> -static void intel_timeline_fini(struct intel_timeline *timeline)
> +static void intel_timeline_fini(struct rcu_head *rcu)
>  {
> -       GEM_BUG_ON(atomic_read(&timeline->pin_count));
> -       GEM_BUG_ON(!list_empty(&timeline->requests));
> -       GEM_BUG_ON(timeline->retire);
> +       struct intel_timeline *timeline =
> +               container_of(rcu, struct intel_timeline, rcu);
>
> -       if (timeline->hwsp_cacheline)
> -               cacheline_free(timeline->hwsp_cacheline);
> -       else
> -               i915_gem_object_unpin_map(timeline->hwsp_ggtt->obj);
> +       i915_gem_object_unpin_map(timeline->hwsp_ggtt->obj);
>
>         i915_vma_put(timeline->hwsp_ggtt);
> +       i915_active_fini(&timeline->active);
> +       kfree(timeline);
>  }
>
>  struct intel_timeline *
> @@ -360,9 +183,9 @@ int intel_timeline_pin(struct intel_timeline *tl, struct i915_gem_ww_ctx *ww)
>         GT_TRACE(tl->gt, "timeline:%llx using HWSP offset:%x\n",
>                  tl->fence_context, tl->hwsp_offset);
>
> -       cacheline_acquire(tl->hwsp_cacheline, tl->hwsp_offset);
> +       i915_active_acquire(&tl->active);
>         if (atomic_fetch_inc(&tl->pin_count)) {
> -               cacheline_release(tl->hwsp_cacheline);
> +               i915_active_release(&tl->active);
>                 __i915_vma_unpin(tl->hwsp_ggtt);
>         }
>
> @@ -371,9 +194,13 @@ int intel_timeline_pin(struct intel_timeline *tl, struct i915_gem_ww_ctx *ww)
>
>  void intel_timeline_reset_seqno(const struct intel_timeline *tl)
>  {
> +       u32 *hwsp_seqno = (u32 *)tl->hwsp_seqno;
>         /* Must be pinned to be writable, and no requests in flight. */
>         GEM_BUG_ON(!atomic_read(&tl->pin_count));
> -       WRITE_ONCE(*(u32 *)tl->hwsp_seqno, tl->seqno);
> +
> +       memset(hwsp_seqno + 1, 0, TIMELINE_SEQNO_BYTES - sizeof(*hwsp_seqno));
> +       WRITE_ONCE(*hwsp_seqno, tl->seqno);
> +       clflush(hwsp_seqno);
>  }
>
>  void intel_timeline_enter(struct intel_timeline *tl)
> @@ -449,106 +276,23 @@ static u32 timeline_advance(struct intel_timeline *tl)
>         return tl->seqno += 1 + tl->has_initial_breadcrumb;
>  }
>
> -static void timeline_rollback(struct intel_timeline *tl)
> -{
> -       tl->seqno -= 1 + tl->has_initial_breadcrumb;
> -}
> -
>  static noinline int
>  __intel_timeline_get_seqno(struct intel_timeline *tl,
> -                          struct i915_request *rq,
>                            u32 *seqno)
>  {
> -       struct intel_timeline_cacheline *cl;
> -       unsigned int cacheline;
> -       struct i915_vma *vma;
> -       void *vaddr;
> -       int err;
> -
> -       might_lock(&tl->gt->ggtt->vm.mutex);
> -       GT_TRACE(tl->gt, "timeline:%llx wrapped\n", tl->fence_context);
> -
> -       /*
> -        * If there is an outstanding GPU reference to this cacheline,
> -        * such as it being sampled by a HW semaphore on another timeline,
> -        * we cannot wraparound our seqno value (the HW semaphore does
> -        * a strict greater-than-or-equals compare, not i915_seqno_passed).
> -        * So if the cacheline is still busy, we must detach ourselves
> -        * from it and leave it inflight alongside its users.
> -        *
> -        * However, if nobody is watching and we can guarantee that nobody
> -        * will, we could simply reuse the same cacheline.
> -        *
> -        * if (i915_active_request_is_signaled(&tl->last_request) &&
> -        *     i915_active_is_signaled(&tl->hwsp_cacheline->active))
> -        *      return 0;
> -        *
> -        * That seems unlikely for a busy timeline that needed to wrap in
> -        * the first place, so just replace the cacheline.
> -        */
> -
> -       vma = hwsp_alloc(tl, &cacheline);
> -       if (IS_ERR(vma)) {
> -               err = PTR_ERR(vma);
> -               goto err_rollback;
> -       }
> -
> -       err = i915_ggtt_pin(vma, NULL, 0, PIN_HIGH);
> -       if (err) {
> -               __idle_hwsp_free(vma->private, cacheline);
> -               goto err_rollback;
> -       }
> +       u32 next_ofs = offset_in_page(tl->hwsp_offset + TIMELINE_SEQNO_BYTES);
>
> -       cl = cacheline_alloc(vma->private, cacheline);
> -       if (IS_ERR(cl)) {
> -               err = PTR_ERR(cl);
> -               __idle_hwsp_free(vma->private, cacheline);
> -               goto err_unpin;
> -       }
> -       GEM_BUG_ON(cl->hwsp->vma != vma);
> -
> -       /*
> -        * Attach the old cacheline to the current request, so that we only
> -        * free it after the current request is retired, which ensures that
> -        * all writes into the cacheline from previous requests are complete.
> -        */
> -       err = i915_active_ref(&tl->hwsp_cacheline->active,
> -                             tl->fence_context,
> -                             &rq->fence);
> -       if (err)
> -               goto err_cacheline;
> +       /* w/a: bit 5 needs to be zero for MI_FLUSH_DW address. */
> +       if (TIMELINE_SEQNO_BYTES <= BIT(5) && (next_ofs & BIT(5)))
> +               next_ofs = offset_in_page(next_ofs + BIT(5));
>
> -       cacheline_release(tl->hwsp_cacheline); /* ownership now xfered to rq */
> -       cacheline_free(tl->hwsp_cacheline);
> -
> -       i915_vma_unpin(tl->hwsp_ggtt); /* binding kept alive by old cacheline */
> -       i915_vma_put(tl->hwsp_ggtt);
> -
> -       tl->hwsp_ggtt = i915_vma_get(vma);
> -
> -       vaddr = page_mask_bits(cl->vaddr);
> -       tl->hwsp_offset = cacheline * CACHELINE_BYTES;
> -       tl->hwsp_seqno =
> -               memset(vaddr + tl->hwsp_offset, 0, CACHELINE_BYTES);
> -
> -       tl->hwsp_offset += i915_ggtt_offset(vma);
> -       GT_TRACE(tl->gt, "timeline:%llx using HWSP offset:%x\n",
> -                tl->fence_context, tl->hwsp_offset);
> -
> -       cacheline_acquire(cl, tl->hwsp_offset);
> -       tl->hwsp_cacheline = cl;
> +       tl->hwsp_offset = i915_ggtt_offset(tl->hwsp_ggtt) + next_ofs;
> +       tl->hwsp_seqno = tl->hwsp_map + next_ofs;
> +       intel_timeline_reset_seqno(tl);
>
>         *seqno = timeline_advance(tl);
>         GEM_BUG_ON(i915_seqno_passed(*tl->hwsp_seqno, *seqno));
>         return 0;
> -
> -err_cacheline:
> -       cacheline_free(cl);
> -err_unpin:
> -       i915_vma_unpin(vma);
> -err_rollback:
> -       timeline_rollback(tl);
> -       return err;
>  }
>
>  int intel_timeline_get_seqno(struct intel_timeline *tl,
> @@ -558,51 +302,52 @@ int intel_timeline_get_seqno(struct intel_timeline *tl,
>         *seqno = timeline_advance(tl);
>
>         /* Replace the HWSP on wraparound for HW semaphores */
> -       if (unlikely(!*seqno && tl->hwsp_cacheline))
> -               return __intel_timeline_get_seqno(tl, rq, seqno);
> +       if (unlikely(!*seqno && tl->has_initial_breadcrumb))
> +               return __intel_timeline_get_seqno(tl, seqno);
>
>         return 0;
>  }
>
> -static int cacheline_ref(struct intel_timeline_cacheline *cl,
> -                        struct i915_request *rq)
> -{
> -       return i915_active_add_request(&cl->active, rq);
> -}
> -
>  int intel_timeline_read_hwsp(struct i915_request *from,
>                              struct i915_request *to,
>                              u32 *hwsp)
>  {
> -       struct intel_timeline_cacheline *cl;
> +       struct intel_timeline *tl;
>         int err;
>
> -       GEM_BUG_ON(!rcu_access_pointer(from->hwsp_cacheline));
> -
>         rcu_read_lock();
> -       cl = rcu_dereference(from->hwsp_cacheline);
> -       if (i915_request_signaled(from)) /* confirm cacheline is valid */
> -               goto unlock;
> -       if (unlikely(!i915_active_acquire_if_busy(&cl->active)))
> -               goto unlock; /* seqno wrapped and completed! */
> -       if (unlikely(__i915_request_is_complete(from)))
> -               goto release;
> +       tl = rcu_dereference(from->timeline);
> +       if (i915_request_signaled(from) ||
> +           !i915_active_acquire_if_busy(&tl->active))
> +               tl = NULL;
> +
> +       if (tl) {
> +               /* hwsp_offset may wraparound, so use from->hwsp_seqno */
> +               *hwsp = i915_ggtt_offset(tl->hwsp_ggtt) +
> +                       offset_in_page(from->hwsp_seqno);
> +       }
> +
> +       /* ensure we wait on the right request, if not, we completed */
> +       if (tl && __i915_request_is_complete(from)) {
> +               i915_active_release(&tl->active);
> +               tl = NULL;
> +       }
>         rcu_read_unlock();
>
> -       err = cacheline_ref(cl, to);
> -       if (err)
> +       if (!tl)
> +               return 1;
> +
> +       /* Can't do semaphore waits on kernel context */
> +       if (!tl->has_initial_breadcrumb) {
> +               err = -EINVAL;
>                 goto out;
> +       }
> +
> +       err = i915_active_add_request(&tl->active, to);
>
> -       *hwsp = cl->ggtt_offset;
>  out:
> -       i915_active_release(&cl->active);
> +       i915_active_release(&tl->active);
>         return err;
> -
> -release:
> -       i915_active_release(&cl->active);
> -unlock:
> -       rcu_read_unlock();
> -       return 1;
>  }
>
>  void intel_timeline_unpin(struct intel_timeline *tl)
> @@ -611,8 +356,7 @@ void intel_timeline_unpin(struct intel_timeline *tl)
>         if (!atomic_dec_and_test(&tl->pin_count))
>                 return;
>
> -       cacheline_release(tl->hwsp_cacheline);
> -
> +       i915_active_release(&tl->active);
>         __i915_vma_unpin(tl->hwsp_ggtt);
>  }
>
> @@ -621,8 +365,11 @@ void __intel_timeline_free(struct kref *kref)
>         struct intel_timeline *timeline =
>                 container_of(kref, typeof(*timeline), kref);
>
> -       intel_timeline_fini(timeline);
> -       kfree_rcu(timeline, rcu);
> +       GEM_BUG_ON(atomic_read(&timeline->pin_count));
> +       GEM_BUG_ON(!list_empty(&timeline->requests));
> +       GEM_BUG_ON(timeline->retire);
> +
> +       call_rcu(&timeline->rcu, intel_timeline_fini);
>  }
>
>  void intel_gt_fini_timelines(struct intel_gt *gt)
> @@ -630,7 +377,6 @@ void intel_gt_fini_timelines(struct intel_gt *gt)
>         struct intel_gt_timelines *timelines = &gt->timelines;
>
>         GEM_BUG_ON(!list_empty(&timelines->active_list));
> -       GEM_BUG_ON(!list_empty(&timelines->hwsp_free_list));
>  }
>
>  void intel_gt_show_timelines(struct intel_gt *gt,
> diff --git a/drivers/gpu/drm/i915/gt/intel_timeline_types.h b/drivers/gpu/drm/i915/gt/intel_timeline_types.h
> index 9f677c9b7d06..74e67dbf89c5 100644
> --- a/drivers/gpu/drm/i915/gt/intel_timeline_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_timeline_types.h
> @@ -17,7 +17,6 @@
>  struct i915_vma;
>  struct i915_syncmap;
>  struct intel_gt;
> -struct intel_timeline_hwsp;
>
>  struct intel_timeline {
>         u64 fence_context;
> @@ -44,12 +43,11 @@ struct intel_timeline {
>         atomic_t pin_count;
>         atomic_t active_count;
>
> +       void *hwsp_map;
>         const u32 *hwsp_seqno;
>         struct i915_vma *hwsp_ggtt;
>         u32 hwsp_offset;
>
> -       struct intel_timeline_cacheline *hwsp_cacheline;
> -
>         bool has_initial_breadcrumb;
>
>         /**
> @@ -66,6 +64,8 @@ struct intel_timeline {
>          */
>         struct i915_active_fence last_request;
>
> +       struct i915_active active;
> +
>         /** A chain of completed timelines ready for early retirement. */
>         struct intel_timeline *retire;
>
> @@ -89,15 +89,4 @@ struct intel_timeline {
>         struct rcu_head rcu;
>  };
>
> -struct intel_timeline_cacheline {
> -       struct i915_active active;
> -
> -       struct intel_timeline_hwsp *hwsp;
> -       void *vaddr;
> -
> -       u32 ggtt_offset;
> -
> -       struct rcu_head rcu;
> -};
> -
>  #endif /* __I915_TIMELINE_TYPES_H__ */
> diff --git a/drivers/gpu/drm/i915/gt/selftest_engine_cs.c b/drivers/gpu/drm/i915/gt/selftest_engine_cs.c
> index 84d883de30ee..7e466ae114f8 100644
> --- a/drivers/gpu/drm/i915/gt/selftest_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/selftest_engine_cs.c
> @@ -41,6 +41,9 @@ static int perf_end(struct intel_gt *gt)
>
>  static int write_timestamp(struct i915_request *rq, int slot)
>  {
> +       struct intel_timeline *tl =
> +               rcu_dereference_protected(rq->timeline,
> +                                         !i915_request_signaled(rq));
>         u32 cmd;
>         u32 *cs;
>
> @@ -53,7 +56,7 @@ static int write_timestamp(struct i915_request *rq, int slot)
>                 cmd++;
>         *cs++ = cmd;
>         *cs++ = i915_mmio_reg_offset(RING_TIMESTAMP(rq->engine->mmio_base));
> -       *cs++ = i915_request_timeline(rq)->hwsp_offset + slot * sizeof(u32);
> +       *cs++ = tl->hwsp_offset + slot * sizeof(u32);
>         *cs++ = 0;
>
>         intel_ring_advance(rq, cs);
> diff --git a/drivers/gpu/drm/i915/gt/selftest_timeline.c b/drivers/gpu/drm/i915/gt/selftest_timeline.c
> index d283dce5b4ac..a4c78062e92b 100644
> --- a/drivers/gpu/drm/i915/gt/selftest_timeline.c
> +++ b/drivers/gpu/drm/i915/gt/selftest_timeline.c
> @@ -665,7 +665,7 @@ static int live_hwsp_wrap(void *arg)
>         if (IS_ERR(tl))
>                 return PTR_ERR(tl);
>
> -       if (!tl->has_initial_breadcrumb || !tl->hwsp_cacheline)
> +       if (!tl->has_initial_breadcrumb)
>                 goto out_free;
>
>         err = intel_timeline_pin(tl, NULL);
> @@ -832,12 +832,26 @@ static int setup_watcher(struct hwsp_watcher *w, struct intel_gt *gt)
>         return 0;
>  }
>
> +static void switch_tl_lock(struct i915_request *from, struct i915_request *to)
> +{
> +       /* some light mutex juggling required; think co-routines */
> +
> +       if (from) {
> +               lockdep_unpin_lock(&from->context->timeline->mutex, from->cookie);
> +               mutex_unlock(&from->context->timeline->mutex);
> +       }
> +
> +       if (to) {
> +               mutex_lock(&to->context->timeline->mutex);
> +               to->cookie = lockdep_pin_lock(&to->context->timeline->mutex);
> +       }
> +}
> +
>  static int create_watcher(struct hwsp_watcher *w,
>                           struct intel_engine_cs *engine,
>                           int ringsz)
>  {
>         struct intel_context *ce;
> -       struct intel_timeline *tl;
>
>         ce = intel_context_create(engine);
>         if (IS_ERR(ce))
> @@ -850,11 +864,8 @@ static int create_watcher(struct hwsp_watcher *w,
>                 return PTR_ERR(w->rq);
>
>         w->addr = i915_ggtt_offset(w->vma);
> -       tl = w->rq->context->timeline;
>
> -       /* some light mutex juggling required; think co-routines */
> -       lockdep_unpin_lock(&tl->mutex, w->rq->cookie);
> -       mutex_unlock(&tl->mutex);
> +       switch_tl_lock(w->rq, NULL);
>
>         return 0;
>  }
> @@ -863,15 +874,13 @@ static int check_watcher(struct hwsp_watcher *w, const char *name,
>                          bool (*op)(u32 hwsp, u32 seqno))
>  {
>         struct i915_request *rq = fetch_and_zero(&w->rq);
> -       struct intel_timeline *tl = rq->context->timeline;
>         u32 offset, end;
>         int err;
>
>         GEM_BUG_ON(w->addr - i915_ggtt_offset(w->vma) > w->vma->size);
>
>         i915_request_get(rq);
> -       mutex_lock(&tl->mutex);
> -       rq->cookie = lockdep_pin_lock(&tl->mutex);
> +       switch_tl_lock(NULL, rq);
>         i915_request_add(rq);
>
>         if (i915_request_wait(rq, 0, HZ) < 0) {
> @@ -900,10 +909,7 @@ static int check_watcher(struct hwsp_watcher *w, const char *name,
>  static void cleanup_watcher(struct hwsp_watcher *w)
>  {
>         if (w->rq) {
> -               struct intel_timeline *tl = w->rq->context->timeline;
> -
> -               mutex_lock(&tl->mutex);
> -               w->rq->cookie = lockdep_pin_lock(&tl->mutex);
> +               switch_tl_lock(NULL, w->rq);
>
>                 i915_request_add(w->rq);
>         }
> @@ -941,7 +947,7 @@ static struct i915_request *wrap_timeline(struct i915_request *rq)
>         }
>
>         i915_request_put(rq);
> -       rq = intel_context_create_request(ce);
> +       rq = i915_request_create(ce);
>         if (IS_ERR(rq))
>                 return rq;
>
> @@ -976,7 +982,7 @@ static int live_hwsp_read(void *arg)
>         if (IS_ERR(tl))
>                 return PTR_ERR(tl);
>
> -       if (!tl->hwsp_cacheline)
> +       if (!tl->has_initial_breadcrumb)
>                 goto out_free;
>
>         for (i = 0; i < ARRAY_SIZE(watcher); i++) {
> @@ -998,7 +1004,7 @@ static int live_hwsp_read(void *arg)
>                 do {
>                         struct i915_sw_fence *submit;
>                         struct i915_request *rq;
> -                       u32 hwsp;
> +                       u32 hwsp, dummy;
>
>                         submit = heap_fence_create(GFP_KERNEL);
>                         if (!submit) {
> @@ -1016,14 +1022,26 @@ static int live_hwsp_read(void *arg)
>                                 goto out;
>                         }
>
> -                       /* Skip to the end, saving 30 minutes of nops */
> -                       tl->seqno = -10u + 2 * (count & 3);
> -                       WRITE_ONCE(*(u32 *)tl->hwsp_seqno, tl->seqno);
>                         ce->timeline = intel_timeline_get(tl);
>
> -                       rq = intel_context_create_request(ce);
> +                       /* Ensure timeline is mapped, done during first pin */
> +                       err = intel_context_pin(ce);
> +                       if (err) {
> +                               intel_context_put(ce);
> +                               goto out;
> +                       }
> +
> +                       /*
> +                        * Start at a new wrap, and set seqno right before another wrap,
> +                        * saving 30 minutes of nops
> +                        */
> +                       tl->seqno = -12u + 2 * (count & 3);
> +                       __intel_timeline_get_seqno(tl, &dummy);
> +
> +                       rq = i915_request_create(ce);
>                         if (IS_ERR(rq)) {
>                                 err = PTR_ERR(rq);
> +                               intel_context_unpin(ce);
>                                 intel_context_put(ce);
>                                 goto out;
>                         }
> @@ -1033,32 +1051,35 @@ static int live_hwsp_read(void *arg)
>                                                             GFP_KERNEL);
>                         if (err < 0) {
>                                 i915_request_add(rq);
> +                               intel_context_unpin(ce);
>                                 intel_context_put(ce);
>                                 goto out;
>                         }
>
> -                       mutex_lock(&watcher[0].rq->context->timeline->mutex);
> +                       switch_tl_lock(rq, watcher[0].rq);
>                         err = intel_timeline_read_hwsp(rq, watcher[0].rq, &hwsp);
>                         if (err == 0)
>                                 err = emit_read_hwsp(watcher[0].rq, /* before */
>                                                      rq->fence.seqno, hwsp,
>                                                      &watcher[0].addr);
> -                       mutex_unlock(&watcher[0].rq->context->timeline->mutex);
> +                       switch_tl_lock(watcher[0].rq, rq);
>                         if (err) {
>                                 i915_request_add(rq);
> +                               intel_context_unpin(ce);
>                                 intel_context_put(ce);
>                                 goto out;
>                         }
>
> -                       mutex_lock(&watcher[1].rq->context->timeline->mutex);
> +                       switch_tl_lock(rq, watcher[1].rq);
>                         err = intel_timeline_read_hwsp(rq, watcher[1].rq, &hwsp);
>                         if (err == 0)
>                                 err = emit_read_hwsp(watcher[1].rq, /* after */
>                                                      rq->fence.seqno, hwsp,
>                                                      &watcher[1].addr);
> -                       mutex_unlock(&watcher[1].rq->context->timeline->mutex);
> +                       switch_tl_lock(watcher[1].rq, rq);
>                         if (err) {
>                                 i915_request_add(rq);
> +                               intel_context_unpin(ce);
>                                 intel_context_put(ce);
>                                 goto out;
>                         }
> @@ -1067,6 +1088,7 @@ static int live_hwsp_read(void *arg)
>                         i915_request_add(rq);
>
>                         rq = wrap_timeline(rq);
> +                       intel_context_unpin(ce);
>                         intel_context_put(ce);
>                         if (IS_ERR(rq)) {
>                                 err = PTR_ERR(rq);
> @@ -1106,8 +1128,8 @@ static int live_hwsp_read(void *arg)
>                             3 * watcher[1].rq->ring->size)
>                                 break;
>
> -               } while (!__igt_timeout(end_time, NULL));
> -               WRITE_ONCE(*(u32 *)tl->hwsp_seqno, 0xdeadbeef);
> +               } while (!__igt_timeout(end_time, NULL) &&
> +                        count < (PAGE_SIZE / TIMELINE_SEQNO_BYTES - 1) / 2);
>
>                 pr_info("%s: simulated %lu wraps\n", engine->name, count);
>                 err = check_watcher(&watcher[1], "after", cmp_gte);
> @@ -1152,9 +1174,7 @@ static int live_hwsp_rollover_kernel(void *arg)
>                 }
>
>                 GEM_BUG_ON(i915_active_fence_isset(&tl->last_request));
> -               tl->seqno = 0;
> -               timeline_rollback(tl);
> -               timeline_rollback(tl);
> +               tl->seqno = -2u;
>                 WRITE_ONCE(*(u32 *)tl->hwsp_seqno, tl->seqno);
>
>                 for (i = 0; i < ARRAY_SIZE(rq); i++) {
> @@ -1234,11 +1254,10 @@ static int live_hwsp_rollover_user(void *arg)
>                         goto out;
>
>                 tl = ce->timeline;
> -               if (!tl->has_initial_breadcrumb || !tl->hwsp_cacheline)
> +               if (!tl->has_initial_breadcrumb)
>                         goto out;
>
> -               timeline_rollback(tl);
> -               timeline_rollback(tl);
> +               tl->seqno = -4u;
>                 WRITE_ONCE(*(u32 *)tl->hwsp_seqno, tl->seqno);
>
>                 for (i = 0; i < ARRAY_SIZE(rq); i++) {
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index e7b4c4bc41a6..59d942910558 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -794,7 +794,6 @@ __i915_request_create(struct intel_context *ce, gfp_t gfp)
>         rq->fence.seqno = seqno;
>
>         RCU_INIT_POINTER(rq->timeline, tl);
> -       RCU_INIT_POINTER(rq->hwsp_cacheline, tl->hwsp_cacheline);
>         rq->hwsp_seqno = tl->hwsp_seqno;
>         GEM_BUG_ON(__i915_request_is_complete(rq));
>
> @@ -1039,9 +1038,6 @@ emit_semaphore_wait(struct i915_request *to,
>         if (i915_request_has_initial_breadcrumb(to))
>                 goto await_fence;
>
> -       if (!rcu_access_pointer(from->hwsp_cacheline))
> -               goto await_fence;
> -
>         /*
>          * If this or its dependents are waiting on an external fence
>          * that may fail catastrophically, then we want to avoid using
> diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> index dd10a6db3d21..38062495b66f 100644
> --- a/drivers/gpu/drm/i915/i915_request.h
> +++ b/drivers/gpu/drm/i915/i915_request.h
> @@ -239,16 +239,6 @@ struct i915_request {
>          */
>         const u32 *hwsp_seqno;
>
> -       /*
> -        * If we need to access the timeline's seqno for this request in
> -        * another request, we need to keep a read reference to this associated
> -        * cacheline, so that we do not free and recycle it before the foreign
> -        * observers have completed. Hence, we keep a pointer to the cacheline
> -        * inside the timeline's HWSP vma, but it is only valid while this
> -        * request has not completed and guarded by the timeline mutex.
> -        */
> -       struct intel_timeline_cacheline __rcu *hwsp_cacheline;
> -
>         /** Position in the ring of the start of the request */
>         u32 head;
>
> @@ -650,4 +640,25 @@ static inline bool i915_request_use_semaphores(const struct i915_request *rq)
>         return intel_engine_has_semaphores(rq->engine);
>  }
>
> +static inline u32
> +i915_request_active_seqno(const struct i915_request *rq)
> +{
> +       u32 hwsp_phys_base =
> +               page_mask_bits(i915_request_active_timeline(rq)->hwsp_offset);
> +       u32 hwsp_relative_offset = offset_in_page(rq->hwsp_seqno);
> +
> +       /*
> +        * Because of wraparound, we cannot simply take tl->hwsp_offset,
> +        * but instead use the fact that the relative for vaddr is the
> +        * offset as for hwsp_offset. Take the top bits from tl->hwsp_offset
> +        * and combine them with the relative offset in rq->hwsp_seqno.
> +        *
> +        * As rw->hwsp_seqno is rewritten when signaled, this only works
> +        * when the request isn't signaled yet, but at that point you
> +        * no longer need the offset.
> +        */
> +
> +       return hwsp_phys_base + hwsp_relative_offset;
> +}
> +
>  #endif /* I915_REQUEST_H */
> --
> 2.30.1
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Maarten Lankhorst March 15, 2021, 12:08 p.m. UTC | #2
Op 2021-03-11 om 22:22 schreef Jason Ekstrand:
> First off, I'm just here asking questions right now trying to start
> getting my head around some of this stuff.  Feel free to ignore me or
> tell me to go away if I'm being annoying. :-)
>
> On Thu, Mar 11, 2021 at 7:49 AM Maarten Lankhorst
> <maarten.lankhorst@linux.intel.com> wrote:
>> Instead of sharing pages with breadcrumbs, give each timeline a
>> single page. This allows unrelated timelines not to share locks
>> any more during command submission.
>>
>> As an additional benefit, seqno wraparound no longer requires
>> i915_vma_pin, which means we no longer need to worry about a
>> potential -EDEADLK at a point where we are ready to submit.
>>
>> Changes since v1:
>> - Fix erroneous i915_vma_acquire that should be a i915_vma_release (ickle).
>> - Extra check for completion in intel_read_hwsp().
>> Changes since v2:
>> - Fix inconsistent indent in hwsp_alloc() (kbuild)
>> - memset entire cacheline to 0.
>> Changes since v3:
>> - Do same in intel_timeline_reset_seqno(), and clflush for good measure.
>> Changes since v4:
>> - Use refcounting on timeline, instead of relying on i915_active.
>> - Fix waiting on kernel requests.
>> Changes since v5:
>> - Bump amount of slots to maximum (256), for best wraparounds.
>> - Add hwsp_offset to i915_request to fix potential wraparound hang.
>> - Ensure timeline wrap test works with the changes.
>> - Assign hwsp in intel_timeline_read_hwsp() within the rcu lock to
>>   fix a hang.
>> Changes since v6:
>> - Rename i915_request_active_offset to i915_request_active_seqno(),
>>   and elaborate the function. (tvrtko)
>>
>> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>> Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com> #v1
>> Reported-by: kernel test robot <lkp@intel.com>
>> ---
>>  drivers/gpu/drm/i915/gt/gen2_engine_cs.c      |   2 +-
>>  drivers/gpu/drm/i915/gt/gen6_engine_cs.c      |   8 +-
>>  drivers/gpu/drm/i915/gt/gen8_engine_cs.c      |  13 +-
>>  drivers/gpu/drm/i915/gt/intel_engine_cs.c     |   1 +
>>  drivers/gpu/drm/i915/gt/intel_gt_types.h      |   4 -
>>  drivers/gpu/drm/i915/gt/intel_timeline.c      | 422 ++++--------------
>>  .../gpu/drm/i915/gt/intel_timeline_types.h    |  17 +-
>>  drivers/gpu/drm/i915/gt/selftest_engine_cs.c  |   5 +-
>>  drivers/gpu/drm/i915/gt/selftest_timeline.c   |  83 ++--
>>  drivers/gpu/drm/i915/i915_request.c           |   4 -
>>  drivers/gpu/drm/i915/i915_request.h           |  31 +-
>>  11 files changed, 175 insertions(+), 415 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/gen2_engine_cs.c b/drivers/gpu/drm/i915/gt/gen2_engine_cs.c
>> index b491a64919c8..9646200d2792 100644
>> --- a/drivers/gpu/drm/i915/gt/gen2_engine_cs.c
>> +++ b/drivers/gpu/drm/i915/gt/gen2_engine_cs.c
>> @@ -143,7 +143,7 @@ static u32 *__gen2_emit_breadcrumb(struct i915_request *rq, u32 *cs,
>>                                    int flush, int post)
>>  {
>>         GEM_BUG_ON(i915_request_active_timeline(rq)->hwsp_ggtt != rq->engine->status_page.vma);
>> -       GEM_BUG_ON(offset_in_page(i915_request_active_timeline(rq)->hwsp_offset) != I915_GEM_HWS_SEQNO_ADDR);
>> +       GEM_BUG_ON(offset_in_page(rq->hwsp_seqno) != I915_GEM_HWS_SEQNO_ADDR);
>>
>>         *cs++ = MI_FLUSH;
>>
>> diff --git a/drivers/gpu/drm/i915/gt/gen6_engine_cs.c b/drivers/gpu/drm/i915/gt/gen6_engine_cs.c
>> index ce38d1bcaba3..b388ceeeb1c9 100644
>> --- a/drivers/gpu/drm/i915/gt/gen6_engine_cs.c
>> +++ b/drivers/gpu/drm/i915/gt/gen6_engine_cs.c
>> @@ -161,7 +161,7 @@ u32 *gen6_emit_breadcrumb_rcs(struct i915_request *rq, u32 *cs)
>>                  PIPE_CONTROL_DC_FLUSH_ENABLE |
>>                  PIPE_CONTROL_QW_WRITE |
>>                  PIPE_CONTROL_CS_STALL);
>> -       *cs++ = i915_request_active_timeline(rq)->hwsp_offset |
>> +       *cs++ = i915_request_active_seqno(rq) |
>>                 PIPE_CONTROL_GLOBAL_GTT;
>>         *cs++ = rq->fence.seqno;
>>
>> @@ -359,7 +359,7 @@ u32 *gen7_emit_breadcrumb_rcs(struct i915_request *rq, u32 *cs)
>>                  PIPE_CONTROL_QW_WRITE |
>>                  PIPE_CONTROL_GLOBAL_GTT_IVB |
>>                  PIPE_CONTROL_CS_STALL);
>> -       *cs++ = i915_request_active_timeline(rq)->hwsp_offset;
>> +       *cs++ = i915_request_active_seqno(rq);
>>         *cs++ = rq->fence.seqno;
>>
>>         *cs++ = MI_USER_INTERRUPT;
>> @@ -374,7 +374,7 @@ u32 *gen7_emit_breadcrumb_rcs(struct i915_request *rq, u32 *cs)
>>  u32 *gen6_emit_breadcrumb_xcs(struct i915_request *rq, u32 *cs)
>>  {
>>         GEM_BUG_ON(i915_request_active_timeline(rq)->hwsp_ggtt != rq->engine->status_page.vma);
>> -       GEM_BUG_ON(offset_in_page(i915_request_active_timeline(rq)->hwsp_offset) != I915_GEM_HWS_SEQNO_ADDR);
>> +       GEM_BUG_ON(offset_in_page(rq->hwsp_seqno) != I915_GEM_HWS_SEQNO_ADDR);
>>
>>         *cs++ = MI_FLUSH_DW | MI_FLUSH_DW_OP_STOREDW | MI_FLUSH_DW_STORE_INDEX;
>>         *cs++ = I915_GEM_HWS_SEQNO_ADDR | MI_FLUSH_DW_USE_GTT;
>> @@ -394,7 +394,7 @@ u32 *gen7_emit_breadcrumb_xcs(struct i915_request *rq, u32 *cs)
>>         int i;
>>
>>         GEM_BUG_ON(i915_request_active_timeline(rq)->hwsp_ggtt != rq->engine->status_page.vma);
>> -       GEM_BUG_ON(offset_in_page(i915_request_active_timeline(rq)->hwsp_offset) != I915_GEM_HWS_SEQNO_ADDR);
>> +       GEM_BUG_ON(offset_in_page(rq->hwsp_seqno) != I915_GEM_HWS_SEQNO_ADDR);
>>
>>         *cs++ = MI_FLUSH_DW | MI_INVALIDATE_TLB |
>>                 MI_FLUSH_DW_OP_STOREDW | MI_FLUSH_DW_STORE_INDEX;
>> diff --git a/drivers/gpu/drm/i915/gt/gen8_engine_cs.c b/drivers/gpu/drm/i915/gt/gen8_engine_cs.c
>> index cac80af7ad1c..6b9c34d3ac8d 100644
>> --- a/drivers/gpu/drm/i915/gt/gen8_engine_cs.c
>> +++ b/drivers/gpu/drm/i915/gt/gen8_engine_cs.c
>> @@ -338,15 +338,14 @@ static u32 preempt_address(struct intel_engine_cs *engine)
>>
>>  static u32 hwsp_offset(const struct i915_request *rq)
>>  {
>> -       const struct intel_timeline_cacheline *cl;
>> +       const struct intel_timeline *tl;
>>
>> -       /* Before the request is executed, the timeline/cachline is fixed */
>> +       /* Before the request is executed, the timeline is fixed */
>> +       tl = rcu_dereference_protected(rq->timeline,
>> +                                      !i915_request_signaled(rq));
> Why is Gen8+ different from Gen2/6 here?  In particular, why not use
> i915_request_active_timeline(rq) or, better yet,
> i915_request_active_seqno()?  The primary difference I see is that the
> guard on the RCU is different but it's not immediately obvious to me
> why this should be different between hardware generations.  Also,
> i915_request_active_seqno() returns a u32, but that could be fixed.

The legacy rings different locking, and it was splatting on PROVE_RCU.

I didn't want to weaken it, so I just put in a different condition for gen8 that's slightly weaker, but still better than the '1'.
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/gen2_engine_cs.c b/drivers/gpu/drm/i915/gt/gen2_engine_cs.c
index b491a64919c8..9646200d2792 100644
--- a/drivers/gpu/drm/i915/gt/gen2_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/gen2_engine_cs.c
@@ -143,7 +143,7 @@  static u32 *__gen2_emit_breadcrumb(struct i915_request *rq, u32 *cs,
 				   int flush, int post)
 {
 	GEM_BUG_ON(i915_request_active_timeline(rq)->hwsp_ggtt != rq->engine->status_page.vma);
-	GEM_BUG_ON(offset_in_page(i915_request_active_timeline(rq)->hwsp_offset) != I915_GEM_HWS_SEQNO_ADDR);
+	GEM_BUG_ON(offset_in_page(rq->hwsp_seqno) != I915_GEM_HWS_SEQNO_ADDR);
 
 	*cs++ = MI_FLUSH;
 
diff --git a/drivers/gpu/drm/i915/gt/gen6_engine_cs.c b/drivers/gpu/drm/i915/gt/gen6_engine_cs.c
index ce38d1bcaba3..b388ceeeb1c9 100644
--- a/drivers/gpu/drm/i915/gt/gen6_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/gen6_engine_cs.c
@@ -161,7 +161,7 @@  u32 *gen6_emit_breadcrumb_rcs(struct i915_request *rq, u32 *cs)
 		 PIPE_CONTROL_DC_FLUSH_ENABLE |
 		 PIPE_CONTROL_QW_WRITE |
 		 PIPE_CONTROL_CS_STALL);
-	*cs++ = i915_request_active_timeline(rq)->hwsp_offset |
+	*cs++ = i915_request_active_seqno(rq) |
 		PIPE_CONTROL_GLOBAL_GTT;
 	*cs++ = rq->fence.seqno;
 
@@ -359,7 +359,7 @@  u32 *gen7_emit_breadcrumb_rcs(struct i915_request *rq, u32 *cs)
 		 PIPE_CONTROL_QW_WRITE |
 		 PIPE_CONTROL_GLOBAL_GTT_IVB |
 		 PIPE_CONTROL_CS_STALL);
-	*cs++ = i915_request_active_timeline(rq)->hwsp_offset;
+	*cs++ = i915_request_active_seqno(rq);
 	*cs++ = rq->fence.seqno;
 
 	*cs++ = MI_USER_INTERRUPT;
@@ -374,7 +374,7 @@  u32 *gen7_emit_breadcrumb_rcs(struct i915_request *rq, u32 *cs)
 u32 *gen6_emit_breadcrumb_xcs(struct i915_request *rq, u32 *cs)
 {
 	GEM_BUG_ON(i915_request_active_timeline(rq)->hwsp_ggtt != rq->engine->status_page.vma);
-	GEM_BUG_ON(offset_in_page(i915_request_active_timeline(rq)->hwsp_offset) != I915_GEM_HWS_SEQNO_ADDR);
+	GEM_BUG_ON(offset_in_page(rq->hwsp_seqno) != I915_GEM_HWS_SEQNO_ADDR);
 
 	*cs++ = MI_FLUSH_DW | MI_FLUSH_DW_OP_STOREDW | MI_FLUSH_DW_STORE_INDEX;
 	*cs++ = I915_GEM_HWS_SEQNO_ADDR | MI_FLUSH_DW_USE_GTT;
@@ -394,7 +394,7 @@  u32 *gen7_emit_breadcrumb_xcs(struct i915_request *rq, u32 *cs)
 	int i;
 
 	GEM_BUG_ON(i915_request_active_timeline(rq)->hwsp_ggtt != rq->engine->status_page.vma);
-	GEM_BUG_ON(offset_in_page(i915_request_active_timeline(rq)->hwsp_offset) != I915_GEM_HWS_SEQNO_ADDR);
+	GEM_BUG_ON(offset_in_page(rq->hwsp_seqno) != I915_GEM_HWS_SEQNO_ADDR);
 
 	*cs++ = MI_FLUSH_DW | MI_INVALIDATE_TLB |
 		MI_FLUSH_DW_OP_STOREDW | MI_FLUSH_DW_STORE_INDEX;
diff --git a/drivers/gpu/drm/i915/gt/gen8_engine_cs.c b/drivers/gpu/drm/i915/gt/gen8_engine_cs.c
index cac80af7ad1c..6b9c34d3ac8d 100644
--- a/drivers/gpu/drm/i915/gt/gen8_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/gen8_engine_cs.c
@@ -338,15 +338,14 @@  static u32 preempt_address(struct intel_engine_cs *engine)
 
 static u32 hwsp_offset(const struct i915_request *rq)
 {
-	const struct intel_timeline_cacheline *cl;
+	const struct intel_timeline *tl;
 
-	/* Before the request is executed, the timeline/cachline is fixed */
+	/* Before the request is executed, the timeline is fixed */
+	tl = rcu_dereference_protected(rq->timeline,
+				       !i915_request_signaled(rq));
 
-	cl = rcu_dereference_protected(rq->hwsp_cacheline, 1);
-	if (cl)
-		return cl->ggtt_offset;
-
-	return rcu_dereference_protected(rq->timeline, 1)->hwsp_offset;
+	/* See the comment in i915_request_active_seqno(). */
+	return page_mask_bits(tl->hwsp_offset) + offset_in_page(rq->hwsp_seqno);
 }
 
 int gen8_emit_init_breadcrumb(struct i915_request *rq)
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index b4df275cba79..e6cefd00b4a1 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -752,6 +752,7 @@  static int measure_breadcrumb_dw(struct intel_context *ce)
 	frame->rq.engine = engine;
 	frame->rq.context = ce;
 	rcu_assign_pointer(frame->rq.timeline, ce->timeline);
+	frame->rq.hwsp_seqno = ce->timeline->hwsp_seqno;
 
 	frame->ring.vaddr = frame->cs;
 	frame->ring.size = sizeof(frame->cs);
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_types.h b/drivers/gpu/drm/i915/gt/intel_gt_types.h
index 626af37c7790..3f6db8357309 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_types.h
@@ -45,10 +45,6 @@  struct intel_gt {
 	struct intel_gt_timelines {
 		spinlock_t lock; /* protects active_list */
 		struct list_head active_list;
-
-		/* Pack multiple timelines' seqnos into the same page */
-		spinlock_t hwsp_lock;
-		struct list_head hwsp_free_list;
 	} timelines;
 
 	struct intel_gt_requests {
diff --git a/drivers/gpu/drm/i915/gt/intel_timeline.c b/drivers/gpu/drm/i915/gt/intel_timeline.c
index 491b8df174c2..efe2030cfe5e 100644
--- a/drivers/gpu/drm/i915/gt/intel_timeline.c
+++ b/drivers/gpu/drm/i915/gt/intel_timeline.c
@@ -11,21 +11,9 @@ 
 #include "intel_ring.h"
 #include "intel_timeline.h"
 
-#define ptr_set_bit(ptr, bit) ((typeof(ptr))((unsigned long)(ptr) | BIT(bit)))
-#define ptr_test_bit(ptr, bit) ((unsigned long)(ptr) & BIT(bit))
+#define TIMELINE_SEQNO_BYTES 8
 
-#define CACHELINE_BITS 6
-#define CACHELINE_FREE CACHELINE_BITS
-
-struct intel_timeline_hwsp {
-	struct intel_gt *gt;
-	struct intel_gt_timelines *gt_timelines;
-	struct list_head free_link;
-	struct i915_vma *vma;
-	u64 free_bitmap;
-};
-
-static struct i915_vma *__hwsp_alloc(struct intel_gt *gt)
+static struct i915_vma *hwsp_alloc(struct intel_gt *gt)
 {
 	struct drm_i915_private *i915 = gt->i915;
 	struct drm_i915_gem_object *obj;
@@ -44,220 +32,59 @@  static struct i915_vma *__hwsp_alloc(struct intel_gt *gt)
 	return vma;
 }
 
-static struct i915_vma *
-hwsp_alloc(struct intel_timeline *timeline, unsigned int *cacheline)
-{
-	struct intel_gt_timelines *gt = &timeline->gt->timelines;
-	struct intel_timeline_hwsp *hwsp;
-
-	BUILD_BUG_ON(BITS_PER_TYPE(u64) * CACHELINE_BYTES > PAGE_SIZE);
-
-	spin_lock_irq(&gt->hwsp_lock);
-
-	/* hwsp_free_list only contains HWSP that have available cachelines */
-	hwsp = list_first_entry_or_null(&gt->hwsp_free_list,
-					typeof(*hwsp), free_link);
-	if (!hwsp) {
-		struct i915_vma *vma;
-
-		spin_unlock_irq(&gt->hwsp_lock);
-
-		hwsp = kmalloc(sizeof(*hwsp), GFP_KERNEL);
-		if (!hwsp)
-			return ERR_PTR(-ENOMEM);
-
-		vma = __hwsp_alloc(timeline->gt);
-		if (IS_ERR(vma)) {
-			kfree(hwsp);
-			return vma;
-		}
-
-		GT_TRACE(timeline->gt, "new HWSP allocated\n");
-
-		vma->private = hwsp;
-		hwsp->gt = timeline->gt;
-		hwsp->vma = vma;
-		hwsp->free_bitmap = ~0ull;
-		hwsp->gt_timelines = gt;
-
-		spin_lock_irq(&gt->hwsp_lock);
-		list_add(&hwsp->free_link, &gt->hwsp_free_list);
-	}
-
-	GEM_BUG_ON(!hwsp->free_bitmap);
-	*cacheline = __ffs64(hwsp->free_bitmap);
-	hwsp->free_bitmap &= ~BIT_ULL(*cacheline);
-	if (!hwsp->free_bitmap)
-		list_del(&hwsp->free_link);
-
-	spin_unlock_irq(&gt->hwsp_lock);
-
-	GEM_BUG_ON(hwsp->vma->private != hwsp);
-	return hwsp->vma;
-}
-
-static void __idle_hwsp_free(struct intel_timeline_hwsp *hwsp, int cacheline)
-{
-	struct intel_gt_timelines *gt = hwsp->gt_timelines;
-	unsigned long flags;
-
-	spin_lock_irqsave(&gt->hwsp_lock, flags);
-
-	/* As a cacheline becomes available, publish the HWSP on the freelist */
-	if (!hwsp->free_bitmap)
-		list_add_tail(&hwsp->free_link, &gt->hwsp_free_list);
-
-	GEM_BUG_ON(cacheline >= BITS_PER_TYPE(hwsp->free_bitmap));
-	hwsp->free_bitmap |= BIT_ULL(cacheline);
-
-	/* And if no one is left using it, give the page back to the system */
-	if (hwsp->free_bitmap == ~0ull) {
-		i915_vma_put(hwsp->vma);
-		list_del(&hwsp->free_link);
-		kfree(hwsp);
-	}
-
-	spin_unlock_irqrestore(&gt->hwsp_lock, flags);
-}
-
-static void __rcu_cacheline_free(struct rcu_head *rcu)
-{
-	struct intel_timeline_cacheline *cl =
-		container_of(rcu, typeof(*cl), rcu);
-
-	/* Must wait until after all *rq->hwsp are complete before removing */
-	i915_gem_object_unpin_map(cl->hwsp->vma->obj);
-	__idle_hwsp_free(cl->hwsp, ptr_unmask_bits(cl->vaddr, CACHELINE_BITS));
-
-	i915_active_fini(&cl->active);
-	kfree(cl);
-}
-
-static void __idle_cacheline_free(struct intel_timeline_cacheline *cl)
-{
-	GEM_BUG_ON(!i915_active_is_idle(&cl->active));
-	call_rcu(&cl->rcu, __rcu_cacheline_free);
-}
-
 __i915_active_call
-static void __cacheline_retire(struct i915_active *active)
+static void __timeline_retire(struct i915_active *active)
 {
-	struct intel_timeline_cacheline *cl =
-		container_of(active, typeof(*cl), active);
+	struct intel_timeline *tl =
+		container_of(active, typeof(*tl), active);
 
-	i915_vma_unpin(cl->hwsp->vma);
-	if (ptr_test_bit(cl->vaddr, CACHELINE_FREE))
-		__idle_cacheline_free(cl);
+	i915_vma_unpin(tl->hwsp_ggtt);
+	intel_timeline_put(tl);
 }
 
-static int __cacheline_active(struct i915_active *active)
+static int __timeline_active(struct i915_active *active)
 {
-	struct intel_timeline_cacheline *cl =
-		container_of(active, typeof(*cl), active);
+	struct intel_timeline *tl =
+		container_of(active, typeof(*tl), active);
 
-	__i915_vma_pin(cl->hwsp->vma);
+	__i915_vma_pin(tl->hwsp_ggtt);
+	intel_timeline_get(tl);
 	return 0;
 }
 
-static struct intel_timeline_cacheline *
-cacheline_alloc(struct intel_timeline_hwsp *hwsp, unsigned int cacheline)
-{
-	struct intel_timeline_cacheline *cl;
-	void *vaddr;
-
-	GEM_BUG_ON(cacheline >= BIT(CACHELINE_BITS));
-
-	cl = kmalloc(sizeof(*cl), GFP_KERNEL);
-	if (!cl)
-		return ERR_PTR(-ENOMEM);
-
-	vaddr = i915_gem_object_pin_map(hwsp->vma->obj, I915_MAP_WB);
-	if (IS_ERR(vaddr)) {
-		kfree(cl);
-		return ERR_CAST(vaddr);
-	}
-
-	cl->hwsp = hwsp;
-	cl->vaddr = page_pack_bits(vaddr, cacheline);
-
-	i915_active_init(&cl->active, __cacheline_active, __cacheline_retire);
-
-	return cl;
-}
-
-static void cacheline_acquire(struct intel_timeline_cacheline *cl,
-			      u32 ggtt_offset)
-{
-	if (!cl)
-		return;
-
-	cl->ggtt_offset = ggtt_offset;
-	i915_active_acquire(&cl->active);
-}
-
-static void cacheline_release(struct intel_timeline_cacheline *cl)
-{
-	if (cl)
-		i915_active_release(&cl->active);
-}
-
-static void cacheline_free(struct intel_timeline_cacheline *cl)
-{
-	if (!i915_active_acquire_if_busy(&cl->active)) {
-		__idle_cacheline_free(cl);
-		return;
-	}
-
-	GEM_BUG_ON(ptr_test_bit(cl->vaddr, CACHELINE_FREE));
-	cl->vaddr = ptr_set_bit(cl->vaddr, CACHELINE_FREE);
-
-	i915_active_release(&cl->active);
-}
-
 static int intel_timeline_init(struct intel_timeline *timeline,
 			       struct intel_gt *gt,
 			       struct i915_vma *hwsp,
 			       unsigned int offset)
 {
 	void *vaddr;
+	u32 *seqno;
 
 	kref_init(&timeline->kref);
 	atomic_set(&timeline->pin_count, 0);
 
 	timeline->gt = gt;
 
-	timeline->has_initial_breadcrumb = !hwsp;
-	timeline->hwsp_cacheline = NULL;
-
-	if (!hwsp) {
-		struct intel_timeline_cacheline *cl;
-		unsigned int cacheline;
-
-		hwsp = hwsp_alloc(timeline, &cacheline);
+	if (hwsp) {
+		timeline->hwsp_offset = offset;
+		timeline->hwsp_ggtt = i915_vma_get(hwsp);
+	} else {
+		timeline->has_initial_breadcrumb = true;
+		hwsp = hwsp_alloc(gt);
 		if (IS_ERR(hwsp))
 			return PTR_ERR(hwsp);
-
-		cl = cacheline_alloc(hwsp->private, cacheline);
-		if (IS_ERR(cl)) {
-			__idle_hwsp_free(hwsp->private, cacheline);
-			return PTR_ERR(cl);
-		}
-
-		timeline->hwsp_cacheline = cl;
-		timeline->hwsp_offset = cacheline * CACHELINE_BYTES;
-
-		vaddr = page_mask_bits(cl->vaddr);
-	} else {
-		timeline->hwsp_offset = offset;
-		vaddr = i915_gem_object_pin_map(hwsp->obj, I915_MAP_WB);
-		if (IS_ERR(vaddr))
-			return PTR_ERR(vaddr);
+		timeline->hwsp_ggtt = hwsp;
 	}
 
-	timeline->hwsp_seqno =
-		memset(vaddr + timeline->hwsp_offset, 0, CACHELINE_BYTES);
+	vaddr = i915_gem_object_pin_map(hwsp->obj, I915_MAP_WB);
+	if (IS_ERR(vaddr))
+		return PTR_ERR(vaddr);
+
+	timeline->hwsp_map = vaddr;
+	seqno = vaddr + timeline->hwsp_offset;
+	WRITE_ONCE(*seqno, 0);
+	timeline->hwsp_seqno = seqno;
 
-	timeline->hwsp_ggtt = i915_vma_get(hwsp);
 	GEM_BUG_ON(timeline->hwsp_offset >= hwsp->size);
 
 	timeline->fence_context = dma_fence_context_alloc(1);
@@ -268,6 +95,7 @@  static int intel_timeline_init(struct intel_timeline *timeline,
 	INIT_LIST_HEAD(&timeline->requests);
 
 	i915_syncmap_init(&timeline->sync);
+	i915_active_init(&timeline->active, __timeline_active, __timeline_retire);
 
 	return 0;
 }
@@ -278,23 +106,18 @@  void intel_gt_init_timelines(struct intel_gt *gt)
 
 	spin_lock_init(&timelines->lock);
 	INIT_LIST_HEAD(&timelines->active_list);
-
-	spin_lock_init(&timelines->hwsp_lock);
-	INIT_LIST_HEAD(&timelines->hwsp_free_list);
 }
 
-static void intel_timeline_fini(struct intel_timeline *timeline)
+static void intel_timeline_fini(struct rcu_head *rcu)
 {
-	GEM_BUG_ON(atomic_read(&timeline->pin_count));
-	GEM_BUG_ON(!list_empty(&timeline->requests));
-	GEM_BUG_ON(timeline->retire);
+	struct intel_timeline *timeline =
+		container_of(rcu, struct intel_timeline, rcu);
 
-	if (timeline->hwsp_cacheline)
-		cacheline_free(timeline->hwsp_cacheline);
-	else
-		i915_gem_object_unpin_map(timeline->hwsp_ggtt->obj);
+	i915_gem_object_unpin_map(timeline->hwsp_ggtt->obj);
 
 	i915_vma_put(timeline->hwsp_ggtt);
+	i915_active_fini(&timeline->active);
+	kfree(timeline);
 }
 
 struct intel_timeline *
@@ -360,9 +183,9 @@  int intel_timeline_pin(struct intel_timeline *tl, struct i915_gem_ww_ctx *ww)
 	GT_TRACE(tl->gt, "timeline:%llx using HWSP offset:%x\n",
 		 tl->fence_context, tl->hwsp_offset);
 
-	cacheline_acquire(tl->hwsp_cacheline, tl->hwsp_offset);
+	i915_active_acquire(&tl->active);
 	if (atomic_fetch_inc(&tl->pin_count)) {
-		cacheline_release(tl->hwsp_cacheline);
+		i915_active_release(&tl->active);
 		__i915_vma_unpin(tl->hwsp_ggtt);
 	}
 
@@ -371,9 +194,13 @@  int intel_timeline_pin(struct intel_timeline *tl, struct i915_gem_ww_ctx *ww)
 
 void intel_timeline_reset_seqno(const struct intel_timeline *tl)
 {
+	u32 *hwsp_seqno = (u32 *)tl->hwsp_seqno;
 	/* Must be pinned to be writable, and no requests in flight. */
 	GEM_BUG_ON(!atomic_read(&tl->pin_count));
-	WRITE_ONCE(*(u32 *)tl->hwsp_seqno, tl->seqno);
+
+	memset(hwsp_seqno + 1, 0, TIMELINE_SEQNO_BYTES - sizeof(*hwsp_seqno));
+	WRITE_ONCE(*hwsp_seqno, tl->seqno);
+	clflush(hwsp_seqno);
 }
 
 void intel_timeline_enter(struct intel_timeline *tl)
@@ -449,106 +276,23 @@  static u32 timeline_advance(struct intel_timeline *tl)
 	return tl->seqno += 1 + tl->has_initial_breadcrumb;
 }
 
-static void timeline_rollback(struct intel_timeline *tl)
-{
-	tl->seqno -= 1 + tl->has_initial_breadcrumb;
-}
-
 static noinline int
 __intel_timeline_get_seqno(struct intel_timeline *tl,
-			   struct i915_request *rq,
 			   u32 *seqno)
 {
-	struct intel_timeline_cacheline *cl;
-	unsigned int cacheline;
-	struct i915_vma *vma;
-	void *vaddr;
-	int err;
-
-	might_lock(&tl->gt->ggtt->vm.mutex);
-	GT_TRACE(tl->gt, "timeline:%llx wrapped\n", tl->fence_context);
-
-	/*
-	 * If there is an outstanding GPU reference to this cacheline,
-	 * such as it being sampled by a HW semaphore on another timeline,
-	 * we cannot wraparound our seqno value (the HW semaphore does
-	 * a strict greater-than-or-equals compare, not i915_seqno_passed).
-	 * So if the cacheline is still busy, we must detach ourselves
-	 * from it and leave it inflight alongside its users.
-	 *
-	 * However, if nobody is watching and we can guarantee that nobody
-	 * will, we could simply reuse the same cacheline.
-	 *
-	 * if (i915_active_request_is_signaled(&tl->last_request) &&
-	 *     i915_active_is_signaled(&tl->hwsp_cacheline->active))
-	 *	return 0;
-	 *
-	 * That seems unlikely for a busy timeline that needed to wrap in
-	 * the first place, so just replace the cacheline.
-	 */
-
-	vma = hwsp_alloc(tl, &cacheline);
-	if (IS_ERR(vma)) {
-		err = PTR_ERR(vma);
-		goto err_rollback;
-	}
-
-	err = i915_ggtt_pin(vma, NULL, 0, PIN_HIGH);
-	if (err) {
-		__idle_hwsp_free(vma->private, cacheline);
-		goto err_rollback;
-	}
+	u32 next_ofs = offset_in_page(tl->hwsp_offset + TIMELINE_SEQNO_BYTES);
 
-	cl = cacheline_alloc(vma->private, cacheline);
-	if (IS_ERR(cl)) {
-		err = PTR_ERR(cl);
-		__idle_hwsp_free(vma->private, cacheline);
-		goto err_unpin;
-	}
-	GEM_BUG_ON(cl->hwsp->vma != vma);
-
-	/*
-	 * Attach the old cacheline to the current request, so that we only
-	 * free it after the current request is retired, which ensures that
-	 * all writes into the cacheline from previous requests are complete.
-	 */
-	err = i915_active_ref(&tl->hwsp_cacheline->active,
-			      tl->fence_context,
-			      &rq->fence);
-	if (err)
-		goto err_cacheline;
+	/* w/a: bit 5 needs to be zero for MI_FLUSH_DW address. */
+	if (TIMELINE_SEQNO_BYTES <= BIT(5) && (next_ofs & BIT(5)))
+		next_ofs = offset_in_page(next_ofs + BIT(5));
 
-	cacheline_release(tl->hwsp_cacheline); /* ownership now xfered to rq */
-	cacheline_free(tl->hwsp_cacheline);
-
-	i915_vma_unpin(tl->hwsp_ggtt); /* binding kept alive by old cacheline */
-	i915_vma_put(tl->hwsp_ggtt);
-
-	tl->hwsp_ggtt = i915_vma_get(vma);
-
-	vaddr = page_mask_bits(cl->vaddr);
-	tl->hwsp_offset = cacheline * CACHELINE_BYTES;
-	tl->hwsp_seqno =
-		memset(vaddr + tl->hwsp_offset, 0, CACHELINE_BYTES);
-
-	tl->hwsp_offset += i915_ggtt_offset(vma);
-	GT_TRACE(tl->gt, "timeline:%llx using HWSP offset:%x\n",
-		 tl->fence_context, tl->hwsp_offset);
-
-	cacheline_acquire(cl, tl->hwsp_offset);
-	tl->hwsp_cacheline = cl;
+	tl->hwsp_offset = i915_ggtt_offset(tl->hwsp_ggtt) + next_ofs;
+	tl->hwsp_seqno = tl->hwsp_map + next_ofs;
+	intel_timeline_reset_seqno(tl);
 
 	*seqno = timeline_advance(tl);
 	GEM_BUG_ON(i915_seqno_passed(*tl->hwsp_seqno, *seqno));
 	return 0;
-
-err_cacheline:
-	cacheline_free(cl);
-err_unpin:
-	i915_vma_unpin(vma);
-err_rollback:
-	timeline_rollback(tl);
-	return err;
 }
 
 int intel_timeline_get_seqno(struct intel_timeline *tl,
@@ -558,51 +302,52 @@  int intel_timeline_get_seqno(struct intel_timeline *tl,
 	*seqno = timeline_advance(tl);
 
 	/* Replace the HWSP on wraparound for HW semaphores */
-	if (unlikely(!*seqno && tl->hwsp_cacheline))
-		return __intel_timeline_get_seqno(tl, rq, seqno);
+	if (unlikely(!*seqno && tl->has_initial_breadcrumb))
+		return __intel_timeline_get_seqno(tl, seqno);
 
 	return 0;
 }
 
-static int cacheline_ref(struct intel_timeline_cacheline *cl,
-			 struct i915_request *rq)
-{
-	return i915_active_add_request(&cl->active, rq);
-}
-
 int intel_timeline_read_hwsp(struct i915_request *from,
 			     struct i915_request *to,
 			     u32 *hwsp)
 {
-	struct intel_timeline_cacheline *cl;
+	struct intel_timeline *tl;
 	int err;
 
-	GEM_BUG_ON(!rcu_access_pointer(from->hwsp_cacheline));
-
 	rcu_read_lock();
-	cl = rcu_dereference(from->hwsp_cacheline);
-	if (i915_request_signaled(from)) /* confirm cacheline is valid */
-		goto unlock;
-	if (unlikely(!i915_active_acquire_if_busy(&cl->active)))
-		goto unlock; /* seqno wrapped and completed! */
-	if (unlikely(__i915_request_is_complete(from)))
-		goto release;
+	tl = rcu_dereference(from->timeline);
+	if (i915_request_signaled(from) ||
+	    !i915_active_acquire_if_busy(&tl->active))
+		tl = NULL;
+
+	if (tl) {
+		/* hwsp_offset may wraparound, so use from->hwsp_seqno */
+		*hwsp = i915_ggtt_offset(tl->hwsp_ggtt) +
+			offset_in_page(from->hwsp_seqno);
+	}
+
+	/* ensure we wait on the right request, if not, we completed */
+	if (tl && __i915_request_is_complete(from)) {
+		i915_active_release(&tl->active);
+		tl = NULL;
+	}
 	rcu_read_unlock();
 
-	err = cacheline_ref(cl, to);
-	if (err)
+	if (!tl)
+		return 1;
+
+	/* Can't do semaphore waits on kernel context */
+	if (!tl->has_initial_breadcrumb) {
+		err = -EINVAL;
 		goto out;
+	}
+
+	err = i915_active_add_request(&tl->active, to);
 
-	*hwsp = cl->ggtt_offset;
 out:
-	i915_active_release(&cl->active);
+	i915_active_release(&tl->active);
 	return err;
-
-release:
-	i915_active_release(&cl->active);
-unlock:
-	rcu_read_unlock();
-	return 1;
 }
 
 void intel_timeline_unpin(struct intel_timeline *tl)
@@ -611,8 +356,7 @@  void intel_timeline_unpin(struct intel_timeline *tl)
 	if (!atomic_dec_and_test(&tl->pin_count))
 		return;
 
-	cacheline_release(tl->hwsp_cacheline);
-
+	i915_active_release(&tl->active);
 	__i915_vma_unpin(tl->hwsp_ggtt);
 }
 
@@ -621,8 +365,11 @@  void __intel_timeline_free(struct kref *kref)
 	struct intel_timeline *timeline =
 		container_of(kref, typeof(*timeline), kref);
 
-	intel_timeline_fini(timeline);
-	kfree_rcu(timeline, rcu);
+	GEM_BUG_ON(atomic_read(&timeline->pin_count));
+	GEM_BUG_ON(!list_empty(&timeline->requests));
+	GEM_BUG_ON(timeline->retire);
+
+	call_rcu(&timeline->rcu, intel_timeline_fini);
 }
 
 void intel_gt_fini_timelines(struct intel_gt *gt)
@@ -630,7 +377,6 @@  void intel_gt_fini_timelines(struct intel_gt *gt)
 	struct intel_gt_timelines *timelines = &gt->timelines;
 
 	GEM_BUG_ON(!list_empty(&timelines->active_list));
-	GEM_BUG_ON(!list_empty(&timelines->hwsp_free_list));
 }
 
 void intel_gt_show_timelines(struct intel_gt *gt,
diff --git a/drivers/gpu/drm/i915/gt/intel_timeline_types.h b/drivers/gpu/drm/i915/gt/intel_timeline_types.h
index 9f677c9b7d06..74e67dbf89c5 100644
--- a/drivers/gpu/drm/i915/gt/intel_timeline_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_timeline_types.h
@@ -17,7 +17,6 @@ 
 struct i915_vma;
 struct i915_syncmap;
 struct intel_gt;
-struct intel_timeline_hwsp;
 
 struct intel_timeline {
 	u64 fence_context;
@@ -44,12 +43,11 @@  struct intel_timeline {
 	atomic_t pin_count;
 	atomic_t active_count;
 
+	void *hwsp_map;
 	const u32 *hwsp_seqno;
 	struct i915_vma *hwsp_ggtt;
 	u32 hwsp_offset;
 
-	struct intel_timeline_cacheline *hwsp_cacheline;
-
 	bool has_initial_breadcrumb;
 
 	/**
@@ -66,6 +64,8 @@  struct intel_timeline {
 	 */
 	struct i915_active_fence last_request;
 
+	struct i915_active active;
+
 	/** A chain of completed timelines ready for early retirement. */
 	struct intel_timeline *retire;
 
@@ -89,15 +89,4 @@  struct intel_timeline {
 	struct rcu_head rcu;
 };
 
-struct intel_timeline_cacheline {
-	struct i915_active active;
-
-	struct intel_timeline_hwsp *hwsp;
-	void *vaddr;
-
-	u32 ggtt_offset;
-
-	struct rcu_head rcu;
-};
-
 #endif /* __I915_TIMELINE_TYPES_H__ */
diff --git a/drivers/gpu/drm/i915/gt/selftest_engine_cs.c b/drivers/gpu/drm/i915/gt/selftest_engine_cs.c
index 84d883de30ee..7e466ae114f8 100644
--- a/drivers/gpu/drm/i915/gt/selftest_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/selftest_engine_cs.c
@@ -41,6 +41,9 @@  static int perf_end(struct intel_gt *gt)
 
 static int write_timestamp(struct i915_request *rq, int slot)
 {
+	struct intel_timeline *tl =
+		rcu_dereference_protected(rq->timeline,
+					  !i915_request_signaled(rq));
 	u32 cmd;
 	u32 *cs;
 
@@ -53,7 +56,7 @@  static int write_timestamp(struct i915_request *rq, int slot)
 		cmd++;
 	*cs++ = cmd;
 	*cs++ = i915_mmio_reg_offset(RING_TIMESTAMP(rq->engine->mmio_base));
-	*cs++ = i915_request_timeline(rq)->hwsp_offset + slot * sizeof(u32);
+	*cs++ = tl->hwsp_offset + slot * sizeof(u32);
 	*cs++ = 0;
 
 	intel_ring_advance(rq, cs);
diff --git a/drivers/gpu/drm/i915/gt/selftest_timeline.c b/drivers/gpu/drm/i915/gt/selftest_timeline.c
index d283dce5b4ac..a4c78062e92b 100644
--- a/drivers/gpu/drm/i915/gt/selftest_timeline.c
+++ b/drivers/gpu/drm/i915/gt/selftest_timeline.c
@@ -665,7 +665,7 @@  static int live_hwsp_wrap(void *arg)
 	if (IS_ERR(tl))
 		return PTR_ERR(tl);
 
-	if (!tl->has_initial_breadcrumb || !tl->hwsp_cacheline)
+	if (!tl->has_initial_breadcrumb)
 		goto out_free;
 
 	err = intel_timeline_pin(tl, NULL);
@@ -832,12 +832,26 @@  static int setup_watcher(struct hwsp_watcher *w, struct intel_gt *gt)
 	return 0;
 }
 
+static void switch_tl_lock(struct i915_request *from, struct i915_request *to)
+{
+	/* some light mutex juggling required; think co-routines */
+
+	if (from) {
+		lockdep_unpin_lock(&from->context->timeline->mutex, from->cookie);
+		mutex_unlock(&from->context->timeline->mutex);
+	}
+
+	if (to) {
+		mutex_lock(&to->context->timeline->mutex);
+		to->cookie = lockdep_pin_lock(&to->context->timeline->mutex);
+	}
+}
+
 static int create_watcher(struct hwsp_watcher *w,
 			  struct intel_engine_cs *engine,
 			  int ringsz)
 {
 	struct intel_context *ce;
-	struct intel_timeline *tl;
 
 	ce = intel_context_create(engine);
 	if (IS_ERR(ce))
@@ -850,11 +864,8 @@  static int create_watcher(struct hwsp_watcher *w,
 		return PTR_ERR(w->rq);
 
 	w->addr = i915_ggtt_offset(w->vma);
-	tl = w->rq->context->timeline;
 
-	/* some light mutex juggling required; think co-routines */
-	lockdep_unpin_lock(&tl->mutex, w->rq->cookie);
-	mutex_unlock(&tl->mutex);
+	switch_tl_lock(w->rq, NULL);
 
 	return 0;
 }
@@ -863,15 +874,13 @@  static int check_watcher(struct hwsp_watcher *w, const char *name,
 			 bool (*op)(u32 hwsp, u32 seqno))
 {
 	struct i915_request *rq = fetch_and_zero(&w->rq);
-	struct intel_timeline *tl = rq->context->timeline;
 	u32 offset, end;
 	int err;
 
 	GEM_BUG_ON(w->addr - i915_ggtt_offset(w->vma) > w->vma->size);
 
 	i915_request_get(rq);
-	mutex_lock(&tl->mutex);
-	rq->cookie = lockdep_pin_lock(&tl->mutex);
+	switch_tl_lock(NULL, rq);
 	i915_request_add(rq);
 
 	if (i915_request_wait(rq, 0, HZ) < 0) {
@@ -900,10 +909,7 @@  static int check_watcher(struct hwsp_watcher *w, const char *name,
 static void cleanup_watcher(struct hwsp_watcher *w)
 {
 	if (w->rq) {
-		struct intel_timeline *tl = w->rq->context->timeline;
-
-		mutex_lock(&tl->mutex);
-		w->rq->cookie = lockdep_pin_lock(&tl->mutex);
+		switch_tl_lock(NULL, w->rq);
 
 		i915_request_add(w->rq);
 	}
@@ -941,7 +947,7 @@  static struct i915_request *wrap_timeline(struct i915_request *rq)
 	}
 
 	i915_request_put(rq);
-	rq = intel_context_create_request(ce);
+	rq = i915_request_create(ce);
 	if (IS_ERR(rq))
 		return rq;
 
@@ -976,7 +982,7 @@  static int live_hwsp_read(void *arg)
 	if (IS_ERR(tl))
 		return PTR_ERR(tl);
 
-	if (!tl->hwsp_cacheline)
+	if (!tl->has_initial_breadcrumb)
 		goto out_free;
 
 	for (i = 0; i < ARRAY_SIZE(watcher); i++) {
@@ -998,7 +1004,7 @@  static int live_hwsp_read(void *arg)
 		do {
 			struct i915_sw_fence *submit;
 			struct i915_request *rq;
-			u32 hwsp;
+			u32 hwsp, dummy;
 
 			submit = heap_fence_create(GFP_KERNEL);
 			if (!submit) {
@@ -1016,14 +1022,26 @@  static int live_hwsp_read(void *arg)
 				goto out;
 			}
 
-			/* Skip to the end, saving 30 minutes of nops */
-			tl->seqno = -10u + 2 * (count & 3);
-			WRITE_ONCE(*(u32 *)tl->hwsp_seqno, tl->seqno);
 			ce->timeline = intel_timeline_get(tl);
 
-			rq = intel_context_create_request(ce);
+			/* Ensure timeline is mapped, done during first pin */
+			err = intel_context_pin(ce);
+			if (err) {
+				intel_context_put(ce);
+				goto out;
+			}
+
+			/*
+			 * Start at a new wrap, and set seqno right before another wrap,
+			 * saving 30 minutes of nops
+			 */
+			tl->seqno = -12u + 2 * (count & 3);
+			__intel_timeline_get_seqno(tl, &dummy);
+
+			rq = i915_request_create(ce);
 			if (IS_ERR(rq)) {
 				err = PTR_ERR(rq);
+				intel_context_unpin(ce);
 				intel_context_put(ce);
 				goto out;
 			}
@@ -1033,32 +1051,35 @@  static int live_hwsp_read(void *arg)
 							    GFP_KERNEL);
 			if (err < 0) {
 				i915_request_add(rq);
+				intel_context_unpin(ce);
 				intel_context_put(ce);
 				goto out;
 			}
 
-			mutex_lock(&watcher[0].rq->context->timeline->mutex);
+			switch_tl_lock(rq, watcher[0].rq);
 			err = intel_timeline_read_hwsp(rq, watcher[0].rq, &hwsp);
 			if (err == 0)
 				err = emit_read_hwsp(watcher[0].rq, /* before */
 						     rq->fence.seqno, hwsp,
 						     &watcher[0].addr);
-			mutex_unlock(&watcher[0].rq->context->timeline->mutex);
+			switch_tl_lock(watcher[0].rq, rq);
 			if (err) {
 				i915_request_add(rq);
+				intel_context_unpin(ce);
 				intel_context_put(ce);
 				goto out;
 			}
 
-			mutex_lock(&watcher[1].rq->context->timeline->mutex);
+			switch_tl_lock(rq, watcher[1].rq);
 			err = intel_timeline_read_hwsp(rq, watcher[1].rq, &hwsp);
 			if (err == 0)
 				err = emit_read_hwsp(watcher[1].rq, /* after */
 						     rq->fence.seqno, hwsp,
 						     &watcher[1].addr);
-			mutex_unlock(&watcher[1].rq->context->timeline->mutex);
+			switch_tl_lock(watcher[1].rq, rq);
 			if (err) {
 				i915_request_add(rq);
+				intel_context_unpin(ce);
 				intel_context_put(ce);
 				goto out;
 			}
@@ -1067,6 +1088,7 @@  static int live_hwsp_read(void *arg)
 			i915_request_add(rq);
 
 			rq = wrap_timeline(rq);
+			intel_context_unpin(ce);
 			intel_context_put(ce);
 			if (IS_ERR(rq)) {
 				err = PTR_ERR(rq);
@@ -1106,8 +1128,8 @@  static int live_hwsp_read(void *arg)
 			    3 * watcher[1].rq->ring->size)
 				break;
 
-		} while (!__igt_timeout(end_time, NULL));
-		WRITE_ONCE(*(u32 *)tl->hwsp_seqno, 0xdeadbeef);
+		} while (!__igt_timeout(end_time, NULL) &&
+			 count < (PAGE_SIZE / TIMELINE_SEQNO_BYTES - 1) / 2);
 
 		pr_info("%s: simulated %lu wraps\n", engine->name, count);
 		err = check_watcher(&watcher[1], "after", cmp_gte);
@@ -1152,9 +1174,7 @@  static int live_hwsp_rollover_kernel(void *arg)
 		}
 
 		GEM_BUG_ON(i915_active_fence_isset(&tl->last_request));
-		tl->seqno = 0;
-		timeline_rollback(tl);
-		timeline_rollback(tl);
+		tl->seqno = -2u;
 		WRITE_ONCE(*(u32 *)tl->hwsp_seqno, tl->seqno);
 
 		for (i = 0; i < ARRAY_SIZE(rq); i++) {
@@ -1234,11 +1254,10 @@  static int live_hwsp_rollover_user(void *arg)
 			goto out;
 
 		tl = ce->timeline;
-		if (!tl->has_initial_breadcrumb || !tl->hwsp_cacheline)
+		if (!tl->has_initial_breadcrumb)
 			goto out;
 
-		timeline_rollback(tl);
-		timeline_rollback(tl);
+		tl->seqno = -4u;
 		WRITE_ONCE(*(u32 *)tl->hwsp_seqno, tl->seqno);
 
 		for (i = 0; i < ARRAY_SIZE(rq); i++) {
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index e7b4c4bc41a6..59d942910558 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -794,7 +794,6 @@  __i915_request_create(struct intel_context *ce, gfp_t gfp)
 	rq->fence.seqno = seqno;
 
 	RCU_INIT_POINTER(rq->timeline, tl);
-	RCU_INIT_POINTER(rq->hwsp_cacheline, tl->hwsp_cacheline);
 	rq->hwsp_seqno = tl->hwsp_seqno;
 	GEM_BUG_ON(__i915_request_is_complete(rq));
 
@@ -1039,9 +1038,6 @@  emit_semaphore_wait(struct i915_request *to,
 	if (i915_request_has_initial_breadcrumb(to))
 		goto await_fence;
 
-	if (!rcu_access_pointer(from->hwsp_cacheline))
-		goto await_fence;
-
 	/*
 	 * If this or its dependents are waiting on an external fence
 	 * that may fail catastrophically, then we want to avoid using
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index dd10a6db3d21..38062495b66f 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -239,16 +239,6 @@  struct i915_request {
 	 */
 	const u32 *hwsp_seqno;
 
-	/*
-	 * If we need to access the timeline's seqno for this request in
-	 * another request, we need to keep a read reference to this associated
-	 * cacheline, so that we do not free and recycle it before the foreign
-	 * observers have completed. Hence, we keep a pointer to the cacheline
-	 * inside the timeline's HWSP vma, but it is only valid while this
-	 * request has not completed and guarded by the timeline mutex.
-	 */
-	struct intel_timeline_cacheline __rcu *hwsp_cacheline;
-
 	/** Position in the ring of the start of the request */
 	u32 head;
 
@@ -650,4 +640,25 @@  static inline bool i915_request_use_semaphores(const struct i915_request *rq)
 	return intel_engine_has_semaphores(rq->engine);
 }
 
+static inline u32
+i915_request_active_seqno(const struct i915_request *rq)
+{
+	u32 hwsp_phys_base =
+		page_mask_bits(i915_request_active_timeline(rq)->hwsp_offset);
+	u32 hwsp_relative_offset = offset_in_page(rq->hwsp_seqno);
+
+	/*
+	 * Because of wraparound, we cannot simply take tl->hwsp_offset,
+	 * but instead use the fact that the relative for vaddr is the
+	 * offset as for hwsp_offset. Take the top bits from tl->hwsp_offset
+	 * and combine them with the relative offset in rq->hwsp_seqno.
+	 *
+	 * As rw->hwsp_seqno is rewritten when signaled, this only works
+	 * when the request isn't signaled yet, but at that point you
+	 * no longer need the offset.
+	 */
+
+	return hwsp_phys_base + hwsp_relative_offset;
+}
+
 #endif /* I915_REQUEST_H */