diff mbox

[10/13] drm/msm: Support multiple ringbuffers

Message ID 1494275709-25782-11-git-send-email-jcrouse@codeaurora.org (mailing list archive)
State Not Applicable, archived
Delegated to: Andy Gross
Headers show

Commit Message

Jordan Crouse May 8, 2017, 8:35 p.m. UTC
Add the infrastructure to support the idea of multiple ringbuffers.
Assign each ringbuffer an id and use that as an index for the various
ring specific operations.

The biggest delta is to support legacy fences. Each fence gets its own
sequence number but the legacy functions expect to use a unique integer.
To handle this we return a unique identifer for each submission but
map it to a specific ring/sequence under the covers. Newer users use
a dma_fence pointer anyway so they don't care about the actual sequence
ID or ring.

The actual mechanics for multiple ringbuffers are very target specific
so this code just allows for the possibility but still only defines
one ringbuffer for each target family.

Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
---
 drivers/gpu/drm/msm/adreno/a3xx_gpu.c   |   9 +-
 drivers/gpu/drm/msm/adreno/a4xx_gpu.c   |   9 +-
 drivers/gpu/drm/msm/adreno/a5xx_gpu.c   |  45 +++++-----
 drivers/gpu/drm/msm/adreno/a5xx_gpu.h   |   2 +-
 drivers/gpu/drm/msm/adreno/a5xx_power.c |   6 +-
 drivers/gpu/drm/msm/adreno/adreno_gpu.c | 149 ++++++++++++++++++++------------
 drivers/gpu/drm/msm/adreno/adreno_gpu.h |  36 +++++---
 drivers/gpu/drm/msm/msm_drv.h           |   2 +
 drivers/gpu/drm/msm/msm_fence.c         |  92 +++++++++++++++-----
 drivers/gpu/drm/msm/msm_fence.h         |  15 ++--
 drivers/gpu/drm/msm/msm_gem.h           |   3 +-
 drivers/gpu/drm/msm/msm_gem_submit.c    |  10 ++-
 drivers/gpu/drm/msm/msm_gpu.c           | 138 ++++++++++++++++++++---------
 drivers/gpu/drm/msm/msm_gpu.h           |  30 ++++---
 drivers/gpu/drm/msm/msm_ringbuffer.c    |  19 ++--
 drivers/gpu/drm/msm/msm_ringbuffer.h    |   9 +-
 16 files changed, 381 insertions(+), 193 deletions(-)

Comments

Jordan Crouse May 25, 2017, 5:25 p.m. UTC | #1
On Mon, May 08, 2017 at 02:35:06PM -0600, Jordan Crouse wrote:
> -#define rbmemptr(adreno_gpu, member)  \
> +#define _sizeof(member) \
> +	sizeof(((struct adreno_rbmemptrs *) 0)->member[0])
> +
> +#define _base(adreno_gpu, member)  \
>  	((adreno_gpu)->memptrs_iova + offsetof(struct adreno_rbmemptrs, member))
>  
> +#define rbmemptr(adreno_gpu, index, member) \
> +	(_base((adreno_gpu), member) + ((index) * _sizeof(member)))
> +
>  struct adreno_rbmemptrs {
> -	volatile uint32_t rptr;
> -	volatile uint32_t fence;
> +	volatile uint32_t rptr[MSM_GPU_MAX_RINGS];
> +	volatile uint32_t fence[MSM_GPU_MAX_RINGS];
>  };

I'm looking for opinions on moving this to a per-ring buffer object. It would be
a lot simpler to understand but it would cost us a page per ring as opposed
to the 1 page we use now.

Looking ahead we are going to want to start using trace messages in conjunction
with tools like systrace:

(https://developer.android.com/studio/profile/systrace-commandline.html)

This will involve tracking the always on counter value at start/retire for each
outstanding submit on each ring. I *think* we could fit those values into the
existing rmemptrs buffer if we wanted to but I can't imagine these would be the
last runtime statistics we would gather.

I guess I'm leaning toward the per-ring solution but I'll listen to anybody
argue that the memory usage isn't worth it.

Jordan
Rob Clark May 25, 2017, 5:37 p.m. UTC | #2
On Thu, May 25, 2017 at 1:25 PM, Jordan Crouse <jcrouse@codeaurora.org> wrote:
> On Mon, May 08, 2017 at 02:35:06PM -0600, Jordan Crouse wrote:
>> -#define rbmemptr(adreno_gpu, member)  \
>> +#define _sizeof(member) \
>> +     sizeof(((struct adreno_rbmemptrs *) 0)->member[0])
>> +
>> +#define _base(adreno_gpu, member)  \
>>       ((adreno_gpu)->memptrs_iova + offsetof(struct adreno_rbmemptrs, member))
>>
>> +#define rbmemptr(adreno_gpu, index, member) \
>> +     (_base((adreno_gpu), member) + ((index) * _sizeof(member)))
>> +
>>  struct adreno_rbmemptrs {
>> -     volatile uint32_t rptr;
>> -     volatile uint32_t fence;
>> +     volatile uint32_t rptr[MSM_GPU_MAX_RINGS];
>> +     volatile uint32_t fence[MSM_GPU_MAX_RINGS];
>>  };
>
> I'm looking for opinions on moving this to a per-ring buffer object. It would be
> a lot simpler to understand but it would cost us a page per ring as opposed
> to the 1 page we use now.

Well, I guess sub-allocation is an option.. we don't *have* to do a
page-per memptrs struct just to have separate struct per pring.

Just turn the single rbmemptrs allocation into an memptrs[MAX_RINGS]..

BR,
-R

> Looking ahead we are going to want to start using trace messages in conjunction
> with tools like systrace:
>
> (https://developer.android.com/studio/profile/systrace-commandline.html)
>
> This will involve tracking the always on counter value at start/retire for each
> outstanding submit on each ring. I *think* we could fit those values into the
> existing rmemptrs buffer if we wanted to but I can't imagine these would be the
> last runtime statistics we would gather.
>
> I guess I'm leaning toward the per-ring solution but I'll listen to anybody
> argue that the memory usage isn't worth it.
>
> Jordan
>
> --
> The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> a Linux Foundation Collaborative Project
> _______________________________________________
> Freedreno mailing list
> Freedreno@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/freedreno
--
To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rob Clark May 28, 2017, 1:43 p.m. UTC | #3
On Mon, May 8, 2017 at 4:35 PM, Jordan Crouse <jcrouse@codeaurora.org> wrote:
> Add the infrastructure to support the idea of multiple ringbuffers.
> Assign each ringbuffer an id and use that as an index for the various
> ring specific operations.
>
> The biggest delta is to support legacy fences. Each fence gets its own
> sequence number but the legacy functions expect to use a unique integer.
> To handle this we return a unique identifer for each submission but
> map it to a specific ring/sequence under the covers. Newer users use
> a dma_fence pointer anyway so they don't care about the actual sequence
> ID or ring.

So, WAIT_FENCE is alive and well, and useful since it avoids the
overhead of creating a 'struct file', but it is only used within a
single pipe_context (or at least situations where we know which ctx
the seqno fence applies to).  It seems like it would be simpler if we
just introduced a ctx-id in all the ioctls (SUBMIT and WAIT_FENCE)
that take a uint fence.  Then I think we don't need hashtable
fancyness.

Also, one thing I was thinking of is that some-day we might want to
make SUBMIT non-blocking when there is a dependency on a fence from a
different ring.  (Ie. queue it up but don't write cmds into rb yet.)
Which means we'd need multiple fence timelines per priority-level rb.
Which brings me back to wanting a CREATE_CTX type of ioctl.  (And I
guess DESTROY_CTX.)  We could make these simple stubs for now, ie.
CREATE_CTX just returns the priority level back, and not really have
any separate "context" object on the kernel side for now.  This
wouldn't change the implementation much from what you have, but I
think that gives us some flexibility to later on actually let us have
multiple contexts at a given priority level which don't block each
other for submits that are still pending on some fence, without
another UABI change.

BR,
-R

> The actual mechanics for multiple ringbuffers are very target specific
> so this code just allows for the possibility but still only defines
> one ringbuffer for each target family.
>
> Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
> ---
>  drivers/gpu/drm/msm/adreno/a3xx_gpu.c   |   9 +-
>  drivers/gpu/drm/msm/adreno/a4xx_gpu.c   |   9 +-
>  drivers/gpu/drm/msm/adreno/a5xx_gpu.c   |  45 +++++-----
>  drivers/gpu/drm/msm/adreno/a5xx_gpu.h   |   2 +-
>  drivers/gpu/drm/msm/adreno/a5xx_power.c |   6 +-
>  drivers/gpu/drm/msm/adreno/adreno_gpu.c | 149 ++++++++++++++++++++------------
>  drivers/gpu/drm/msm/adreno/adreno_gpu.h |  36 +++++---
>  drivers/gpu/drm/msm/msm_drv.h           |   2 +
>  drivers/gpu/drm/msm/msm_fence.c         |  92 +++++++++++++++-----
>  drivers/gpu/drm/msm/msm_fence.h         |  15 ++--
>  drivers/gpu/drm/msm/msm_gem.h           |   3 +-
>  drivers/gpu/drm/msm/msm_gem_submit.c    |  10 ++-
>  drivers/gpu/drm/msm/msm_gpu.c           | 138 ++++++++++++++++++++---------
>  drivers/gpu/drm/msm/msm_gpu.h           |  30 ++++---
>  drivers/gpu/drm/msm/msm_ringbuffer.c    |  19 ++--
>  drivers/gpu/drm/msm/msm_ringbuffer.h    |   9 +-
>  16 files changed, 381 insertions(+), 193 deletions(-)
>
> diff --git a/drivers/gpu/drm/msm/adreno/a3xx_gpu.c b/drivers/gpu/drm/msm/adreno/a3xx_gpu.c
> index 0e3828ed..10d0234 100644
> --- a/drivers/gpu/drm/msm/adreno/a3xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a3xx_gpu.c
> @@ -44,7 +44,7 @@
>
>  static bool a3xx_me_init(struct msm_gpu *gpu)
>  {
> -       struct msm_ringbuffer *ring = gpu->rb;
> +       struct msm_ringbuffer *ring = gpu->rb[0];
>
>         OUT_PKT3(ring, CP_ME_INIT, 17);
>         OUT_RING(ring, 0x000003f7);
> @@ -65,7 +65,7 @@ static bool a3xx_me_init(struct msm_gpu *gpu)
>         OUT_RING(ring, 0x00000000);
>         OUT_RING(ring, 0x00000000);
>
> -       gpu->funcs->flush(gpu);
> +       gpu->funcs->flush(gpu, ring);
>         return a3xx_idle(gpu);
>  }
>
> @@ -339,7 +339,7 @@ static void a3xx_destroy(struct msm_gpu *gpu)
>  static bool a3xx_idle(struct msm_gpu *gpu)
>  {
>         /* wait for ringbuffer to drain: */
> -       if (!adreno_idle(gpu))
> +       if (!adreno_idle(gpu, gpu->rb[0]))
>                 return false;
>
>         /* then wait for GPU to finish: */
> @@ -447,6 +447,7 @@ static void a3xx_dump(struct msm_gpu *gpu)
>                 .last_fence = adreno_last_fence,
>                 .submit = adreno_submit,
>                 .flush = adreno_flush,
> +               .active_ring = adreno_active_ring,
>                 .irq = a3xx_irq,
>                 .destroy = a3xx_destroy,
>  #ifdef CONFIG_DEBUG_FS
> @@ -494,7 +495,7 @@ struct msm_gpu *a3xx_gpu_init(struct drm_device *dev)
>         adreno_gpu->registers = a3xx_registers;
>         adreno_gpu->reg_offsets = a3xx_register_offsets;
>
> -       ret = adreno_gpu_init(dev, pdev, adreno_gpu, &funcs);
> +       ret = adreno_gpu_init(dev, pdev, adreno_gpu, &funcs, 1);
>         if (ret)
>                 goto fail;
>
> diff --git a/drivers/gpu/drm/msm/adreno/a4xx_gpu.c b/drivers/gpu/drm/msm/adreno/a4xx_gpu.c
> index 19abf22..35fbf18 100644
> --- a/drivers/gpu/drm/msm/adreno/a4xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a4xx_gpu.c
> @@ -116,7 +116,7 @@ static void a4xx_enable_hwcg(struct msm_gpu *gpu)
>
>  static bool a4xx_me_init(struct msm_gpu *gpu)
>  {
> -       struct msm_ringbuffer *ring = gpu->rb;
> +       struct msm_ringbuffer *ring = gpu->rb[0];
>
>         OUT_PKT3(ring, CP_ME_INIT, 17);
>         OUT_RING(ring, 0x000003f7);
> @@ -137,7 +137,7 @@ static bool a4xx_me_init(struct msm_gpu *gpu)
>         OUT_RING(ring, 0x00000000);
>         OUT_RING(ring, 0x00000000);
>
> -       gpu->funcs->flush(gpu);
> +       gpu->funcs->flush(gpu, ring);
>         return a4xx_idle(gpu);
>  }
>
> @@ -337,7 +337,7 @@ static void a4xx_destroy(struct msm_gpu *gpu)
>  static bool a4xx_idle(struct msm_gpu *gpu)
>  {
>         /* wait for ringbuffer to drain: */
> -       if (!adreno_idle(gpu))
> +       if (!adreno_idle(gpu, gpu->rb[0]))
>                 return false;
>
>         /* then wait for GPU to finish: */
> @@ -535,6 +535,7 @@ static int a4xx_get_timestamp(struct msm_gpu *gpu, uint64_t *value)
>                 .last_fence = adreno_last_fence,
>                 .submit = adreno_submit,
>                 .flush = adreno_flush,
> +               .active_ring = adreno_active_ring,
>                 .irq = a4xx_irq,
>                 .destroy = a4xx_destroy,
>  #ifdef CONFIG_DEBUG_FS
> @@ -576,7 +577,7 @@ struct msm_gpu *a4xx_gpu_init(struct drm_device *dev)
>         adreno_gpu->registers = a4xx_registers;
>         adreno_gpu->reg_offsets = a4xx_register_offsets;
>
> -       ret = adreno_gpu_init(dev, pdev, adreno_gpu, &funcs);
> +       ret = adreno_gpu_init(dev, pdev, adreno_gpu, &funcs, 1);
>         if (ret)
>                 goto fail;
>
> diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> index fd54cc7..aaa941e 100644
> --- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> @@ -80,7 +80,7 @@ static void a5xx_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit,
>  {
>         struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
>         struct msm_drm_private *priv = gpu->dev->dev_private;
> -       struct msm_ringbuffer *ring = gpu->rb;
> +       struct msm_ringbuffer *ring = submit->ring;
>         unsigned int i, ibs = 0;
>
>         for (i = 0; i < submit->nr_cmds; i++) {
> @@ -105,11 +105,11 @@ static void a5xx_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit,
>
>         OUT_PKT7(ring, CP_EVENT_WRITE, 4);
>         OUT_RING(ring, CACHE_FLUSH_TS | (1 << 31));
> -       OUT_RING(ring, lower_32_bits(rbmemptr(adreno_gpu, fence)));
> -       OUT_RING(ring, upper_32_bits(rbmemptr(adreno_gpu, fence)));
> +       OUT_RING(ring, lower_32_bits(rbmemptr(adreno_gpu, ring->id, fence)));
> +       OUT_RING(ring, upper_32_bits(rbmemptr(adreno_gpu, ring->id, fence)));
>         OUT_RING(ring, submit->fence->seqno);
>
> -       gpu->funcs->flush(gpu);
> +       gpu->funcs->flush(gpu, ring);
>  }
>
>  struct a5xx_hwcg {
> @@ -249,7 +249,7 @@ static void a5xx_enable_hwcg(struct msm_gpu *gpu)
>  static int a5xx_me_init(struct msm_gpu *gpu)
>  {
>         struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
> -       struct msm_ringbuffer *ring = gpu->rb;
> +       struct msm_ringbuffer *ring = gpu->rb[0];
>
>         OUT_PKT7(ring, CP_ME_INIT, 8);
>
> @@ -280,9 +280,8 @@ static int a5xx_me_init(struct msm_gpu *gpu)
>         OUT_RING(ring, 0x00000000);
>         OUT_RING(ring, 0x00000000);
>
> -       gpu->funcs->flush(gpu);
> -
> -       return a5xx_idle(gpu) ? 0 : -EINVAL;
> +       gpu->funcs->flush(gpu, ring);
> +       return a5xx_idle(gpu, ring) ? 0 : -EINVAL;
>  }
>
>  static struct drm_gem_object *a5xx_ucode_load_bo(struct msm_gpu *gpu,
> @@ -628,11 +627,11 @@ static int a5xx_hw_init(struct msm_gpu *gpu)
>          * ticking correctly
>          */
>         if (adreno_is_a530(adreno_gpu)) {
> -               OUT_PKT7(gpu->rb, CP_EVENT_WRITE, 1);
> -               OUT_RING(gpu->rb, 0x0F);
> +               OUT_PKT7(gpu->rb[0], CP_EVENT_WRITE, 1);
> +               OUT_RING(gpu->rb[0], 0x0F);
>
> -               gpu->funcs->flush(gpu);
> -               if (!a5xx_idle(gpu))
> +               gpu->funcs->flush(gpu, gpu->rb[0]);
> +               if (!a5xx_idle(gpu, gpu->rb[0]))
>                         return -EINVAL;
>         }
>
> @@ -645,11 +644,11 @@ static int a5xx_hw_init(struct msm_gpu *gpu)
>          */
>         ret = a5xx_zap_shader_init(gpu);
>         if (!ret) {
> -               OUT_PKT7(gpu->rb, CP_SET_SECURE_MODE, 1);
> -               OUT_RING(gpu->rb, 0x00000000);
> +               OUT_PKT7(gpu->rb[0], CP_SET_SECURE_MODE, 1);
> +               OUT_RING(gpu->rb[0], 0x00000000);
>
> -               gpu->funcs->flush(gpu);
> -               if (!a5xx_idle(gpu))
> +               gpu->funcs->flush(gpu, gpu->rb[0]);
> +               if (!a5xx_idle(gpu, gpu->rb[0]))
>                         return -EINVAL;
>         } else {
>                 /* Print a warning so if we die, we know why */
> @@ -726,16 +725,19 @@ static inline bool _a5xx_check_idle(struct msm_gpu *gpu)
>                 A5XX_RBBM_INT_0_MASK_MISC_HANG_DETECT);
>  }
>
> -bool a5xx_idle(struct msm_gpu *gpu)
> +bool a5xx_idle(struct msm_gpu *gpu, struct msm_ringbuffer *ring)
>  {
>         /* wait for CP to drain ringbuffer: */
> -       if (!adreno_idle(gpu))
> +       if (!adreno_idle(gpu, ring))
>                 return false;
>
>         if (spin_until(_a5xx_check_idle(gpu))) {
> -               DRM_ERROR("%s: %ps: timeout waiting for GPU to idle: status %8.8X irq %8.8X\n",
> -                       gpu->name, __builtin_return_address(0),
> +               DRM_DEV_ERROR(gpu->dev->dev,
> +                       "timeout waiting for GPU RB %d to idle: status %8.8X rptr/wptr: %4.4X/%4.4X irq %8.8X\n",
> +                       ring->id,
>                         gpu_read(gpu, REG_A5XX_RBBM_STATUS),
> +                       gpu_read(gpu, REG_A5XX_CP_RB_RPTR),
> +                       gpu_read(gpu, REG_A5XX_CP_RB_WPTR),
>                         gpu_read(gpu, REG_A5XX_RBBM_INT_0_STATUS));
>
>                 return false;
> @@ -1031,6 +1033,7 @@ static void a5xx_show(struct msm_gpu *gpu, struct seq_file *m)
>                 .last_fence = adreno_last_fence,
>                 .submit = a5xx_submit,
>                 .flush = adreno_flush,
> +               .active_ring = adreno_active_ring,
>                 .irq = a5xx_irq,
>                 .destroy = a5xx_destroy,
>  #ifdef CONFIG_DEBUG_FS
> @@ -1067,7 +1070,7 @@ struct msm_gpu *a5xx_gpu_init(struct drm_device *dev)
>
>         a5xx_gpu->lm_leakage = 0x4E001A;
>
> -       ret = adreno_gpu_init(dev, pdev, adreno_gpu, &funcs);
> +       ret = adreno_gpu_init(dev, pdev, adreno_gpu, &funcs, 1);
>         if (ret) {
>                 a5xx_destroy(&(a5xx_gpu->base.base));
>                 return ERR_PTR(ret);
> diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.h b/drivers/gpu/drm/msm/adreno/a5xx_gpu.h
> index 6638bc8..aba6faf 100644
> --- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.h
> +++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.h
> @@ -58,6 +58,6 @@ static inline int spin_usecs(struct msm_gpu *gpu, uint32_t usecs,
>         return -ETIMEDOUT;
>  }
>
> -bool a5xx_idle(struct msm_gpu *gpu);
> +bool a5xx_idle(struct msm_gpu *gpu, struct msm_ringbuffer *ring);
>
>  #endif /* __A5XX_GPU_H__ */
> diff --git a/drivers/gpu/drm/msm/adreno/a5xx_power.c b/drivers/gpu/drm/msm/adreno/a5xx_power.c
> index 2fdee44..a7d91ac 100644
> --- a/drivers/gpu/drm/msm/adreno/a5xx_power.c
> +++ b/drivers/gpu/drm/msm/adreno/a5xx_power.c
> @@ -173,7 +173,7 @@ static int a5xx_gpmu_init(struct msm_gpu *gpu)
>  {
>         struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
>         struct a5xx_gpu *a5xx_gpu = to_a5xx_gpu(adreno_gpu);
> -       struct msm_ringbuffer *ring = gpu->rb;
> +       struct msm_ringbuffer *ring = gpu->rb[0];
>
>         if (!a5xx_gpu->gpmu_dwords)
>                 return 0;
> @@ -192,9 +192,9 @@ static int a5xx_gpmu_init(struct msm_gpu *gpu)
>         OUT_PKT7(ring, CP_SET_PROTECTED_MODE, 1);
>         OUT_RING(ring, 1);
>
> -       gpu->funcs->flush(gpu);
> +       gpu->funcs->flush(gpu, ring);
>
> -       if (!a5xx_idle(gpu)) {
> +       if (!a5xx_idle(gpu, ring)) {
>                 DRM_ERROR("%s: Unable to load GPMU firmware. GPMU will not be active\n",
>                         gpu->name);
>                 return -EINVAL;
> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> index 4a24506..6b7114d 100644
> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> @@ -21,7 +21,6 @@
>  #include "msm_gem.h"
>  #include "msm_mmu.h"
>
> -#define RB_SIZE    SZ_32K
>  #define RB_BLKSIZE 32
>
>  int adreno_get_param(struct msm_gpu *gpu, uint32_t param, uint64_t *value)
> @@ -60,39 +59,47 @@ int adreno_get_param(struct msm_gpu *gpu, uint32_t param, uint64_t *value)
>  int adreno_hw_init(struct msm_gpu *gpu)
>  {
>         struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
> -       int ret;
> +       int i;
>
>         DBG("%s", gpu->name);
>
> -       ret = msm_gem_get_iova(gpu->rb->bo, gpu->aspace, &gpu->rb_iova);
> -       if (ret) {
> -               gpu->rb_iova = 0;
> -               dev_err(gpu->dev->dev, "could not map ringbuffer: %d\n", ret);
> -               return ret;
> -       }
> +       for (i = 0; i < gpu->nr_rings; i++) {
> +               struct msm_ringbuffer *ring = gpu->rb[i];
> +               int ret;
>
> -       /* reset ringbuffer: */
> -       gpu->rb->cur = gpu->rb->start;
> +               if (!ring)
> +                       continue;
>
> -       /* reset completed fence seqno: */
> -       adreno_gpu->memptrs->fence = gpu->fctx->completed_fence;
> -       adreno_gpu->memptrs->rptr  = 0;
> +               ret = msm_gem_get_iova(ring->bo, gpu->aspace, &ring->iova);
> +               if (ret) {
> +                       ring->iova = 0;
> +                       dev_err(gpu->dev->dev,
> +                               "could not map ringbuffer %d: %d\n", i, ret);
> +                       return ret;
> +               }
> +
> +               ring->cur = ring->start;
> +
> +               /* reset completed fence seqno: */
> +               adreno_gpu->memptrs->fence[ring->id] = ring->completed_fence;
> +               adreno_gpu->memptrs->rptr[ring->id]  = 0;
> +       }
>
>         /* Setup REG_CP_RB_CNTL: */
>         adreno_gpu_write(adreno_gpu, REG_ADRENO_CP_RB_CNTL,
> -                       /* size is log2(quad-words): */
> -                       AXXX_CP_RB_CNTL_BUFSZ(ilog2(gpu->rb->size / 8)) |
> -                       AXXX_CP_RB_CNTL_BLKSZ(ilog2(RB_BLKSIZE / 8)) |
> -                       (adreno_is_a430(adreno_gpu) ? AXXX_CP_RB_CNTL_NO_UPDATE : 0));
> +               /* size is log2(quad-words): */
> +               AXXX_CP_RB_CNTL_BUFSZ(ilog2(MSM_GPU_RINGBUFFER_SZ / 8)) |
> +               AXXX_CP_RB_CNTL_BLKSZ(ilog2(RB_BLKSIZE / 8)) |
> +               (adreno_is_a430(adreno_gpu) ? AXXX_CP_RB_CNTL_NO_UPDATE : 0));
>
> -       /* Setup ringbuffer address: */
> +       /* Setup ringbuffer address - use ringbuffer[0] for GPU init */
>         adreno_gpu_write64(adreno_gpu, REG_ADRENO_CP_RB_BASE,
> -               REG_ADRENO_CP_RB_BASE_HI, gpu->rb_iova);
> +               REG_ADRENO_CP_RB_BASE_HI, gpu->rb[0]->iova);
>
>         if (!adreno_is_a430(adreno_gpu)) {
>                 adreno_gpu_write64(adreno_gpu, REG_ADRENO_CP_RB_RPTR_ADDR,
>                         REG_ADRENO_CP_RB_RPTR_ADDR_HI,
> -                       rbmemptr(adreno_gpu, rptr));
> +                       rbmemptr(adreno_gpu, 0, rptr));
>         }
>
>         return 0;
> @@ -104,19 +111,35 @@ static uint32_t get_wptr(struct msm_ringbuffer *ring)
>  }
>
>  /* Use this helper to read rptr, since a430 doesn't update rptr in memory */
> -static uint32_t get_rptr(struct adreno_gpu *adreno_gpu)
> +static uint32_t get_rptr(struct adreno_gpu *adreno_gpu,
> +               struct msm_ringbuffer *ring)
>  {
> -       if (adreno_is_a430(adreno_gpu))
> -               return adreno_gpu->memptrs->rptr = adreno_gpu_read(
> +       if (adreno_is_a430(adreno_gpu)) {
> +               /*
> +                * If index is anything but 0 this will probably break horribly,
> +                * but I think that we have enough infrastructure in place to
> +                * ensure that it won't be. If not then this is why your
> +                * a430 stopped working.
> +                */
> +               return adreno_gpu->memptrs->rptr[ring->id] = adreno_gpu_read(
>                         adreno_gpu, REG_ADRENO_CP_RB_RPTR);
> -       else
> -               return adreno_gpu->memptrs->rptr;
> +       } else
> +               return adreno_gpu->memptrs->rptr[ring->id];
>  }
>
> -uint32_t adreno_last_fence(struct msm_gpu *gpu)
> +struct msm_ringbuffer *adreno_active_ring(struct msm_gpu *gpu)
> +{
> +       return gpu->rb[0];
> +}
> +
> +uint32_t adreno_last_fence(struct msm_gpu *gpu, struct msm_ringbuffer *ring)
>  {
>         struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
> -       return adreno_gpu->memptrs->fence;
> +
> +       if (!ring)
> +               return 0;
> +
> +       return adreno_gpu->memptrs->fence[ring->id];
>  }
>
>  void adreno_recover(struct msm_gpu *gpu)
> @@ -142,7 +165,7 @@ void adreno_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit,
>  {
>         struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
>         struct msm_drm_private *priv = gpu->dev->dev_private;
> -       struct msm_ringbuffer *ring = gpu->rb;
> +       struct msm_ringbuffer *ring = submit->ring;
>         unsigned i;
>
>         for (i = 0; i < submit->nr_cmds; i++) {
> @@ -181,7 +204,7 @@ void adreno_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit,
>
>         OUT_PKT3(ring, CP_EVENT_WRITE, 3);
>         OUT_RING(ring, CACHE_FLUSH_TS);
> -       OUT_RING(ring, rbmemptr(adreno_gpu, fence));
> +       OUT_RING(ring, rbmemptr(adreno_gpu, ring->id, fence));
>         OUT_RING(ring, submit->fence->seqno);
>
>         /* we could maybe be clever and only CP_COND_EXEC the interrupt: */
> @@ -208,10 +231,10 @@ void adreno_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit,
>         }
>  #endif
>
> -       gpu->funcs->flush(gpu);
> +       gpu->funcs->flush(gpu, ring);
>  }
>
> -void adreno_flush(struct msm_gpu *gpu)
> +void adreno_flush(struct msm_gpu *gpu, struct msm_ringbuffer *ring)
>  {
>         struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
>         uint32_t wptr;
> @@ -221,7 +244,7 @@ void adreno_flush(struct msm_gpu *gpu)
>          * to account for the possibility that the last command fit exactly into
>          * the ringbuffer and rb->next hasn't wrapped to zero yet
>          */
> -       wptr = get_wptr(gpu->rb) & ((gpu->rb->size / 4) - 1);
> +       wptr = get_wptr(ring) % (MSM_GPU_RINGBUFFER_SZ >> 2);
>
>         /* ensure writes to ringbuffer have hit system memory: */
>         mb();
> @@ -229,17 +252,18 @@ void adreno_flush(struct msm_gpu *gpu)
>         adreno_gpu_write(adreno_gpu, REG_ADRENO_CP_RB_WPTR, wptr);
>  }
>
> -bool adreno_idle(struct msm_gpu *gpu)
> +bool adreno_idle(struct msm_gpu *gpu, struct msm_ringbuffer *ring)
>  {
>         struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
> -       uint32_t wptr = get_wptr(gpu->rb);
> +       uint32_t wptr = get_wptr(ring);
>
>         /* wait for CP to drain ringbuffer: */
> -       if (!spin_until(get_rptr(adreno_gpu) == wptr))
> +       if (!spin_until(get_rptr(adreno_gpu, ring) == wptr))
>                 return true;
>
>         /* TODO maybe we need to reset GPU here to recover from hang? */
> -       DRM_ERROR("%s: timeout waiting to drain ringbuffer!\n", gpu->name);
> +       DRM_ERROR("%s: timeout waiting to drain ringbuffer %d!\n", gpu->name,
> +               ring->id);
>         return false;
>  }
>
> @@ -254,10 +278,17 @@ void adreno_show(struct msm_gpu *gpu, struct seq_file *m)
>                         adreno_gpu->rev.major, adreno_gpu->rev.minor,
>                         adreno_gpu->rev.patchid);
>
> -       seq_printf(m, "fence:    %d/%d\n", adreno_gpu->memptrs->fence,
> -                       gpu->fctx->last_fence);
> -       seq_printf(m, "rptr:     %d\n", get_rptr(adreno_gpu));
> -       seq_printf(m, "rb wptr:  %d\n", get_wptr(gpu->rb));
> +       for (i = 0; i < gpu->nr_rings; i++) {
> +               struct msm_ringbuffer *ring = gpu->rb[i];
> +
> +               seq_printf(m, "rb %d: fence:    %d/%d\n", i,
> +                       adreno_last_fence(gpu, ring),
> +                       ring->completed_fence);
> +
> +               seq_printf(m, "      rptr:     %d\n",
> +                       get_rptr(adreno_gpu, ring));
> +               seq_printf(m, "rb wptr:  %d\n", get_wptr(ring));
> +       }
>
>         /* dump these out in a form that can be parsed by demsm: */
>         seq_printf(m, "IO:region %s 00000000 00020000\n", gpu->name);
> @@ -283,16 +314,23 @@ void adreno_show(struct msm_gpu *gpu, struct seq_file *m)
>  void adreno_dump_info(struct msm_gpu *gpu)
>  {
>         struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
> +       int i;
>
>         printk("revision: %d (%d.%d.%d.%d)\n",
>                         adreno_gpu->info->revn, adreno_gpu->rev.core,
>                         adreno_gpu->rev.major, adreno_gpu->rev.minor,
>                         adreno_gpu->rev.patchid);
>
> -       printk("fence:    %d/%d\n", adreno_gpu->memptrs->fence,
> -                       gpu->fctx->last_fence);
> -       printk("rptr:     %d\n", get_rptr(adreno_gpu));
> -       printk("rb wptr:  %d\n", get_wptr(gpu->rb));
> +       for (i = 0; i < gpu->nr_rings; i++) {
> +               struct msm_ringbuffer *ring = gpu->rb[i];
> +
> +               printk("rb %d: fence:    %d/%d\n", i,
> +                       adreno_last_fence(gpu, ring),
> +                       ring->completed_fence);
> +
> +               printk("rptr:     %d\n", get_rptr(adreno_gpu, ring));
> +               printk("rb wptr:  %d\n", get_wptr(ring));
> +       }
>  }
>
>  /* would be nice to not have to duplicate the _show() stuff with printk(): */
> @@ -315,19 +353,21 @@ void adreno_dump(struct msm_gpu *gpu)
>         }
>  }
>
> -static uint32_t ring_freewords(struct msm_gpu *gpu)
> +static uint32_t ring_freewords(struct msm_ringbuffer *ring)
>  {
> -       struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
> -       uint32_t size = gpu->rb->size / 4;
> -       uint32_t wptr = get_wptr(gpu->rb);
> -       uint32_t rptr = get_rptr(adreno_gpu);
> +       struct adreno_gpu *adreno_gpu = to_adreno_gpu(ring->gpu);
> +       uint32_t size = MSM_GPU_RINGBUFFER_SZ >> 2;
> +       uint32_t wptr = get_wptr(ring);
> +       uint32_t rptr = get_rptr(adreno_gpu, ring);
>         return (rptr + (size - 1) - wptr) % size;
>  }
>
> -void adreno_wait_ring(struct msm_gpu *gpu, uint32_t ndwords)
> +void adreno_wait_ring(struct msm_ringbuffer *ring, uint32_t ndwords)
>  {
> -       if (spin_until(ring_freewords(gpu) >= ndwords))
> -               DRM_ERROR("%s: timeout waiting for ringbuffer space\n", gpu->name);
> +       if (spin_until(ring_freewords(ring) >= ndwords))
> +               DRM_DEV_ERROR(ring->gpu->dev->dev,
> +                       "timeout waiting for space in ringubffer %d\n",
> +                       ring->id);
>  }
>
>  static const char *iommu_ports[] = {
> @@ -336,7 +376,8 @@ void adreno_wait_ring(struct msm_gpu *gpu, uint32_t ndwords)
>  };
>
>  int adreno_gpu_init(struct drm_device *drm, struct platform_device *pdev,
> -               struct adreno_gpu *adreno_gpu, const struct adreno_gpu_funcs *funcs)
> +               struct adreno_gpu *adreno_gpu,
> +               const struct adreno_gpu_funcs *funcs, int nr_rings)
>  {
>         struct adreno_platform_config *config = pdev->dev.platform_data;
>         struct msm_gpu_config adreno_gpu_config  = { 0 };
> @@ -364,7 +405,7 @@ int adreno_gpu_init(struct drm_device *drm, struct platform_device *pdev,
>         adreno_gpu_config.va_start = SZ_16M;
>         adreno_gpu_config.va_end = 0xffffffff;
>
> -       adreno_gpu_config.ringsz = RB_SIZE;
> +       adreno_gpu_config.nr_rings = nr_rings;
>
>         ret = msm_gpu_init(drm, pdev, &adreno_gpu->base, &funcs->base,
>                         adreno_gpu->info->name, &adreno_gpu_config);
> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.h b/drivers/gpu/drm/msm/adreno/adreno_gpu.h
> index 4d9165f..9e78e49 100644
> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.h
> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.h
> @@ -82,12 +82,18 @@ struct adreno_info {
>
>  const struct adreno_info *adreno_info(struct adreno_rev rev);
>
> -#define rbmemptr(adreno_gpu, member)  \
> +#define _sizeof(member) \
> +       sizeof(((struct adreno_rbmemptrs *) 0)->member[0])
> +
> +#define _base(adreno_gpu, member)  \
>         ((adreno_gpu)->memptrs_iova + offsetof(struct adreno_rbmemptrs, member))
>
> +#define rbmemptr(adreno_gpu, index, member) \
> +       (_base((adreno_gpu), member) + ((index) * _sizeof(member)))
> +
>  struct adreno_rbmemptrs {
> -       volatile uint32_t rptr;
> -       volatile uint32_t fence;
> +       volatile uint32_t rptr[MSM_GPU_MAX_RINGS];
> +       volatile uint32_t fence[MSM_GPU_MAX_RINGS];
>  };
>
>  struct adreno_gpu {
> @@ -197,21 +203,25 @@ static inline int adreno_is_a530(struct adreno_gpu *gpu)
>
>  int adreno_get_param(struct msm_gpu *gpu, uint32_t param, uint64_t *value);
>  int adreno_hw_init(struct msm_gpu *gpu);
> -uint32_t adreno_last_fence(struct msm_gpu *gpu);
> +uint32_t adreno_last_fence(struct msm_gpu *gpu, struct msm_ringbuffer *ring);
> +uint32_t adreno_submitted_fence(struct msm_gpu *gpu,
> +               struct msm_ringbuffer *ring);
>  void adreno_recover(struct msm_gpu *gpu);
>  void adreno_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit,
>                 struct msm_file_private *ctx);
> -void adreno_flush(struct msm_gpu *gpu);
> -bool adreno_idle(struct msm_gpu *gpu);
> +void adreno_flush(struct msm_gpu *gpu, struct msm_ringbuffer *ring);
> +bool adreno_idle(struct msm_gpu *gpu, struct msm_ringbuffer *ring);
>  #ifdef CONFIG_DEBUG_FS
>  void adreno_show(struct msm_gpu *gpu, struct seq_file *m);
>  #endif
>  void adreno_dump_info(struct msm_gpu *gpu);
>  void adreno_dump(struct msm_gpu *gpu);
> -void adreno_wait_ring(struct msm_gpu *gpu, uint32_t ndwords);
> +void adreno_wait_ring(struct msm_ringbuffer *ring, uint32_t ndwords);
> +struct msm_ringbuffer *adreno_active_ring(struct msm_gpu *gpu);
>
>  int adreno_gpu_init(struct drm_device *drm, struct platform_device *pdev,
> -               struct adreno_gpu *gpu, const struct adreno_gpu_funcs *funcs);
> +               struct adreno_gpu *gpu, const struct adreno_gpu_funcs *funcs,
> +               int nr_rings);
>  void adreno_gpu_cleanup(struct adreno_gpu *gpu);
>
>
> @@ -220,7 +230,7 @@ int adreno_gpu_init(struct drm_device *drm, struct platform_device *pdev,
>  static inline void
>  OUT_PKT0(struct msm_ringbuffer *ring, uint16_t regindx, uint16_t cnt)
>  {
> -       adreno_wait_ring(ring->gpu, cnt+1);
> +       adreno_wait_ring(ring, cnt+1);
>         OUT_RING(ring, CP_TYPE0_PKT | ((cnt-1) << 16) | (regindx & 0x7FFF));
>  }
>
> @@ -228,14 +238,14 @@ int adreno_gpu_init(struct drm_device *drm, struct platform_device *pdev,
>  static inline void
>  OUT_PKT2(struct msm_ringbuffer *ring)
>  {
> -       adreno_wait_ring(ring->gpu, 1);
> +       adreno_wait_ring(ring, 1);
>         OUT_RING(ring, CP_TYPE2_PKT);
>  }
>
>  static inline void
>  OUT_PKT3(struct msm_ringbuffer *ring, uint8_t opcode, uint16_t cnt)
>  {
> -       adreno_wait_ring(ring->gpu, cnt+1);
> +       adreno_wait_ring(ring, cnt+1);
>         OUT_RING(ring, CP_TYPE3_PKT | ((cnt-1) << 16) | ((opcode & 0xFF) << 8));
>  }
>
> @@ -257,14 +267,14 @@ static inline u32 PM4_PARITY(u32 val)
>  static inline void
>  OUT_PKT4(struct msm_ringbuffer *ring, uint16_t regindx, uint16_t cnt)
>  {
> -       adreno_wait_ring(ring->gpu, cnt + 1);
> +       adreno_wait_ring(ring, cnt + 1);
>         OUT_RING(ring, PKT4(regindx, cnt));
>  }
>
>  static inline void
>  OUT_PKT7(struct msm_ringbuffer *ring, uint8_t opcode, uint16_t cnt)
>  {
> -       adreno_wait_ring(ring->gpu, cnt + 1);
> +       adreno_wait_ring(ring, cnt + 1);
>         OUT_RING(ring, CP_TYPE7_PKT | (cnt << 0) | (PM4_PARITY(cnt) << 15) |
>                 ((opcode & 0x7F) << 16) | (PM4_PARITY(opcode) << 23));
>  }
> diff --git a/drivers/gpu/drm/msm/msm_drv.h b/drivers/gpu/drm/msm/msm_drv.h
> index 192147c..bbf7d3d 100644
> --- a/drivers/gpu/drm/msm/msm_drv.h
> +++ b/drivers/gpu/drm/msm/msm_drv.h
> @@ -76,6 +76,8 @@ struct msm_vblank_ctrl {
>         spinlock_t lock;
>  };
>
> +#define MSM_GPU_MAX_RINGS 1
> +
>  struct msm_drm_private {
>
>         struct drm_device *dev;
> diff --git a/drivers/gpu/drm/msm/msm_fence.c b/drivers/gpu/drm/msm/msm_fence.c
> index 3f299c5..8cf029f 100644
> --- a/drivers/gpu/drm/msm/msm_fence.c
> +++ b/drivers/gpu/drm/msm/msm_fence.c
> @@ -20,7 +20,6 @@
>  #include "msm_drv.h"
>  #include "msm_fence.h"
>
> -
>  struct msm_fence_context *
>  msm_fence_context_alloc(struct drm_device *dev, const char *name)
>  {
> @@ -32,9 +31,10 @@ struct msm_fence_context *
>
>         fctx->dev = dev;
>         fctx->name = name;
> -       fctx->context = dma_fence_context_alloc(1);
> +       fctx->context = dma_fence_context_alloc(MSM_GPU_MAX_RINGS);
>         init_waitqueue_head(&fctx->event);
>         spin_lock_init(&fctx->spinlock);
> +       hash_init(fctx->hash);
>
>         return fctx;
>  }
> @@ -44,64 +44,94 @@ void msm_fence_context_free(struct msm_fence_context *fctx)
>         kfree(fctx);
>  }
>
> -static inline bool fence_completed(struct msm_fence_context *fctx, uint32_t fence)
> +static inline bool fence_completed(struct msm_ringbuffer *ring, uint32_t fence)
> +{
> +       return (int32_t)(ring->completed_fence - fence) >= 0;
> +}
> +
> +struct msm_fence {
> +       struct msm_fence_context *fctx;
> +       struct msm_ringbuffer *ring;
> +       struct dma_fence base;
> +       struct hlist_node node;
> +       u32 fence_id;
> +};
> +
> +static struct msm_fence *fence_from_id(struct msm_fence_context *fctx,
> +               uint32_t id)
>  {
> -       return (int32_t)(fctx->completed_fence - fence) >= 0;
> +       struct msm_fence *f;
> +
> +       hash_for_each_possible_rcu(fctx->hash, f, node, id) {
> +               if (f->fence_id == id) {
> +                       if (dma_fence_get_rcu(&f->base))
> +                               return f;
> +               }
> +       }
> +
> +       return NULL;
>  }
>
>  /* legacy path for WAIT_FENCE ioctl: */
>  int msm_wait_fence(struct msm_fence_context *fctx, uint32_t fence,
>                 ktime_t *timeout, bool interruptible)
>  {
> +       struct msm_fence *f = fence_from_id(fctx, fence);
>         int ret;
>
> -       if (fence > fctx->last_fence) {
> -               DRM_ERROR("%s: waiting on invalid fence: %u (of %u)\n",
> -                               fctx->name, fence, fctx->last_fence);
> -               return -EINVAL;
> +       /* If no active fence was found, there are two possibilities */
> +       if (!f) {
> +               /* The requested ID is newer than last issued - return error */
> +               if (fence > fctx->fence_id) {
> +                       DRM_ERROR("%s: waiting on invalid fence: %u (of %u)\n",
> +                               fctx->name, fence, fctx->fence_id);
> +                       return -EINVAL;
> +               }
> +
> +               /* If the id has been issued assume fence has been retired */
> +               return 0;
>         }
>
>         if (!timeout) {
>                 /* no-wait: */
> -               ret = fence_completed(fctx, fence) ? 0 : -EBUSY;
> +               ret = fence_completed(f->ring, f->base.seqno) ? 0 : -EBUSY;
>         } else {
>                 unsigned long remaining_jiffies = timeout_to_jiffies(timeout);
>
>                 if (interruptible)
>                         ret = wait_event_interruptible_timeout(fctx->event,
> -                               fence_completed(fctx, fence),
> +                               fence_completed(f->ring, f->base.seqno),
>                                 remaining_jiffies);
>                 else
>                         ret = wait_event_timeout(fctx->event,
> -                               fence_completed(fctx, fence),
> +                               fence_completed(f->ring, f->base.seqno),
>                                 remaining_jiffies);
>
>                 if (ret == 0) {
>                         DBG("timeout waiting for fence: %u (completed: %u)",
> -                                       fence, fctx->completed_fence);
> +                               f->base.seqno, f->ring->completed_fence);
>                         ret = -ETIMEDOUT;
>                 } else if (ret != -ERESTARTSYS) {
>                         ret = 0;
>                 }
>         }
>
> +       dma_fence_put(&f->base);
> +
>         return ret;
>  }
>
>  /* called from workqueue */
> -void msm_update_fence(struct msm_fence_context *fctx, uint32_t fence)
> +void msm_update_fence(struct msm_fence_context *fctx,
> +               struct msm_ringbuffer *ring, uint32_t fence)
>  {
>         spin_lock(&fctx->spinlock);
> -       fctx->completed_fence = max(fence, fctx->completed_fence);
> +       ring->completed_fence = max(fence, ring->completed_fence);
>         spin_unlock(&fctx->spinlock);
>
>         wake_up_all(&fctx->event);
>  }
>
> -struct msm_fence {
> -       struct msm_fence_context *fctx;
> -       struct dma_fence base;
> -};
>
>  static inline struct msm_fence *to_msm_fence(struct dma_fence *fence)
>  {
> @@ -127,12 +157,17 @@ static bool msm_fence_enable_signaling(struct dma_fence *fence)
>  static bool msm_fence_signaled(struct dma_fence *fence)
>  {
>         struct msm_fence *f = to_msm_fence(fence);
> -       return fence_completed(f->fctx, f->base.seqno);
> +       return fence_completed(f->ring, f->base.seqno);
>  }
>
>  static void msm_fence_release(struct dma_fence *fence)
>  {
>         struct msm_fence *f = to_msm_fence(fence);
> +
> +       spin_lock(&f->fctx->spinlock);
> +       hash_del_rcu(&f->node);
> +       spin_unlock(&f->fctx->spinlock);
> +
>         kfree_rcu(f, base.rcu);
>  }
>
> @@ -145,8 +180,15 @@ static void msm_fence_release(struct dma_fence *fence)
>         .release = msm_fence_release,
>  };
>
> +uint32_t msm_fence_id(struct dma_fence *fence)
> +{
> +       struct msm_fence *f = to_msm_fence(fence);
> +
> +       return f->fence_id;
> +}
> +
>  struct dma_fence *
> -msm_fence_alloc(struct msm_fence_context *fctx)
> +msm_fence_alloc(struct msm_fence_context *fctx, struct msm_ringbuffer *ring)
>  {
>         struct msm_fence *f;
>
> @@ -155,9 +197,17 @@ struct dma_fence *
>                 return ERR_PTR(-ENOMEM);
>
>         f->fctx = fctx;
> +       f->ring = ring;
> +
> +       /* Make a user fence ID to pass back for the legacy functions */
> +       f->fence_id = ++fctx->fence_id;
> +
> +       spin_lock(&fctx->spinlock);
> +       hash_add(fctx->hash, &f->node, f->fence_id);
> +       spin_unlock(&fctx->spinlock);
>
>         dma_fence_init(&f->base, &msm_fence_ops, &fctx->spinlock,
> -                      fctx->context, ++fctx->last_fence);
> +                      fctx->context + ring->id, ++ring->last_fence);
>
>         return &f->base;
>  }
> diff --git a/drivers/gpu/drm/msm/msm_fence.h b/drivers/gpu/drm/msm/msm_fence.h
> index 56061aa..b5c6830 100644
> --- a/drivers/gpu/drm/msm/msm_fence.h
> +++ b/drivers/gpu/drm/msm/msm_fence.h
> @@ -18,17 +18,18 @@
>  #ifndef __MSM_FENCE_H__
>  #define __MSM_FENCE_H__
>
> +#include <linux/hashtable.h>
>  #include "msm_drv.h"
> +#include "msm_ringbuffer.h"
>
>  struct msm_fence_context {
>         struct drm_device *dev;
>         const char *name;
>         unsigned context;
> -       /* last_fence == completed_fence --> no pending work */
> -       uint32_t last_fence;          /* last assigned fence */
> -       uint32_t completed_fence;     /* last completed fence */
> +       u32 fence_id;
>         wait_queue_head_t event;
>         spinlock_t spinlock;
> +       DECLARE_HASHTABLE(hash, 4);
>  };
>
>  struct msm_fence_context * msm_fence_context_alloc(struct drm_device *dev,
> @@ -39,8 +40,12 @@ int msm_wait_fence(struct msm_fence_context *fctx, uint32_t fence,
>                 ktime_t *timeout, bool interruptible);
>  int msm_queue_fence_cb(struct msm_fence_context *fctx,
>                 struct msm_fence_cb *cb, uint32_t fence);
> -void msm_update_fence(struct msm_fence_context *fctx, uint32_t fence);
> +void msm_update_fence(struct msm_fence_context *fctx,
> +               struct msm_ringbuffer *ring, uint32_t fence);
>
> -struct dma_fence * msm_fence_alloc(struct msm_fence_context *fctx);
> +struct dma_fence *msm_fence_alloc(struct msm_fence_context *fctx,
> +               struct msm_ringbuffer *ring);
> +
> +uint32_t msm_fence_id(struct dma_fence *fence);
>
>  #endif
> diff --git a/drivers/gpu/drm/msm/msm_gem.h b/drivers/gpu/drm/msm/msm_gem.h
> index 2767014..ddae0a9 100644
> --- a/drivers/gpu/drm/msm/msm_gem.h
> +++ b/drivers/gpu/drm/msm/msm_gem.h
> @@ -116,12 +116,13 @@ static inline bool is_vunmapable(struct msm_gem_object *msm_obj)
>  struct msm_gem_submit {
>         struct drm_device *dev;
>         struct msm_gpu *gpu;
> -       struct list_head node;   /* node in gpu submit_list */
> +       struct list_head node;   /* node in ring submit list */
>         struct list_head bo_list;
>         struct ww_acquire_ctx ticket;
>         struct dma_fence *fence;
>         struct pid *pid;    /* submitting process */
>         bool valid;         /* true if no cmdstream patching needed */
> +       struct msm_ringbuffer *ring;
>         unsigned int nr_cmds;
>         unsigned int nr_bos;
>         struct {
> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
> index 0129ca2..4f483c0 100644
> --- a/drivers/gpu/drm/msm/msm_gem_submit.c
> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
> @@ -418,7 +418,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
>         int out_fence_fd = -1;
>         unsigned i;
>         u32 prio = 0;
> -       int ret;
> +       int ret, ring;
>
>         if (!gpu)
>                 return -ENXIO;
> @@ -552,7 +552,11 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
>
>         submit->nr_cmds = i;
>
> -       submit->fence = msm_fence_alloc(gpu->fctx);
> +       ring = clamp_t(uint32_t, prio, 0, gpu->nr_rings - 1);
> +
> +       submit->ring = gpu->rb[ring];
> +
> +       submit->fence = msm_fence_alloc(gpu->fctx, submit->ring);
>         if (IS_ERR(submit->fence)) {
>                 ret = PTR_ERR(submit->fence);
>                 submit->fence = NULL;
> @@ -569,7 +573,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
>
>         msm_gpu_submit(gpu, submit, ctx);
>
> -       args->fence = submit->fence->seqno;
> +       args->fence = msm_fence_id(submit->fence);
>
>         if (args->flags & MSM_SUBMIT_FENCE_FD_OUT) {
>                 fd_install(out_fence_fd, sync_file->file);
> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> index 1f753f0..a1bb3db 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.c
> +++ b/drivers/gpu/drm/msm/msm_gpu.c
> @@ -226,15 +226,35 @@ static void recover_worker(struct work_struct *work)
>         struct msm_gpu *gpu = container_of(work, struct msm_gpu, recover_work);
>         struct drm_device *dev = gpu->dev;
>         struct msm_gem_submit *submit;
> -       uint32_t fence = gpu->funcs->last_fence(gpu);
> +       struct msm_ringbuffer *cur_ring = gpu->funcs->active_ring(gpu);
> +       uint32_t fence;
> +       int i;
> +
> +       /* Update all the rings with the latest and greatest fence */
> +       for (i = 0; i < ARRAY_SIZE(gpu->rb); i++) {
> +               struct msm_ringbuffer *ring = gpu->rb[i];
> +               uint32_t fence = gpu->funcs->last_fence(gpu, ring);
> +
> +               /*
> +                * For the current (faulting?) ring/submit advance the fence by
> +                * one more to clear the faulting submit
> +                */
> +               if (ring == cur_ring)
> +                       fence = fence + 1;
>
> -       msm_update_fence(gpu->fctx, fence + 1);
> +               msm_update_fence(gpu->fctx, cur_ring, fence);
> +       }
>
>         mutex_lock(&dev->struct_mutex);
>
> +
>         dev_err(dev->dev, "%s: hangcheck recover!\n", gpu->name);
> -       list_for_each_entry(submit, &gpu->submit_list, node) {
> -               if (submit->fence->seqno == (fence + 1)) {
> +
> +       fence = gpu->funcs->last_fence(gpu, cur_ring) + 1;
> +
> +       list_for_each_entry(submit, &cur_ring->submits, node) {
> +
> +               if (submit->fence->seqno == fence) {
>                         struct task_struct *task;
>
>                         rcu_read_lock();
> @@ -256,9 +276,16 @@ static void recover_worker(struct work_struct *work)
>                 gpu->funcs->recover(gpu);
>                 pm_runtime_put_sync(&gpu->pdev->dev);
>
> -               /* replay the remaining submits after the one that hung: */
> -               list_for_each_entry(submit, &gpu->submit_list, node) {
> -                       gpu->funcs->submit(gpu, submit, NULL);
> +               /*
> +                * Replay all remaining submits starting with highest priority
> +                * ring
> +                */
> +
> +               for (i = gpu->nr_rings - 1; i >= 0; i--) {
> +                       struct msm_ringbuffer *ring = gpu->rb[i];
> +
> +                       list_for_each_entry(submit, &ring->submits, node)
> +                               gpu->funcs->submit(gpu, submit, NULL);
>                 }
>         }
>
> @@ -279,25 +306,27 @@ static void hangcheck_handler(unsigned long data)
>         struct msm_gpu *gpu = (struct msm_gpu *)data;
>         struct drm_device *dev = gpu->dev;
>         struct msm_drm_private *priv = dev->dev_private;
> -       uint32_t fence = gpu->funcs->last_fence(gpu);
> +       struct msm_ringbuffer *ring = gpu->funcs->active_ring(gpu);
> +       uint32_t fence = gpu->funcs->last_fence(gpu, ring);
>
> -       if (fence != gpu->hangcheck_fence) {
> +       if (fence != gpu->hangcheck_fence[ring->id]) {
>                 /* some progress has been made.. ya! */
> -               gpu->hangcheck_fence = fence;
> -       } else if (fence < gpu->fctx->last_fence) {
> +               gpu->hangcheck_fence[ring->id] = fence;
> +       } else if (fence < ring->last_fence) {
>                 /* no progress and not done.. hung! */
> -               gpu->hangcheck_fence = fence;
> -               dev_err(dev->dev, "%s: hangcheck detected gpu lockup!\n",
> -                               gpu->name);
> +               gpu->hangcheck_fence[ring->id] = fence;
> +               dev_err(dev->dev, "%s: hangcheck detected gpu lockup rb %d!\n",
> +                               gpu->name, ring->id);
>                 dev_err(dev->dev, "%s:     completed fence: %u\n",
>                                 gpu->name, fence);
>                 dev_err(dev->dev, "%s:     submitted fence: %u\n",
> -                               gpu->name, gpu->fctx->last_fence);
> +                               gpu->name, ring->last_fence);
> +
>                 queue_work(priv->wq, &gpu->recover_work);
>         }
>
>         /* if still more pending work, reset the hangcheck timer: */
> -       if (gpu->fctx->last_fence > gpu->hangcheck_fence)
> +       if (ring->last_fence > gpu->hangcheck_fence[ring->id])
>                 hangcheck_timer_reset(gpu);
>
>         /* workaround for missing irq: */
> @@ -426,19 +455,18 @@ static void retire_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit)
>  static void retire_submits(struct msm_gpu *gpu)
>  {
>         struct drm_device *dev = gpu->dev;
> +       struct msm_gem_submit *submit, *tmp;
> +       int i;
>
>         WARN_ON(!mutex_is_locked(&dev->struct_mutex));
>
> -       while (!list_empty(&gpu->submit_list)) {
> -               struct msm_gem_submit *submit;
> +       /* Retire the commits starting with highest priority */
> +       for (i = gpu->nr_rings - 1; i >= 0; i--) {
> +               struct msm_ringbuffer *ring = gpu->rb[i];
>
> -               submit = list_first_entry(&gpu->submit_list,
> -                               struct msm_gem_submit, node);
> -
> -               if (dma_fence_is_signaled(submit->fence)) {
> -                       retire_submit(gpu, submit);
> -               } else {
> -                       break;
> +               list_for_each_entry_safe(submit, tmp, &ring->submits, node) {
> +                       if (dma_fence_is_signaled(submit->fence))
> +                               retire_submit(gpu, submit);
>                 }
>         }
>  }
> @@ -447,9 +475,12 @@ static void retire_worker(struct work_struct *work)
>  {
>         struct msm_gpu *gpu = container_of(work, struct msm_gpu, retire_work);
>         struct drm_device *dev = gpu->dev;
> -       uint32_t fence = gpu->funcs->last_fence(gpu);
> +       int i;
> +
> +       for (i = 0; i < gpu->nr_rings; i++)
> +               msm_update_fence(gpu->fctx, gpu->rb[i],
> +                       gpu->funcs->last_fence(gpu, gpu->rb[i]));
>
> -       msm_update_fence(gpu->fctx, fence);
>
>         mutex_lock(&dev->struct_mutex);
>         retire_submits(gpu);
> @@ -470,6 +501,7 @@ void msm_gpu_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit,
>  {
>         struct drm_device *dev = gpu->dev;
>         struct msm_drm_private *priv = dev->dev_private;
> +       struct msm_ringbuffer *ring = submit->ring;
>         int i;
>
>         WARN_ON(!mutex_is_locked(&dev->struct_mutex));
> @@ -478,7 +510,7 @@ void msm_gpu_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit,
>
>         msm_gpu_hw_init(gpu);
>
> -       list_add_tail(&submit->node, &gpu->submit_list);
> +       list_add_tail(&submit->node, &ring->submits);
>
>         msm_rd_dump_submit(submit);
>
> @@ -565,7 +597,7 @@ int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
>                 const char *name, struct msm_gpu_config *config)
>  {
>         struct iommu_domain *iommu;
> -       int ret;
> +       int i, ret, nr_rings = config->nr_rings;
>
>         if (WARN_ON(gpu->num_perfcntrs > ARRAY_SIZE(gpu->last_cntrs)))
>                 gpu->num_perfcntrs = ARRAY_SIZE(gpu->last_cntrs);
> @@ -584,7 +616,6 @@ int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
>         INIT_WORK(&gpu->retire_work, retire_worker);
>         INIT_WORK(&gpu->recover_work, recover_worker);
>
> -       INIT_LIST_HEAD(&gpu->submit_list);
>
>         setup_timer(&gpu->hangcheck_timer, hangcheck_handler,
>                         (unsigned long)gpu);
> @@ -658,40 +689,61 @@ int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
>                 dev_info(drm->dev, "%s: no IOMMU, fallback to VRAM carveout!\n", name);
>         }
>
> -       /* Create ringbuffer: */
> -       mutex_lock(&drm->struct_mutex);
> -       gpu->rb = msm_ringbuffer_new(gpu, config->ringsz);
> -       mutex_unlock(&drm->struct_mutex);
> -       if (IS_ERR(gpu->rb)) {
> -               ret = PTR_ERR(gpu->rb);
> -               gpu->rb = NULL;
> -               dev_err(drm->dev, "could not create ringbuffer: %d\n", ret);
> -               goto fail;
> +       if (nr_rings > ARRAY_SIZE(gpu->rb)) {
> +               DRM_DEV_INFO_ONCE(drm->dev, "Only creating %lu ringbuffers\n",
> +                       ARRAY_SIZE(gpu->rb));
> +               nr_rings = ARRAY_SIZE(gpu->rb);
> +       }
> +
> +       /* Create ringbuffer(s): */
> +       for (i = 0; i < nr_rings; i++) {
> +               mutex_lock(&drm->struct_mutex);
> +               gpu->rb[i] = msm_ringbuffer_new(gpu, i);
> +               mutex_unlock(&drm->struct_mutex);
> +
> +               if (IS_ERR(gpu->rb[i])) {
> +                       ret = PTR_ERR(gpu->rb[i]);
> +                       dev_err(drm->dev,
> +                               "could not create ringbuffer %d: %d\n", i, ret);
> +                       goto fail;
> +               }
>         }
>
> +       gpu->nr_rings = nr_rings;
> +
>         gpu->pdev = pdev;
>         platform_set_drvdata(pdev, gpu);
>
> +
>         bs_init(gpu);
>
>         return 0;
>
>  fail:
> +       for (i = 0; i < nr_rings; i++) {
> +               msm_ringbuffer_destroy(gpu->rb[i]);
> +               gpu->rb[i] = NULL;
> +       }
> +
>         return ret;
>  }
>
>  void msm_gpu_cleanup(struct msm_gpu *gpu)
>  {
> +       int i;
> +
>         DBG("%s", gpu->name);
>
>         WARN_ON(!list_empty(&gpu->active_list));
>
>         bs_fini(gpu);
>
> -       if (gpu->rb) {
> -               if (gpu->rb_iova)
> -                       msm_gem_put_iova(gpu->rb->bo, gpu->aspace);
> -               msm_ringbuffer_destroy(gpu->rb);
> +       for (i = 0; i < gpu->nr_rings; i++) {
> +               if (gpu->rb[i]->iova)
> +                       msm_gem_put_iova(gpu->rb[i]->bo, gpu->aspace);
> +
> +               msm_ringbuffer_destroy(gpu->rb[i]);
> +               gpu->rb[i] = NULL;
>         }
>
>         if (gpu->fctx)
> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
> index ca07a21..c0e7c84 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.h
> +++ b/drivers/gpu/drm/msm/msm_gpu.h
> @@ -33,7 +33,7 @@ struct msm_gpu_config {
>         const char *irqname;
>         uint64_t va_start;
>         uint64_t va_end;
> -       unsigned int ringsz;
> +       unsigned int nr_rings;
>  };
>
>  /* So far, with hardware that I've seen to date, we can have:
> @@ -57,9 +57,11 @@ struct msm_gpu_funcs {
>         int (*pm_resume)(struct msm_gpu *gpu);
>         void (*submit)(struct msm_gpu *gpu, struct msm_gem_submit *submit,
>                         struct msm_file_private *ctx);
> -       void (*flush)(struct msm_gpu *gpu);
> +       void (*flush)(struct msm_gpu *gpu, struct msm_ringbuffer *ring);
>         irqreturn_t (*irq)(struct msm_gpu *irq);
> -       uint32_t (*last_fence)(struct msm_gpu *gpu);
> +       uint32_t (*last_fence)(struct msm_gpu *gpu,
> +                       struct msm_ringbuffer *ring);
> +       struct msm_ringbuffer *(*active_ring)(struct msm_gpu *gpu);
>         void (*recover)(struct msm_gpu *gpu);
>         void (*destroy)(struct msm_gpu *gpu);
>  #ifdef CONFIG_DEBUG_FS
> @@ -86,9 +88,8 @@ struct msm_gpu {
>         const struct msm_gpu_perfcntr *perfcntrs;
>         uint32_t num_perfcntrs;
>
> -       /* ringbuffer: */
> -       struct msm_ringbuffer *rb;
> -       uint64_t rb_iova;
> +       struct msm_ringbuffer *rb[MSM_GPU_MAX_RINGS];
> +       int nr_rings;
>
>         /* list of GEM active objects: */
>         struct list_head active_list;
> @@ -126,10 +127,8 @@ struct msm_gpu {
>  #define DRM_MSM_HANGCHECK_PERIOD 500 /* in ms */
>  #define DRM_MSM_HANGCHECK_JIFFIES msecs_to_jiffies(DRM_MSM_HANGCHECK_PERIOD)
>         struct timer_list hangcheck_timer;
> -       uint32_t hangcheck_fence;
> +       uint32_t hangcheck_fence[MSM_GPU_MAX_RINGS];
>         struct work_struct recover_work;
> -
> -       struct list_head submit_list;
>  };
>
>  struct msm_gpu_drawqueue {
> @@ -139,9 +138,20 @@ struct msm_gpu_drawqueue {
>         struct list_head node;
>  };
>
> +/* It turns out that all targets use the same ringbuffer size */
> +#define MSM_GPU_RINGBUFFER_SZ SZ_32K
> +
>  static inline bool msm_gpu_active(struct msm_gpu *gpu)
>  {
> -       return gpu->fctx->last_fence > gpu->funcs->last_fence(gpu);
> +       int i;
> +
> +       for (i = 0; i < gpu->nr_rings; i++) {
> +               if (gpu->rb[i]->last_fence >
> +                       gpu->funcs->last_fence(gpu, gpu->rb[i]))
> +                       return true;
> +       }
> +
> +       return false;
>  }
>
>  /* Perf-Counters:
> diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c
> index 67b34e0..10f1d948 100644
> --- a/drivers/gpu/drm/msm/msm_ringbuffer.c
> +++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
> @@ -18,13 +18,13 @@
>  #include "msm_ringbuffer.h"
>  #include "msm_gpu.h"
>
> -struct msm_ringbuffer *msm_ringbuffer_new(struct msm_gpu *gpu, int size)
> +struct msm_ringbuffer *msm_ringbuffer_new(struct msm_gpu *gpu, int id)
>  {
>         struct msm_ringbuffer *ring;
>         int ret;
>
> -       if (WARN_ON(!is_power_of_2(size)))
> -               return ERR_PTR(-EINVAL);
> +       /* We assume everwhere that MSM_GPU_RINGBUFFER_SZ is a power of 2 */
> +       BUILD_BUG_ON(!is_power_of_2(MSM_GPU_RINGBUFFER_SZ));
>
>         ring = kzalloc(sizeof(*ring), GFP_KERNEL);
>         if (!ring) {
> @@ -33,7 +33,8 @@ struct msm_ringbuffer *msm_ringbuffer_new(struct msm_gpu *gpu, int size)
>         }
>
>         ring->gpu = gpu;
> -       ring->bo = msm_gem_new(gpu->dev, size, MSM_BO_WC);
> +       ring->id = id;
> +       ring->bo = msm_gem_new(gpu->dev, MSM_GPU_RINGBUFFER_SZ, MSM_BO_WC);
>         if (IS_ERR(ring->bo)) {
>                 ret = PTR_ERR(ring->bo);
>                 ring->bo = NULL;
> @@ -45,21 +46,23 @@ struct msm_ringbuffer *msm_ringbuffer_new(struct msm_gpu *gpu, int size)
>                 ret = PTR_ERR(ring->start);
>                 goto fail;
>         }
> -       ring->end   = ring->start + (size / 4);
> +       ring->end   = ring->start + (MSM_GPU_RINGBUFFER_SZ >> 2);
>         ring->cur   = ring->start;
>
> -       ring->size = size;
> +       INIT_LIST_HEAD(&ring->submits);
>
>         return ring;
>
>  fail:
> -       if (ring)
> -               msm_ringbuffer_destroy(ring);
> +       msm_ringbuffer_destroy(ring);
>         return ERR_PTR(ret);
>  }
>
>  void msm_ringbuffer_destroy(struct msm_ringbuffer *ring)
>  {
> +       if (IS_ERR_OR_NULL(ring))
> +               return;
> +
>         if (ring->bo) {
>                 msm_gem_put_vaddr(ring->bo);
>                 drm_gem_object_unreference_unlocked(ring->bo);
> diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.h b/drivers/gpu/drm/msm/msm_ringbuffer.h
> index 6e0e104..c803364 100644
> --- a/drivers/gpu/drm/msm/msm_ringbuffer.h
> +++ b/drivers/gpu/drm/msm/msm_ringbuffer.h
> @@ -22,12 +22,17 @@
>
>  struct msm_ringbuffer {
>         struct msm_gpu *gpu;
> -       int size;
> +       int id;
>         struct drm_gem_object *bo;
>         uint32_t *start, *end, *cur;
> +       uint64_t iova;
> +       /* last_fence == completed_fence --> no pending work */
> +       uint32_t last_fence;
> +       uint32_t completed_fence;
> +       struct list_head submits;
>  };
>
> -struct msm_ringbuffer *msm_ringbuffer_new(struct msm_gpu *gpu, int size);
> +struct msm_ringbuffer *msm_ringbuffer_new(struct msm_gpu *gpu, int id);
>  void msm_ringbuffer_destroy(struct msm_ringbuffer *ring);
>
>  /* ringbuffer helpers (the parts that are same for a3xx/a2xx/z180..) */
> --
> 1.9.1
>
> _______________________________________________
> Freedreno mailing list
> Freedreno@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/freedreno
--
To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jordan Crouse May 30, 2017, 4:20 p.m. UTC | #4
On Sun, May 28, 2017 at 09:43:35AM -0400, Rob Clark wrote:
> On Mon, May 8, 2017 at 4:35 PM, Jordan Crouse <jcrouse@codeaurora.org> wrote:
> > Add the infrastructure to support the idea of multiple ringbuffers.
> > Assign each ringbuffer an id and use that as an index for the various
> > ring specific operations.
> >
> > The biggest delta is to support legacy fences. Each fence gets its own
> > sequence number but the legacy functions expect to use a unique integer.
> > To handle this we return a unique identifer for each submission but
> > map it to a specific ring/sequence under the covers. Newer users use
> > a dma_fence pointer anyway so they don't care about the actual sequence
> > ID or ring.
> 
> So, WAIT_FENCE is alive and well, and useful since it avoids the
> overhead of creating a 'struct file', but it is only used within a
> single pipe_context (or at least situations where we know which ctx
> the seqno fence applies to).  It seems like it would be simpler if we
> just introduced a ctx-id in all the ioctls (SUBMIT and WAIT_FENCE)
> that take a uint fence.  Then I think we don't need hashtable
> fancyness.
> 
> Also, one thing I was thinking of is that some-day we might want to
> make SUBMIT non-blocking when there is a dependency on a fence from a
> different ring.  (Ie. queue it up but don't write cmds into rb yet.)
> Which means we'd need multiple fence timelines per priority-level rb.
> Which brings me back to wanting a CREATE_CTX type of ioctl.  (And I
> guess DESTROY_CTX.)  We could make these simple stubs for now, ie.
> CREATE_CTX just returns the priority level back, and not really have
> any separate "context" object on the kernel side for now.  This
> wouldn't change the implementation much from what you have, but I
> think that gives us some flexibility to later on actually let us have
> multiple contexts at a given priority level which don't block each
> other for submits that are still pending on some fence, without
> another UABI change.

Sure. My motivation here was to mostly avoid making that decision because I know
from experience once we start going down that path we end up using the context
ID for everything and we end up re-spinning a bunch of APIs.

But I agree that the context concept is our inevitable future - I've already
posted one set of patches for "draw queues" (which will soon be bravely renamed
as submit queues). I think thats the way we want to go because as you said,
there is a 100% chance we'll go for asynchronous submissions in the very near
future.

That said, there is a bit of added complexity for per-queue fences - namely,
we need to move the per-ring fence value in the memptrs to a per-queue value.
This probably isn't a huge deal (an extra page of memory would give us up to
1024 queues to work with globally) but I get itchy every time an arbitrary
limit is introduced no matter how reasonable it might be.

Jordan
Alex Deucher May 30, 2017, 4:34 p.m. UTC | #5
On Tue, May 30, 2017 at 12:20 PM, Jordan Crouse <jcrouse@codeaurora.org> wrote:
> On Sun, May 28, 2017 at 09:43:35AM -0400, Rob Clark wrote:
>> On Mon, May 8, 2017 at 4:35 PM, Jordan Crouse <jcrouse@codeaurora.org> wrote:
>> > Add the infrastructure to support the idea of multiple ringbuffers.
>> > Assign each ringbuffer an id and use that as an index for the various
>> > ring specific operations.
>> >
>> > The biggest delta is to support legacy fences. Each fence gets its own
>> > sequence number but the legacy functions expect to use a unique integer.
>> > To handle this we return a unique identifer for each submission but
>> > map it to a specific ring/sequence under the covers. Newer users use
>> > a dma_fence pointer anyway so they don't care about the actual sequence
>> > ID or ring.
>>
>> So, WAIT_FENCE is alive and well, and useful since it avoids the
>> overhead of creating a 'struct file', but it is only used within a
>> single pipe_context (or at least situations where we know which ctx
>> the seqno fence applies to).  It seems like it would be simpler if we
>> just introduced a ctx-id in all the ioctls (SUBMIT and WAIT_FENCE)
>> that take a uint fence.  Then I think we don't need hashtable
>> fancyness.
>>
>> Also, one thing I was thinking of is that some-day we might want to
>> make SUBMIT non-blocking when there is a dependency on a fence from a
>> different ring.  (Ie. queue it up but don't write cmds into rb yet.)
>> Which means we'd need multiple fence timelines per priority-level rb.
>> Which brings me back to wanting a CREATE_CTX type of ioctl.  (And I
>> guess DESTROY_CTX.)  We could make these simple stubs for now, ie.
>> CREATE_CTX just returns the priority level back, and not really have
>> any separate "context" object on the kernel side for now.  This
>> wouldn't change the implementation much from what you have, but I
>> think that gives us some flexibility to later on actually let us have
>> multiple contexts at a given priority level which don't block each
>> other for submits that are still pending on some fence, without
>> another UABI change.
>
> Sure. My motivation here was to mostly avoid making that decision because I know
> from experience once we start going down that path we end up using the context
> ID for everything and we end up re-spinning a bunch of APIs.
>
> But I agree that the context concept is our inevitable future - I've already
> posted one set of patches for "draw queues" (which will soon be bravely renamed
> as submit queues). I think thats the way we want to go because as you said,
> there is a 100% chance we'll go for asynchronous submissions in the very near
> future.
>
> That said, there is a bit of added complexity for per-queue fences - namely,
> we need to move the per-ring fence value in the memptrs to a per-queue value.
> This probably isn't a huge deal (an extra page of memory would give us up to
> 1024 queues to work with globally) but I get itchy every time an arbitrary
> limit is introduced no matter how reasonable it might be.
>

FWIW, we have contexts in amdgpu and it makes a lot of things easier
when dealing with dependencies.  Feel free to browse our
implementation for ideas.

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Vetter May 31, 2017, 7:21 a.m. UTC | #6
On Tue, May 30, 2017 at 12:34:34PM -0400, Alex Deucher wrote:
> On Tue, May 30, 2017 at 12:20 PM, Jordan Crouse <jcrouse@codeaurora.org> wrote:
> > On Sun, May 28, 2017 at 09:43:35AM -0400, Rob Clark wrote:
> >> On Mon, May 8, 2017 at 4:35 PM, Jordan Crouse <jcrouse@codeaurora.org> wrote:
> >> > Add the infrastructure to support the idea of multiple ringbuffers.
> >> > Assign each ringbuffer an id and use that as an index for the various
> >> > ring specific operations.
> >> >
> >> > The biggest delta is to support legacy fences. Each fence gets its own
> >> > sequence number but the legacy functions expect to use a unique integer.
> >> > To handle this we return a unique identifer for each submission but
> >> > map it to a specific ring/sequence under the covers. Newer users use
> >> > a dma_fence pointer anyway so they don't care about the actual sequence
> >> > ID or ring.
> >>
> >> So, WAIT_FENCE is alive and well, and useful since it avoids the
> >> overhead of creating a 'struct file', but it is only used within a
> >> single pipe_context (or at least situations where we know which ctx
> >> the seqno fence applies to).  It seems like it would be simpler if we
> >> just introduced a ctx-id in all the ioctls (SUBMIT and WAIT_FENCE)
> >> that take a uint fence.  Then I think we don't need hashtable
> >> fancyness.
> >>
> >> Also, one thing I was thinking of is that some-day we might want to
> >> make SUBMIT non-blocking when there is a dependency on a fence from a
> >> different ring.  (Ie. queue it up but don't write cmds into rb yet.)
> >> Which means we'd need multiple fence timelines per priority-level rb.
> >> Which brings me back to wanting a CREATE_CTX type of ioctl.  (And I
> >> guess DESTROY_CTX.)  We could make these simple stubs for now, ie.
> >> CREATE_CTX just returns the priority level back, and not really have
> >> any separate "context" object on the kernel side for now.  This
> >> wouldn't change the implementation much from what you have, but I
> >> think that gives us some flexibility to later on actually let us have
> >> multiple contexts at a given priority level which don't block each
> >> other for submits that are still pending on some fence, without
> >> another UABI change.
> >
> > Sure. My motivation here was to mostly avoid making that decision because I know
> > from experience once we start going down that path we end up using the context
> > ID for everything and we end up re-spinning a bunch of APIs.
> >
> > But I agree that the context concept is our inevitable future - I've already
> > posted one set of patches for "draw queues" (which will soon be bravely renamed
> > as submit queues). I think thats the way we want to go because as you said,
> > there is a 100% chance we'll go for asynchronous submissions in the very near
> > future.
> >
> > That said, there is a bit of added complexity for per-queue fences - namely,
> > we need to move the per-ring fence value in the memptrs to a per-queue value.
> > This probably isn't a huge deal (an extra page of memory would give us up to
> > 1024 queues to work with globally) but I get itchy every time an arbitrary
> > limit is introduced no matter how reasonable it might be.
> >
> 
> FWIW, we have contexts in amdgpu and it makes a lot of things easier
> when dealing with dependencies.  Feel free to browse our
> implementation for ideas.

Same on i915, we use contexts (not batches) as the scheduling entity.
Think of them like threads on a cpu, at least in our case. And we can
dynamically allocate as many as we need (well until we run out of memory
of course), we can even swap them in/out :-)
-Daniel
diff mbox

Patch

diff --git a/drivers/gpu/drm/msm/adreno/a3xx_gpu.c b/drivers/gpu/drm/msm/adreno/a3xx_gpu.c
index 0e3828ed..10d0234 100644
--- a/drivers/gpu/drm/msm/adreno/a3xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a3xx_gpu.c
@@ -44,7 +44,7 @@ 
 
 static bool a3xx_me_init(struct msm_gpu *gpu)
 {
-	struct msm_ringbuffer *ring = gpu->rb;
+	struct msm_ringbuffer *ring = gpu->rb[0];
 
 	OUT_PKT3(ring, CP_ME_INIT, 17);
 	OUT_RING(ring, 0x000003f7);
@@ -65,7 +65,7 @@  static bool a3xx_me_init(struct msm_gpu *gpu)
 	OUT_RING(ring, 0x00000000);
 	OUT_RING(ring, 0x00000000);
 
-	gpu->funcs->flush(gpu);
+	gpu->funcs->flush(gpu, ring);
 	return a3xx_idle(gpu);
 }
 
@@ -339,7 +339,7 @@  static void a3xx_destroy(struct msm_gpu *gpu)
 static bool a3xx_idle(struct msm_gpu *gpu)
 {
 	/* wait for ringbuffer to drain: */
-	if (!adreno_idle(gpu))
+	if (!adreno_idle(gpu, gpu->rb[0]))
 		return false;
 
 	/* then wait for GPU to finish: */
@@ -447,6 +447,7 @@  static void a3xx_dump(struct msm_gpu *gpu)
 		.last_fence = adreno_last_fence,
 		.submit = adreno_submit,
 		.flush = adreno_flush,
+		.active_ring = adreno_active_ring,
 		.irq = a3xx_irq,
 		.destroy = a3xx_destroy,
 #ifdef CONFIG_DEBUG_FS
@@ -494,7 +495,7 @@  struct msm_gpu *a3xx_gpu_init(struct drm_device *dev)
 	adreno_gpu->registers = a3xx_registers;
 	adreno_gpu->reg_offsets = a3xx_register_offsets;
 
-	ret = adreno_gpu_init(dev, pdev, adreno_gpu, &funcs);
+	ret = adreno_gpu_init(dev, pdev, adreno_gpu, &funcs, 1);
 	if (ret)
 		goto fail;
 
diff --git a/drivers/gpu/drm/msm/adreno/a4xx_gpu.c b/drivers/gpu/drm/msm/adreno/a4xx_gpu.c
index 19abf22..35fbf18 100644
--- a/drivers/gpu/drm/msm/adreno/a4xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a4xx_gpu.c
@@ -116,7 +116,7 @@  static void a4xx_enable_hwcg(struct msm_gpu *gpu)
 
 static bool a4xx_me_init(struct msm_gpu *gpu)
 {
-	struct msm_ringbuffer *ring = gpu->rb;
+	struct msm_ringbuffer *ring = gpu->rb[0];
 
 	OUT_PKT3(ring, CP_ME_INIT, 17);
 	OUT_RING(ring, 0x000003f7);
@@ -137,7 +137,7 @@  static bool a4xx_me_init(struct msm_gpu *gpu)
 	OUT_RING(ring, 0x00000000);
 	OUT_RING(ring, 0x00000000);
 
-	gpu->funcs->flush(gpu);
+	gpu->funcs->flush(gpu, ring);
 	return a4xx_idle(gpu);
 }
 
@@ -337,7 +337,7 @@  static void a4xx_destroy(struct msm_gpu *gpu)
 static bool a4xx_idle(struct msm_gpu *gpu)
 {
 	/* wait for ringbuffer to drain: */
-	if (!adreno_idle(gpu))
+	if (!adreno_idle(gpu, gpu->rb[0]))
 		return false;
 
 	/* then wait for GPU to finish: */
@@ -535,6 +535,7 @@  static int a4xx_get_timestamp(struct msm_gpu *gpu, uint64_t *value)
 		.last_fence = adreno_last_fence,
 		.submit = adreno_submit,
 		.flush = adreno_flush,
+		.active_ring = adreno_active_ring,
 		.irq = a4xx_irq,
 		.destroy = a4xx_destroy,
 #ifdef CONFIG_DEBUG_FS
@@ -576,7 +577,7 @@  struct msm_gpu *a4xx_gpu_init(struct drm_device *dev)
 	adreno_gpu->registers = a4xx_registers;
 	adreno_gpu->reg_offsets = a4xx_register_offsets;
 
-	ret = adreno_gpu_init(dev, pdev, adreno_gpu, &funcs);
+	ret = adreno_gpu_init(dev, pdev, adreno_gpu, &funcs, 1);
 	if (ret)
 		goto fail;
 
diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
index fd54cc7..aaa941e 100644
--- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
@@ -80,7 +80,7 @@  static void a5xx_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit,
 {
 	struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
 	struct msm_drm_private *priv = gpu->dev->dev_private;
-	struct msm_ringbuffer *ring = gpu->rb;
+	struct msm_ringbuffer *ring = submit->ring;
 	unsigned int i, ibs = 0;
 
 	for (i = 0; i < submit->nr_cmds; i++) {
@@ -105,11 +105,11 @@  static void a5xx_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit,
 
 	OUT_PKT7(ring, CP_EVENT_WRITE, 4);
 	OUT_RING(ring, CACHE_FLUSH_TS | (1 << 31));
-	OUT_RING(ring, lower_32_bits(rbmemptr(adreno_gpu, fence)));
-	OUT_RING(ring, upper_32_bits(rbmemptr(adreno_gpu, fence)));
+	OUT_RING(ring, lower_32_bits(rbmemptr(adreno_gpu, ring->id, fence)));
+	OUT_RING(ring, upper_32_bits(rbmemptr(adreno_gpu, ring->id, fence)));
 	OUT_RING(ring, submit->fence->seqno);
 
-	gpu->funcs->flush(gpu);
+	gpu->funcs->flush(gpu, ring);
 }
 
 struct a5xx_hwcg {
@@ -249,7 +249,7 @@  static void a5xx_enable_hwcg(struct msm_gpu *gpu)
 static int a5xx_me_init(struct msm_gpu *gpu)
 {
 	struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
-	struct msm_ringbuffer *ring = gpu->rb;
+	struct msm_ringbuffer *ring = gpu->rb[0];
 
 	OUT_PKT7(ring, CP_ME_INIT, 8);
 
@@ -280,9 +280,8 @@  static int a5xx_me_init(struct msm_gpu *gpu)
 	OUT_RING(ring, 0x00000000);
 	OUT_RING(ring, 0x00000000);
 
-	gpu->funcs->flush(gpu);
-
-	return a5xx_idle(gpu) ? 0 : -EINVAL;
+	gpu->funcs->flush(gpu, ring);
+	return a5xx_idle(gpu, ring) ? 0 : -EINVAL;
 }
 
 static struct drm_gem_object *a5xx_ucode_load_bo(struct msm_gpu *gpu,
@@ -628,11 +627,11 @@  static int a5xx_hw_init(struct msm_gpu *gpu)
 	 * ticking correctly
 	 */
 	if (adreno_is_a530(adreno_gpu)) {
-		OUT_PKT7(gpu->rb, CP_EVENT_WRITE, 1);
-		OUT_RING(gpu->rb, 0x0F);
+		OUT_PKT7(gpu->rb[0], CP_EVENT_WRITE, 1);
+		OUT_RING(gpu->rb[0], 0x0F);
 
-		gpu->funcs->flush(gpu);
-		if (!a5xx_idle(gpu))
+		gpu->funcs->flush(gpu, gpu->rb[0]);
+		if (!a5xx_idle(gpu, gpu->rb[0]))
 			return -EINVAL;
 	}
 
@@ -645,11 +644,11 @@  static int a5xx_hw_init(struct msm_gpu *gpu)
 	 */
 	ret = a5xx_zap_shader_init(gpu);
 	if (!ret) {
-		OUT_PKT7(gpu->rb, CP_SET_SECURE_MODE, 1);
-		OUT_RING(gpu->rb, 0x00000000);
+		OUT_PKT7(gpu->rb[0], CP_SET_SECURE_MODE, 1);
+		OUT_RING(gpu->rb[0], 0x00000000);
 
-		gpu->funcs->flush(gpu);
-		if (!a5xx_idle(gpu))
+		gpu->funcs->flush(gpu, gpu->rb[0]);
+		if (!a5xx_idle(gpu, gpu->rb[0]))
 			return -EINVAL;
 	} else {
 		/* Print a warning so if we die, we know why */
@@ -726,16 +725,19 @@  static inline bool _a5xx_check_idle(struct msm_gpu *gpu)
 		A5XX_RBBM_INT_0_MASK_MISC_HANG_DETECT);
 }
 
-bool a5xx_idle(struct msm_gpu *gpu)
+bool a5xx_idle(struct msm_gpu *gpu, struct msm_ringbuffer *ring)
 {
 	/* wait for CP to drain ringbuffer: */
-	if (!adreno_idle(gpu))
+	if (!adreno_idle(gpu, ring))
 		return false;
 
 	if (spin_until(_a5xx_check_idle(gpu))) {
-		DRM_ERROR("%s: %ps: timeout waiting for GPU to idle: status %8.8X irq %8.8X\n",
-			gpu->name, __builtin_return_address(0),
+		DRM_DEV_ERROR(gpu->dev->dev,
+			"timeout waiting for GPU RB %d to idle: status %8.8X rptr/wptr: %4.4X/%4.4X irq %8.8X\n",
+			ring->id,
 			gpu_read(gpu, REG_A5XX_RBBM_STATUS),
+			gpu_read(gpu, REG_A5XX_CP_RB_RPTR),
+			gpu_read(gpu, REG_A5XX_CP_RB_WPTR),
 			gpu_read(gpu, REG_A5XX_RBBM_INT_0_STATUS));
 
 		return false;
@@ -1031,6 +1033,7 @@  static void a5xx_show(struct msm_gpu *gpu, struct seq_file *m)
 		.last_fence = adreno_last_fence,
 		.submit = a5xx_submit,
 		.flush = adreno_flush,
+		.active_ring = adreno_active_ring,
 		.irq = a5xx_irq,
 		.destroy = a5xx_destroy,
 #ifdef CONFIG_DEBUG_FS
@@ -1067,7 +1070,7 @@  struct msm_gpu *a5xx_gpu_init(struct drm_device *dev)
 
 	a5xx_gpu->lm_leakage = 0x4E001A;
 
-	ret = adreno_gpu_init(dev, pdev, adreno_gpu, &funcs);
+	ret = adreno_gpu_init(dev, pdev, adreno_gpu, &funcs, 1);
 	if (ret) {
 		a5xx_destroy(&(a5xx_gpu->base.base));
 		return ERR_PTR(ret);
diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.h b/drivers/gpu/drm/msm/adreno/a5xx_gpu.h
index 6638bc8..aba6faf 100644
--- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.h
+++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.h
@@ -58,6 +58,6 @@  static inline int spin_usecs(struct msm_gpu *gpu, uint32_t usecs,
 	return -ETIMEDOUT;
 }
 
-bool a5xx_idle(struct msm_gpu *gpu);
+bool a5xx_idle(struct msm_gpu *gpu, struct msm_ringbuffer *ring);
 
 #endif /* __A5XX_GPU_H__ */
diff --git a/drivers/gpu/drm/msm/adreno/a5xx_power.c b/drivers/gpu/drm/msm/adreno/a5xx_power.c
index 2fdee44..a7d91ac 100644
--- a/drivers/gpu/drm/msm/adreno/a5xx_power.c
+++ b/drivers/gpu/drm/msm/adreno/a5xx_power.c
@@ -173,7 +173,7 @@  static int a5xx_gpmu_init(struct msm_gpu *gpu)
 {
 	struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
 	struct a5xx_gpu *a5xx_gpu = to_a5xx_gpu(adreno_gpu);
-	struct msm_ringbuffer *ring = gpu->rb;
+	struct msm_ringbuffer *ring = gpu->rb[0];
 
 	if (!a5xx_gpu->gpmu_dwords)
 		return 0;
@@ -192,9 +192,9 @@  static int a5xx_gpmu_init(struct msm_gpu *gpu)
 	OUT_PKT7(ring, CP_SET_PROTECTED_MODE, 1);
 	OUT_RING(ring, 1);
 
-	gpu->funcs->flush(gpu);
+	gpu->funcs->flush(gpu, ring);
 
-	if (!a5xx_idle(gpu)) {
+	if (!a5xx_idle(gpu, ring)) {
 		DRM_ERROR("%s: Unable to load GPMU firmware. GPMU will not be active\n",
 			gpu->name);
 		return -EINVAL;
diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
index 4a24506..6b7114d 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
@@ -21,7 +21,6 @@ 
 #include "msm_gem.h"
 #include "msm_mmu.h"
 
-#define RB_SIZE    SZ_32K
 #define RB_BLKSIZE 32
 
 int adreno_get_param(struct msm_gpu *gpu, uint32_t param, uint64_t *value)
@@ -60,39 +59,47 @@  int adreno_get_param(struct msm_gpu *gpu, uint32_t param, uint64_t *value)
 int adreno_hw_init(struct msm_gpu *gpu)
 {
 	struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
-	int ret;
+	int i;
 
 	DBG("%s", gpu->name);
 
-	ret = msm_gem_get_iova(gpu->rb->bo, gpu->aspace, &gpu->rb_iova);
-	if (ret) {
-		gpu->rb_iova = 0;
-		dev_err(gpu->dev->dev, "could not map ringbuffer: %d\n", ret);
-		return ret;
-	}
+	for (i = 0; i < gpu->nr_rings; i++) {
+		struct msm_ringbuffer *ring = gpu->rb[i];
+		int ret;
 
-	/* reset ringbuffer: */
-	gpu->rb->cur = gpu->rb->start;
+		if (!ring)
+			continue;
 
-	/* reset completed fence seqno: */
-	adreno_gpu->memptrs->fence = gpu->fctx->completed_fence;
-	adreno_gpu->memptrs->rptr  = 0;
+		ret = msm_gem_get_iova(ring->bo, gpu->aspace, &ring->iova);
+		if (ret) {
+			ring->iova = 0;
+			dev_err(gpu->dev->dev,
+				"could not map ringbuffer %d: %d\n", i, ret);
+			return ret;
+		}
+
+		ring->cur = ring->start;
+
+		/* reset completed fence seqno: */
+		adreno_gpu->memptrs->fence[ring->id] = ring->completed_fence;
+		adreno_gpu->memptrs->rptr[ring->id]  = 0;
+	}
 
 	/* Setup REG_CP_RB_CNTL: */
 	adreno_gpu_write(adreno_gpu, REG_ADRENO_CP_RB_CNTL,
-			/* size is log2(quad-words): */
-			AXXX_CP_RB_CNTL_BUFSZ(ilog2(gpu->rb->size / 8)) |
-			AXXX_CP_RB_CNTL_BLKSZ(ilog2(RB_BLKSIZE / 8)) |
-			(adreno_is_a430(adreno_gpu) ? AXXX_CP_RB_CNTL_NO_UPDATE : 0));
+		/* size is log2(quad-words): */
+		AXXX_CP_RB_CNTL_BUFSZ(ilog2(MSM_GPU_RINGBUFFER_SZ / 8)) |
+		AXXX_CP_RB_CNTL_BLKSZ(ilog2(RB_BLKSIZE / 8)) |
+		(adreno_is_a430(adreno_gpu) ? AXXX_CP_RB_CNTL_NO_UPDATE : 0));
 
-	/* Setup ringbuffer address: */
+	/* Setup ringbuffer address - use ringbuffer[0] for GPU init */
 	adreno_gpu_write64(adreno_gpu, REG_ADRENO_CP_RB_BASE,
-		REG_ADRENO_CP_RB_BASE_HI, gpu->rb_iova);
+		REG_ADRENO_CP_RB_BASE_HI, gpu->rb[0]->iova);
 
 	if (!adreno_is_a430(adreno_gpu)) {
 		adreno_gpu_write64(adreno_gpu, REG_ADRENO_CP_RB_RPTR_ADDR,
 			REG_ADRENO_CP_RB_RPTR_ADDR_HI,
-			rbmemptr(adreno_gpu, rptr));
+			rbmemptr(adreno_gpu, 0, rptr));
 	}
 
 	return 0;
@@ -104,19 +111,35 @@  static uint32_t get_wptr(struct msm_ringbuffer *ring)
 }
 
 /* Use this helper to read rptr, since a430 doesn't update rptr in memory */
-static uint32_t get_rptr(struct adreno_gpu *adreno_gpu)
+static uint32_t get_rptr(struct adreno_gpu *adreno_gpu,
+		struct msm_ringbuffer *ring)
 {
-	if (adreno_is_a430(adreno_gpu))
-		return adreno_gpu->memptrs->rptr = adreno_gpu_read(
+	if (adreno_is_a430(adreno_gpu)) {
+		/*
+		 * If index is anything but 0 this will probably break horribly,
+		 * but I think that we have enough infrastructure in place to
+		 * ensure that it won't be. If not then this is why your
+		 * a430 stopped working.
+		 */
+		return adreno_gpu->memptrs->rptr[ring->id] = adreno_gpu_read(
 			adreno_gpu, REG_ADRENO_CP_RB_RPTR);
-	else
-		return adreno_gpu->memptrs->rptr;
+	} else
+		return adreno_gpu->memptrs->rptr[ring->id];
 }
 
-uint32_t adreno_last_fence(struct msm_gpu *gpu)
+struct msm_ringbuffer *adreno_active_ring(struct msm_gpu *gpu)
+{
+	return gpu->rb[0];
+}
+
+uint32_t adreno_last_fence(struct msm_gpu *gpu, struct msm_ringbuffer *ring)
 {
 	struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
-	return adreno_gpu->memptrs->fence;
+
+	if (!ring)
+		return 0;
+
+	return adreno_gpu->memptrs->fence[ring->id];
 }
 
 void adreno_recover(struct msm_gpu *gpu)
@@ -142,7 +165,7 @@  void adreno_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit,
 {
 	struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
 	struct msm_drm_private *priv = gpu->dev->dev_private;
-	struct msm_ringbuffer *ring = gpu->rb;
+	struct msm_ringbuffer *ring = submit->ring;
 	unsigned i;
 
 	for (i = 0; i < submit->nr_cmds; i++) {
@@ -181,7 +204,7 @@  void adreno_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit,
 
 	OUT_PKT3(ring, CP_EVENT_WRITE, 3);
 	OUT_RING(ring, CACHE_FLUSH_TS);
-	OUT_RING(ring, rbmemptr(adreno_gpu, fence));
+	OUT_RING(ring, rbmemptr(adreno_gpu, ring->id, fence));
 	OUT_RING(ring, submit->fence->seqno);
 
 	/* we could maybe be clever and only CP_COND_EXEC the interrupt: */
@@ -208,10 +231,10 @@  void adreno_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit,
 	}
 #endif
 
-	gpu->funcs->flush(gpu);
+	gpu->funcs->flush(gpu, ring);
 }
 
-void adreno_flush(struct msm_gpu *gpu)
+void adreno_flush(struct msm_gpu *gpu, struct msm_ringbuffer *ring)
 {
 	struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
 	uint32_t wptr;
@@ -221,7 +244,7 @@  void adreno_flush(struct msm_gpu *gpu)
 	 * to account for the possibility that the last command fit exactly into
 	 * the ringbuffer and rb->next hasn't wrapped to zero yet
 	 */
-	wptr = get_wptr(gpu->rb) & ((gpu->rb->size / 4) - 1);
+	wptr = get_wptr(ring) % (MSM_GPU_RINGBUFFER_SZ >> 2);
 
 	/* ensure writes to ringbuffer have hit system memory: */
 	mb();
@@ -229,17 +252,18 @@  void adreno_flush(struct msm_gpu *gpu)
 	adreno_gpu_write(adreno_gpu, REG_ADRENO_CP_RB_WPTR, wptr);
 }
 
-bool adreno_idle(struct msm_gpu *gpu)
+bool adreno_idle(struct msm_gpu *gpu, struct msm_ringbuffer *ring)
 {
 	struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
-	uint32_t wptr = get_wptr(gpu->rb);
+	uint32_t wptr = get_wptr(ring);
 
 	/* wait for CP to drain ringbuffer: */
-	if (!spin_until(get_rptr(adreno_gpu) == wptr))
+	if (!spin_until(get_rptr(adreno_gpu, ring) == wptr))
 		return true;
 
 	/* TODO maybe we need to reset GPU here to recover from hang? */
-	DRM_ERROR("%s: timeout waiting to drain ringbuffer!\n", gpu->name);
+	DRM_ERROR("%s: timeout waiting to drain ringbuffer %d!\n", gpu->name,
+		ring->id);
 	return false;
 }
 
@@ -254,10 +278,17 @@  void adreno_show(struct msm_gpu *gpu, struct seq_file *m)
 			adreno_gpu->rev.major, adreno_gpu->rev.minor,
 			adreno_gpu->rev.patchid);
 
-	seq_printf(m, "fence:    %d/%d\n", adreno_gpu->memptrs->fence,
-			gpu->fctx->last_fence);
-	seq_printf(m, "rptr:     %d\n", get_rptr(adreno_gpu));
-	seq_printf(m, "rb wptr:  %d\n", get_wptr(gpu->rb));
+	for (i = 0; i < gpu->nr_rings; i++) {
+		struct msm_ringbuffer *ring = gpu->rb[i];
+
+		seq_printf(m, "rb %d: fence:    %d/%d\n", i,
+			adreno_last_fence(gpu, ring),
+			ring->completed_fence);
+
+		seq_printf(m, "      rptr:     %d\n",
+			get_rptr(adreno_gpu, ring));
+		seq_printf(m, "rb wptr:  %d\n", get_wptr(ring));
+	}
 
 	/* dump these out in a form that can be parsed by demsm: */
 	seq_printf(m, "IO:region %s 00000000 00020000\n", gpu->name);
@@ -283,16 +314,23 @@  void adreno_show(struct msm_gpu *gpu, struct seq_file *m)
 void adreno_dump_info(struct msm_gpu *gpu)
 {
 	struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
+	int i;
 
 	printk("revision: %d (%d.%d.%d.%d)\n",
 			adreno_gpu->info->revn, adreno_gpu->rev.core,
 			adreno_gpu->rev.major, adreno_gpu->rev.minor,
 			adreno_gpu->rev.patchid);
 
-	printk("fence:    %d/%d\n", adreno_gpu->memptrs->fence,
-			gpu->fctx->last_fence);
-	printk("rptr:     %d\n", get_rptr(adreno_gpu));
-	printk("rb wptr:  %d\n", get_wptr(gpu->rb));
+	for (i = 0; i < gpu->nr_rings; i++) {
+		struct msm_ringbuffer *ring = gpu->rb[i];
+
+		printk("rb %d: fence:    %d/%d\n", i,
+			adreno_last_fence(gpu, ring),
+			ring->completed_fence);
+
+		printk("rptr:     %d\n", get_rptr(adreno_gpu, ring));
+		printk("rb wptr:  %d\n", get_wptr(ring));
+	}
 }
 
 /* would be nice to not have to duplicate the _show() stuff with printk(): */
@@ -315,19 +353,21 @@  void adreno_dump(struct msm_gpu *gpu)
 	}
 }
 
-static uint32_t ring_freewords(struct msm_gpu *gpu)
+static uint32_t ring_freewords(struct msm_ringbuffer *ring)
 {
-	struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
-	uint32_t size = gpu->rb->size / 4;
-	uint32_t wptr = get_wptr(gpu->rb);
-	uint32_t rptr = get_rptr(adreno_gpu);
+	struct adreno_gpu *adreno_gpu = to_adreno_gpu(ring->gpu);
+	uint32_t size = MSM_GPU_RINGBUFFER_SZ >> 2;
+	uint32_t wptr = get_wptr(ring);
+	uint32_t rptr = get_rptr(adreno_gpu, ring);
 	return (rptr + (size - 1) - wptr) % size;
 }
 
-void adreno_wait_ring(struct msm_gpu *gpu, uint32_t ndwords)
+void adreno_wait_ring(struct msm_ringbuffer *ring, uint32_t ndwords)
 {
-	if (spin_until(ring_freewords(gpu) >= ndwords))
-		DRM_ERROR("%s: timeout waiting for ringbuffer space\n", gpu->name);
+	if (spin_until(ring_freewords(ring) >= ndwords))
+		DRM_DEV_ERROR(ring->gpu->dev->dev,
+			"timeout waiting for space in ringubffer %d\n",
+			ring->id);
 }
 
 static const char *iommu_ports[] = {
@@ -336,7 +376,8 @@  void adreno_wait_ring(struct msm_gpu *gpu, uint32_t ndwords)
 };
 
 int adreno_gpu_init(struct drm_device *drm, struct platform_device *pdev,
-		struct adreno_gpu *adreno_gpu, const struct adreno_gpu_funcs *funcs)
+		struct adreno_gpu *adreno_gpu,
+		const struct adreno_gpu_funcs *funcs, int nr_rings)
 {
 	struct adreno_platform_config *config = pdev->dev.platform_data;
 	struct msm_gpu_config adreno_gpu_config  = { 0 };
@@ -364,7 +405,7 @@  int adreno_gpu_init(struct drm_device *drm, struct platform_device *pdev,
 	adreno_gpu_config.va_start = SZ_16M;
 	adreno_gpu_config.va_end = 0xffffffff;
 
-	adreno_gpu_config.ringsz = RB_SIZE;
+	adreno_gpu_config.nr_rings = nr_rings;
 
 	ret = msm_gpu_init(drm, pdev, &adreno_gpu->base, &funcs->base,
 			adreno_gpu->info->name, &adreno_gpu_config);
diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.h b/drivers/gpu/drm/msm/adreno/adreno_gpu.h
index 4d9165f..9e78e49 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_gpu.h
+++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.h
@@ -82,12 +82,18 @@  struct adreno_info {
 
 const struct adreno_info *adreno_info(struct adreno_rev rev);
 
-#define rbmemptr(adreno_gpu, member)  \
+#define _sizeof(member) \
+	sizeof(((struct adreno_rbmemptrs *) 0)->member[0])
+
+#define _base(adreno_gpu, member)  \
 	((adreno_gpu)->memptrs_iova + offsetof(struct adreno_rbmemptrs, member))
 
+#define rbmemptr(adreno_gpu, index, member) \
+	(_base((adreno_gpu), member) + ((index) * _sizeof(member)))
+
 struct adreno_rbmemptrs {
-	volatile uint32_t rptr;
-	volatile uint32_t fence;
+	volatile uint32_t rptr[MSM_GPU_MAX_RINGS];
+	volatile uint32_t fence[MSM_GPU_MAX_RINGS];
 };
 
 struct adreno_gpu {
@@ -197,21 +203,25 @@  static inline int adreno_is_a530(struct adreno_gpu *gpu)
 
 int adreno_get_param(struct msm_gpu *gpu, uint32_t param, uint64_t *value);
 int adreno_hw_init(struct msm_gpu *gpu);
-uint32_t adreno_last_fence(struct msm_gpu *gpu);
+uint32_t adreno_last_fence(struct msm_gpu *gpu, struct msm_ringbuffer *ring);
+uint32_t adreno_submitted_fence(struct msm_gpu *gpu,
+		struct msm_ringbuffer *ring);
 void adreno_recover(struct msm_gpu *gpu);
 void adreno_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit,
 		struct msm_file_private *ctx);
-void adreno_flush(struct msm_gpu *gpu);
-bool adreno_idle(struct msm_gpu *gpu);
+void adreno_flush(struct msm_gpu *gpu, struct msm_ringbuffer *ring);
+bool adreno_idle(struct msm_gpu *gpu, struct msm_ringbuffer *ring);
 #ifdef CONFIG_DEBUG_FS
 void adreno_show(struct msm_gpu *gpu, struct seq_file *m);
 #endif
 void adreno_dump_info(struct msm_gpu *gpu);
 void adreno_dump(struct msm_gpu *gpu);
-void adreno_wait_ring(struct msm_gpu *gpu, uint32_t ndwords);
+void adreno_wait_ring(struct msm_ringbuffer *ring, uint32_t ndwords);
+struct msm_ringbuffer *adreno_active_ring(struct msm_gpu *gpu);
 
 int adreno_gpu_init(struct drm_device *drm, struct platform_device *pdev,
-		struct adreno_gpu *gpu, const struct adreno_gpu_funcs *funcs);
+		struct adreno_gpu *gpu, const struct adreno_gpu_funcs *funcs,
+		int nr_rings);
 void adreno_gpu_cleanup(struct adreno_gpu *gpu);
 
 
@@ -220,7 +230,7 @@  int adreno_gpu_init(struct drm_device *drm, struct platform_device *pdev,
 static inline void
 OUT_PKT0(struct msm_ringbuffer *ring, uint16_t regindx, uint16_t cnt)
 {
-	adreno_wait_ring(ring->gpu, cnt+1);
+	adreno_wait_ring(ring, cnt+1);
 	OUT_RING(ring, CP_TYPE0_PKT | ((cnt-1) << 16) | (regindx & 0x7FFF));
 }
 
@@ -228,14 +238,14 @@  int adreno_gpu_init(struct drm_device *drm, struct platform_device *pdev,
 static inline void
 OUT_PKT2(struct msm_ringbuffer *ring)
 {
-	adreno_wait_ring(ring->gpu, 1);
+	adreno_wait_ring(ring, 1);
 	OUT_RING(ring, CP_TYPE2_PKT);
 }
 
 static inline void
 OUT_PKT3(struct msm_ringbuffer *ring, uint8_t opcode, uint16_t cnt)
 {
-	adreno_wait_ring(ring->gpu, cnt+1);
+	adreno_wait_ring(ring, cnt+1);
 	OUT_RING(ring, CP_TYPE3_PKT | ((cnt-1) << 16) | ((opcode & 0xFF) << 8));
 }
 
@@ -257,14 +267,14 @@  static inline u32 PM4_PARITY(u32 val)
 static inline void
 OUT_PKT4(struct msm_ringbuffer *ring, uint16_t regindx, uint16_t cnt)
 {
-	adreno_wait_ring(ring->gpu, cnt + 1);
+	adreno_wait_ring(ring, cnt + 1);
 	OUT_RING(ring, PKT4(regindx, cnt));
 }
 
 static inline void
 OUT_PKT7(struct msm_ringbuffer *ring, uint8_t opcode, uint16_t cnt)
 {
-	adreno_wait_ring(ring->gpu, cnt + 1);
+	adreno_wait_ring(ring, cnt + 1);
 	OUT_RING(ring, CP_TYPE7_PKT | (cnt << 0) | (PM4_PARITY(cnt) << 15) |
 		((opcode & 0x7F) << 16) | (PM4_PARITY(opcode) << 23));
 }
diff --git a/drivers/gpu/drm/msm/msm_drv.h b/drivers/gpu/drm/msm/msm_drv.h
index 192147c..bbf7d3d 100644
--- a/drivers/gpu/drm/msm/msm_drv.h
+++ b/drivers/gpu/drm/msm/msm_drv.h
@@ -76,6 +76,8 @@  struct msm_vblank_ctrl {
 	spinlock_t lock;
 };
 
+#define MSM_GPU_MAX_RINGS 1
+
 struct msm_drm_private {
 
 	struct drm_device *dev;
diff --git a/drivers/gpu/drm/msm/msm_fence.c b/drivers/gpu/drm/msm/msm_fence.c
index 3f299c5..8cf029f 100644
--- a/drivers/gpu/drm/msm/msm_fence.c
+++ b/drivers/gpu/drm/msm/msm_fence.c
@@ -20,7 +20,6 @@ 
 #include "msm_drv.h"
 #include "msm_fence.h"
 
-
 struct msm_fence_context *
 msm_fence_context_alloc(struct drm_device *dev, const char *name)
 {
@@ -32,9 +31,10 @@  struct msm_fence_context *
 
 	fctx->dev = dev;
 	fctx->name = name;
-	fctx->context = dma_fence_context_alloc(1);
+	fctx->context = dma_fence_context_alloc(MSM_GPU_MAX_RINGS);
 	init_waitqueue_head(&fctx->event);
 	spin_lock_init(&fctx->spinlock);
+	hash_init(fctx->hash);
 
 	return fctx;
 }
@@ -44,64 +44,94 @@  void msm_fence_context_free(struct msm_fence_context *fctx)
 	kfree(fctx);
 }
 
-static inline bool fence_completed(struct msm_fence_context *fctx, uint32_t fence)
+static inline bool fence_completed(struct msm_ringbuffer *ring, uint32_t fence)
+{
+	return (int32_t)(ring->completed_fence - fence) >= 0;
+}
+
+struct msm_fence {
+	struct msm_fence_context *fctx;
+	struct msm_ringbuffer *ring;
+	struct dma_fence base;
+	struct hlist_node node;
+	u32 fence_id;
+};
+
+static struct msm_fence *fence_from_id(struct msm_fence_context *fctx,
+		uint32_t id)
 {
-	return (int32_t)(fctx->completed_fence - fence) >= 0;
+	struct msm_fence *f;
+
+	hash_for_each_possible_rcu(fctx->hash, f, node, id) {
+		if (f->fence_id == id) {
+			if (dma_fence_get_rcu(&f->base))
+				return f;
+		}
+	}
+
+	return NULL;
 }
 
 /* legacy path for WAIT_FENCE ioctl: */
 int msm_wait_fence(struct msm_fence_context *fctx, uint32_t fence,
 		ktime_t *timeout, bool interruptible)
 {
+	struct msm_fence *f = fence_from_id(fctx, fence);
 	int ret;
 
-	if (fence > fctx->last_fence) {
-		DRM_ERROR("%s: waiting on invalid fence: %u (of %u)\n",
-				fctx->name, fence, fctx->last_fence);
-		return -EINVAL;
+	/* If no active fence was found, there are two possibilities */
+	if (!f) {
+		/* The requested ID is newer than last issued - return error */
+		if (fence > fctx->fence_id) {
+			DRM_ERROR("%s: waiting on invalid fence: %u (of %u)\n",
+				fctx->name, fence, fctx->fence_id);
+			return -EINVAL;
+		}
+
+		/* If the id has been issued assume fence has been retired */
+		return 0;
 	}
 
 	if (!timeout) {
 		/* no-wait: */
-		ret = fence_completed(fctx, fence) ? 0 : -EBUSY;
+		ret = fence_completed(f->ring, f->base.seqno) ? 0 : -EBUSY;
 	} else {
 		unsigned long remaining_jiffies = timeout_to_jiffies(timeout);
 
 		if (interruptible)
 			ret = wait_event_interruptible_timeout(fctx->event,
-				fence_completed(fctx, fence),
+				fence_completed(f->ring, f->base.seqno),
 				remaining_jiffies);
 		else
 			ret = wait_event_timeout(fctx->event,
-				fence_completed(fctx, fence),
+				fence_completed(f->ring, f->base.seqno),
 				remaining_jiffies);
 
 		if (ret == 0) {
 			DBG("timeout waiting for fence: %u (completed: %u)",
-					fence, fctx->completed_fence);
+				f->base.seqno, f->ring->completed_fence);
 			ret = -ETIMEDOUT;
 		} else if (ret != -ERESTARTSYS) {
 			ret = 0;
 		}
 	}
 
+	dma_fence_put(&f->base);
+
 	return ret;
 }
 
 /* called from workqueue */
-void msm_update_fence(struct msm_fence_context *fctx, uint32_t fence)
+void msm_update_fence(struct msm_fence_context *fctx,
+		struct msm_ringbuffer *ring, uint32_t fence)
 {
 	spin_lock(&fctx->spinlock);
-	fctx->completed_fence = max(fence, fctx->completed_fence);
+	ring->completed_fence = max(fence, ring->completed_fence);
 	spin_unlock(&fctx->spinlock);
 
 	wake_up_all(&fctx->event);
 }
 
-struct msm_fence {
-	struct msm_fence_context *fctx;
-	struct dma_fence base;
-};
 
 static inline struct msm_fence *to_msm_fence(struct dma_fence *fence)
 {
@@ -127,12 +157,17 @@  static bool msm_fence_enable_signaling(struct dma_fence *fence)
 static bool msm_fence_signaled(struct dma_fence *fence)
 {
 	struct msm_fence *f = to_msm_fence(fence);
-	return fence_completed(f->fctx, f->base.seqno);
+	return fence_completed(f->ring, f->base.seqno);
 }
 
 static void msm_fence_release(struct dma_fence *fence)
 {
 	struct msm_fence *f = to_msm_fence(fence);
+
+	spin_lock(&f->fctx->spinlock);
+	hash_del_rcu(&f->node);
+	spin_unlock(&f->fctx->spinlock);
+
 	kfree_rcu(f, base.rcu);
 }
 
@@ -145,8 +180,15 @@  static void msm_fence_release(struct dma_fence *fence)
 	.release = msm_fence_release,
 };
 
+uint32_t msm_fence_id(struct dma_fence *fence)
+{
+	struct msm_fence *f = to_msm_fence(fence);
+
+	return f->fence_id;
+}
+
 struct dma_fence *
-msm_fence_alloc(struct msm_fence_context *fctx)
+msm_fence_alloc(struct msm_fence_context *fctx, struct msm_ringbuffer *ring)
 {
 	struct msm_fence *f;
 
@@ -155,9 +197,17 @@  struct dma_fence *
 		return ERR_PTR(-ENOMEM);
 
 	f->fctx = fctx;
+	f->ring = ring;
+
+	/* Make a user fence ID to pass back for the legacy functions */
+	f->fence_id = ++fctx->fence_id;
+
+	spin_lock(&fctx->spinlock);
+	hash_add(fctx->hash, &f->node, f->fence_id);
+	spin_unlock(&fctx->spinlock);
 
 	dma_fence_init(&f->base, &msm_fence_ops, &fctx->spinlock,
-		       fctx->context, ++fctx->last_fence);
+		       fctx->context + ring->id, ++ring->last_fence);
 
 	return &f->base;
 }
diff --git a/drivers/gpu/drm/msm/msm_fence.h b/drivers/gpu/drm/msm/msm_fence.h
index 56061aa..b5c6830 100644
--- a/drivers/gpu/drm/msm/msm_fence.h
+++ b/drivers/gpu/drm/msm/msm_fence.h
@@ -18,17 +18,18 @@ 
 #ifndef __MSM_FENCE_H__
 #define __MSM_FENCE_H__
 
+#include <linux/hashtable.h>
 #include "msm_drv.h"
+#include "msm_ringbuffer.h"
 
 struct msm_fence_context {
 	struct drm_device *dev;
 	const char *name;
 	unsigned context;
-	/* last_fence == completed_fence --> no pending work */
-	uint32_t last_fence;          /* last assigned fence */
-	uint32_t completed_fence;     /* last completed fence */
+	u32 fence_id;
 	wait_queue_head_t event;
 	spinlock_t spinlock;
+	DECLARE_HASHTABLE(hash, 4);
 };
 
 struct msm_fence_context * msm_fence_context_alloc(struct drm_device *dev,
@@ -39,8 +40,12 @@  int msm_wait_fence(struct msm_fence_context *fctx, uint32_t fence,
 		ktime_t *timeout, bool interruptible);
 int msm_queue_fence_cb(struct msm_fence_context *fctx,
 		struct msm_fence_cb *cb, uint32_t fence);
-void msm_update_fence(struct msm_fence_context *fctx, uint32_t fence);
+void msm_update_fence(struct msm_fence_context *fctx,
+		struct msm_ringbuffer *ring, uint32_t fence);
 
-struct dma_fence * msm_fence_alloc(struct msm_fence_context *fctx);
+struct dma_fence *msm_fence_alloc(struct msm_fence_context *fctx,
+		struct msm_ringbuffer *ring);
+
+uint32_t msm_fence_id(struct dma_fence *fence);
 
 #endif
diff --git a/drivers/gpu/drm/msm/msm_gem.h b/drivers/gpu/drm/msm/msm_gem.h
index 2767014..ddae0a9 100644
--- a/drivers/gpu/drm/msm/msm_gem.h
+++ b/drivers/gpu/drm/msm/msm_gem.h
@@ -116,12 +116,13 @@  static inline bool is_vunmapable(struct msm_gem_object *msm_obj)
 struct msm_gem_submit {
 	struct drm_device *dev;
 	struct msm_gpu *gpu;
-	struct list_head node;   /* node in gpu submit_list */
+	struct list_head node;   /* node in ring submit list */
 	struct list_head bo_list;
 	struct ww_acquire_ctx ticket;
 	struct dma_fence *fence;
 	struct pid *pid;    /* submitting process */
 	bool valid;         /* true if no cmdstream patching needed */
+	struct msm_ringbuffer *ring;
 	unsigned int nr_cmds;
 	unsigned int nr_bos;
 	struct {
diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
index 0129ca2..4f483c0 100644
--- a/drivers/gpu/drm/msm/msm_gem_submit.c
+++ b/drivers/gpu/drm/msm/msm_gem_submit.c
@@ -418,7 +418,7 @@  int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
 	int out_fence_fd = -1;
 	unsigned i;
 	u32 prio = 0;
-	int ret;
+	int ret, ring;
 
 	if (!gpu)
 		return -ENXIO;
@@ -552,7 +552,11 @@  int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
 
 	submit->nr_cmds = i;
 
-	submit->fence = msm_fence_alloc(gpu->fctx);
+	ring = clamp_t(uint32_t, prio, 0, gpu->nr_rings - 1);
+
+	submit->ring = gpu->rb[ring];
+
+	submit->fence = msm_fence_alloc(gpu->fctx, submit->ring);
 	if (IS_ERR(submit->fence)) {
 		ret = PTR_ERR(submit->fence);
 		submit->fence = NULL;
@@ -569,7 +573,7 @@  int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
 
 	msm_gpu_submit(gpu, submit, ctx);
 
-	args->fence = submit->fence->seqno;
+	args->fence = msm_fence_id(submit->fence);
 
 	if (args->flags & MSM_SUBMIT_FENCE_FD_OUT) {
 		fd_install(out_fence_fd, sync_file->file);
diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index 1f753f0..a1bb3db 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -226,15 +226,35 @@  static void recover_worker(struct work_struct *work)
 	struct msm_gpu *gpu = container_of(work, struct msm_gpu, recover_work);
 	struct drm_device *dev = gpu->dev;
 	struct msm_gem_submit *submit;
-	uint32_t fence = gpu->funcs->last_fence(gpu);
+	struct msm_ringbuffer *cur_ring = gpu->funcs->active_ring(gpu);
+	uint32_t fence;
+	int i;
+
+	/* Update all the rings with the latest and greatest fence */
+	for (i = 0; i < ARRAY_SIZE(gpu->rb); i++) {
+		struct msm_ringbuffer *ring = gpu->rb[i];
+		uint32_t fence = gpu->funcs->last_fence(gpu, ring);
+
+		/*
+		 * For the current (faulting?) ring/submit advance the fence by
+		 * one more to clear the faulting submit
+		 */
+		if (ring == cur_ring)
+			fence = fence + 1;
 
-	msm_update_fence(gpu->fctx, fence + 1);
+		msm_update_fence(gpu->fctx, cur_ring, fence);
+	}
 
 	mutex_lock(&dev->struct_mutex);
 
+
 	dev_err(dev->dev, "%s: hangcheck recover!\n", gpu->name);
-	list_for_each_entry(submit, &gpu->submit_list, node) {
-		if (submit->fence->seqno == (fence + 1)) {
+
+	fence = gpu->funcs->last_fence(gpu, cur_ring) + 1;
+
+	list_for_each_entry(submit, &cur_ring->submits, node) {
+
+		if (submit->fence->seqno == fence) {
 			struct task_struct *task;
 
 			rcu_read_lock();
@@ -256,9 +276,16 @@  static void recover_worker(struct work_struct *work)
 		gpu->funcs->recover(gpu);
 		pm_runtime_put_sync(&gpu->pdev->dev);
 
-		/* replay the remaining submits after the one that hung: */
-		list_for_each_entry(submit, &gpu->submit_list, node) {
-			gpu->funcs->submit(gpu, submit, NULL);
+		/*
+		 * Replay all remaining submits starting with highest priority
+		 * ring
+		 */
+
+		for (i = gpu->nr_rings - 1; i >= 0; i--) {
+			struct msm_ringbuffer *ring = gpu->rb[i];
+
+			list_for_each_entry(submit, &ring->submits, node)
+				gpu->funcs->submit(gpu, submit, NULL);
 		}
 	}
 
@@ -279,25 +306,27 @@  static void hangcheck_handler(unsigned long data)
 	struct msm_gpu *gpu = (struct msm_gpu *)data;
 	struct drm_device *dev = gpu->dev;
 	struct msm_drm_private *priv = dev->dev_private;
-	uint32_t fence = gpu->funcs->last_fence(gpu);
+	struct msm_ringbuffer *ring = gpu->funcs->active_ring(gpu);
+	uint32_t fence = gpu->funcs->last_fence(gpu, ring);
 
-	if (fence != gpu->hangcheck_fence) {
+	if (fence != gpu->hangcheck_fence[ring->id]) {
 		/* some progress has been made.. ya! */
-		gpu->hangcheck_fence = fence;
-	} else if (fence < gpu->fctx->last_fence) {
+		gpu->hangcheck_fence[ring->id] = fence;
+	} else if (fence < ring->last_fence) {
 		/* no progress and not done.. hung! */
-		gpu->hangcheck_fence = fence;
-		dev_err(dev->dev, "%s: hangcheck detected gpu lockup!\n",
-				gpu->name);
+		gpu->hangcheck_fence[ring->id] = fence;
+		dev_err(dev->dev, "%s: hangcheck detected gpu lockup rb %d!\n",
+				gpu->name, ring->id);
 		dev_err(dev->dev, "%s:     completed fence: %u\n",
 				gpu->name, fence);
 		dev_err(dev->dev, "%s:     submitted fence: %u\n",
-				gpu->name, gpu->fctx->last_fence);
+				gpu->name, ring->last_fence);
+
 		queue_work(priv->wq, &gpu->recover_work);
 	}
 
 	/* if still more pending work, reset the hangcheck timer: */
-	if (gpu->fctx->last_fence > gpu->hangcheck_fence)
+	if (ring->last_fence > gpu->hangcheck_fence[ring->id])
 		hangcheck_timer_reset(gpu);
 
 	/* workaround for missing irq: */
@@ -426,19 +455,18 @@  static void retire_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit)
 static void retire_submits(struct msm_gpu *gpu)
 {
 	struct drm_device *dev = gpu->dev;
+	struct msm_gem_submit *submit, *tmp;
+	int i;
 
 	WARN_ON(!mutex_is_locked(&dev->struct_mutex));
 
-	while (!list_empty(&gpu->submit_list)) {
-		struct msm_gem_submit *submit;
+	/* Retire the commits starting with highest priority */
+	for (i = gpu->nr_rings - 1; i >= 0; i--) {
+		struct msm_ringbuffer *ring = gpu->rb[i];
 
-		submit = list_first_entry(&gpu->submit_list,
-				struct msm_gem_submit, node);
-
-		if (dma_fence_is_signaled(submit->fence)) {
-			retire_submit(gpu, submit);
-		} else {
-			break;
+		list_for_each_entry_safe(submit, tmp, &ring->submits, node) {
+			if (dma_fence_is_signaled(submit->fence))
+				retire_submit(gpu, submit);
 		}
 	}
 }
@@ -447,9 +475,12 @@  static void retire_worker(struct work_struct *work)
 {
 	struct msm_gpu *gpu = container_of(work, struct msm_gpu, retire_work);
 	struct drm_device *dev = gpu->dev;
-	uint32_t fence = gpu->funcs->last_fence(gpu);
+	int i;
+
+	for (i = 0; i < gpu->nr_rings; i++)
+		msm_update_fence(gpu->fctx, gpu->rb[i],
+			gpu->funcs->last_fence(gpu, gpu->rb[i]));
 
-	msm_update_fence(gpu->fctx, fence);
 
 	mutex_lock(&dev->struct_mutex);
 	retire_submits(gpu);
@@ -470,6 +501,7 @@  void msm_gpu_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit,
 {
 	struct drm_device *dev = gpu->dev;
 	struct msm_drm_private *priv = dev->dev_private;
+	struct msm_ringbuffer *ring = submit->ring;
 	int i;
 
 	WARN_ON(!mutex_is_locked(&dev->struct_mutex));
@@ -478,7 +510,7 @@  void msm_gpu_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit,
 
 	msm_gpu_hw_init(gpu);
 
-	list_add_tail(&submit->node, &gpu->submit_list);
+	list_add_tail(&submit->node, &ring->submits);
 
 	msm_rd_dump_submit(submit);
 
@@ -565,7 +597,7 @@  int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
 		const char *name, struct msm_gpu_config *config)
 {
 	struct iommu_domain *iommu;
-	int ret;
+	int i, ret, nr_rings = config->nr_rings;
 
 	if (WARN_ON(gpu->num_perfcntrs > ARRAY_SIZE(gpu->last_cntrs)))
 		gpu->num_perfcntrs = ARRAY_SIZE(gpu->last_cntrs);
@@ -584,7 +616,6 @@  int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
 	INIT_WORK(&gpu->retire_work, retire_worker);
 	INIT_WORK(&gpu->recover_work, recover_worker);
 
-	INIT_LIST_HEAD(&gpu->submit_list);
 
 	setup_timer(&gpu->hangcheck_timer, hangcheck_handler,
 			(unsigned long)gpu);
@@ -658,40 +689,61 @@  int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
 		dev_info(drm->dev, "%s: no IOMMU, fallback to VRAM carveout!\n", name);
 	}
 
-	/* Create ringbuffer: */
-	mutex_lock(&drm->struct_mutex);
-	gpu->rb = msm_ringbuffer_new(gpu, config->ringsz);
-	mutex_unlock(&drm->struct_mutex);
-	if (IS_ERR(gpu->rb)) {
-		ret = PTR_ERR(gpu->rb);
-		gpu->rb = NULL;
-		dev_err(drm->dev, "could not create ringbuffer: %d\n", ret);
-		goto fail;
+	if (nr_rings > ARRAY_SIZE(gpu->rb)) {
+		DRM_DEV_INFO_ONCE(drm->dev, "Only creating %lu ringbuffers\n",
+			ARRAY_SIZE(gpu->rb));
+		nr_rings = ARRAY_SIZE(gpu->rb);
+	}
+
+	/* Create ringbuffer(s): */
+	for (i = 0; i < nr_rings; i++) {
+		mutex_lock(&drm->struct_mutex);
+		gpu->rb[i] = msm_ringbuffer_new(gpu, i);
+		mutex_unlock(&drm->struct_mutex);
+
+		if (IS_ERR(gpu->rb[i])) {
+			ret = PTR_ERR(gpu->rb[i]);
+			dev_err(drm->dev,
+				"could not create ringbuffer %d: %d\n", i, ret);
+			goto fail;
+		}
 	}
 
+	gpu->nr_rings = nr_rings;
+
 	gpu->pdev = pdev;
 	platform_set_drvdata(pdev, gpu);
 
+
 	bs_init(gpu);
 
 	return 0;
 
 fail:
+	for (i = 0; i < nr_rings; i++) {
+		msm_ringbuffer_destroy(gpu->rb[i]);
+		gpu->rb[i] = NULL;
+	}
+
 	return ret;
 }
 
 void msm_gpu_cleanup(struct msm_gpu *gpu)
 {
+	int i;
+
 	DBG("%s", gpu->name);
 
 	WARN_ON(!list_empty(&gpu->active_list));
 
 	bs_fini(gpu);
 
-	if (gpu->rb) {
-		if (gpu->rb_iova)
-			msm_gem_put_iova(gpu->rb->bo, gpu->aspace);
-		msm_ringbuffer_destroy(gpu->rb);
+	for (i = 0; i < gpu->nr_rings; i++) {
+		if (gpu->rb[i]->iova)
+			msm_gem_put_iova(gpu->rb[i]->bo, gpu->aspace);
+
+		msm_ringbuffer_destroy(gpu->rb[i]);
+		gpu->rb[i] = NULL;
 	}
 
 	if (gpu->fctx)
diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
index ca07a21..c0e7c84 100644
--- a/drivers/gpu/drm/msm/msm_gpu.h
+++ b/drivers/gpu/drm/msm/msm_gpu.h
@@ -33,7 +33,7 @@  struct msm_gpu_config {
 	const char *irqname;
 	uint64_t va_start;
 	uint64_t va_end;
-	unsigned int ringsz;
+	unsigned int nr_rings;
 };
 
 /* So far, with hardware that I've seen to date, we can have:
@@ -57,9 +57,11 @@  struct msm_gpu_funcs {
 	int (*pm_resume)(struct msm_gpu *gpu);
 	void (*submit)(struct msm_gpu *gpu, struct msm_gem_submit *submit,
 			struct msm_file_private *ctx);
-	void (*flush)(struct msm_gpu *gpu);
+	void (*flush)(struct msm_gpu *gpu, struct msm_ringbuffer *ring);
 	irqreturn_t (*irq)(struct msm_gpu *irq);
-	uint32_t (*last_fence)(struct msm_gpu *gpu);
+	uint32_t (*last_fence)(struct msm_gpu *gpu,
+			struct msm_ringbuffer *ring);
+	struct msm_ringbuffer *(*active_ring)(struct msm_gpu *gpu);
 	void (*recover)(struct msm_gpu *gpu);
 	void (*destroy)(struct msm_gpu *gpu);
 #ifdef CONFIG_DEBUG_FS
@@ -86,9 +88,8 @@  struct msm_gpu {
 	const struct msm_gpu_perfcntr *perfcntrs;
 	uint32_t num_perfcntrs;
 
-	/* ringbuffer: */
-	struct msm_ringbuffer *rb;
-	uint64_t rb_iova;
+	struct msm_ringbuffer *rb[MSM_GPU_MAX_RINGS];
+	int nr_rings;
 
 	/* list of GEM active objects: */
 	struct list_head active_list;
@@ -126,10 +127,8 @@  struct msm_gpu {
 #define DRM_MSM_HANGCHECK_PERIOD 500 /* in ms */
 #define DRM_MSM_HANGCHECK_JIFFIES msecs_to_jiffies(DRM_MSM_HANGCHECK_PERIOD)
 	struct timer_list hangcheck_timer;
-	uint32_t hangcheck_fence;
+	uint32_t hangcheck_fence[MSM_GPU_MAX_RINGS];
 	struct work_struct recover_work;
-
-	struct list_head submit_list;
 };
 
 struct msm_gpu_drawqueue {
@@ -139,9 +138,20 @@  struct msm_gpu_drawqueue {
 	struct list_head node;
 };
 
+/* It turns out that all targets use the same ringbuffer size */
+#define MSM_GPU_RINGBUFFER_SZ SZ_32K
+
 static inline bool msm_gpu_active(struct msm_gpu *gpu)
 {
-	return gpu->fctx->last_fence > gpu->funcs->last_fence(gpu);
+	int i;
+
+	for (i = 0; i < gpu->nr_rings; i++) {
+		if (gpu->rb[i]->last_fence >
+			gpu->funcs->last_fence(gpu, gpu->rb[i]))
+			return true;
+	}
+
+	return false;
 }
 
 /* Perf-Counters:
diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c
index 67b34e0..10f1d948 100644
--- a/drivers/gpu/drm/msm/msm_ringbuffer.c
+++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
@@ -18,13 +18,13 @@ 
 #include "msm_ringbuffer.h"
 #include "msm_gpu.h"
 
-struct msm_ringbuffer *msm_ringbuffer_new(struct msm_gpu *gpu, int size)
+struct msm_ringbuffer *msm_ringbuffer_new(struct msm_gpu *gpu, int id)
 {
 	struct msm_ringbuffer *ring;
 	int ret;
 
-	if (WARN_ON(!is_power_of_2(size)))
-		return ERR_PTR(-EINVAL);
+	/* We assume everwhere that MSM_GPU_RINGBUFFER_SZ is a power of 2 */
+	BUILD_BUG_ON(!is_power_of_2(MSM_GPU_RINGBUFFER_SZ));
 
 	ring = kzalloc(sizeof(*ring), GFP_KERNEL);
 	if (!ring) {
@@ -33,7 +33,8 @@  struct msm_ringbuffer *msm_ringbuffer_new(struct msm_gpu *gpu, int size)
 	}
 
 	ring->gpu = gpu;
-	ring->bo = msm_gem_new(gpu->dev, size, MSM_BO_WC);
+	ring->id = id;
+	ring->bo = msm_gem_new(gpu->dev, MSM_GPU_RINGBUFFER_SZ, MSM_BO_WC);
 	if (IS_ERR(ring->bo)) {
 		ret = PTR_ERR(ring->bo);
 		ring->bo = NULL;
@@ -45,21 +46,23 @@  struct msm_ringbuffer *msm_ringbuffer_new(struct msm_gpu *gpu, int size)
 		ret = PTR_ERR(ring->start);
 		goto fail;
 	}
-	ring->end   = ring->start + (size / 4);
+	ring->end   = ring->start + (MSM_GPU_RINGBUFFER_SZ >> 2);
 	ring->cur   = ring->start;
 
-	ring->size = size;
+	INIT_LIST_HEAD(&ring->submits);
 
 	return ring;
 
 fail:
-	if (ring)
-		msm_ringbuffer_destroy(ring);
+	msm_ringbuffer_destroy(ring);
 	return ERR_PTR(ret);
 }
 
 void msm_ringbuffer_destroy(struct msm_ringbuffer *ring)
 {
+	if (IS_ERR_OR_NULL(ring))
+		return;
+
 	if (ring->bo) {
 		msm_gem_put_vaddr(ring->bo);
 		drm_gem_object_unreference_unlocked(ring->bo);
diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.h b/drivers/gpu/drm/msm/msm_ringbuffer.h
index 6e0e104..c803364 100644
--- a/drivers/gpu/drm/msm/msm_ringbuffer.h
+++ b/drivers/gpu/drm/msm/msm_ringbuffer.h
@@ -22,12 +22,17 @@ 
 
 struct msm_ringbuffer {
 	struct msm_gpu *gpu;
-	int size;
+	int id;
 	struct drm_gem_object *bo;
 	uint32_t *start, *end, *cur;
+	uint64_t iova;
+	/* last_fence == completed_fence --> no pending work */
+	uint32_t last_fence;
+	uint32_t completed_fence;
+	struct list_head submits;
 };
 
-struct msm_ringbuffer *msm_ringbuffer_new(struct msm_gpu *gpu, int size);
+struct msm_ringbuffer *msm_ringbuffer_new(struct msm_gpu *gpu, int id);
 void msm_ringbuffer_destroy(struct msm_ringbuffer *ring);
 
 /* ringbuffer helpers (the parts that are same for a3xx/a2xx/z180..) */