diff mbox series

[v4,02/18] drm/sched: Barriers are needed for entity->last_scheduled

Message ID 20210712175352.802687-3-daniel.vetter@ffwll.ch (mailing list archive)
State New, archived
Headers show
Series drm/sched dependency tracking and dma-resv fixes | expand

Commit Message

Daniel Vetter July 12, 2021, 5:53 p.m. UTC
It might be good enough on x86 with just READ_ONCE, but the write side
should then at least be WRITE_ONCE because x86 has total store order.

It's definitely not enough on arm.

Fix this proplery, which means
- explain the need for the barrier in both places
- point at the other side in each comment

Also pull out the !sched_list case as the first check, so that the
code flow is clearer.

While at it sprinkle some comments around because it was very
non-obvious to me what's actually going on here and why.

Note that we really need full barriers here, at first I thought
store-release and load-acquire on ->last_scheduled would be enough,
but we actually requiring ordering between that and the queue state.

v2: Put smp_rmp() in the right place and fix up comment (Andrey)

Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Steven Price <steven.price@arm.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Cc: Lee Jones <lee.jones@linaro.org>
Cc: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/scheduler/sched_entity.c | 27 ++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

Comments

Christian König July 13, 2021, 6:35 a.m. UTC | #1
Am 12.07.21 um 19:53 schrieb Daniel Vetter:
> It might be good enough on x86 with just READ_ONCE, but the write side
> should then at least be WRITE_ONCE because x86 has total store order.
>
> It's definitely not enough on arm.
>
> Fix this proplery, which means
> - explain the need for the barrier in both places
> - point at the other side in each comment
>
> Also pull out the !sched_list case as the first check, so that the
> code flow is clearer.
>
> While at it sprinkle some comments around because it was very
> non-obvious to me what's actually going on here and why.
>
> Note that we really need full barriers here, at first I thought
> store-release and load-acquire on ->last_scheduled would be enough,
> but we actually requiring ordering between that and the queue state.
>
> v2: Put smp_rmp() in the right place and fix up comment (Andrey)
>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> Cc: "Christian König" <christian.koenig@amd.com>
> Cc: Steven Price <steven.price@arm.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> Cc: Lee Jones <lee.jones@linaro.org>
> Cc: Boris Brezillon <boris.brezillon@collabora.com>
> ---
>   drivers/gpu/drm/scheduler/sched_entity.c | 27 ++++++++++++++++++++++--
>   1 file changed, 25 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index f7347c284886..89e3f6eaf519 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -439,8 +439,16 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
>   		dma_fence_set_error(&sched_job->s_fence->finished, -ECANCELED);
>   
>   	dma_fence_put(entity->last_scheduled);
> +
>   	entity->last_scheduled = dma_fence_get(&sched_job->s_fence->finished);
>   
> +	/*
> +	 * If the queue is empty we allow drm_sched_entity_select_rq() to
> +	 * locklessly access ->last_scheduled. This only works if we set the
> +	 * pointer before we dequeue and if we a write barrier here.
> +	 */
> +	smp_wmb();
> +

Again, conceptual those barriers should be part of the spsc_queue 
container and not externally.

Regards,
Christian.

>   	spsc_queue_pop(&entity->job_queue);
>   	return sched_job;
>   }
> @@ -459,10 +467,25 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
>   	struct drm_gpu_scheduler *sched;
>   	struct drm_sched_rq *rq;
>   
> -	if (spsc_queue_count(&entity->job_queue) || !entity->sched_list)
> +	/* single possible engine and already selected */
> +	if (!entity->sched_list)
> +		return;
> +
> +	/* queue non-empty, stay on the same engine */
> +	if (spsc_queue_count(&entity->job_queue))
>   		return;
>   
> -	fence = READ_ONCE(entity->last_scheduled);
> +	/*
> +	 * Only when the queue is empty are we guaranteed that the scheduler
> +	 * thread cannot change ->last_scheduled. To enforce ordering we need
> +	 * a read barrier here. See drm_sched_entity_pop_job() for the other
> +	 * side.
> +	 */
> +	smp_rmb();
> +
> +	fence = entity->last_scheduled;
> +
> +	/* stay on the same engine if the previous job hasn't finished */
>   	if (fence && !dma_fence_is_signaled(fence))
>   		return;
>
Daniel Vetter July 13, 2021, 6:50 a.m. UTC | #2
On Tue, Jul 13, 2021 at 8:35 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 12.07.21 um 19:53 schrieb Daniel Vetter:
> > It might be good enough on x86 with just READ_ONCE, but the write side
> > should then at least be WRITE_ONCE because x86 has total store order.
> >
> > It's definitely not enough on arm.
> >
> > Fix this proplery, which means
> > - explain the need for the barrier in both places
> > - point at the other side in each comment
> >
> > Also pull out the !sched_list case as the first check, so that the
> > code flow is clearer.
> >
> > While at it sprinkle some comments around because it was very
> > non-obvious to me what's actually going on here and why.
> >
> > Note that we really need full barriers here, at first I thought
> > store-release and load-acquire on ->last_scheduled would be enough,
> > but we actually requiring ordering between that and the queue state.
> >
> > v2: Put smp_rmp() in the right place and fix up comment (Andrey)
> >
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > Cc: "Christian König" <christian.koenig@amd.com>
> > Cc: Steven Price <steven.price@arm.com>
> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> > Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> > Cc: Lee Jones <lee.jones@linaro.org>
> > Cc: Boris Brezillon <boris.brezillon@collabora.com>
> > ---
> >   drivers/gpu/drm/scheduler/sched_entity.c | 27 ++++++++++++++++++++++--
> >   1 file changed, 25 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> > index f7347c284886..89e3f6eaf519 100644
> > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > @@ -439,8 +439,16 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
> >               dma_fence_set_error(&sched_job->s_fence->finished, -ECANCELED);
> >
> >       dma_fence_put(entity->last_scheduled);
> > +
> >       entity->last_scheduled = dma_fence_get(&sched_job->s_fence->finished);
> >
> > +     /*
> > +      * If the queue is empty we allow drm_sched_entity_select_rq() to
> > +      * locklessly access ->last_scheduled. This only works if we set the
> > +      * pointer before we dequeue and if we a write barrier here.
> > +      */
> > +     smp_wmb();
> > +
>
> Again, conceptual those barriers should be part of the spsc_queue
> container and not externally.

That would be extremely unusual api. Let's assume that your queue is
very dumb, and protected by a simple lock. That's about the maximum
any user could expect.

But then you still need barriers here, because linux locks (spinlock,
mutex) are defined to be one-way barriers: Stuff that's inside is
guaranteed to be done insinde, but stuff outside of the locked region
can leak in. They're load-acquire/store-release barriers. So not good
enough.

You really need to have barriers here, and they really all need to be
documented properly. And yes that's a shit-ton of work in drm/sched,
because it's full of yolo lockless stuff.

The other case you could make is that this works like a wakeup queue,
or similar. The rules there are:
- wake_up (i.e. pushing something into the queue) is a store-release barrier
- the waked up (i.e. popping an entry) is a load acquire barrier
Which is obviuosly needed because otherwise you don't have coherency
for the data queued up. And again not the barriers you're locking for
here.

Either way, we'd still need the comments, because it's still lockless
trickery, and every single one of that needs to have a comment on both
sides to explain what's going on.

Essentially replace spsc_queue with an llist underneath, and that's
the amount of barriers a data structure should provide. Anything else
is asking your datastructure to paper over bugs in your users.

This is similar to how atomic_t is by default completely unordered,
and users need to add barriers as needed, with comments. I think this
is all to make sure people don't just write lockless algorithms
because it's a cool idea, but are forced to think this all through.
Which seems to not have happened very consistently for drm/sched, so I
guess needs to be fixed.

I'm definitely not going to hide all that by making the spsc_queue
stuff provide random unjustified barriers just because that would
paper over drm/sched bugs. We need to fix the actual bugs, and
preferrable all of them. I've found a few, but I wasn't involved in
drm/sched thus far, so best I can do is discover them as we go.
-Daniel


> Regards,
> Christian.
>
> >       spsc_queue_pop(&entity->job_queue);
> >       return sched_job;
> >   }
> > @@ -459,10 +467,25 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
> >       struct drm_gpu_scheduler *sched;
> >       struct drm_sched_rq *rq;
> >
> > -     if (spsc_queue_count(&entity->job_queue) || !entity->sched_list)
> > +     /* single possible engine and already selected */
> > +     if (!entity->sched_list)
> > +             return;
> > +
> > +     /* queue non-empty, stay on the same engine */
> > +     if (spsc_queue_count(&entity->job_queue))
> >               return;
> >
> > -     fence = READ_ONCE(entity->last_scheduled);
> > +     /*
> > +      * Only when the queue is empty are we guaranteed that the scheduler
> > +      * thread cannot change ->last_scheduled. To enforce ordering we need
> > +      * a read barrier here. See drm_sched_entity_pop_job() for the other
> > +      * side.
> > +      */
> > +     smp_rmb();
> > +
> > +     fence = entity->last_scheduled;
> > +
> > +     /* stay on the same engine if the previous job hasn't finished */
> >       if (fence && !dma_fence_is_signaled(fence))
> >               return;
> >
>


--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
Christian König July 13, 2021, 7:25 a.m. UTC | #3
Am 13.07.21 um 08:50 schrieb Daniel Vetter:
> On Tue, Jul 13, 2021 at 8:35 AM Christian König
> <christian.koenig@amd.com> wrote:
>> Am 12.07.21 um 19:53 schrieb Daniel Vetter:
>>> It might be good enough on x86 with just READ_ONCE, but the write side
>>> should then at least be WRITE_ONCE because x86 has total store order.
>>>
>>> It's definitely not enough on arm.
>>>
>>> Fix this proplery, which means
>>> - explain the need for the barrier in both places
>>> - point at the other side in each comment
>>>
>>> Also pull out the !sched_list case as the first check, so that the
>>> code flow is clearer.
>>>
>>> While at it sprinkle some comments around because it was very
>>> non-obvious to me what's actually going on here and why.
>>>
>>> Note that we really need full barriers here, at first I thought
>>> store-release and load-acquire on ->last_scheduled would be enough,
>>> but we actually requiring ordering between that and the queue state.
>>>
>>> v2: Put smp_rmp() in the right place and fix up comment (Andrey)
>>>
>>> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
>>> Cc: "Christian König" <christian.koenig@amd.com>
>>> Cc: Steven Price <steven.price@arm.com>
>>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>>> Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> Cc: Lee Jones <lee.jones@linaro.org>
>>> Cc: Boris Brezillon <boris.brezillon@collabora.com>
>>> ---
>>>    drivers/gpu/drm/scheduler/sched_entity.c | 27 ++++++++++++++++++++++--
>>>    1 file changed, 25 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>>> index f7347c284886..89e3f6eaf519 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>> @@ -439,8 +439,16 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
>>>                dma_fence_set_error(&sched_job->s_fence->finished, -ECANCELED);
>>>
>>>        dma_fence_put(entity->last_scheduled);
>>> +
>>>        entity->last_scheduled = dma_fence_get(&sched_job->s_fence->finished);
>>>
>>> +     /*
>>> +      * If the queue is empty we allow drm_sched_entity_select_rq() to
>>> +      * locklessly access ->last_scheduled. This only works if we set the
>>> +      * pointer before we dequeue and if we a write barrier here.
>>> +      */
>>> +     smp_wmb();
>>> +
>> Again, conceptual those barriers should be part of the spsc_queue
>> container and not externally.
> That would be extremely unusual api. Let's assume that your queue is
> very dumb, and protected by a simple lock. That's about the maximum
> any user could expect.
>
> But then you still need barriers here, because linux locks (spinlock,
> mutex) are defined to be one-way barriers: Stuff that's inside is
> guaranteed to be done insinde, but stuff outside of the locked region
> can leak in. They're load-acquire/store-release barriers. So not good
> enough.
>
> You really need to have barriers here, and they really all need to be
> documented properly. And yes that's a shit-ton of work in drm/sched,
> because it's full of yolo lockless stuff.
>
> The other case you could make is that this works like a wakeup queue,
> or similar. The rules there are:
> - wake_up (i.e. pushing something into the queue) is a store-release barrier
> - the waked up (i.e. popping an entry) is a load acquire barrier
> Which is obviuosly needed because otherwise you don't have coherency
> for the data queued up. And again not the barriers you're locking for
> here.

Exactly that was the idea, yes.

> Either way, we'd still need the comments, because it's still lockless
> trickery, and every single one of that needs to have a comment on both
> sides to explain what's going on.
>
> Essentially replace spsc_queue with an llist underneath, and that's
> the amount of barriers a data structure should provide. Anything else
> is asking your datastructure to paper over bugs in your users.
>
> This is similar to how atomic_t is by default completely unordered,
> and users need to add barriers as needed, with comments.

My main problem is as always that kernel atomics work different than 
userspace atomics.

> I think this is all to make sure people don't just write lockless algorithms
> because it's a cool idea, but are forced to think this all through.
> Which seems to not have happened very consistently for drm/sched, so I
> guess needs to be fixed.

Well at least initially that was all perfectly thought through. The 
problem is nobody is really maintaining that stuff.

> I'm definitely not going to hide all that by making the spsc_queue
> stuff provide random unjustified barriers just because that would
> paper over drm/sched bugs. We need to fix the actual bugs, and
> preferrable all of them. I've found a few, but I wasn't involved in
> drm/sched thus far, so best I can do is discover them as we go.

I don't think that those are random unjustified barriers at all and it 
sounds like you didn't grip what I said here.

See the spsc queue must have the following semantics:

1. When you pop a job all changes made before you push the job must be 
visible.

2. When the queue becomes empty all the changes made before you pop the 
last job must be visible.

Otherwise I completely agree with you that the whole scheduler doesn't 
work at all and we need to add tons of external barriers.

Regards,
Christian.

> -Daniel
>
>
>> Regards,
>> Christian.
>>
>>>        spsc_queue_pop(&entity->job_queue);
>>>        return sched_job;
>>>    }
>>> @@ -459,10 +467,25 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
>>>        struct drm_gpu_scheduler *sched;
>>>        struct drm_sched_rq *rq;
>>>
>>> -     if (spsc_queue_count(&entity->job_queue) || !entity->sched_list)
>>> +     /* single possible engine and already selected */
>>> +     if (!entity->sched_list)
>>> +             return;
>>> +
>>> +     /* queue non-empty, stay on the same engine */
>>> +     if (spsc_queue_count(&entity->job_queue))
>>>                return;
>>>
>>> -     fence = READ_ONCE(entity->last_scheduled);
>>> +     /*
>>> +      * Only when the queue is empty are we guaranteed that the scheduler
>>> +      * thread cannot change ->last_scheduled. To enforce ordering we need
>>> +      * a read barrier here. See drm_sched_entity_pop_job() for the other
>>> +      * side.
>>> +      */
>>> +     smp_rmb();
>>> +
>>> +     fence = entity->last_scheduled;
>>> +
>>> +     /* stay on the same engine if the previous job hasn't finished */
>>>        if (fence && !dma_fence_is_signaled(fence))
>>>                return;
>>>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7Ce06013b14cfc49e3e10f08d945ca8f73%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637617558577952913%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=yKAIrzyRRh1AoGM%2BQXyrwd4psTvyO%2Bcn8961PbcJmpQ%3D&amp;reserved=0
Daniel Vetter July 13, 2021, 9:10 a.m. UTC | #4
On Tue, Jul 13, 2021 at 9:25 AM Christian König
<christian.koenig@amd.com> wrote:
> Am 13.07.21 um 08:50 schrieb Daniel Vetter:
> > On Tue, Jul 13, 2021 at 8:35 AM Christian König
> > <christian.koenig@amd.com> wrote:
> >> Am 12.07.21 um 19:53 schrieb Daniel Vetter:
> >>> It might be good enough on x86 with just READ_ONCE, but the write side
> >>> should then at least be WRITE_ONCE because x86 has total store order.
> >>>
> >>> It's definitely not enough on arm.
> >>>
> >>> Fix this proplery, which means
> >>> - explain the need for the barrier in both places
> >>> - point at the other side in each comment
> >>>
> >>> Also pull out the !sched_list case as the first check, so that the
> >>> code flow is clearer.
> >>>
> >>> While at it sprinkle some comments around because it was very
> >>> non-obvious to me what's actually going on here and why.
> >>>
> >>> Note that we really need full barriers here, at first I thought
> >>> store-release and load-acquire on ->last_scheduled would be enough,
> >>> but we actually requiring ordering between that and the queue state.
> >>>
> >>> v2: Put smp_rmp() in the right place and fix up comment (Andrey)
> >>>
> >>> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> >>> Cc: "Christian König" <christian.koenig@amd.com>
> >>> Cc: Steven Price <steven.price@arm.com>
> >>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> >>> Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> >>> Cc: Lee Jones <lee.jones@linaro.org>
> >>> Cc: Boris Brezillon <boris.brezillon@collabora.com>
> >>> ---
> >>>    drivers/gpu/drm/scheduler/sched_entity.c | 27 ++++++++++++++++++++++--
> >>>    1 file changed, 25 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> >>> index f7347c284886..89e3f6eaf519 100644
> >>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> >>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> >>> @@ -439,8 +439,16 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
> >>>                dma_fence_set_error(&sched_job->s_fence->finished, -ECANCELED);
> >>>
> >>>        dma_fence_put(entity->last_scheduled);
> >>> +
> >>>        entity->last_scheduled = dma_fence_get(&sched_job->s_fence->finished);
> >>>
> >>> +     /*
> >>> +      * If the queue is empty we allow drm_sched_entity_select_rq() to
> >>> +      * locklessly access ->last_scheduled. This only works if we set the
> >>> +      * pointer before we dequeue and if we a write barrier here.
> >>> +      */
> >>> +     smp_wmb();
> >>> +
> >> Again, conceptual those barriers should be part of the spsc_queue
> >> container and not externally.
> > That would be extremely unusual api. Let's assume that your queue is
> > very dumb, and protected by a simple lock. That's about the maximum
> > any user could expect.
> >
> > But then you still need barriers here, because linux locks (spinlock,
> > mutex) are defined to be one-way barriers: Stuff that's inside is
> > guaranteed to be done insinde, but stuff outside of the locked region
> > can leak in. They're load-acquire/store-release barriers. So not good
> > enough.
> >
> > You really need to have barriers here, and they really all need to be
> > documented properly. And yes that's a shit-ton of work in drm/sched,
> > because it's full of yolo lockless stuff.
> >
> > The other case you could make is that this works like a wakeup queue,
> > or similar. The rules there are:
> > - wake_up (i.e. pushing something into the queue) is a store-release barrier
> > - the waked up (i.e. popping an entry) is a load acquire barrier
> > Which is obviuosly needed because otherwise you don't have coherency
> > for the data queued up. And again not the barriers you're locking for
> > here.
>
> Exactly that was the idea, yes.
>
> > Either way, we'd still need the comments, because it's still lockless
> > trickery, and every single one of that needs to have a comment on both
> > sides to explain what's going on.
> >
> > Essentially replace spsc_queue with an llist underneath, and that's
> > the amount of barriers a data structure should provide. Anything else
> > is asking your datastructure to paper over bugs in your users.
> >
> > This is similar to how atomic_t is by default completely unordered,
> > and users need to add barriers as needed, with comments.
>
> My main problem is as always that kernel atomics work different than
> userspace atomics.
>
> > I think this is all to make sure people don't just write lockless algorithms
> > because it's a cool idea, but are forced to think this all through.
> > Which seems to not have happened very consistently for drm/sched, so I
> > guess needs to be fixed.
>
> Well at least initially that was all perfectly thought through. The
> problem is nobody is really maintaining that stuff.
>
> > I'm definitely not going to hide all that by making the spsc_queue
> > stuff provide random unjustified barriers just because that would
> > paper over drm/sched bugs. We need to fix the actual bugs, and
> > preferrable all of them. I've found a few, but I wasn't involved in
> > drm/sched thus far, so best I can do is discover them as we go.
>
> I don't think that those are random unjustified barriers at all and it
> sounds like you didn't grip what I said here.
>
> See the spsc queue must have the following semantics:
>
> 1. When you pop a job all changes made before you push the job must be
> visible.

This is the standard barriers that also wake-up queues have, it's just
store-release+load-acquire.

> 2. When the queue becomes empty all the changes made before you pop the
> last job must be visible.

This is very much non-standard for a queue. I guess you could make
that part of the spsc_queue api between pop and is_empty (really we
shouldn't expose the _count() function for this), but that's all very
clever.

I think having explicit barriers in the code, with comments, is much
more robust. Because it forces you to think about all this, and
document it properly. Because there's also lockless stuff like
drm_sched.ready, which doesn't look at all like it's ordered somehow.

E.g. there's also an rmb(); in drm_sched_entity_is_idle(), which
- probably should be an smp_rmb()
- really should document what it actually synchronizes against, and
the lack of an smp_wmb() somewhere else indicates it's probably
busted. You always need two barriers.

> Otherwise I completely agree with you that the whole scheduler doesn't
> work at all and we need to add tons of external barriers.

Imo that's what we need to do. And the most important part for
maintainability is to properly document thing with comments, and the
most important part in that comment is pointing at the other side of a
barrier (since a barrier on one side only orders nothing).

Also, on x86 almost nothing here matters, because both rmb() and wmb()
are no-op. Aside from the compiler barrier, which tends to not be the
biggest issue. Only mb() does anything, because x86 is only allowed to
reorder reads ahead of writes.

So in practice it's not quite as big a disaster, imo the big thing
here is maintainability of all these tricks just not being documented.
-Daniel

> Regards,
> Christian.
>
> > -Daniel
> >
> >
> >> Regards,
> >> Christian.
> >>
> >>>        spsc_queue_pop(&entity->job_queue);
> >>>        return sched_job;
> >>>    }
> >>> @@ -459,10 +467,25 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
> >>>        struct drm_gpu_scheduler *sched;
> >>>        struct drm_sched_rq *rq;
> >>>
> >>> -     if (spsc_queue_count(&entity->job_queue) || !entity->sched_list)
> >>> +     /* single possible engine and already selected */
> >>> +     if (!entity->sched_list)
> >>> +             return;
> >>> +
> >>> +     /* queue non-empty, stay on the same engine */
> >>> +     if (spsc_queue_count(&entity->job_queue))
> >>>                return;
> >>>
> >>> -     fence = READ_ONCE(entity->last_scheduled);
> >>> +     /*
> >>> +      * Only when the queue is empty are we guaranteed that the scheduler
> >>> +      * thread cannot change ->last_scheduled. To enforce ordering we need
> >>> +      * a read barrier here. See drm_sched_entity_pop_job() for the other
> >>> +      * side.
> >>> +      */
> >>> +     smp_rmb();
> >>> +
> >>> +     fence = entity->last_scheduled;
> >>> +
> >>> +     /* stay on the same engine if the previous job hasn't finished */
> >>>        if (fence && !dma_fence_is_signaled(fence))
> >>>                return;
> >>>
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7Ce06013b14cfc49e3e10f08d945ca8f73%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637617558577952913%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=yKAIrzyRRh1AoGM%2BQXyrwd4psTvyO%2Bcn8961PbcJmpQ%3D&amp;reserved=0
>
Christian König July 13, 2021, 11:20 a.m. UTC | #5
Am 13.07.21 um 11:10 schrieb Daniel Vetter:
> On Tue, Jul 13, 2021 at 9:25 AM Christian König
> <christian.koenig@amd.com> wrote:
>> Am 13.07.21 um 08:50 schrieb Daniel Vetter:
>>> On Tue, Jul 13, 2021 at 8:35 AM Christian König
>>> <christian.koenig@amd.com> wrote:
>>>> Am 12.07.21 um 19:53 schrieb Daniel Vetter:
>>>>> It might be good enough on x86 with just READ_ONCE, but the write side
>>>>> should then at least be WRITE_ONCE because x86 has total store order.
>>>>>
>>>>> It's definitely not enough on arm.
>>>>>
>>>>> Fix this proplery, which means
>>>>> - explain the need for the barrier in both places
>>>>> - point at the other side in each comment
>>>>>
>>>>> Also pull out the !sched_list case as the first check, so that the
>>>>> code flow is clearer.
>>>>>
>>>>> While at it sprinkle some comments around because it was very
>>>>> non-obvious to me what's actually going on here and why.
>>>>>
>>>>> Note that we really need full barriers here, at first I thought
>>>>> store-release and load-acquire on ->last_scheduled would be enough,
>>>>> but we actually requiring ordering between that and the queue state.
>>>>>
>>>>> v2: Put smp_rmp() in the right place and fix up comment (Andrey)
>>>>>
>>>>> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
>>>>> Cc: "Christian König" <christian.koenig@amd.com>
>>>>> Cc: Steven Price <steven.price@arm.com>
>>>>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>>>>> Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>> Cc: Lee Jones <lee.jones@linaro.org>
>>>>> Cc: Boris Brezillon <boris.brezillon@collabora.com>
>>>>> ---
>>>>>     drivers/gpu/drm/scheduler/sched_entity.c | 27 ++++++++++++++++++++++--
>>>>>     1 file changed, 25 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> index f7347c284886..89e3f6eaf519 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> @@ -439,8 +439,16 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
>>>>>                 dma_fence_set_error(&sched_job->s_fence->finished, -ECANCELED);
>>>>>
>>>>>         dma_fence_put(entity->last_scheduled);
>>>>> +
>>>>>         entity->last_scheduled = dma_fence_get(&sched_job->s_fence->finished);
>>>>>
>>>>> +     /*
>>>>> +      * If the queue is empty we allow drm_sched_entity_select_rq() to
>>>>> +      * locklessly access ->last_scheduled. This only works if we set the
>>>>> +      * pointer before we dequeue and if we a write barrier here.
>>>>> +      */
>>>>> +     smp_wmb();
>>>>> +
>>>> Again, conceptual those barriers should be part of the spsc_queue
>>>> container and not externally.
>>> That would be extremely unusual api. Let's assume that your queue is
>>> very dumb, and protected by a simple lock. That's about the maximum
>>> any user could expect.
>>>
>>> But then you still need barriers here, because linux locks (spinlock,
>>> mutex) are defined to be one-way barriers: Stuff that's inside is
>>> guaranteed to be done insinde, but stuff outside of the locked region
>>> can leak in. They're load-acquire/store-release barriers. So not good
>>> enough.
>>>
>>> You really need to have barriers here, and they really all need to be
>>> documented properly. And yes that's a shit-ton of work in drm/sched,
>>> because it's full of yolo lockless stuff.
>>>
>>> The other case you could make is that this works like a wakeup queue,
>>> or similar. The rules there are:
>>> - wake_up (i.e. pushing something into the queue) is a store-release barrier
>>> - the waked up (i.e. popping an entry) is a load acquire barrier
>>> Which is obviuosly needed because otherwise you don't have coherency
>>> for the data queued up. And again not the barriers you're locking for
>>> here.
>> Exactly that was the idea, yes.
>>
>>> Either way, we'd still need the comments, because it's still lockless
>>> trickery, and every single one of that needs to have a comment on both
>>> sides to explain what's going on.
>>>
>>> Essentially replace spsc_queue with an llist underneath, and that's
>>> the amount of barriers a data structure should provide. Anything else
>>> is asking your datastructure to paper over bugs in your users.
>>>
>>> This is similar to how atomic_t is by default completely unordered,
>>> and users need to add barriers as needed, with comments.
>> My main problem is as always that kernel atomics work different than
>> userspace atomics.
>>
>>> I think this is all to make sure people don't just write lockless algorithms
>>> because it's a cool idea, but are forced to think this all through.
>>> Which seems to not have happened very consistently for drm/sched, so I
>>> guess needs to be fixed.
>> Well at least initially that was all perfectly thought through. The
>> problem is nobody is really maintaining that stuff.
>>
>>> I'm definitely not going to hide all that by making the spsc_queue
>>> stuff provide random unjustified barriers just because that would
>>> paper over drm/sched bugs. We need to fix the actual bugs, and
>>> preferrable all of them. I've found a few, but I wasn't involved in
>>> drm/sched thus far, so best I can do is discover them as we go.
>> I don't think that those are random unjustified barriers at all and it
>> sounds like you didn't grip what I said here.
>>
>> See the spsc queue must have the following semantics:
>>
>> 1. When you pop a job all changes made before you push the job must be
>> visible.
> This is the standard barriers that also wake-up queues have, it's just
> store-release+load-acquire.
>
>> 2. When the queue becomes empty all the changes made before you pop the
>> last job must be visible.
> This is very much non-standard for a queue. I guess you could make
> that part of the spsc_queue api between pop and is_empty (really we
> shouldn't expose the _count() function for this), but that's all very
> clever.

Yeah, even having count is superfluous. You can much easier do this by 
checking if the pointer is NULL or not.

>
> I think having explicit barriers in the code, with comments, is much
> more robust. Because it forces you to think about all this, and
> document it properly. Because there's also lockless stuff like
> drm_sched.ready, which doesn't look at all like it's ordered somehow.

But then you have to fix drm_sched_entity_fini() as well which also 
relies on the same behavior.

Regards,
Christian.

>
> E.g. there's also an rmb(); in drm_sched_entity_is_idle(), which
> - probably should be an smp_rmb()
> - really should document what it actually synchronizes against, and
> the lack of an smp_wmb() somewhere else indicates it's probably
> busted. You always need two barriers.
>
>> Otherwise I completely agree with you that the whole scheduler doesn't
>> work at all and we need to add tons of external barriers.
> Imo that's what we need to do. And the most important part for
> maintainability is to properly document thing with comments, and the
> most important part in that comment is pointing at the other side of a
> barrier (since a barrier on one side only orders nothing).
>
> Also, on x86 almost nothing here matters, because both rmb() and wmb()
> are no-op. Aside from the compiler barrier, which tends to not be the
> biggest issue. Only mb() does anything, because x86 is only allowed to
> reorder reads ahead of writes.
>
> So in practice it's not quite as big a disaster, imo the big thing
> here is maintainability of all these tricks just not being documented.
> -Daniel
>
>> Regards,
>> Christian.
>>
>>> -Daniel
>>>
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>         spsc_queue_pop(&entity->job_queue);
>>>>>         return sched_job;
>>>>>     }
>>>>> @@ -459,10 +467,25 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
>>>>>         struct drm_gpu_scheduler *sched;
>>>>>         struct drm_sched_rq *rq;
>>>>>
>>>>> -     if (spsc_queue_count(&entity->job_queue) || !entity->sched_list)
>>>>> +     /* single possible engine and already selected */
>>>>> +     if (!entity->sched_list)
>>>>> +             return;
>>>>> +
>>>>> +     /* queue non-empty, stay on the same engine */
>>>>> +     if (spsc_queue_count(&entity->job_queue))
>>>>>                 return;
>>>>>
>>>>> -     fence = READ_ONCE(entity->last_scheduled);
>>>>> +     /*
>>>>> +      * Only when the queue is empty are we guaranteed that the scheduler
>>>>> +      * thread cannot change ->last_scheduled. To enforce ordering we need
>>>>> +      * a read barrier here. See drm_sched_entity_pop_job() for the other
>>>>> +      * side.
>>>>> +      */
>>>>> +     smp_rmb();
>>>>> +
>>>>> +     fence = entity->last_scheduled;
>>>>> +
>>>>> +     /* stay on the same engine if the previous job hasn't finished */
>>>>>         if (fence && !dma_fence_is_signaled(fence))
>>>>>                 return;
>>>>>
>>> --
>>> Daniel Vetter
>>> Software Engineer, Intel Corporation
>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7Ca29a8f0b7dea46d9be7608d945de0570%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637617642150542001%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Rv%2FY8LKVKz09FuqC2neEM3Ps0NMJq1SeZ4Y08wkUKBI%3D&amp;reserved=0
>
Andrey Grodzovsky July 13, 2021, 4:11 p.m. UTC | #6
On 2021-07-13 5:10 a.m., Daniel Vetter wrote:
> On Tue, Jul 13, 2021 at 9:25 AM Christian König
> <christian.koenig@amd.com> wrote:
>> Am 13.07.21 um 08:50 schrieb Daniel Vetter:
>>> On Tue, Jul 13, 2021 at 8:35 AM Christian König
>>> <christian.koenig@amd.com> wrote:
>>>> Am 12.07.21 um 19:53 schrieb Daniel Vetter:
>>>>> It might be good enough on x86 with just READ_ONCE, but the write side
>>>>> should then at least be WRITE_ONCE because x86 has total store order.
>>>>>
>>>>> It's definitely not enough on arm.
>>>>>
>>>>> Fix this proplery, which means
>>>>> - explain the need for the barrier in both places
>>>>> - point at the other side in each comment
>>>>>
>>>>> Also pull out the !sched_list case as the first check, so that the
>>>>> code flow is clearer.
>>>>>
>>>>> While at it sprinkle some comments around because it was very
>>>>> non-obvious to me what's actually going on here and why.
>>>>>
>>>>> Note that we really need full barriers here, at first I thought
>>>>> store-release and load-acquire on ->last_scheduled would be enough,
>>>>> but we actually requiring ordering between that and the queue state.
>>>>>
>>>>> v2: Put smp_rmp() in the right place and fix up comment (Andrey)
>>>>>
>>>>> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
>>>>> Cc: "Christian König" <christian.koenig@amd.com>
>>>>> Cc: Steven Price <steven.price@arm.com>
>>>>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>>>>> Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>> Cc: Lee Jones <lee.jones@linaro.org>
>>>>> Cc: Boris Brezillon <boris.brezillon@collabora.com>
>>>>> ---
>>>>>     drivers/gpu/drm/scheduler/sched_entity.c | 27 ++++++++++++++++++++++--
>>>>>     1 file changed, 25 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> index f7347c284886..89e3f6eaf519 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> @@ -439,8 +439,16 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
>>>>>                 dma_fence_set_error(&sched_job->s_fence->finished, -ECANCELED);
>>>>>
>>>>>         dma_fence_put(entity->last_scheduled);
>>>>> +
>>>>>         entity->last_scheduled = dma_fence_get(&sched_job->s_fence->finished);
>>>>>
>>>>> +     /*
>>>>> +      * If the queue is empty we allow drm_sched_entity_select_rq() to
>>>>> +      * locklessly access ->last_scheduled. This only works if we set the
>>>>> +      * pointer before we dequeue and if we a write barrier here.
>>>>> +      */
>>>>> +     smp_wmb();
>>>>> +
>>>> Again, conceptual those barriers should be part of the spsc_queue
>>>> container and not externally.
>>> That would be extremely unusual api. Let's assume that your queue is
>>> very dumb, and protected by a simple lock. That's about the maximum
>>> any user could expect.
>>>
>>> But then you still need barriers here, because linux locks (spinlock,
>>> mutex) are defined to be one-way barriers: Stuff that's inside is
>>> guaranteed to be done insinde, but stuff outside of the locked region
>>> can leak in. They're load-acquire/store-release barriers. So not good
>>> enough.
>>>
>>> You really need to have barriers here, and they really all need to be
>>> documented properly. And yes that's a shit-ton of work in drm/sched,
>>> because it's full of yolo lockless stuff.
>>>
>>> The other case you could make is that this works like a wakeup queue,
>>> or similar. The rules there are:
>>> - wake_up (i.e. pushing something into the queue) is a store-release barrier
>>> - the waked up (i.e. popping an entry) is a load acquire barrier
>>> Which is obviuosly needed because otherwise you don't have coherency
>>> for the data queued up. And again not the barriers you're locking for
>>> here.
>> Exactly that was the idea, yes.
>>
>>> Either way, we'd still need the comments, because it's still lockless
>>> trickery, and every single one of that needs to have a comment on both
>>> sides to explain what's going on.
>>>
>>> Essentially replace spsc_queue with an llist underneath, and that's
>>> the amount of barriers a data structure should provide. Anything else
>>> is asking your datastructure to paper over bugs in your users.
>>>
>>> This is similar to how atomic_t is by default completely unordered,
>>> and users need to add barriers as needed, with comments.
>> My main problem is as always that kernel atomics work different than
>> userspace atomics.
>>
>>> I think this is all to make sure people don't just write lockless algorithms
>>> because it's a cool idea, but are forced to think this all through.
>>> Which seems to not have happened very consistently for drm/sched, so I
>>> guess needs to be fixed.
>> Well at least initially that was all perfectly thought through. The
>> problem is nobody is really maintaining that stuff.
>>
>>> I'm definitely not going to hide all that by making the spsc_queue
>>> stuff provide random unjustified barriers just because that would
>>> paper over drm/sched bugs. We need to fix the actual bugs, and
>>> preferrable all of them. I've found a few, but I wasn't involved in
>>> drm/sched thus far, so best I can do is discover them as we go.
>> I don't think that those are random unjustified barriers at all and it
>> sounds like you didn't grip what I said here.
>>
>> See the spsc queue must have the following semantics:
>>
>> 1. When you pop a job all changes made before you push the job must be
>> visible.
> This is the standard barriers that also wake-up queues have, it's just
> store-release+load-acquire.
>
>> 2. When the queue becomes empty all the changes made before you pop the
>> last job must be visible.
> This is very much non-standard for a queue. I guess you could make
> that part of the spsc_queue api between pop and is_empty (really we
> shouldn't expose the _count() function for this), but that's all very
> clever.
>
> I think having explicit barriers in the code, with comments, is much
> more robust. Because it forces you to think about all this, and
> document it properly. Because there's also lockless stuff like
> drm_sched.ready, which doesn't look at all like it's ordered somehow.


At least for amdgpu, after drm_sched_fini is called (setting sched.ready 
= false)
we call amdgpu_fence_wait_empty to ensure all in progress jobs are done.
Seems to me at least, this should guarantee that all in flight consumers
of sched.ready (those who still see sched.ready == true) are finished while
all later consumers will see sched.ready == fakle and will bail out.

On second thought there is a gap between checking for sched.ready and 
inserting
the HW fence for the new job so this might still be a bug... Looks like 
we need to check for
sched.ready after inserting the HW fence  and for this we will need 
barrier or locking.

Andrey

>
> E.g. there's also an rmb(); in drm_sched_entity_is_idle(), which
> - probably should be an smp_rmb()
> - really should document what it actually synchronizes against, and
> the lack of an smp_wmb() somewhere else indicates it's probably
> busted. You always need two barriers.
>
>> Otherwise I completely agree with you that the whole scheduler doesn't
>> work at all and we need to add tons of external barriers.
> Imo that's what we need to do. And the most important part for
> maintainability is to properly document thing with comments, and the
> most important part in that comment is pointing at the other side of a
> barrier (since a barrier on one side only orders nothing).
>
> Also, on x86 almost nothing here matters, because both rmb() and wmb()
> are no-op. Aside from the compiler barrier, which tends to not be the
> biggest issue. Only mb() does anything, because x86 is only allowed to
> reorder reads ahead of writes.
>
> So in practice it's not quite as big a disaster, imo the big thing
> here is maintainability of all these tricks just not being documented.
> -Daniel
>
>> Regards,
>> Christian.
>>
>>> -Daniel
>>>
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>         spsc_queue_pop(&entity->job_queue);
>>>>>         return sched_job;
>>>>>     }
>>>>> @@ -459,10 +467,25 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
>>>>>         struct drm_gpu_scheduler *sched;
>>>>>         struct drm_sched_rq *rq;
>>>>>
>>>>> -     if (spsc_queue_count(&entity->job_queue) || !entity->sched_list)
>>>>> +     /* single possible engine and already selected */
>>>>> +     if (!entity->sched_list)
>>>>> +             return;
>>>>> +
>>>>> +     /* queue non-empty, stay on the same engine */
>>>>> +     if (spsc_queue_count(&entity->job_queue))
>>>>>                 return;
>>>>>
>>>>> -     fence = READ_ONCE(entity->last_scheduled);
>>>>> +     /*
>>>>> +      * Only when the queue is empty are we guaranteed that the scheduler
>>>>> +      * thread cannot change ->last_scheduled. To enforce ordering we need
>>>>> +      * a read barrier here. See drm_sched_entity_pop_job() for the other
>>>>> +      * side.
>>>>> +      */
>>>>> +     smp_rmb();
>>>>> +
>>>>> +     fence = entity->last_scheduled;
>>>>> +
>>>>> +     /* stay on the same engine if the previous job hasn't finished */
>>>>>         if (fence && !dma_fence_is_signaled(fence))
>>>>>                 return;
>>>>>
>>> --
>>> Daniel Vetter
>>> Software Engineer, Intel Corporation
>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C24488ad1956a4a54b0ab08d945de0552%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637617642160395692%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=u%2Fpm%2BsmTT%2Bdb8NvEah%2BJHj22l8sRfhgo1gvO%2FnYLvhg%3D&amp;reserved=0
>
Daniel Vetter July 13, 2021, 4:45 p.m. UTC | #7
On Tue, Jul 13, 2021 at 6:11 PM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
> On 2021-07-13 5:10 a.m., Daniel Vetter wrote:
> > On Tue, Jul 13, 2021 at 9:25 AM Christian König
> > <christian.koenig@amd.com> wrote:
> >> Am 13.07.21 um 08:50 schrieb Daniel Vetter:
> >>> On Tue, Jul 13, 2021 at 8:35 AM Christian König
> >>> <christian.koenig@amd.com> wrote:
> >>>> Am 12.07.21 um 19:53 schrieb Daniel Vetter:
> >>>>> It might be good enough on x86 with just READ_ONCE, but the write side
> >>>>> should then at least be WRITE_ONCE because x86 has total store order.
> >>>>>
> >>>>> It's definitely not enough on arm.
> >>>>>
> >>>>> Fix this proplery, which means
> >>>>> - explain the need for the barrier in both places
> >>>>> - point at the other side in each comment
> >>>>>
> >>>>> Also pull out the !sched_list case as the first check, so that the
> >>>>> code flow is clearer.
> >>>>>
> >>>>> While at it sprinkle some comments around because it was very
> >>>>> non-obvious to me what's actually going on here and why.
> >>>>>
> >>>>> Note that we really need full barriers here, at first I thought
> >>>>> store-release and load-acquire on ->last_scheduled would be enough,
> >>>>> but we actually requiring ordering between that and the queue state.
> >>>>>
> >>>>> v2: Put smp_rmp() in the right place and fix up comment (Andrey)
> >>>>>
> >>>>> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> >>>>> Cc: "Christian König" <christian.koenig@amd.com>
> >>>>> Cc: Steven Price <steven.price@arm.com>
> >>>>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> >>>>> Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> >>>>> Cc: Lee Jones <lee.jones@linaro.org>
> >>>>> Cc: Boris Brezillon <boris.brezillon@collabora.com>
> >>>>> ---
> >>>>>     drivers/gpu/drm/scheduler/sched_entity.c | 27 ++++++++++++++++++++++--
> >>>>>     1 file changed, 25 insertions(+), 2 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> >>>>> index f7347c284886..89e3f6eaf519 100644
> >>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> >>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> >>>>> @@ -439,8 +439,16 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
> >>>>>                 dma_fence_set_error(&sched_job->s_fence->finished, -ECANCELED);
> >>>>>
> >>>>>         dma_fence_put(entity->last_scheduled);
> >>>>> +
> >>>>>         entity->last_scheduled = dma_fence_get(&sched_job->s_fence->finished);
> >>>>>
> >>>>> +     /*
> >>>>> +      * If the queue is empty we allow drm_sched_entity_select_rq() to
> >>>>> +      * locklessly access ->last_scheduled. This only works if we set the
> >>>>> +      * pointer before we dequeue and if we a write barrier here.
> >>>>> +      */
> >>>>> +     smp_wmb();
> >>>>> +
> >>>> Again, conceptual those barriers should be part of the spsc_queue
> >>>> container and not externally.
> >>> That would be extremely unusual api. Let's assume that your queue is
> >>> very dumb, and protected by a simple lock. That's about the maximum
> >>> any user could expect.
> >>>
> >>> But then you still need barriers here, because linux locks (spinlock,
> >>> mutex) are defined to be one-way barriers: Stuff that's inside is
> >>> guaranteed to be done insinde, but stuff outside of the locked region
> >>> can leak in. They're load-acquire/store-release barriers. So not good
> >>> enough.
> >>>
> >>> You really need to have barriers here, and they really all need to be
> >>> documented properly. And yes that's a shit-ton of work in drm/sched,
> >>> because it's full of yolo lockless stuff.
> >>>
> >>> The other case you could make is that this works like a wakeup queue,
> >>> or similar. The rules there are:
> >>> - wake_up (i.e. pushing something into the queue) is a store-release barrier
> >>> - the waked up (i.e. popping an entry) is a load acquire barrier
> >>> Which is obviuosly needed because otherwise you don't have coherency
> >>> for the data queued up. And again not the barriers you're locking for
> >>> here.
> >> Exactly that was the idea, yes.
> >>
> >>> Either way, we'd still need the comments, because it's still lockless
> >>> trickery, and every single one of that needs to have a comment on both
> >>> sides to explain what's going on.
> >>>
> >>> Essentially replace spsc_queue with an llist underneath, and that's
> >>> the amount of barriers a data structure should provide. Anything else
> >>> is asking your datastructure to paper over bugs in your users.
> >>>
> >>> This is similar to how atomic_t is by default completely unordered,
> >>> and users need to add barriers as needed, with comments.
> >> My main problem is as always that kernel atomics work different than
> >> userspace atomics.
> >>
> >>> I think this is all to make sure people don't just write lockless algorithms
> >>> because it's a cool idea, but are forced to think this all through.
> >>> Which seems to not have happened very consistently for drm/sched, so I
> >>> guess needs to be fixed.
> >> Well at least initially that was all perfectly thought through. The
> >> problem is nobody is really maintaining that stuff.
> >>
> >>> I'm definitely not going to hide all that by making the spsc_queue
> >>> stuff provide random unjustified barriers just because that would
> >>> paper over drm/sched bugs. We need to fix the actual bugs, and
> >>> preferrable all of them. I've found a few, but I wasn't involved in
> >>> drm/sched thus far, so best I can do is discover them as we go.
> >> I don't think that those are random unjustified barriers at all and it
> >> sounds like you didn't grip what I said here.
> >>
> >> See the spsc queue must have the following semantics:
> >>
> >> 1. When you pop a job all changes made before you push the job must be
> >> visible.
> > This is the standard barriers that also wake-up queues have, it's just
> > store-release+load-acquire.
> >
> >> 2. When the queue becomes empty all the changes made before you pop the
> >> last job must be visible.
> > This is very much non-standard for a queue. I guess you could make
> > that part of the spsc_queue api between pop and is_empty (really we
> > shouldn't expose the _count() function for this), but that's all very
> > clever.
> >
> > I think having explicit barriers in the code, with comments, is much
> > more robust. Because it forces you to think about all this, and
> > document it properly. Because there's also lockless stuff like
> > drm_sched.ready, which doesn't look at all like it's ordered somehow.
>
>
> At least for amdgpu, after drm_sched_fini is called (setting sched.ready
> = false)
> we call amdgpu_fence_wait_empty to ensure all in progress jobs are done.
> Seems to me at least, this should guarantee that all in flight consumers
> of sched.ready (those who still see sched.ready == true) are finished while
> all later consumers will see sched.ready == fakle and will bail out.
>
> On second thought there is a gap between checking for sched.ready and
> inserting
> the HW fence for the new job so this might still be a bug... Looks like
> we need to check for
> sched.ready after inserting the HW fence  and for this we will need
> barrier or locking.

Yeah, and at that point I think it's good to split up drm_sched.ready
from a new thing for when the hw died, like drm_sched.wedged or
.hw_death or similar, so that we can tell them apart. Trying to submit
a job to a non-ready scheduler is a driver bug and should WARN, while
submitting a job to a dead scheduler should probably result in -EIO
being returned to userspace (instead of the current -ENOENT, assuming
I haven't missed a errno remapping code somewhere in amdgpu).

Also, then you could do a drm_sched_die() or similar function which
combines setting the hw_died with the right barriers and cleaning up
all the jobs.

Wrt the fundamental race: I think that's not fixeable easily, so maybe
the scheduler thread also needs to handle this and immediately fail
these jobs by setting all fences to -EIO and completing them, without
even calling into the driver. If you try to catch this synchronously I
think it would require some kind of locking in push_job, plus failure
handling, which would be a) slow and b) real ugly in the driver code.
Just accepting that some jobs can slip through and letting the
scheduler thread clean them up is I think cleaner.

If userspace then goes ahead and closes the ctx before all the jobs
are cleaned up we can handle that with the normal drm_sched_entity
cleanup logic. Which would be another reason to split normal cleanup
from hw death.
-Daniel

> Andrey
>
> >
> > E.g. there's also an rmb(); in drm_sched_entity_is_idle(), which
> > - probably should be an smp_rmb()
> > - really should document what it actually synchronizes against, and
> > the lack of an smp_wmb() somewhere else indicates it's probably
> > busted. You always need two barriers.
> >
> >> Otherwise I completely agree with you that the whole scheduler doesn't
> >> work at all and we need to add tons of external barriers.
> > Imo that's what we need to do. And the most important part for
> > maintainability is to properly document thing with comments, and the
> > most important part in that comment is pointing at the other side of a
> > barrier (since a barrier on one side only orders nothing).
> >
> > Also, on x86 almost nothing here matters, because both rmb() and wmb()
> > are no-op. Aside from the compiler barrier, which tends to not be the
> > biggest issue. Only mb() does anything, because x86 is only allowed to
> > reorder reads ahead of writes.
> >
> > So in practice it's not quite as big a disaster, imo the big thing
> > here is maintainability of all these tricks just not being documented.
> > -Daniel
> >
> >> Regards,
> >> Christian.
> >>
> >>> -Daniel
> >>>
> >>>
> >>>> Regards,
> >>>> Christian.
> >>>>
> >>>>>         spsc_queue_pop(&entity->job_queue);
> >>>>>         return sched_job;
> >>>>>     }
> >>>>> @@ -459,10 +467,25 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
> >>>>>         struct drm_gpu_scheduler *sched;
> >>>>>         struct drm_sched_rq *rq;
> >>>>>
> >>>>> -     if (spsc_queue_count(&entity->job_queue) || !entity->sched_list)
> >>>>> +     /* single possible engine and already selected */
> >>>>> +     if (!entity->sched_list)
> >>>>> +             return;
> >>>>> +
> >>>>> +     /* queue non-empty, stay on the same engine */
> >>>>> +     if (spsc_queue_count(&entity->job_queue))
> >>>>>                 return;
> >>>>>
> >>>>> -     fence = READ_ONCE(entity->last_scheduled);
> >>>>> +     /*
> >>>>> +      * Only when the queue is empty are we guaranteed that the scheduler
> >>>>> +      * thread cannot change ->last_scheduled. To enforce ordering we need
> >>>>> +      * a read barrier here. See drm_sched_entity_pop_job() for the other
> >>>>> +      * side.
> >>>>> +      */
> >>>>> +     smp_rmb();
> >>>>> +
> >>>>> +     fence = entity->last_scheduled;
> >>>>> +
> >>>>> +     /* stay on the same engine if the previous job hasn't finished */
> >>>>>         if (fence && !dma_fence_is_signaled(fence))
> >>>>>                 return;
> >>>>>
> >>> --
> >>> Daniel Vetter
> >>> Software Engineer, Intel Corporation
> >>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C24488ad1956a4a54b0ab08d945de0552%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637617642160395692%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=u%2Fpm%2BsmTT%2Bdb8NvEah%2BJHj22l8sRfhgo1gvO%2FnYLvhg%3D&amp;reserved=0
> >
Andrey Grodzovsky July 14, 2021, 10:12 p.m. UTC | #8
On 2021-07-13 12:45 p.m., Daniel Vetter wrote:
> On Tue, Jul 13, 2021 at 6:11 PM Andrey Grodzovsky
> <andrey.grodzovsky@amd.com> wrote:
>> On 2021-07-13 5:10 a.m., Daniel Vetter wrote:
>>> On Tue, Jul 13, 2021 at 9:25 AM Christian König
>>> <christian.koenig@amd.com> wrote:
>>>> Am 13.07.21 um 08:50 schrieb Daniel Vetter:
>>>>> On Tue, Jul 13, 2021 at 8:35 AM Christian König
>>>>> <christian.koenig@amd.com> wrote:
>>>>>> Am 12.07.21 um 19:53 schrieb Daniel Vetter:
>>>>>>> It might be good enough on x86 with just READ_ONCE, but the write side
>>>>>>> should then at least be WRITE_ONCE because x86 has total store order.
>>>>>>>
>>>>>>> It's definitely not enough on arm.
>>>>>>>
>>>>>>> Fix this proplery, which means
>>>>>>> - explain the need for the barrier in both places
>>>>>>> - point at the other side in each comment
>>>>>>>
>>>>>>> Also pull out the !sched_list case as the first check, so that the
>>>>>>> code flow is clearer.
>>>>>>>
>>>>>>> While at it sprinkle some comments around because it was very
>>>>>>> non-obvious to me what's actually going on here and why.
>>>>>>>
>>>>>>> Note that we really need full barriers here, at first I thought
>>>>>>> store-release and load-acquire on ->last_scheduled would be enough,
>>>>>>> but we actually requiring ordering between that and the queue state.
>>>>>>>
>>>>>>> v2: Put smp_rmp() in the right place and fix up comment (Andrey)
>>>>>>>
>>>>>>> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
>>>>>>> Cc: "Christian König" <christian.koenig@amd.com>
>>>>>>> Cc: Steven Price <steven.price@arm.com>
>>>>>>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>>>>>>> Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>> Cc: Lee Jones <lee.jones@linaro.org>
>>>>>>> Cc: Boris Brezillon <boris.brezillon@collabora.com>
>>>>>>> ---
>>>>>>>      drivers/gpu/drm/scheduler/sched_entity.c | 27 ++++++++++++++++++++++--
>>>>>>>      1 file changed, 25 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>> index f7347c284886..89e3f6eaf519 100644
>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>> @@ -439,8 +439,16 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
>>>>>>>                  dma_fence_set_error(&sched_job->s_fence->finished, -ECANCELED);
>>>>>>>
>>>>>>>          dma_fence_put(entity->last_scheduled);
>>>>>>> +
>>>>>>>          entity->last_scheduled = dma_fence_get(&sched_job->s_fence->finished);
>>>>>>>
>>>>>>> +     /*
>>>>>>> +      * If the queue is empty we allow drm_sched_entity_select_rq() to
>>>>>>> +      * locklessly access ->last_scheduled. This only works if we set the
>>>>>>> +      * pointer before we dequeue and if we a write barrier here.
>>>>>>> +      */
>>>>>>> +     smp_wmb();
>>>>>>> +
>>>>>> Again, conceptual those barriers should be part of the spsc_queue
>>>>>> container and not externally.
>>>>> That would be extremely unusual api. Let's assume that your queue is
>>>>> very dumb, and protected by a simple lock. That's about the maximum
>>>>> any user could expect.
>>>>>
>>>>> But then you still need barriers here, because linux locks (spinlock,
>>>>> mutex) are defined to be one-way barriers: Stuff that's inside is
>>>>> guaranteed to be done insinde, but stuff outside of the locked region
>>>>> can leak in. They're load-acquire/store-release barriers. So not good
>>>>> enough.
>>>>>
>>>>> You really need to have barriers here, and they really all need to be
>>>>> documented properly. And yes that's a shit-ton of work in drm/sched,
>>>>> because it's full of yolo lockless stuff.
>>>>>
>>>>> The other case you could make is that this works like a wakeup queue,
>>>>> or similar. The rules there are:
>>>>> - wake_up (i.e. pushing something into the queue) is a store-release barrier
>>>>> - the waked up (i.e. popping an entry) is a load acquire barrier
>>>>> Which is obviuosly needed because otherwise you don't have coherency
>>>>> for the data queued up. And again not the barriers you're locking for
>>>>> here.
>>>> Exactly that was the idea, yes.
>>>>
>>>>> Either way, we'd still need the comments, because it's still lockless
>>>>> trickery, and every single one of that needs to have a comment on both
>>>>> sides to explain what's going on.
>>>>>
>>>>> Essentially replace spsc_queue with an llist underneath, and that's
>>>>> the amount of barriers a data structure should provide. Anything else
>>>>> is asking your datastructure to paper over bugs in your users.
>>>>>
>>>>> This is similar to how atomic_t is by default completely unordered,
>>>>> and users need to add barriers as needed, with comments.
>>>> My main problem is as always that kernel atomics work different than
>>>> userspace atomics.
>>>>
>>>>> I think this is all to make sure people don't just write lockless algorithms
>>>>> because it's a cool idea, but are forced to think this all through.
>>>>> Which seems to not have happened very consistently for drm/sched, so I
>>>>> guess needs to be fixed.
>>>> Well at least initially that was all perfectly thought through. The
>>>> problem is nobody is really maintaining that stuff.
>>>>
>>>>> I'm definitely not going to hide all that by making the spsc_queue
>>>>> stuff provide random unjustified barriers just because that would
>>>>> paper over drm/sched bugs. We need to fix the actual bugs, and
>>>>> preferrable all of them. I've found a few, but I wasn't involved in
>>>>> drm/sched thus far, so best I can do is discover them as we go.
>>>> I don't think that those are random unjustified barriers at all and it
>>>> sounds like you didn't grip what I said here.
>>>>
>>>> See the spsc queue must have the following semantics:
>>>>
>>>> 1. When you pop a job all changes made before you push the job must be
>>>> visible.
>>> This is the standard barriers that also wake-up queues have, it's just
>>> store-release+load-acquire.
>>>
>>>> 2. When the queue becomes empty all the changes made before you pop the
>>>> last job must be visible.
>>> This is very much non-standard for a queue. I guess you could make
>>> that part of the spsc_queue api between pop and is_empty (really we
>>> shouldn't expose the _count() function for this), but that's all very
>>> clever.
>>>
>>> I think having explicit barriers in the code, with comments, is much
>>> more robust. Because it forces you to think about all this, and
>>> document it properly. Because there's also lockless stuff like
>>> drm_sched.ready, which doesn't look at all like it's ordered somehow.
>>
>> At least for amdgpu, after drm_sched_fini is called (setting sched.ready
>> = false)
>> we call amdgpu_fence_wait_empty to ensure all in progress jobs are done.
>> Seems to me at least, this should guarantee that all in flight consumers
>> of sched.ready (those who still see sched.ready == true) are finished while
>> all later consumers will see sched.ready == fakle and will bail out.
>>
>> On second thought there is a gap between checking for sched.ready and
>> inserting
>> the HW fence for the new job so this might still be a bug... Looks like
>> we need to check for
>> sched.ready after inserting the HW fence  and for this we will need
>> barrier or locking.
> Yeah, and at that point I think it's good to split up drm_sched.ready
> from a new thing for when the hw died, like drm_sched.wedged or
> .hw_death or similar, so that we can tell them apart. Trying to submit
> a job to a non-ready scheduler is a driver bug and should WARN, while
> submitting a job to a dead scheduler should probably result in -EIO
> being returned to userspace (instead of the current -ENOENT, assuming
> I haven't missed a errno remapping code somewhere in amdgpu).
>
> Also, then you could do a drm_sched_die() or similar function which
> combines setting the hw_died with the right barriers and cleaning up
> all the jobs.
>
> Wrt the fundamental race: I think that's not fixeable easily, so maybe
> the scheduler thread also needs to handle this and immediately fail
> these jobs by setting all fences to -EIO and completing them, without
> even calling into the driver. If you try to catch this synchronously I
> think it would require some kind of locking in push_job, plus failure
> handling, which would be a) slow and b) real ugly in the driver code.
> Just accepting that some jobs can slip through and letting the
> scheduler thread clean them up is I think cleaner.


I agree about moving this check to scheduler thread, I also not quite
understand why in some places which are clearly post the job being
pick-up by it's scheduler thread such as amdgpu_ib_schedule, still
check for sched.ready... What's the point ? Also there are direct submission
cases where IB insertion into HW ring is done without any scheduler 
involvement
and even more in that case why we care that scheduler is not ready.

Andrey


>
> If userspace then goes ahead and closes the ctx before all the jobs
> are cleaned up we can handle that with the normal drm_sched_entity
> cleanup logic. Which would be another reason to split normal cleanup
> from hw death.
> -Daniel
>
>> Andrey
>>
>>> E.g. there's also an rmb(); in drm_sched_entity_is_idle(), which
>>> - probably should be an smp_rmb()
>>> - really should document what it actually synchronizes against, and
>>> the lack of an smp_wmb() somewhere else indicates it's probably
>>> busted. You always need two barriers.
>>>
>>>> Otherwise I completely agree with you that the whole scheduler doesn't
>>>> work at all and we need to add tons of external barriers.
>>> Imo that's what we need to do. And the most important part for
>>> maintainability is to properly document thing with comments, and the
>>> most important part in that comment is pointing at the other side of a
>>> barrier (since a barrier on one side only orders nothing).
>>>
>>> Also, on x86 almost nothing here matters, because both rmb() and wmb()
>>> are no-op. Aside from the compiler barrier, which tends to not be the
>>> biggest issue. Only mb() does anything, because x86 is only allowed to
>>> reorder reads ahead of writes.
>>>
>>> So in practice it's not quite as big a disaster, imo the big thing
>>> here is maintainability of all these tricks just not being documented.
>>> -Daniel
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> -Daniel
>>>>>
>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>          spsc_queue_pop(&entity->job_queue);
>>>>>>>          return sched_job;
>>>>>>>      }
>>>>>>> @@ -459,10 +467,25 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
>>>>>>>          struct drm_gpu_scheduler *sched;
>>>>>>>          struct drm_sched_rq *rq;
>>>>>>>
>>>>>>> -     if (spsc_queue_count(&entity->job_queue) || !entity->sched_list)
>>>>>>> +     /* single possible engine and already selected */
>>>>>>> +     if (!entity->sched_list)
>>>>>>> +             return;
>>>>>>> +
>>>>>>> +     /* queue non-empty, stay on the same engine */
>>>>>>> +     if (spsc_queue_count(&entity->job_queue))
>>>>>>>                  return;
>>>>>>>
>>>>>>> -     fence = READ_ONCE(entity->last_scheduled);
>>>>>>> +     /*
>>>>>>> +      * Only when the queue is empty are we guaranteed that the scheduler
>>>>>>> +      * thread cannot change ->last_scheduled. To enforce ordering we need
>>>>>>> +      * a read barrier here. See drm_sched_entity_pop_job() for the other
>>>>>>> +      * side.
>>>>>>> +      */
>>>>>>> +     smp_rmb();
>>>>>>> +
>>>>>>> +     fence = entity->last_scheduled;
>>>>>>> +
>>>>>>> +     /* stay on the same engine if the previous job hasn't finished */
>>>>>>>          if (fence && !dma_fence_is_signaled(fence))
>>>>>>>                  return;
>>>>>>>
>>>>> --
>>>>> Daniel Vetter
>>>>> Software Engineer, Intel Corporation
>>>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C01c4933fcb364820067408d9461d9c29%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637617915261739604%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=t6oYaz%2FvvN0GhRc35qksHXOHCLGfF1OxNKrqkRF6VWg%3D&amp;reserved=0
>
>
Daniel Vetter July 15, 2021, 10:16 a.m. UTC | #9
On Wed, Jul 14, 2021 at 06:12:54PM -0400, Andrey Grodzovsky wrote:
> 
> On 2021-07-13 12:45 p.m., Daniel Vetter wrote:
> > On Tue, Jul 13, 2021 at 6:11 PM Andrey Grodzovsky
> > <andrey.grodzovsky@amd.com> wrote:
> > > On 2021-07-13 5:10 a.m., Daniel Vetter wrote:
> > > > On Tue, Jul 13, 2021 at 9:25 AM Christian König
> > > > <christian.koenig@amd.com> wrote:
> > > > > Am 13.07.21 um 08:50 schrieb Daniel Vetter:
> > > > > > On Tue, Jul 13, 2021 at 8:35 AM Christian König
> > > > > > <christian.koenig@amd.com> wrote:
> > > > > > > Am 12.07.21 um 19:53 schrieb Daniel Vetter:
> > > > > > > > It might be good enough on x86 with just READ_ONCE, but the write side
> > > > > > > > should then at least be WRITE_ONCE because x86 has total store order.
> > > > > > > > 
> > > > > > > > It's definitely not enough on arm.
> > > > > > > > 
> > > > > > > > Fix this proplery, which means
> > > > > > > > - explain the need for the barrier in both places
> > > > > > > > - point at the other side in each comment
> > > > > > > > 
> > > > > > > > Also pull out the !sched_list case as the first check, so that the
> > > > > > > > code flow is clearer.
> > > > > > > > 
> > > > > > > > While at it sprinkle some comments around because it was very
> > > > > > > > non-obvious to me what's actually going on here and why.
> > > > > > > > 
> > > > > > > > Note that we really need full barriers here, at first I thought
> > > > > > > > store-release and load-acquire on ->last_scheduled would be enough,
> > > > > > > > but we actually requiring ordering between that and the queue state.
> > > > > > > > 
> > > > > > > > v2: Put smp_rmp() in the right place and fix up comment (Andrey)
> > > > > > > > 
> > > > > > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > > > > > Cc: "Christian König" <christian.koenig@amd.com>
> > > > > > > > Cc: Steven Price <steven.price@arm.com>
> > > > > > > > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> > > > > > > > Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> > > > > > > > Cc: Lee Jones <lee.jones@linaro.org>
> > > > > > > > Cc: Boris Brezillon <boris.brezillon@collabora.com>
> > > > > > > > ---
> > > > > > > >      drivers/gpu/drm/scheduler/sched_entity.c | 27 ++++++++++++++++++++++--
> > > > > > > >      1 file changed, 25 insertions(+), 2 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > > > index f7347c284886..89e3f6eaf519 100644
> > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > > > > > @@ -439,8 +439,16 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
> > > > > > > >                  dma_fence_set_error(&sched_job->s_fence->finished, -ECANCELED);
> > > > > > > > 
> > > > > > > >          dma_fence_put(entity->last_scheduled);
> > > > > > > > +
> > > > > > > >          entity->last_scheduled = dma_fence_get(&sched_job->s_fence->finished);
> > > > > > > > 
> > > > > > > > +     /*
> > > > > > > > +      * If the queue is empty we allow drm_sched_entity_select_rq() to
> > > > > > > > +      * locklessly access ->last_scheduled. This only works if we set the
> > > > > > > > +      * pointer before we dequeue and if we a write barrier here.
> > > > > > > > +      */
> > > > > > > > +     smp_wmb();
> > > > > > > > +
> > > > > > > Again, conceptual those barriers should be part of the spsc_queue
> > > > > > > container and not externally.
> > > > > > That would be extremely unusual api. Let's assume that your queue is
> > > > > > very dumb, and protected by a simple lock. That's about the maximum
> > > > > > any user could expect.
> > > > > > 
> > > > > > But then you still need barriers here, because linux locks (spinlock,
> > > > > > mutex) are defined to be one-way barriers: Stuff that's inside is
> > > > > > guaranteed to be done insinde, but stuff outside of the locked region
> > > > > > can leak in. They're load-acquire/store-release barriers. So not good
> > > > > > enough.
> > > > > > 
> > > > > > You really need to have barriers here, and they really all need to be
> > > > > > documented properly. And yes that's a shit-ton of work in drm/sched,
> > > > > > because it's full of yolo lockless stuff.
> > > > > > 
> > > > > > The other case you could make is that this works like a wakeup queue,
> > > > > > or similar. The rules there are:
> > > > > > - wake_up (i.e. pushing something into the queue) is a store-release barrier
> > > > > > - the waked up (i.e. popping an entry) is a load acquire barrier
> > > > > > Which is obviuosly needed because otherwise you don't have coherency
> > > > > > for the data queued up. And again not the barriers you're locking for
> > > > > > here.
> > > > > Exactly that was the idea, yes.
> > > > > 
> > > > > > Either way, we'd still need the comments, because it's still lockless
> > > > > > trickery, and every single one of that needs to have a comment on both
> > > > > > sides to explain what's going on.
> > > > > > 
> > > > > > Essentially replace spsc_queue with an llist underneath, and that's
> > > > > > the amount of barriers a data structure should provide. Anything else
> > > > > > is asking your datastructure to paper over bugs in your users.
> > > > > > 
> > > > > > This is similar to how atomic_t is by default completely unordered,
> > > > > > and users need to add barriers as needed, with comments.
> > > > > My main problem is as always that kernel atomics work different than
> > > > > userspace atomics.
> > > > > 
> > > > > > I think this is all to make sure people don't just write lockless algorithms
> > > > > > because it's a cool idea, but are forced to think this all through.
> > > > > > Which seems to not have happened very consistently for drm/sched, so I
> > > > > > guess needs to be fixed.
> > > > > Well at least initially that was all perfectly thought through. The
> > > > > problem is nobody is really maintaining that stuff.
> > > > > 
> > > > > > I'm definitely not going to hide all that by making the spsc_queue
> > > > > > stuff provide random unjustified barriers just because that would
> > > > > > paper over drm/sched bugs. We need to fix the actual bugs, and
> > > > > > preferrable all of them. I've found a few, but I wasn't involved in
> > > > > > drm/sched thus far, so best I can do is discover them as we go.
> > > > > I don't think that those are random unjustified barriers at all and it
> > > > > sounds like you didn't grip what I said here.
> > > > > 
> > > > > See the spsc queue must have the following semantics:
> > > > > 
> > > > > 1. When you pop a job all changes made before you push the job must be
> > > > > visible.
> > > > This is the standard barriers that also wake-up queues have, it's just
> > > > store-release+load-acquire.
> > > > 
> > > > > 2. When the queue becomes empty all the changes made before you pop the
> > > > > last job must be visible.
> > > > This is very much non-standard for a queue. I guess you could make
> > > > that part of the spsc_queue api between pop and is_empty (really we
> > > > shouldn't expose the _count() function for this), but that's all very
> > > > clever.
> > > > 
> > > > I think having explicit barriers in the code, with comments, is much
> > > > more robust. Because it forces you to think about all this, and
> > > > document it properly. Because there's also lockless stuff like
> > > > drm_sched.ready, which doesn't look at all like it's ordered somehow.
> > > 
> > > At least for amdgpu, after drm_sched_fini is called (setting sched.ready
> > > = false)
> > > we call amdgpu_fence_wait_empty to ensure all in progress jobs are done.
> > > Seems to me at least, this should guarantee that all in flight consumers
> > > of sched.ready (those who still see sched.ready == true) are finished while
> > > all later consumers will see sched.ready == fakle and will bail out.
> > > 
> > > On second thought there is a gap between checking for sched.ready and
> > > inserting
> > > the HW fence for the new job so this might still be a bug... Looks like
> > > we need to check for
> > > sched.ready after inserting the HW fence  and for this we will need
> > > barrier or locking.
> > Yeah, and at that point I think it's good to split up drm_sched.ready
> > from a new thing for when the hw died, like drm_sched.wedged or
> > .hw_death or similar, so that we can tell them apart. Trying to submit
> > a job to a non-ready scheduler is a driver bug and should WARN, while
> > submitting a job to a dead scheduler should probably result in -EIO
> > being returned to userspace (instead of the current -ENOENT, assuming
> > I haven't missed a errno remapping code somewhere in amdgpu).
> > 
> > Also, then you could do a drm_sched_die() or similar function which
> > combines setting the hw_died with the right barriers and cleaning up
> > all the jobs.
> > 
> > Wrt the fundamental race: I think that's not fixeable easily, so maybe
> > the scheduler thread also needs to handle this and immediately fail
> > these jobs by setting all fences to -EIO and completing them, without
> > even calling into the driver. If you try to catch this synchronously I
> > think it would require some kind of locking in push_job, plus failure
> > handling, which would be a) slow and b) real ugly in the driver code.
> > Just accepting that some jobs can slip through and letting the
> > scheduler thread clean them up is I think cleaner.
> 
> 
> I agree about moving this check to scheduler thread, I also not quite
> understand why in some places which are clearly post the job being
> pick-up by it's scheduler thread such as amdgpu_ib_schedule, still
> check for sched.ready... What's the point ? Also there are direct submission
> cases where IB insertion into HW ring is done without any scheduler
> involvement
> and even more in that case why we care that scheduler is not ready.

I think (but I haven't checked the code in full detail) that this is
because there's a mixup of what ->ready means:
- Setup/teardown ordering, where we sometimes try to submit stuff without
  the scheduler actually being ready yet (or maybe the hw isn't ready yet)
  and want to transparently fall back to something else.

- The actual "the hw died irrecoverably and reset couldn't resurrect it"
  case.

That's why I want to tear these two apart, so it's clear why we check
things. Also in general I think solving the former problem with checks
littered all over is bad style, but sometimes unavoidable (like when
you're deep in a callchain through ttm to evict buffers for suspend).
Usually it's better to order the code such that you never try to submit to
hw when it's not ready.

Ofc the hw death is a different beast and can happen any time, hence needs
to be treated differently - there's actual races possible with that, the
code ordering issues around suspend/resume and driver load/unload are all
single threaded so not possible to race. Ok maybe hotunplug is more like
hw death since it can happen while we use it.
-Daniel

> 
> Andrey
> 
> 
> > 
> > If userspace then goes ahead and closes the ctx before all the jobs
> > are cleaned up we can handle that with the normal drm_sched_entity
> > cleanup logic. Which would be another reason to split normal cleanup
> > from hw death.
> > -Daniel
> > 
> > > Andrey
> > > 
> > > > E.g. there's also an rmb(); in drm_sched_entity_is_idle(), which
> > > > - probably should be an smp_rmb()
> > > > - really should document what it actually synchronizes against, and
> > > > the lack of an smp_wmb() somewhere else indicates it's probably
> > > > busted. You always need two barriers.
> > > > 
> > > > > Otherwise I completely agree with you that the whole scheduler doesn't
> > > > > work at all and we need to add tons of external barriers.
> > > > Imo that's what we need to do. And the most important part for
> > > > maintainability is to properly document thing with comments, and the
> > > > most important part in that comment is pointing at the other side of a
> > > > barrier (since a barrier on one side only orders nothing).
> > > > 
> > > > Also, on x86 almost nothing here matters, because both rmb() and wmb()
> > > > are no-op. Aside from the compiler barrier, which tends to not be the
> > > > biggest issue. Only mb() does anything, because x86 is only allowed to
> > > > reorder reads ahead of writes.
> > > > 
> > > > So in practice it's not quite as big a disaster, imo the big thing
> > > > here is maintainability of all these tricks just not being documented.
> > > > -Daniel
> > > > 
> > > > > Regards,
> > > > > Christian.
> > > > > 
> > > > > > -Daniel
> > > > > > 
> > > > > > 
> > > > > > > Regards,
> > > > > > > Christian.
> > > > > > > 
> > > > > > > >          spsc_queue_pop(&entity->job_queue);
> > > > > > > >          return sched_job;
> > > > > > > >      }
> > > > > > > > @@ -459,10 +467,25 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
> > > > > > > >          struct drm_gpu_scheduler *sched;
> > > > > > > >          struct drm_sched_rq *rq;
> > > > > > > > 
> > > > > > > > -     if (spsc_queue_count(&entity->job_queue) || !entity->sched_list)
> > > > > > > > +     /* single possible engine and already selected */
> > > > > > > > +     if (!entity->sched_list)
> > > > > > > > +             return;
> > > > > > > > +
> > > > > > > > +     /* queue non-empty, stay on the same engine */
> > > > > > > > +     if (spsc_queue_count(&entity->job_queue))
> > > > > > > >                  return;
> > > > > > > > 
> > > > > > > > -     fence = READ_ONCE(entity->last_scheduled);
> > > > > > > > +     /*
> > > > > > > > +      * Only when the queue is empty are we guaranteed that the scheduler
> > > > > > > > +      * thread cannot change ->last_scheduled. To enforce ordering we need
> > > > > > > > +      * a read barrier here. See drm_sched_entity_pop_job() for the other
> > > > > > > > +      * side.
> > > > > > > > +      */
> > > > > > > > +     smp_rmb();
> > > > > > > > +
> > > > > > > > +     fence = entity->last_scheduled;
> > > > > > > > +
> > > > > > > > +     /* stay on the same engine if the previous job hasn't finished */
> > > > > > > >          if (fence && !dma_fence_is_signaled(fence))
> > > > > > > >                  return;
> > > > > > > > 
> > > > > > --
> > > > > > Daniel Vetter
> > > > > > Software Engineer, Intel Corporation
> > > > > > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C01c4933fcb364820067408d9461d9c29%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637617915261739604%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=t6oYaz%2FvvN0GhRc35qksHXOHCLGfF1OxNKrqkRF6VWg%3D&amp;reserved=0
> > 
> >
diff mbox series

Patch

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index f7347c284886..89e3f6eaf519 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -439,8 +439,16 @@  struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
 		dma_fence_set_error(&sched_job->s_fence->finished, -ECANCELED);
 
 	dma_fence_put(entity->last_scheduled);
+
 	entity->last_scheduled = dma_fence_get(&sched_job->s_fence->finished);
 
+	/*
+	 * If the queue is empty we allow drm_sched_entity_select_rq() to
+	 * locklessly access ->last_scheduled. This only works if we set the
+	 * pointer before we dequeue and if we a write barrier here.
+	 */
+	smp_wmb();
+
 	spsc_queue_pop(&entity->job_queue);
 	return sched_job;
 }
@@ -459,10 +467,25 @@  void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
 	struct drm_gpu_scheduler *sched;
 	struct drm_sched_rq *rq;
 
-	if (spsc_queue_count(&entity->job_queue) || !entity->sched_list)
+	/* single possible engine and already selected */
+	if (!entity->sched_list)
+		return;
+
+	/* queue non-empty, stay on the same engine */
+	if (spsc_queue_count(&entity->job_queue))
 		return;
 
-	fence = READ_ONCE(entity->last_scheduled);
+	/*
+	 * Only when the queue is empty are we guaranteed that the scheduler
+	 * thread cannot change ->last_scheduled. To enforce ordering we need
+	 * a read barrier here. See drm_sched_entity_pop_job() for the other
+	 * side.
+	 */
+	smp_rmb();
+
+	fence = entity->last_scheduled;
+
+	/* stay on the same engine if the previous job hasn't finished */
 	if (fence && !dma_fence_is_signaled(fence))
 		return;