Message ID | 20240913160559.49054-4-tursulin@igalia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | DRM scheduler fixes and improvements | expand |
On 13/09/2024 17:05, Tvrtko Ursulin wrote: > From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> > > Entities run queue can change during drm_sched_entity_push_job() so make > sure to update the score consistently. > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> > Fixes: d41a39dda140 ("drm/scheduler: improve job distribution with multiple queues") > Cc: Nirmoy Das <nirmoy.das@amd.com> > Cc: Christian König <christian.koenig@amd.com> > Cc: Luben Tuikov <ltuikov89@gmail.com> > Cc: Matthew Brost <matthew.brost@intel.com> > Cc: David Airlie <airlied@gmail.com> > Cc: Daniel Vetter <daniel@ffwll.ch> > Cc: dri-devel@lists.freedesktop.org > Cc: <stable@vger.kernel.org> # v5.9+ > Reviewed-by: Christian König <christian.koenig@amd.com> > Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> > --- > drivers/gpu/drm/scheduler/sched_entity.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c > index 76e422548d40..6645a8524699 100644 > --- a/drivers/gpu/drm/scheduler/sched_entity.c > +++ b/drivers/gpu/drm/scheduler/sched_entity.c > @@ -586,7 +586,6 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job) > ktime_t submit_ts; > > trace_drm_sched_job(sched_job, entity); > - atomic_inc(entity->rq->sched->score); > WRITE_ONCE(entity->last_user, current->group_leader); > > /* > @@ -614,6 +613,7 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job) > rq = entity->rq; > sched = rq->sched; > > + atomic_inc(sched->score); Ugh this is wrong. :( I was working on some further consolidation and realised this. It will create an imbalance in score since score is currently supposed to be accounted twice: 1. +/- 1 for each entity (de-)queued 2. +/- 1 for each job queued/completed By moving it into the "if (first) branch" it unbalances it. But it is still true the original placement is racy. It looks like what is required is an unconditional entity->lock section after spsc_queue_push. AFAICT that's the only way to be sure entity->rq is set for the submission at hand. Question also is, why +/- score in entity add/remove and not just for jobs? In the meantime patch will need to get reverted. Regards, Tvrtko > drm_sched_rq_add_entity(rq, entity); > spin_unlock(&entity->rq_lock); >
Am 30.09.24 um 15:01 schrieb Tvrtko Ursulin: > > On 13/09/2024 17:05, Tvrtko Ursulin wrote: >> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> >> >> Entities run queue can change during drm_sched_entity_push_job() so make >> sure to update the score consistently. >> >> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> >> Fixes: d41a39dda140 ("drm/scheduler: improve job distribution with >> multiple queues") >> Cc: Nirmoy Das <nirmoy.das@amd.com> >> Cc: Christian König <christian.koenig@amd.com> >> Cc: Luben Tuikov <ltuikov89@gmail.com> >> Cc: Matthew Brost <matthew.brost@intel.com> >> Cc: David Airlie <airlied@gmail.com> >> Cc: Daniel Vetter <daniel@ffwll.ch> >> Cc: dri-devel@lists.freedesktop.org >> Cc: <stable@vger.kernel.org> # v5.9+ >> Reviewed-by: Christian König <christian.koenig@amd.com> >> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> >> --- >> drivers/gpu/drm/scheduler/sched_entity.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c >> b/drivers/gpu/drm/scheduler/sched_entity.c >> index 76e422548d40..6645a8524699 100644 >> --- a/drivers/gpu/drm/scheduler/sched_entity.c >> +++ b/drivers/gpu/drm/scheduler/sched_entity.c >> @@ -586,7 +586,6 @@ void drm_sched_entity_push_job(struct >> drm_sched_job *sched_job) >> ktime_t submit_ts; >> trace_drm_sched_job(sched_job, entity); >> - atomic_inc(entity->rq->sched->score); >> WRITE_ONCE(entity->last_user, current->group_leader); >> /* >> @@ -614,6 +613,7 @@ void drm_sched_entity_push_job(struct >> drm_sched_job *sched_job) >> rq = entity->rq; >> sched = rq->sched; >> + atomic_inc(sched->score); > > Ugh this is wrong. :( > > I was working on some further consolidation and realised this. > > It will create an imbalance in score since score is currently supposed > to be accounted twice: > > 1. +/- 1 for each entity (de-)queued > 2. +/- 1 for each job queued/completed > > By moving it into the "if (first) branch" it unbalances it. > > But it is still true the original placement is racy. It looks like > what is required is an unconditional entity->lock section after > spsc_queue_push. AFAICT that's the only way to be sure entity->rq is > set for the submission at hand. > > Question also is, why +/- score in entity add/remove and not just for > jobs? > > In the meantime patch will need to get reverted. Ok going to revert that. I also just realized that we don't need to change anything. The rq can't change as soon as there is a job armed for it. So having the increment right before pushing the armed job to the entity was actually correct in the first place. Regards, Christian. > > Regards, > > Tvrtko > >> drm_sched_rq_add_entity(rq, entity); >> spin_unlock(&entity->rq_lock);
On 30/09/2024 14:07, Christian König wrote: > Am 30.09.24 um 15:01 schrieb Tvrtko Ursulin: >> >> On 13/09/2024 17:05, Tvrtko Ursulin wrote: >>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> >>> >>> Entities run queue can change during drm_sched_entity_push_job() so make >>> sure to update the score consistently. >>> >>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> >>> Fixes: d41a39dda140 ("drm/scheduler: improve job distribution with >>> multiple queues") >>> Cc: Nirmoy Das <nirmoy.das@amd.com> >>> Cc: Christian König <christian.koenig@amd.com> >>> Cc: Luben Tuikov <ltuikov89@gmail.com> >>> Cc: Matthew Brost <matthew.brost@intel.com> >>> Cc: David Airlie <airlied@gmail.com> >>> Cc: Daniel Vetter <daniel@ffwll.ch> >>> Cc: dri-devel@lists.freedesktop.org >>> Cc: <stable@vger.kernel.org> # v5.9+ >>> Reviewed-by: Christian König <christian.koenig@amd.com> >>> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> >>> --- >>> drivers/gpu/drm/scheduler/sched_entity.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c >>> b/drivers/gpu/drm/scheduler/sched_entity.c >>> index 76e422548d40..6645a8524699 100644 >>> --- a/drivers/gpu/drm/scheduler/sched_entity.c >>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c >>> @@ -586,7 +586,6 @@ void drm_sched_entity_push_job(struct >>> drm_sched_job *sched_job) >>> ktime_t submit_ts; >>> trace_drm_sched_job(sched_job, entity); >>> - atomic_inc(entity->rq->sched->score); >>> WRITE_ONCE(entity->last_user, current->group_leader); >>> /* >>> @@ -614,6 +613,7 @@ void drm_sched_entity_push_job(struct >>> drm_sched_job *sched_job) >>> rq = entity->rq; >>> sched = rq->sched; >>> + atomic_inc(sched->score); >> >> Ugh this is wrong. :( >> >> I was working on some further consolidation and realised this. >> >> It will create an imbalance in score since score is currently supposed >> to be accounted twice: >> >> 1. +/- 1 for each entity (de-)queued >> 2. +/- 1 for each job queued/completed >> >> By moving it into the "if (first) branch" it unbalances it. >> >> But it is still true the original placement is racy. It looks like >> what is required is an unconditional entity->lock section after >> spsc_queue_push. AFAICT that's the only way to be sure entity->rq is >> set for the submission at hand. >> >> Question also is, why +/- score in entity add/remove and not just for >> jobs? >> >> In the meantime patch will need to get reverted. > > Ok going to revert that. Thank you, and sorry for the trouble! > I also just realized that we don't need to change anything. The rq can't > change as soon as there is a job armed for it. > > So having the increment right before pushing the armed job to the entity > was actually correct in the first place. Are you sure? Two threads racing to arm and push on the same entity? T1 T2 arm job rq1 selected .. push job arm job inc score rq1 spsc_queue_count check passes --- just before T1 spsc_queue_push --- changed to rq2 spsc_queue_push if (first) resamples entity->rq queues rq2 Where rq1 and rq2 belong to different schedulers. Regards, Tvrtko > Regards, > Christian. > >> >> Regards, >> >> Tvrtko >> >>> drm_sched_rq_add_entity(rq, entity); >>> spin_unlock(&entity->rq_lock); >
Am 30.09.24 um 15:22 schrieb Tvrtko Ursulin: > > On 30/09/2024 14:07, Christian König wrote: >> Am 30.09.24 um 15:01 schrieb Tvrtko Ursulin: >>> >>> On 13/09/2024 17:05, Tvrtko Ursulin wrote: >>>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> >>>> >>>> Entities run queue can change during drm_sched_entity_push_job() so >>>> make >>>> sure to update the score consistently. >>>> >>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> >>>> Fixes: d41a39dda140 ("drm/scheduler: improve job distribution with >>>> multiple queues") >>>> Cc: Nirmoy Das <nirmoy.das@amd.com> >>>> Cc: Christian König <christian.koenig@amd.com> >>>> Cc: Luben Tuikov <ltuikov89@gmail.com> >>>> Cc: Matthew Brost <matthew.brost@intel.com> >>>> Cc: David Airlie <airlied@gmail.com> >>>> Cc: Daniel Vetter <daniel@ffwll.ch> >>>> Cc: dri-devel@lists.freedesktop.org >>>> Cc: <stable@vger.kernel.org> # v5.9+ >>>> Reviewed-by: Christian König <christian.koenig@amd.com> >>>> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> >>>> --- >>>> drivers/gpu/drm/scheduler/sched_entity.c | 2 +- >>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c >>>> b/drivers/gpu/drm/scheduler/sched_entity.c >>>> index 76e422548d40..6645a8524699 100644 >>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c >>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c >>>> @@ -586,7 +586,6 @@ void drm_sched_entity_push_job(struct >>>> drm_sched_job *sched_job) >>>> ktime_t submit_ts; >>>> trace_drm_sched_job(sched_job, entity); >>>> - atomic_inc(entity->rq->sched->score); >>>> WRITE_ONCE(entity->last_user, current->group_leader); >>>> /* >>>> @@ -614,6 +613,7 @@ void drm_sched_entity_push_job(struct >>>> drm_sched_job *sched_job) >>>> rq = entity->rq; >>>> sched = rq->sched; >>>> + atomic_inc(sched->score); >>> >>> Ugh this is wrong. :( >>> >>> I was working on some further consolidation and realised this. >>> >>> It will create an imbalance in score since score is currently >>> supposed to be accounted twice: >>> >>> 1. +/- 1 for each entity (de-)queued >>> 2. +/- 1 for each job queued/completed >>> >>> By moving it into the "if (first) branch" it unbalances it. >>> >>> But it is still true the original placement is racy. It looks like >>> what is required is an unconditional entity->lock section after >>> spsc_queue_push. AFAICT that's the only way to be sure entity->rq is >>> set for the submission at hand. >>> >>> Question also is, why +/- score in entity add/remove and not just >>> for jobs? >>> >>> In the meantime patch will need to get reverted. >> >> Ok going to revert that. > > Thank you, and sorry for the trouble! > >> I also just realized that we don't need to change anything. The rq >> can't change as soon as there is a job armed for it. >> >> So having the increment right before pushing the armed job to the >> entity was actually correct in the first place. > > Are you sure? Two threads racing to arm and push on the same entity? > > > T1 T2 > > arm job > rq1 selected > .. > push job arm job > inc score rq1 > spsc_queue_count check passes > --- just before T1 spsc_queue_push --- > changed to rq2 > spsc_queue_push > if (first) > resamples entity->rq > queues rq2 > > Where rq1 and rq2 belong to different schedulers. arm/push must be protected by an external lock preventing two threads pushing into the same entity at the same time. That's what this misleading comment from Sima we already discussed should have meant. Regards, Christian. > > Regards, > > Tvrtko > > >> Regards, >> Christian. >> >>> >>> Regards, >>> >>> Tvrtko >>> >>>> drm_sched_rq_add_entity(rq, entity); >>>> spin_unlock(&entity->rq_lock); >>
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c index 76e422548d40..6645a8524699 100644 --- a/drivers/gpu/drm/scheduler/sched_entity.c +++ b/drivers/gpu/drm/scheduler/sched_entity.c @@ -586,7 +586,6 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job) ktime_t submit_ts; trace_drm_sched_job(sched_job, entity); - atomic_inc(entity->rq->sched->score); WRITE_ONCE(entity->last_user, current->group_leader); /* @@ -614,6 +613,7 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job) rq = entity->rq; sched = rq->sched; + atomic_inc(sched->score); drm_sched_rq_add_entity(rq, entity); spin_unlock(&entity->rq_lock);