[v5,02/20] drm/msm: Fix drm/sched point of no return rules

Message ID	20210805104705.862416-3-daniel.vetter@ffwll.ch (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-media-owner@kernel.org> From: Daniel Vetter <daniel.vetter@ffwll.ch> To: DRI Development <dri-devel@lists.freedesktop.org> Cc: Intel Graphics Development <intel-gfx@lists.freedesktop.org>, Daniel Vetter <daniel.vetter@ffwll.ch>, Rob Clark <robdclark@chromium.org>, Rob Clark <robdclark@gmail.com>, Sean Paul <sean@poorly.run>, Sumit Semwal <sumit.semwal@linaro.org>, =?utf-8?q?Christian_K=C3=B6nig?= <christian.koenig@amd.com>, linux-arm-msm@vger.kernel.org, freedreno@lists.freedesktop.org, linux-media@vger.kernel.org, linaro-mm-sig@lists.linaro.org, Daniel Vetter <daniel.vetter@intel.com> Subject: [PATCH v5 02/20] drm/msm: Fix drm/sched point of no return rules Date: Thu, 5 Aug 2021 12:46:47 +0200 Message-Id: <20210805104705.862416-3-daniel.vetter@ffwll.ch> In-Reply-To: <20210805104705.862416-1-daniel.vetter@ffwll.ch> References: <20210805104705.862416-1-daniel.vetter@ffwll.ch> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[v5,01/20] drm/sched: Split drm_sched_job_init \| expand [v5,01/20] drm/sched: Split drm_sched_job_init [v5,02/20] drm/msm: Fix drm/sched point of no return rules [v5,04/20] drm/sched: Add dependency tracking [v5,05/20] drm/sched: drop entity parameter from drm_sched_push_job [v5,07/20] drm/panfrost: use scheduler dependency tracking [v5,08/20] drm/lima: use scheduler dependency tracking [v5,11/20] drm/etnaviv: Use scheduler dependency handling [v5,12/20] drm/msm: Use scheduler dependency handling [v5,13/20] drm/gem: Delete gem array fencing helpers [v5,20/20] dma-resv: Give the docs a do-over

Daniel Vetter Aug. 5, 2021, 10:46 a.m. UTC

Originally drm_sched_job_init was the point of no return, after which
drivers must submit a job. I've split that up, which allows us to fix
this issue pretty easily.

Only thing we have to take care of is to not skip to error paths after
that. Other drivers do this the same for out-fence and similar things.

Fixes: 1d8a5ca436ee ("drm/msm: Conversion to drm scheduler")
Cc: Rob Clark <robdclark@chromium.org>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Sean Paul <sean@poorly.run>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: linux-arm-msm@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Cc: freedreno@lists.freedesktop.org
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/msm/msm_gem_submit.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

Rob Clark Aug. 5, 2021, 11:02 p.m. UTC | #1

On Thu, Aug 5, 2021 at 3:47 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> Originally drm_sched_job_init was the point of no return, after which
> drivers must submit a job. I've split that up, which allows us to fix
> this issue pretty easily.
>
> Only thing we have to take care of is to not skip to error paths after
> that. Other drivers do this the same for out-fence and similar things.
>
> Fixes: 1d8a5ca436ee ("drm/msm: Conversion to drm scheduler")
> Cc: Rob Clark <robdclark@chromium.org>
> Cc: Rob Clark <robdclark@gmail.com>
> Cc: Sean Paul <sean@poorly.run>
> Cc: Sumit Semwal <sumit.semwal@linaro.org>
> Cc: "Christian König" <christian.koenig@amd.com>
> Cc: linux-arm-msm@vger.kernel.org
> Cc: dri-devel@lists.freedesktop.org
> Cc: freedreno@lists.freedesktop.org
> Cc: linux-media@vger.kernel.org
> Cc: linaro-mm-sig@lists.linaro.org
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>  drivers/gpu/drm/msm/msm_gem_submit.c | 15 +++++++--------
>  1 file changed, 7 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
> index 6d6c44f0e1f3..d0ed4ddc509e 100644
> --- a/drivers/gpu/drm/msm/msm_gem_submit.c
> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
> @@ -52,9 +52,6 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
>                 return ERR_PTR(ret);
>         }
>
> -       /* FIXME: this is way too early */
> -       drm_sched_job_arm(&job->base);
> -
>         xa_init_flags(&submit->deps, XA_FLAGS_ALLOC);
>
>         kref_init(&submit->ref);
> @@ -883,6 +880,9 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
>
>         submit->user_fence = dma_fence_get(&submit->base.s_fence->finished);
>
> +       /* point of no return, we _have_ to submit no matter what */
> +       drm_sched_job_arm(&submit->base);
> +
>         /*
>          * Allocate an id which can be used by WAIT_FENCE ioctl to map back
>          * to the underlying fence.
> @@ -892,17 +892,16 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
>         if (submit->fence_id < 0) {
>                 ret = submit->fence_id = 0;
>                 submit->fence_id = 0;
> -               goto out;
>         }
>
> -       if (args->flags & MSM_SUBMIT_FENCE_FD_OUT) {
> +       if (ret == 0 && args->flags & MSM_SUBMIT_FENCE_FD_OUT) {
>                 struct sync_file *sync_file = sync_file_create(submit->user_fence);
>                 if (!sync_file) {
>                         ret = -ENOMEM;
> -                       goto out;
> +               } else {
> +                       fd_install(out_fence_fd, sync_file->file);
> +                       args->fence_fd = out_fence_fd;
>                 }
> -               fd_install(out_fence_fd, sync_file->file);
> -               args->fence_fd = out_fence_fd;

I wonder if instead we should (approximately) undo "drm/msm/submit:
Simplify out-fence-fd handling" so that the point that it could fail
is moved up ahead of the drm_sched_job_arm()?

Also, does the dma_fence_get() work before drm_sched_job_arm()?  From
a quick look, it looks like it won't, but I'm still playing catchup
and haven't had a chance to look at your entire series.  If it doesn't
work before drm_sched_job_arm(), then there is really no way to
prevent a error path between the fence-init and job-submit.

But, prior to your series, wouldn't a failure after
drm_sched_job_init() but before the job is submitted just burn a
fence-id, and otherwise carry on it's merry way?

BR,
-R

>         }
>
>         submit_attach_object_fences(submit);
> --
> 2.32.0
>

Daniel Vetter Aug. 6, 2021, 4:41 p.m. UTC | #2

On Fri, Aug 6, 2021 at 12:58 AM Rob Clark <robdclark@gmail.com> wrote:
>
> On Thu, Aug 5, 2021 at 3:47 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> >
> > Originally drm_sched_job_init was the point of no return, after which
> > drivers must submit a job. I've split that up, which allows us to fix
> > this issue pretty easily.
> >
> > Only thing we have to take care of is to not skip to error paths after
> > that. Other drivers do this the same for out-fence and similar things.
> >
> > Fixes: 1d8a5ca436ee ("drm/msm: Conversion to drm scheduler")
> > Cc: Rob Clark <robdclark@chromium.org>
> > Cc: Rob Clark <robdclark@gmail.com>
> > Cc: Sean Paul <sean@poorly.run>
> > Cc: Sumit Semwal <sumit.semwal@linaro.org>
> > Cc: "Christian König" <christian.koenig@amd.com>
> > Cc: linux-arm-msm@vger.kernel.org
> > Cc: dri-devel@lists.freedesktop.org
> > Cc: freedreno@lists.freedesktop.org
> > Cc: linux-media@vger.kernel.org
> > Cc: linaro-mm-sig@lists.linaro.org
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > ---
> >  drivers/gpu/drm/msm/msm_gem_submit.c | 15 +++++++--------
> >  1 file changed, 7 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
> > index 6d6c44f0e1f3..d0ed4ddc509e 100644
> > --- a/drivers/gpu/drm/msm/msm_gem_submit.c
> > +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
> > @@ -52,9 +52,6 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
> >                 return ERR_PTR(ret);
> >         }
> >
> > -       /* FIXME: this is way too early */
> > -       drm_sched_job_arm(&job->base);
> > -
> >         xa_init_flags(&submit->deps, XA_FLAGS_ALLOC);
> >
> >         kref_init(&submit->ref);
> > @@ -883,6 +880,9 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
> >
> >         submit->user_fence = dma_fence_get(&submit->base.s_fence->finished);
> >
> > +       /* point of no return, we _have_ to submit no matter what */
> > +       drm_sched_job_arm(&submit->base);
> > +
> >         /*
> >          * Allocate an id which can be used by WAIT_FENCE ioctl to map back
> >          * to the underlying fence.
> > @@ -892,17 +892,16 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
> >         if (submit->fence_id < 0) {
> >                 ret = submit->fence_id = 0;
> >                 submit->fence_id = 0;
> > -               goto out;
> >         }
> >
> > -       if (args->flags & MSM_SUBMIT_FENCE_FD_OUT) {
> > +       if (ret == 0 && args->flags & MSM_SUBMIT_FENCE_FD_OUT) {
> >                 struct sync_file *sync_file = sync_file_create(submit->user_fence);
> >                 if (!sync_file) {
> >                         ret = -ENOMEM;
> > -                       goto out;
> > +               } else {
> > +                       fd_install(out_fence_fd, sync_file->file);
> > +                       args->fence_fd = out_fence_fd;
> >                 }
> > -               fd_install(out_fence_fd, sync_file->file);
> > -               args->fence_fd = out_fence_fd;
>
> I wonder if instead we should (approximately) undo "drm/msm/submit:
> Simplify out-fence-fd handling" so that the point that it could fail
> is moved up ahead of the drm_sched_job_arm()?

Hm yeah. Up to you how you want to paint this shed, I think either is fine.

> Also, does the dma_fence_get() work before drm_sched_job_arm()?  From
> a quick look, it looks like it won't, but I'm still playing catchup
> and haven't had a chance to look at your entire series.  If it doesn't
> work before drm_sched_job_arm(), then there is really no way to
> prevent a error path between the fence-init and job-submit.

Yes. I thought I've checked that I put the _arm() in the right spot,
but I guess I screwed up and you need the fence before the point where
I've put the job_arm()? And yes the error path cannot be avoided for
out-fences, that's what I tried to explain in the commit message.

> But, prior to your series, wouldn't a failure after
> drm_sched_job_init() but before the job is submitted just burn a
> fence-id, and otherwise carry on it's merry way?

Maybe? I'm not sure whether the scheduler gets confused about the gap
and freak out abou that. I'm fairly new to that code and learning
(which is part why I'm working on it). Since you look up in
fences/syncobj after job_init() it should be pretty easy to whip up a
testcase and see what happens. Also as long as nothing fails you won't
see an issue, that's for sure.
-Daniel

> BR,
> -R
>
> >         }
> >
> >         submit_attach_object_fences(submit);
> > --
> > 2.32.0
> >

Rob Clark Aug. 6, 2021, 5:19 p.m. UTC | #3

On Fri, Aug 6, 2021 at 9:42 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> On Fri, Aug 6, 2021 at 12:58 AM Rob Clark <robdclark@gmail.com> wrote:
> >
> > On Thu, Aug 5, 2021 at 3:47 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > >
> > > Originally drm_sched_job_init was the point of no return, after which
> > > drivers must submit a job. I've split that up, which allows us to fix
> > > this issue pretty easily.
> > >
> > > Only thing we have to take care of is to not skip to error paths after
> > > that. Other drivers do this the same for out-fence and similar things.
> > >
> > > Fixes: 1d8a5ca436ee ("drm/msm: Conversion to drm scheduler")
> > > Cc: Rob Clark <robdclark@chromium.org>
> > > Cc: Rob Clark <robdclark@gmail.com>
> > > Cc: Sean Paul <sean@poorly.run>
> > > Cc: Sumit Semwal <sumit.semwal@linaro.org>
> > > Cc: "Christian König" <christian.koenig@amd.com>
> > > Cc: linux-arm-msm@vger.kernel.org
> > > Cc: dri-devel@lists.freedesktop.org
> > > Cc: freedreno@lists.freedesktop.org
> > > Cc: linux-media@vger.kernel.org
> > > Cc: linaro-mm-sig@lists.linaro.org
> > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > ---
> > >  drivers/gpu/drm/msm/msm_gem_submit.c | 15 +++++++--------
> > >  1 file changed, 7 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
> > > index 6d6c44f0e1f3..d0ed4ddc509e 100644
> > > --- a/drivers/gpu/drm/msm/msm_gem_submit.c
> > > +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
> > > @@ -52,9 +52,6 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
> > >                 return ERR_PTR(ret);
> > >         }
> > >
> > > -       /* FIXME: this is way too early */
> > > -       drm_sched_job_arm(&job->base);
> > > -
> > >         xa_init_flags(&submit->deps, XA_FLAGS_ALLOC);
> > >
> > >         kref_init(&submit->ref);
> > > @@ -883,6 +880,9 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
> > >
> > >         submit->user_fence = dma_fence_get(&submit->base.s_fence->finished);
> > >
> > > +       /* point of no return, we _have_ to submit no matter what */
> > > +       drm_sched_job_arm(&submit->base);
> > > +
> > >         /*
> > >          * Allocate an id which can be used by WAIT_FENCE ioctl to map back
> > >          * to the underlying fence.
> > > @@ -892,17 +892,16 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
> > >         if (submit->fence_id < 0) {
> > >                 ret = submit->fence_id = 0;
> > >                 submit->fence_id = 0;
> > > -               goto out;
> > >         }
> > >
> > > -       if (args->flags & MSM_SUBMIT_FENCE_FD_OUT) {
> > > +       if (ret == 0 && args->flags & MSM_SUBMIT_FENCE_FD_OUT) {
> > >                 struct sync_file *sync_file = sync_file_create(submit->user_fence);
> > >                 if (!sync_file) {
> > >                         ret = -ENOMEM;
> > > -                       goto out;
> > > +               } else {
> > > +                       fd_install(out_fence_fd, sync_file->file);
> > > +                       args->fence_fd = out_fence_fd;
> > >                 }
> > > -               fd_install(out_fence_fd, sync_file->file);
> > > -               args->fence_fd = out_fence_fd;
> >
> > I wonder if instead we should (approximately) undo "drm/msm/submit:
> > Simplify out-fence-fd handling" so that the point that it could fail
> > is moved up ahead of the drm_sched_job_arm()?
>
> Hm yeah. Up to you how you want to paint this shed, I think either is fine.
>
> > Also, does the dma_fence_get() work before drm_sched_job_arm()?  From
> > a quick look, it looks like it won't, but I'm still playing catchup
> > and haven't had a chance to look at your entire series.  If it doesn't
> > work before drm_sched_job_arm(), then there is really no way to
> > prevent a error path between the fence-init and job-submit.
>
> Yes. I thought I've checked that I put the _arm() in the right spot,
> but I guess I screwed up and you need the fence before the point where
> I've put the job_arm()? And yes the error path cannot be avoided for
> out-fences, that's what I tried to explain in the commit message.
>
> > But, prior to your series, wouldn't a failure after
> > drm_sched_job_init() but before the job is submitted just burn a
> > fence-id, and otherwise carry on it's merry way?
>
> Maybe? I'm not sure whether the scheduler gets confused about the gap
> and freak out abou that. I'm fairly new to that code and learning
> (which is part why I'm working on it). Since you look up in
> fences/syncobj after job_init() it should be pretty easy to whip up a
> testcase and see what happens. Also as long as nothing fails you won't
> see an issue, that's for sure.

fair.. I'll try to come up with a test case.. pre-scheduler-conversion
it wasn't a problem to fail after the fence seqno was allocated (well,
I guess you might have problems if you had 2^31 failures in a row)

BR,
-R

> -Daniel
>
> > BR,
> > -R
> >
> > >         }
> > >
> > >         submit_attach_object_fences(submit);
> > > --
> > > 2.32.0
> > >
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Daniel Vetter Aug. 6, 2021, 6:41 p.m. UTC | #4

On Fri, Aug 6, 2021 at 7:15 PM Rob Clark <robdclark@gmail.com> wrote:
>
> On Fri, Aug 6, 2021 at 9:42 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> >
> > On Fri, Aug 6, 2021 at 12:58 AM Rob Clark <robdclark@gmail.com> wrote:
> > >
> > > On Thu, Aug 5, 2021 at 3:47 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > >
> > > > Originally drm_sched_job_init was the point of no return, after which
> > > > drivers must submit a job. I've split that up, which allows us to fix
> > > > this issue pretty easily.
> > > >
> > > > Only thing we have to take care of is to not skip to error paths after
> > > > that. Other drivers do this the same for out-fence and similar things.
> > > >
> > > > Fixes: 1d8a5ca436ee ("drm/msm: Conversion to drm scheduler")
> > > > Cc: Rob Clark <robdclark@chromium.org>
> > > > Cc: Rob Clark <robdclark@gmail.com>
> > > > Cc: Sean Paul <sean@poorly.run>
> > > > Cc: Sumit Semwal <sumit.semwal@linaro.org>
> > > > Cc: "Christian König" <christian.koenig@amd.com>
> > > > Cc: linux-arm-msm@vger.kernel.org
> > > > Cc: dri-devel@lists.freedesktop.org
> > > > Cc: freedreno@lists.freedesktop.org
> > > > Cc: linux-media@vger.kernel.org
> > > > Cc: linaro-mm-sig@lists.linaro.org
> > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/msm/msm_gem_submit.c | 15 +++++++--------
> > > >  1 file changed, 7 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
> > > > index 6d6c44f0e1f3..d0ed4ddc509e 100644
> > > > --- a/drivers/gpu/drm/msm/msm_gem_submit.c
> > > > +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
> > > > @@ -52,9 +52,6 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
> > > >                 return ERR_PTR(ret);
> > > >         }
> > > >
> > > > -       /* FIXME: this is way too early */
> > > > -       drm_sched_job_arm(&job->base);
> > > > -
> > > >         xa_init_flags(&submit->deps, XA_FLAGS_ALLOC);
> > > >
> > > >         kref_init(&submit->ref);
> > > > @@ -883,6 +880,9 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
> > > >
> > > >         submit->user_fence = dma_fence_get(&submit->base.s_fence->finished);
> > > >
> > > > +       /* point of no return, we _have_ to submit no matter what */
> > > > +       drm_sched_job_arm(&submit->base);
> > > > +
> > > >         /*
> > > >          * Allocate an id which can be used by WAIT_FENCE ioctl to map back
> > > >          * to the underlying fence.
> > > > @@ -892,17 +892,16 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
> > > >         if (submit->fence_id < 0) {
> > > >                 ret = submit->fence_id = 0;
> > > >                 submit->fence_id = 0;
> > > > -               goto out;
> > > >         }
> > > >
> > > > -       if (args->flags & MSM_SUBMIT_FENCE_FD_OUT) {
> > > > +       if (ret == 0 && args->flags & MSM_SUBMIT_FENCE_FD_OUT) {
> > > >                 struct sync_file *sync_file = sync_file_create(submit->user_fence);
> > > >                 if (!sync_file) {
> > > >                         ret = -ENOMEM;
> > > > -                       goto out;
> > > > +               } else {
> > > > +                       fd_install(out_fence_fd, sync_file->file);
> > > > +                       args->fence_fd = out_fence_fd;
> > > >                 }
> > > > -               fd_install(out_fence_fd, sync_file->file);
> > > > -               args->fence_fd = out_fence_fd;
> > >
> > > I wonder if instead we should (approximately) undo "drm/msm/submit:
> > > Simplify out-fence-fd handling" so that the point that it could fail
> > > is moved up ahead of the drm_sched_job_arm()?
> >
> > Hm yeah. Up to you how you want to paint this shed, I think either is fine.
> >
> > > Also, does the dma_fence_get() work before drm_sched_job_arm()?  From
> > > a quick look, it looks like it won't, but I'm still playing catchup
> > > and haven't had a chance to look at your entire series.  If it doesn't
> > > work before drm_sched_job_arm(), then there is really no way to
> > > prevent a error path between the fence-init and job-submit.
> >
> > Yes. I thought I've checked that I put the _arm() in the right spot,
> > but I guess I screwed up and you need the fence before the point where
> > I've put the job_arm()? And yes the error path cannot be avoided for
> > out-fences, that's what I tried to explain in the commit message.
> >
> > > But, prior to your series, wouldn't a failure after
> > > drm_sched_job_init() but before the job is submitted just burn a
> > > fence-id, and otherwise carry on it's merry way?
> >
> > Maybe? I'm not sure whether the scheduler gets confused about the gap
> > and freak out abou that. I'm fairly new to that code and learning
> > (which is part why I'm working on it). Since you look up in
> > fences/syncobj after job_init() it should be pretty easy to whip up a
> > testcase and see what happens. Also as long as nothing fails you won't
> > see an issue, that's for sure.
>
> fair.. I'll try to come up with a test case.. pre-scheduler-conversion
> it wasn't a problem to fail after the fence seqno was allocated (well,
> I guess you might have problems if you had 2^31 failures in a row)

Yeah one thing drm/sched forces you to do is have a very clear notion
about the point of no return in your submit ioctl. Which I think is a
Very Good Thing, at least looking at i915 execbuf where the point of
no return is a multi-stage thing with such interesting intermediate
points like "we submit the ruquest but without actually running the
batchbuffer". The downside is that the submit ioctl isn't perfectly
transaction anymore, but I don't think that matters for tha tail
stuff, which is generally just some out-fence installing. That
generally never fails.
-Daniel

>
> BR,
> -R
>
> > -Daniel
> >
> > > BR,
> > > -R
> > >
> > > >         }
> > > >
> > > >         submit_attach_object_fences(submit);
> > > > --
> > > > 2.32.0
> > > >
> >
> >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

Rob Clark Aug. 6, 2021, 7:01 p.m. UTC | #5

On Fri, Aug 6, 2021 at 11:41 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> On Fri, Aug 6, 2021 at 7:15 PM Rob Clark <robdclark@gmail.com> wrote:
> >
> > On Fri, Aug 6, 2021 at 9:42 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > >
> > > On Fri, Aug 6, 2021 at 12:58 AM Rob Clark <robdclark@gmail.com> wrote:
> > > >
> > > > On Thu, Aug 5, 2021 at 3:47 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > > >
> > > > > Originally drm_sched_job_init was the point of no return, after which
> > > > > drivers must submit a job. I've split that up, which allows us to fix
> > > > > this issue pretty easily.
> > > > >
> > > > > Only thing we have to take care of is to not skip to error paths after
> > > > > that. Other drivers do this the same for out-fence and similar things.
> > > > >
> > > > > Fixes: 1d8a5ca436ee ("drm/msm: Conversion to drm scheduler")
> > > > > Cc: Rob Clark <robdclark@chromium.org>
> > > > > Cc: Rob Clark <robdclark@gmail.com>
> > > > > Cc: Sean Paul <sean@poorly.run>
> > > > > Cc: Sumit Semwal <sumit.semwal@linaro.org>
> > > > > Cc: "Christian König" <christian.koenig@amd.com>
> > > > > Cc: linux-arm-msm@vger.kernel.org
> > > > > Cc: dri-devel@lists.freedesktop.org
> > > > > Cc: freedreno@lists.freedesktop.org
> > > > > Cc: linux-media@vger.kernel.org
> > > > > Cc: linaro-mm-sig@lists.linaro.org
> > > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > > ---
> > > > >  drivers/gpu/drm/msm/msm_gem_submit.c | 15 +++++++--------
> > > > >  1 file changed, 7 insertions(+), 8 deletions(-)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
> > > > > index 6d6c44f0e1f3..d0ed4ddc509e 100644
> > > > > --- a/drivers/gpu/drm/msm/msm_gem_submit.c
> > > > > +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
> > > > > @@ -52,9 +52,6 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
> > > > >                 return ERR_PTR(ret);
> > > > >         }
> > > > >
> > > > > -       /* FIXME: this is way too early */
> > > > > -       drm_sched_job_arm(&job->base);
> > > > > -
> > > > >         xa_init_flags(&submit->deps, XA_FLAGS_ALLOC);
> > > > >
> > > > >         kref_init(&submit->ref);
> > > > > @@ -883,6 +880,9 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
> > > > >
> > > > >         submit->user_fence = dma_fence_get(&submit->base.s_fence->finished);
> > > > >
> > > > > +       /* point of no return, we _have_ to submit no matter what */
> > > > > +       drm_sched_job_arm(&submit->base);
> > > > > +
> > > > >         /*
> > > > >          * Allocate an id which can be used by WAIT_FENCE ioctl to map back
> > > > >          * to the underlying fence.
> > > > > @@ -892,17 +892,16 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
> > > > >         if (submit->fence_id < 0) {
> > > > >                 ret = submit->fence_id = 0;
> > > > >                 submit->fence_id = 0;
> > > > > -               goto out;
> > > > >         }
> > > > >
> > > > > -       if (args->flags & MSM_SUBMIT_FENCE_FD_OUT) {
> > > > > +       if (ret == 0 && args->flags & MSM_SUBMIT_FENCE_FD_OUT) {
> > > > >                 struct sync_file *sync_file = sync_file_create(submit->user_fence);
> > > > >                 if (!sync_file) {
> > > > >                         ret = -ENOMEM;
> > > > > -                       goto out;
> > > > > +               } else {
> > > > > +                       fd_install(out_fence_fd, sync_file->file);
> > > > > +                       args->fence_fd = out_fence_fd;
> > > > >                 }
> > > > > -               fd_install(out_fence_fd, sync_file->file);
> > > > > -               args->fence_fd = out_fence_fd;
> > > >
> > > > I wonder if instead we should (approximately) undo "drm/msm/submit:
> > > > Simplify out-fence-fd handling" so that the point that it could fail
> > > > is moved up ahead of the drm_sched_job_arm()?
> > >
> > > Hm yeah. Up to you how you want to paint this shed, I think either is fine.
> > >
> > > > Also, does the dma_fence_get() work before drm_sched_job_arm()?  From
> > > > a quick look, it looks like it won't, but I'm still playing catchup
> > > > and haven't had a chance to look at your entire series.  If it doesn't
> > > > work before drm_sched_job_arm(), then there is really no way to
> > > > prevent a error path between the fence-init and job-submit.
> > >
> > > Yes. I thought I've checked that I put the _arm() in the right spot,
> > > but I guess I screwed up and you need the fence before the point where
> > > I've put the job_arm()? And yes the error path cannot be avoided for
> > > out-fences, that's what I tried to explain in the commit message.
> > >
> > > > But, prior to your series, wouldn't a failure after
> > > > drm_sched_job_init() but before the job is submitted just burn a
> > > > fence-id, and otherwise carry on it's merry way?
> > >
> > > Maybe? I'm not sure whether the scheduler gets confused about the gap
> > > and freak out abou that. I'm fairly new to that code and learning
> > > (which is part why I'm working on it). Since you look up in
> > > fences/syncobj after job_init() it should be pretty easy to whip up a
> > > testcase and see what happens. Also as long as nothing fails you won't
> > > see an issue, that's for sure.
> >
> > fair.. I'll try to come up with a test case.. pre-scheduler-conversion
> > it wasn't a problem to fail after the fence seqno was allocated (well,
> > I guess you might have problems if you had 2^31 failures in a row)
>
> Yeah one thing drm/sched forces you to do is have a very clear notion
> about the point of no return in your submit ioctl. Which I think is a
> Very Good Thing, at least looking at i915 execbuf where the point of
> no return is a multi-stage thing with such interesting intermediate
> points like "we submit the ruquest but without actually running the
> batchbuffer". The downside is that the submit ioctl isn't perfectly
> transaction anymore, but I don't think that matters for tha tail
> stuff, which is generally just some out-fence installing. That
> generally never fails.

So I hacked up:

------
diff --git a/drivers/gpu/drm/scheduler/sched_fence.c
b/drivers/gpu/drm/scheduler/sched_fence.c
index 3aa6351d2101..88e66dbc9515 100644
--- a/drivers/gpu/drm/scheduler/sched_fence.c
+++ b/drivers/gpu/drm/scheduler/sched_fence.c
@@ -176,6 +176,7 @@ struct drm_sched_fence
*drm_sched_fence_create(struct drm_sched_entity *entity,
        fence->sched = entity->rq->sched;
        spin_lock_init(&fence->lock);

+       seq = atomic_inc_return(&entity->fence_seq);
        seq = atomic_inc_return(&entity->fence_seq);
        dma_fence_init(&fence->scheduled, &drm_sched_fence_ops_scheduled,
                       &fence->lock, entity->fence_context, seq);
diff --git a/drivers/gpu/drm/scheduler/sched_main.c
b/drivers/gpu/drm/scheduler/sched_main.c
index fcc601962e92..583e85adbbe0 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -593,6 +593,7 @@ int drm_sched_job_init(struct drm_sched_job *job,
        if (!job->s_fence)
                return -ENOMEM;
        job->id = atomic64_inc_return(&sched->job_id_count);
+       job->id = atomic64_inc_return(&sched->job_id_count);

        INIT_LIST_HEAD(&job->list);

------

(I guess the job->id part shouldn't really be needed, that looks like
it is only used by amdgpu)

This didn't cause any problems that I could see.  So I don't *think* a
failure after drm_sched_job_init() is really problematic, as long as
things are serialized between drm_sched_job_init() and
drm_sched_entity_push_job().

I also noticed that in the atomic commit path, the out-fences are
initialized before atomic-check.. so there should be plenty of
precedent for skipping fence seqno's.

BR,
-R

Daniel Vetter Aug. 6, 2021, 7:10 p.m. UTC | #6

On Fri, Aug 6, 2021 at 8:57 PM Rob Clark <robdclark@gmail.com> wrote:
>
> On Fri, Aug 6, 2021 at 11:41 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> >
> > On Fri, Aug 6, 2021 at 7:15 PM Rob Clark <robdclark@gmail.com> wrote:
> > >
> > > On Fri, Aug 6, 2021 at 9:42 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > >
> > > > On Fri, Aug 6, 2021 at 12:58 AM Rob Clark <robdclark@gmail.com> wrote:
> > > > >
> > > > > On Thu, Aug 5, 2021 at 3:47 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > > > >
> > > > > > Originally drm_sched_job_init was the point of no return, after which
> > > > > > drivers must submit a job. I've split that up, which allows us to fix
> > > > > > this issue pretty easily.
> > > > > >
> > > > > > Only thing we have to take care of is to not skip to error paths after
> > > > > > that. Other drivers do this the same for out-fence and similar things.
> > > > > >
> > > > > > Fixes: 1d8a5ca436ee ("drm/msm: Conversion to drm scheduler")
> > > > > > Cc: Rob Clark <robdclark@chromium.org>
> > > > > > Cc: Rob Clark <robdclark@gmail.com>
> > > > > > Cc: Sean Paul <sean@poorly.run>
> > > > > > Cc: Sumit Semwal <sumit.semwal@linaro.org>
> > > > > > Cc: "Christian König" <christian.koenig@amd.com>
> > > > > > Cc: linux-arm-msm@vger.kernel.org
> > > > > > Cc: dri-devel@lists.freedesktop.org
> > > > > > Cc: freedreno@lists.freedesktop.org
> > > > > > Cc: linux-media@vger.kernel.org
> > > > > > Cc: linaro-mm-sig@lists.linaro.org
> > > > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > > > ---
> > > > > >  drivers/gpu/drm/msm/msm_gem_submit.c | 15 +++++++--------
> > > > > >  1 file changed, 7 insertions(+), 8 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
> > > > > > index 6d6c44f0e1f3..d0ed4ddc509e 100644
> > > > > > --- a/drivers/gpu/drm/msm/msm_gem_submit.c
> > > > > > +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
> > > > > > @@ -52,9 +52,6 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
> > > > > >                 return ERR_PTR(ret);
> > > > > >         }
> > > > > >
> > > > > > -       /* FIXME: this is way too early */
> > > > > > -       drm_sched_job_arm(&job->base);
> > > > > > -
> > > > > >         xa_init_flags(&submit->deps, XA_FLAGS_ALLOC);
> > > > > >
> > > > > >         kref_init(&submit->ref);
> > > > > > @@ -883,6 +880,9 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
> > > > > >
> > > > > >         submit->user_fence = dma_fence_get(&submit->base.s_fence->finished);
> > > > > >
> > > > > > +       /* point of no return, we _have_ to submit no matter what */
> > > > > > +       drm_sched_job_arm(&submit->base);
> > > > > > +
> > > > > >         /*
> > > > > >          * Allocate an id which can be used by WAIT_FENCE ioctl to map back
> > > > > >          * to the underlying fence.
> > > > > > @@ -892,17 +892,16 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
> > > > > >         if (submit->fence_id < 0) {
> > > > > >                 ret = submit->fence_id = 0;
> > > > > >                 submit->fence_id = 0;
> > > > > > -               goto out;
> > > > > >         }
> > > > > >
> > > > > > -       if (args->flags & MSM_SUBMIT_FENCE_FD_OUT) {
> > > > > > +       if (ret == 0 && args->flags & MSM_SUBMIT_FENCE_FD_OUT) {
> > > > > >                 struct sync_file *sync_file = sync_file_create(submit->user_fence);
> > > > > >                 if (!sync_file) {
> > > > > >                         ret = -ENOMEM;
> > > > > > -                       goto out;
> > > > > > +               } else {
> > > > > > +                       fd_install(out_fence_fd, sync_file->file);
> > > > > > +                       args->fence_fd = out_fence_fd;
> > > > > >                 }
> > > > > > -               fd_install(out_fence_fd, sync_file->file);
> > > > > > -               args->fence_fd = out_fence_fd;
> > > > >
> > > > > I wonder if instead we should (approximately) undo "drm/msm/submit:
> > > > > Simplify out-fence-fd handling" so that the point that it could fail
> > > > > is moved up ahead of the drm_sched_job_arm()?
> > > >
> > > > Hm yeah. Up to you how you want to paint this shed, I think either is fine.
> > > >
> > > > > Also, does the dma_fence_get() work before drm_sched_job_arm()?  From
> > > > > a quick look, it looks like it won't, but I'm still playing catchup
> > > > > and haven't had a chance to look at your entire series.  If it doesn't
> > > > > work before drm_sched_job_arm(), then there is really no way to
> > > > > prevent a error path between the fence-init and job-submit.
> > > >
> > > > Yes. I thought I've checked that I put the _arm() in the right spot,
> > > > but I guess I screwed up and you need the fence before the point where
> > > > I've put the job_arm()? And yes the error path cannot be avoided for
> > > > out-fences, that's what I tried to explain in the commit message.
> > > >
> > > > > But, prior to your series, wouldn't a failure after
> > > > > drm_sched_job_init() but before the job is submitted just burn a
> > > > > fence-id, and otherwise carry on it's merry way?
> > > >
> > > > Maybe? I'm not sure whether the scheduler gets confused about the gap
> > > > and freak out abou that. I'm fairly new to that code and learning
> > > > (which is part why I'm working on it). Since you look up in
> > > > fences/syncobj after job_init() it should be pretty easy to whip up a
> > > > testcase and see what happens. Also as long as nothing fails you won't
> > > > see an issue, that's for sure.
> > >
> > > fair.. I'll try to come up with a test case.. pre-scheduler-conversion
> > > it wasn't a problem to fail after the fence seqno was allocated (well,
> > > I guess you might have problems if you had 2^31 failures in a row)
> >
> > Yeah one thing drm/sched forces you to do is have a very clear notion
> > about the point of no return in your submit ioctl. Which I think is a
> > Very Good Thing, at least looking at i915 execbuf where the point of
> > no return is a multi-stage thing with such interesting intermediate
> > points like "we submit the ruquest but without actually running the
> > batchbuffer". The downside is that the submit ioctl isn't perfectly
> > transaction anymore, but I don't think that matters for tha tail
> > stuff, which is generally just some out-fence installing. That
> > generally never fails.
>
> So I hacked up:
>
> ------
> diff --git a/drivers/gpu/drm/scheduler/sched_fence.c
> b/drivers/gpu/drm/scheduler/sched_fence.c
> index 3aa6351d2101..88e66dbc9515 100644
> --- a/drivers/gpu/drm/scheduler/sched_fence.c
> +++ b/drivers/gpu/drm/scheduler/sched_fence.c
> @@ -176,6 +176,7 @@ struct drm_sched_fence
> *drm_sched_fence_create(struct drm_sched_entity *entity,
>         fence->sched = entity->rq->sched;
>         spin_lock_init(&fence->lock);
>
> +       seq = atomic_inc_return(&entity->fence_seq);
>         seq = atomic_inc_return(&entity->fence_seq);
>         dma_fence_init(&fence->scheduled, &drm_sched_fence_ops_scheduled,
>                        &fence->lock, entity->fence_context, seq);
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> b/drivers/gpu/drm/scheduler/sched_main.c
> index fcc601962e92..583e85adbbe0 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -593,6 +593,7 @@ int drm_sched_job_init(struct drm_sched_job *job,
>         if (!job->s_fence)
>                 return -ENOMEM;
>         job->id = atomic64_inc_return(&sched->job_id_count);
> +       job->id = atomic64_inc_return(&sched->job_id_count);
>
>         INIT_LIST_HEAD(&job->list);
>
> ------
>
> (I guess the job->id part shouldn't really be needed, that looks like
> it is only used by amdgpu)
>
> This didn't cause any problems that I could see.  So I don't *think* a
> failure after drm_sched_job_init() is really problematic, as long as
> things are serialized between drm_sched_job_init() and
> drm_sched_entity_push_job().
>
> I also noticed that in the atomic commit path, the out-fences are
> initialized before atomic-check.. so there should be plenty of
> precedent for skipping fence seqno's.

Oh I think I remember now. The reason why the split into init/arm is
so that you can keep your critical section only around job_arm() and
push_job(). My very first version just pulled the jobs_init() of that
for most drivers to where I needed it, and that would result in a bit
chaos because the fences would signal out of order potentially. But
yeah I guess bailing out is fine with the scheduler.

Do you want me to tune down the commit message a bit, it's not a must
to submit the job, but just makes a bit more sense than bailing out
with a fence seqno reserved?
-Daniel

Rob Clark Aug. 6, 2021, 7:59 p.m. UTC | #7

On Fri, Aug 6, 2021 at 12:11 PM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> On Fri, Aug 6, 2021 at 8:57 PM Rob Clark <robdclark@gmail.com> wrote:
> >
> > On Fri, Aug 6, 2021 at 11:41 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > >
> > > On Fri, Aug 6, 2021 at 7:15 PM Rob Clark <robdclark@gmail.com> wrote:
> > > >
> > > > On Fri, Aug 6, 2021 at 9:42 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > > >
> > > > > On Fri, Aug 6, 2021 at 12:58 AM Rob Clark <robdclark@gmail.com> wrote:
> > > > > >
> > > > > > On Thu, Aug 5, 2021 at 3:47 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > > > > >
> > > > > > > Originally drm_sched_job_init was the point of no return, after which
> > > > > > > drivers must submit a job. I've split that up, which allows us to fix
> > > > > > > this issue pretty easily.
> > > > > > >
> > > > > > > Only thing we have to take care of is to not skip to error paths after
> > > > > > > that. Other drivers do this the same for out-fence and similar things.
> > > > > > >
> > > > > > > Fixes: 1d8a5ca436ee ("drm/msm: Conversion to drm scheduler")
> > > > > > > Cc: Rob Clark <robdclark@chromium.org>
> > > > > > > Cc: Rob Clark <robdclark@gmail.com>
> > > > > > > Cc: Sean Paul <sean@poorly.run>
> > > > > > > Cc: Sumit Semwal <sumit.semwal@linaro.org>
> > > > > > > Cc: "Christian König" <christian.koenig@amd.com>
> > > > > > > Cc: linux-arm-msm@vger.kernel.org
> > > > > > > Cc: dri-devel@lists.freedesktop.org
> > > > > > > Cc: freedreno@lists.freedesktop.org
> > > > > > > Cc: linux-media@vger.kernel.org
> > > > > > > Cc: linaro-mm-sig@lists.linaro.org
> > > > > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > > > > ---
> > > > > > >  drivers/gpu/drm/msm/msm_gem_submit.c | 15 +++++++--------
> > > > > > >  1 file changed, 7 insertions(+), 8 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
> > > > > > > index 6d6c44f0e1f3..d0ed4ddc509e 100644
> > > > > > > --- a/drivers/gpu/drm/msm/msm_gem_submit.c
> > > > > > > +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
> > > > > > > @@ -52,9 +52,6 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
> > > > > > >                 return ERR_PTR(ret);
> > > > > > >         }
> > > > > > >
> > > > > > > -       /* FIXME: this is way too early */
> > > > > > > -       drm_sched_job_arm(&job->base);
> > > > > > > -
> > > > > > >         xa_init_flags(&submit->deps, XA_FLAGS_ALLOC);
> > > > > > >
> > > > > > >         kref_init(&submit->ref);
> > > > > > > @@ -883,6 +880,9 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
> > > > > > >
> > > > > > >         submit->user_fence = dma_fence_get(&submit->base.s_fence->finished);
> > > > > > >
> > > > > > > +       /* point of no return, we _have_ to submit no matter what */
> > > > > > > +       drm_sched_job_arm(&submit->base);
> > > > > > > +
> > > > > > >         /*
> > > > > > >          * Allocate an id which can be used by WAIT_FENCE ioctl to map back
> > > > > > >          * to the underlying fence.
> > > > > > > @@ -892,17 +892,16 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
> > > > > > >         if (submit->fence_id < 0) {
> > > > > > >                 ret = submit->fence_id = 0;
> > > > > > >                 submit->fence_id = 0;
> > > > > > > -               goto out;
> > > > > > >         }
> > > > > > >
> > > > > > > -       if (args->flags & MSM_SUBMIT_FENCE_FD_OUT) {
> > > > > > > +       if (ret == 0 && args->flags & MSM_SUBMIT_FENCE_FD_OUT) {
> > > > > > >                 struct sync_file *sync_file = sync_file_create(submit->user_fence);
> > > > > > >                 if (!sync_file) {
> > > > > > >                         ret = -ENOMEM;
> > > > > > > -                       goto out;
> > > > > > > +               } else {
> > > > > > > +                       fd_install(out_fence_fd, sync_file->file);
> > > > > > > +                       args->fence_fd = out_fence_fd;
> > > > > > >                 }
> > > > > > > -               fd_install(out_fence_fd, sync_file->file);
> > > > > > > -               args->fence_fd = out_fence_fd;
> > > > > >
> > > > > > I wonder if instead we should (approximately) undo "drm/msm/submit:
> > > > > > Simplify out-fence-fd handling" so that the point that it could fail
> > > > > > is moved up ahead of the drm_sched_job_arm()?
> > > > >
> > > > > Hm yeah. Up to you how you want to paint this shed, I think either is fine.
> > > > >
> > > > > > Also, does the dma_fence_get() work before drm_sched_job_arm()?  From
> > > > > > a quick look, it looks like it won't, but I'm still playing catchup
> > > > > > and haven't had a chance to look at your entire series.  If it doesn't
> > > > > > work before drm_sched_job_arm(), then there is really no way to
> > > > > > prevent a error path between the fence-init and job-submit.
> > > > >
> > > > > Yes. I thought I've checked that I put the _arm() in the right spot,
> > > > > but I guess I screwed up and you need the fence before the point where
> > > > > I've put the job_arm()? And yes the error path cannot be avoided for
> > > > > out-fences, that's what I tried to explain in the commit message.
> > > > >
> > > > > > But, prior to your series, wouldn't a failure after
> > > > > > drm_sched_job_init() but before the job is submitted just burn a
> > > > > > fence-id, and otherwise carry on it's merry way?
> > > > >
> > > > > Maybe? I'm not sure whether the scheduler gets confused about the gap
> > > > > and freak out abou that. I'm fairly new to that code and learning
> > > > > (which is part why I'm working on it). Since you look up in
> > > > > fences/syncobj after job_init() it should be pretty easy to whip up a
> > > > > testcase and see what happens. Also as long as nothing fails you won't
> > > > > see an issue, that's for sure.
> > > >
> > > > fair.. I'll try to come up with a test case.. pre-scheduler-conversion
> > > > it wasn't a problem to fail after the fence seqno was allocated (well,
> > > > I guess you might have problems if you had 2^31 failures in a row)
> > >
> > > Yeah one thing drm/sched forces you to do is have a very clear notion
> > > about the point of no return in your submit ioctl. Which I think is a
> > > Very Good Thing, at least looking at i915 execbuf where the point of
> > > no return is a multi-stage thing with such interesting intermediate
> > > points like "we submit the ruquest but without actually running the
> > > batchbuffer". The downside is that the submit ioctl isn't perfectly
> > > transaction anymore, but I don't think that matters for tha tail
> > > stuff, which is generally just some out-fence installing. That
> > > generally never fails.
> >
> > So I hacked up:
> >
> > ------
> > diff --git a/drivers/gpu/drm/scheduler/sched_fence.c
> > b/drivers/gpu/drm/scheduler/sched_fence.c
> > index 3aa6351d2101..88e66dbc9515 100644
> > --- a/drivers/gpu/drm/scheduler/sched_fence.c
> > +++ b/drivers/gpu/drm/scheduler/sched_fence.c
> > @@ -176,6 +176,7 @@ struct drm_sched_fence
> > *drm_sched_fence_create(struct drm_sched_entity *entity,
> >         fence->sched = entity->rq->sched;
> >         spin_lock_init(&fence->lock);
> >
> > +       seq = atomic_inc_return(&entity->fence_seq);
> >         seq = atomic_inc_return(&entity->fence_seq);
> >         dma_fence_init(&fence->scheduled, &drm_sched_fence_ops_scheduled,
> >                        &fence->lock, entity->fence_context, seq);
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > b/drivers/gpu/drm/scheduler/sched_main.c
> > index fcc601962e92..583e85adbbe0 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -593,6 +593,7 @@ int drm_sched_job_init(struct drm_sched_job *job,
> >         if (!job->s_fence)
> >                 return -ENOMEM;
> >         job->id = atomic64_inc_return(&sched->job_id_count);
> > +       job->id = atomic64_inc_return(&sched->job_id_count);
> >
> >         INIT_LIST_HEAD(&job->list);
> >
> > ------
> >
> > (I guess the job->id part shouldn't really be needed, that looks like
> > it is only used by amdgpu)
> >
> > This didn't cause any problems that I could see.  So I don't *think* a
> > failure after drm_sched_job_init() is really problematic, as long as
> > things are serialized between drm_sched_job_init() and
> > drm_sched_entity_push_job().
> >
> > I also noticed that in the atomic commit path, the out-fences are
> > initialized before atomic-check.. so there should be plenty of
> > precedent for skipping fence seqno's.
>
> Oh I think I remember now. The reason why the split into init/arm is
> so that you can keep your critical section only around job_arm() and
> push_job(). My very first version just pulled the jobs_init() of that
> for most drivers to where I needed it, and that would result in a bit
> chaos because the fences would signal out of order potentially. But
> yeah I guess bailing out is fine with the scheduler.

ahh, that makes more sense

> Do you want me to tune down the commit message a bit, it's not a must
> to submit the job, but just makes a bit more sense than bailing out
> with a fence seqno reserved?

yeah, and I guess drop the fixme comment in the drm/msm patch..

I wonder if it would make sense to split drm_fence_init().. most of
the things we need the out-fence for prior to
drm_sched_entity_push_job() is, I think, just to have a dma_fence
reference.. which doesn't require having a seqno assigned yet.  Which
would let us move the critical section into
drm_sched_entity_push_job() itself.  (OTOH I suppose that opens up an
exciting new class of bugs, to have fences floating around without an
initialized seqno.)

BR,
-R


> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

[v5,02/20] drm/msm: Fix drm/sched point of no return rules

Commit Message

Comments

Patch