Message ID | 20240828094133.17402-2-pstanner@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2,1/2] drm/sched: memset() 'job' in drm_sched_job_init() | expand |
Luben? Christian? On Wed, 2024-08-28 at 11:41 +0200, Philipp Stanner wrote: > drm_sched_job_init() has no control over how users allocate struct > drm_sched_job. Unfortunately, the function can also not set some > struct > members such as job->sched. > > This could theoretically lead to UB by users dereferencing the > struct's > pointer members too early. > > It is easier to debug such issues if these pointers are initialized > to > NULL, so dereferencing them causes a NULL pointer exception. > Accordingly, drm_sched_entity_init() does precisely that and > initializes > its struct with memset(). > > Initialize parameter "job" to 0 in drm_sched_job_init(). > > Signed-off-by: Philipp Stanner <pstanner@redhat.com> > --- > No changes in v2. > --- > drivers/gpu/drm/scheduler/sched_main.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > b/drivers/gpu/drm/scheduler/sched_main.c > index 356c30fa24a8..b0c8ad10b419 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -806,6 +806,14 @@ int drm_sched_job_init(struct drm_sched_job > *job, > return -EINVAL; > } > > + /* > + * We don't know for sure how the user has allocated. Thus, > zero the > + * struct so that unallowed (i.e., too early) usage of > pointers that > + * this function does not set is guaranteed to lead to a > NULL pointer > + * exception instead of UB. > + */ > + memset(job, 0, sizeof(*job)); > + > job->entity = entity; > job->credits = credits; > job->s_fence = drm_sched_fence_alloc(entity, owner);
Hi, On 28/08/2024 10:41, Philipp Stanner wrote: > drm_sched_job_init() has no control over how users allocate struct > drm_sched_job. Unfortunately, the function can also not set some struct > members such as job->sched. job->sched usage from within looks like a bug. But not related to the memset you add. For this one something like this looks easiest for a start: diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index ab53ab486fe6..877113b01af2 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -788,7 +788,7 @@ int drm_sched_job_init(struct drm_sched_job *job, * or worse--a blank screen--leave a trail in the * logs, so this can be debugged easier. */ - drm_err(job->sched, "%s: entity has no rq!\n", __func__); + pr_err("%s: entity has no rq!\n", __func__); return -ENOENT; } Fixes: 56e449603f0a ("drm/sched: Convert the GPU scheduler to variable number of run-queues") Cc: <stable@vger.kernel.org> # v6.7+ > This could theoretically lead to UB by users dereferencing the struct's > pointer members too early. Hmm if drm_sched_job_init returned an error callers should not dereference anything. What was actually the issue you were debugging? Adding a memset is I think not the best solution since it is very likely redundant to someone doing a kzalloc in the first place. Regards, Tvrtko > It is easier to debug such issues if these pointers are initialized to > NULL, so dereferencing them causes a NULL pointer exception. > Accordingly, drm_sched_entity_init() does precisely that and initializes > its struct with memset(). > > Initialize parameter "job" to 0 in drm_sched_job_init(). > > Signed-off-by: Philipp Stanner <pstanner@redhat.com> > --- > No changes in v2. > --- > drivers/gpu/drm/scheduler/sched_main.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > index 356c30fa24a8..b0c8ad10b419 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -806,6 +806,14 @@ int drm_sched_job_init(struct drm_sched_job *job, > return -EINVAL; > } > > + /* > + * We don't know for sure how the user has allocated. Thus, zero the > + * struct so that unallowed (i.e., too early) usage of pointers that > + * this function does not set is guaranteed to lead to a NULL pointer > + * exception instead of UB. > + */ > + memset(job, 0, sizeof(*job)); > + > job->entity = entity; > job->credits = credits; > job->s_fence = drm_sched_fence_alloc(entity, owner);
On Fri, 2024-09-13 at 12:56 +0100, Tvrtko Ursulin wrote: > > Hi, > > On 28/08/2024 10:41, Philipp Stanner wrote: > > drm_sched_job_init() has no control over how users allocate struct > > drm_sched_job. Unfortunately, the function can also not set some > > struct > > members such as job->sched. > > job->sched usage from within looks like a bug. But not related to the > memset you add. > > For this one something like this looks easiest for a start: > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > b/drivers/gpu/drm/scheduler/sched_main.c > index ab53ab486fe6..877113b01af2 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -788,7 +788,7 @@ int drm_sched_job_init(struct drm_sched_job *job, > * or worse--a blank screen--leave a trail in the > * logs, so this can be debugged easier. > */ > - drm_err(job->sched, "%s: entity has no rq!\n", > __func__); > + pr_err("%s: entity has no rq!\n", __func__); > return -ENOENT; > } > > Fixes: 56e449603f0a ("drm/sched: Convert the GPU scheduler to > variable > number of run-queues") > Cc: <stable@vger.kernel.org> # v6.7+ Danilo and I already solved that: https://lore.kernel.org/all/20240827074521.12828-2-pstanner@redhat.com/ > > > This could theoretically lead to UB by users dereferencing the > > struct's > > pointer members too early. > > Hmm if drm_sched_job_init returned an error callers should not > dereference anything. What was actually the issue you were debugging? I was learning about the scheduler, wrote a dummy driver and had awkward behavior. Turned out it was this pointer not being initialized. I would have seen it immediately if it were NULL. The actual issue was and is IMO that a function called drm_sched_job_init() initializes the job. But it doesn't, it only partially initializes it. Only after drm_sched_job_arm() ran you're actually ready to go. > > Adding a memset is I think not the best solution since it is very > likely > redundant to someone doing a kzalloc in the first place. It is redundant in most cases, but it is effectively for free. I measured the runtime with 1e6 jobs with and without memset and there was no difference. P. > > Regards, > > Tvrtko > > > It is easier to debug such issues if these pointers are initialized > > to > > NULL, so dereferencing them causes a NULL pointer exception. > > Accordingly, drm_sched_entity_init() does precisely that and > > initializes > > its struct with memset(). > > > > Initialize parameter "job" to 0 in drm_sched_job_init(). > > > > Signed-off-by: Philipp Stanner <pstanner@redhat.com> > > --- > > No changes in v2. > > --- > > drivers/gpu/drm/scheduler/sched_main.c | 8 ++++++++ > > 1 file changed, 8 insertions(+) > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > b/drivers/gpu/drm/scheduler/sched_main.c > > index 356c30fa24a8..b0c8ad10b419 100644 > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > @@ -806,6 +806,14 @@ int drm_sched_job_init(struct drm_sched_job > > *job, > > return -EINVAL; > > } > > > > + /* > > + * We don't know for sure how the user has allocated. > > Thus, zero the > > + * struct so that unallowed (i.e., too early) usage of > > pointers that > > + * this function does not set is guaranteed to lead to a > > NULL pointer > > + * exception instead of UB. > > + */ > > + memset(job, 0, sizeof(*job)); > > + > > job->entity = entity; > > job->credits = credits; > > job->s_fence = drm_sched_fence_alloc(entity, owner); >
On 13/09/2024 13:30, Philipp Stanner wrote: > On Fri, 2024-09-13 at 12:56 +0100, Tvrtko Ursulin wrote: >> >> Hi, >> >> On 28/08/2024 10:41, Philipp Stanner wrote: >>> drm_sched_job_init() has no control over how users allocate struct >>> drm_sched_job. Unfortunately, the function can also not set some >>> struct >>> members such as job->sched. >> >> job->sched usage from within looks like a bug. But not related to the >> memset you add. >> >> For this one something like this looks easiest for a start: >> >> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >> b/drivers/gpu/drm/scheduler/sched_main.c >> index ab53ab486fe6..877113b01af2 100644 >> --- a/drivers/gpu/drm/scheduler/sched_main.c >> +++ b/drivers/gpu/drm/scheduler/sched_main.c >> @@ -788,7 +788,7 @@ int drm_sched_job_init(struct drm_sched_job *job, >> * or worse--a blank screen--leave a trail in the >> * logs, so this can be debugged easier. >> */ >> - drm_err(job->sched, "%s: entity has no rq!\n", >> __func__); >> + pr_err("%s: entity has no rq!\n", __func__); >> return -ENOENT; >> } >> >> Fixes: 56e449603f0a ("drm/sched: Convert the GPU scheduler to >> variable >> number of run-queues") >> Cc: <stable@vger.kernel.org> # v6.7+ > > Danilo and I already solved that: > > https://lore.kernel.org/all/20240827074521.12828-2-pstanner@redhat.com/ Ah.. I saw the link to this in your maintainership thread and superficially assumed it is among the pending stuff. All good. >> >>> This could theoretically lead to UB by users dereferencing the >>> struct's >>> pointer members too early. >> >> Hmm if drm_sched_job_init returned an error callers should not >> dereference anything. What was actually the issue you were debugging? > > I was learning about the scheduler, wrote a dummy driver and had > awkward behavior. Turned out it was this pointer not being initialized. > I would have seen it immediately if it were NULL. > > The actual issue was and is IMO that a function called > drm_sched_job_init() initializes the job. But it doesn't, it only > partially initializes it. Only after drm_sched_job_arm() ran you're > actually ready to go. In my experience one good approach when developing stuff is to have the various kernel debugging aids enabled. Lockdep, SLAB debugging, memory poisoning, kfence.. Then if you were allocating your job without GFP_ZERO, _and_ dereferencing something too early out of misunderstanding of the API, you would get something obvious in the oops and not a random pointer. Which also applies to various CI systems, such as the Intel's one which already runs a debug kernel and a lot of these mistakes are caught instantly. >> Adding a memset is I think not the best solution since it is very >> likely >> redundant to someone doing a kzalloc in the first place. > > It is redundant in most cases, but it is effectively for free. I > measured the runtime with 1e6 jobs with and without memset and there > was no difference. I guess if kzalloc and drm_sched_job_init() are close enough in time so that cachelines stays put, and depending how you measure, it may be hard to see but cost if still there. For instance https://lore.kernel.org/amd-gfx/20240813140310.82706-1-tursulin@igalia.com/ I can see with perf that both memsets are hotspots even when testing with glxgears and vsync off. But I don't feel too strongly about this and there definitely is sense in initializing everything. Perhaps even instead of a memset we should use correct methods per field? Since in there we have spcs_node, atomic_t, ktime_t, dma_fence_cb (even in an annoying union), drm_sched_priority.. In an ideal world all those would have their initializers. But some don't so meh. Regards, Tvrtko >>> It is easier to debug such issues if these pointers are initialized >>> to >>> NULL, so dereferencing them causes a NULL pointer exception. >>> Accordingly, drm_sched_entity_init() does precisely that and >>> initializes >>> its struct with memset(). >>> >>> Initialize parameter "job" to 0 in drm_sched_job_init(). >>> >>> Signed-off-by: Philipp Stanner <pstanner@redhat.com> >>> --- >>> No changes in v2. >>> --- >>> drivers/gpu/drm/scheduler/sched_main.c | 8 ++++++++ >>> 1 file changed, 8 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >>> b/drivers/gpu/drm/scheduler/sched_main.c >>> index 356c30fa24a8..b0c8ad10b419 100644 >>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>> @@ -806,6 +806,14 @@ int drm_sched_job_init(struct drm_sched_job >>> *job, >>> return -EINVAL; >>> } >>> >>> + /* >>> + * We don't know for sure how the user has allocated. >>> Thus, zero the >>> + * struct so that unallowed (i.e., too early) usage of >>> pointers that >>> + * this function does not set is guaranteed to lead to a >>> NULL pointer >>> + * exception instead of UB. >>> + */ >>> + memset(job, 0, sizeof(*job)); >>> + >>> job->entity = entity; >>> job->credits = credits; >>> job->s_fence = drm_sched_fence_alloc(entity, owner); >> >
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 356c30fa24a8..b0c8ad10b419 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -806,6 +806,14 @@ int drm_sched_job_init(struct drm_sched_job *job, return -EINVAL; } + /* + * We don't know for sure how the user has allocated. Thus, zero the + * struct so that unallowed (i.e., too early) usage of pointers that + * this function does not set is guaranteed to lead to a NULL pointer + * exception instead of UB. + */ + memset(job, 0, sizeof(*job)); + job->entity = entity; job->credits = credits; job->s_fence = drm_sched_fence_alloc(entity, owner);
drm_sched_job_init() has no control over how users allocate struct drm_sched_job. Unfortunately, the function can also not set some struct members such as job->sched. This could theoretically lead to UB by users dereferencing the struct's pointer members too early. It is easier to debug such issues if these pointers are initialized to NULL, so dereferencing them causes a NULL pointer exception. Accordingly, drm_sched_entity_init() does precisely that and initializes its struct with memset(). Initialize parameter "job" to 0 in drm_sched_job_init(). Signed-off-by: Philipp Stanner <pstanner@redhat.com> --- No changes in v2. --- drivers/gpu/drm/scheduler/sched_main.c | 8 ++++++++ 1 file changed, 8 insertions(+)