Message ID | 1605936082-3099-11-git-send-email-andrey.grodzovsky@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | RFC Support hot device unplug in amdgpu | expand |
On Sat, Nov 21, 2020 at 12:21:20AM -0500, Andrey Grodzovsky wrote: > Avoids NULL ptr due to kobj->sd being unset on device removal. > > Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 +++- > drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c | 4 +++- > 2 files changed, 6 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > index caf828a..812e592 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > @@ -27,6 +27,7 @@ > #include <linux/uaccess.h> > #include <linux/reboot.h> > #include <linux/syscalls.h> > +#include <drm/drm_drv.h> > > #include "amdgpu.h" > #include "amdgpu_ras.h" > @@ -1043,7 +1044,8 @@ static int amdgpu_ras_sysfs_remove_feature_node(struct amdgpu_device *adev) > .attrs = attrs, > }; > > - sysfs_remove_group(&adev->dev->kobj, &group); > + if (!drm_dev_is_unplugged(&adev->ddev)) > + sysfs_remove_group(&adev->dev->kobj, &group); This looks wrong. sysfs, like any other interface, should be unconditionally thrown out when we do the drm_dev_unregister. Whether hotunplugged or not should matter at all. Either this isn't needed at all, or something is wrong with the ordering here. But definitely fishy. -Daniel > > return 0; > } > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c > index 2b7c90b..54331fc 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c > @@ -24,6 +24,7 @@ > #include <linux/firmware.h> > #include <linux/slab.h> > #include <linux/module.h> > +#include <drm/drm_drv.h> > > #include "amdgpu.h" > #include "amdgpu_ucode.h" > @@ -464,7 +465,8 @@ int amdgpu_ucode_sysfs_init(struct amdgpu_device *adev) > > void amdgpu_ucode_sysfs_fini(struct amdgpu_device *adev) > { > - sysfs_remove_group(&adev->dev->kobj, &fw_attr_group); > + if (!drm_dev_is_unplugged(&adev->ddev)) > + sysfs_remove_group(&adev->dev->kobj, &fw_attr_group); > } > > static int amdgpu_ucode_init_single_fw(struct amdgpu_device *adev, > -- > 2.7.4 >
On 11/24/20 9:49 AM, Daniel Vetter wrote: > On Sat, Nov 21, 2020 at 12:21:20AM -0500, Andrey Grodzovsky wrote: >> Avoids NULL ptr due to kobj->sd being unset on device removal. >> >> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 +++- >> drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c | 4 +++- >> 2 files changed, 6 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c >> index caf828a..812e592 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c >> @@ -27,6 +27,7 @@ >> #include <linux/uaccess.h> >> #include <linux/reboot.h> >> #include <linux/syscalls.h> >> +#include <drm/drm_drv.h> >> >> #include "amdgpu.h" >> #include "amdgpu_ras.h" >> @@ -1043,7 +1044,8 @@ static int amdgpu_ras_sysfs_remove_feature_node(struct amdgpu_device *adev) >> .attrs = attrs, >> }; >> >> - sysfs_remove_group(&adev->dev->kobj, &group); >> + if (!drm_dev_is_unplugged(&adev->ddev)) >> + sysfs_remove_group(&adev->dev->kobj, &group); > This looks wrong. sysfs, like any other interface, should be > unconditionally thrown out when we do the drm_dev_unregister. Whether > hotunplugged or not should matter at all. Either this isn't needed at all, > or something is wrong with the ordering here. But definitely fishy. > -Daniel So technically this is needed because kobejct's sysfs directory entry kobj->sd is set to NULL on device removal (from sysfs_remove_dir) but because we don't finalize the device until last reference to drm file is dropped (which can happen later) we end up calling sysfs_remove_file/dir after this pointer is NULL. sysfs_remove_file checks for NULL and aborts while sysfs_remove_dir is not and that why I guard against calls to sysfs_remove_dir. But indeed the whole approach in the driver is incorrect, as Greg pointed out - we should use default groups attributes instead of explicit calls to sysfs interface and this would save those troubles. But again. the issue here of scope of work, converting all of amdgpu to default groups attributes is somewhat lengthy process with extra testing as the entire driver is papered with sysfs references and seems to me more of a standalone cleanup, just like switching to devm_ and drmm_ work. To me at least it seems that it makes more sense to finalize and push the hot unplug patches so that this new functionality can be part of the driver sooner and then incrementally improve it by working on those other topics. Just as devm_/drmm_ I also added sysfs cleanup to my TODO list in the RFC patch. Andrey > >> >> return 0; >> } >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c >> index 2b7c90b..54331fc 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c >> @@ -24,6 +24,7 @@ >> #include <linux/firmware.h> >> #include <linux/slab.h> >> #include <linux/module.h> >> +#include <drm/drm_drv.h> >> >> #include "amdgpu.h" >> #include "amdgpu_ucode.h" >> @@ -464,7 +465,8 @@ int amdgpu_ucode_sysfs_init(struct amdgpu_device *adev) >> >> void amdgpu_ucode_sysfs_fini(struct amdgpu_device *adev) >> { >> - sysfs_remove_group(&adev->dev->kobj, &fw_attr_group); >> + if (!drm_dev_is_unplugged(&adev->ddev)) >> + sysfs_remove_group(&adev->dev->kobj, &fw_attr_group); >> } >> >> static int amdgpu_ucode_init_single_fw(struct amdgpu_device *adev, >> -- >> 2.7.4 >>
On Tue, Nov 24, 2020 at 11:27 PM Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> wrote: > > > On 11/24/20 9:49 AM, Daniel Vetter wrote: > > On Sat, Nov 21, 2020 at 12:21:20AM -0500, Andrey Grodzovsky wrote: > >> Avoids NULL ptr due to kobj->sd being unset on device removal. > >> > >> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> > >> --- > >> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 +++- > >> drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c | 4 +++- > >> 2 files changed, 6 insertions(+), 2 deletions(-) > >> > >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > >> index caf828a..812e592 100644 > >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > >> @@ -27,6 +27,7 @@ > >> #include <linux/uaccess.h> > >> #include <linux/reboot.h> > >> #include <linux/syscalls.h> > >> +#include <drm/drm_drv.h> > >> > >> #include "amdgpu.h" > >> #include "amdgpu_ras.h" > >> @@ -1043,7 +1044,8 @@ static int amdgpu_ras_sysfs_remove_feature_node(struct amdgpu_device *adev) > >> .attrs = attrs, > >> }; > >> > >> - sysfs_remove_group(&adev->dev->kobj, &group); > >> + if (!drm_dev_is_unplugged(&adev->ddev)) > >> + sysfs_remove_group(&adev->dev->kobj, &group); > > This looks wrong. sysfs, like any other interface, should be > > unconditionally thrown out when we do the drm_dev_unregister. Whether > > hotunplugged or not should matter at all. Either this isn't needed at all, > > or something is wrong with the ordering here. But definitely fishy. > > -Daniel > > > So technically this is needed because kobejct's sysfs directory entry kobj->sd > is set to NULL > on device removal (from sysfs_remove_dir) but because we don't finalize the device > until last reference to drm file is dropped (which can happen later) we end up > calling sysfs_remove_file/dir after > this pointer is NULL. sysfs_remove_file checks for NULL and aborts while > sysfs_remove_dir > is not and that why I guard against calls to sysfs_remove_dir. > But indeed the whole approach in the driver is incorrect, as Greg pointed out - > we should use > default groups attributes instead of explicit calls to sysfs interface and this > would save those troubles. > But again. the issue here of scope of work, converting all of amdgpu to default > groups attributes is somewhat > lengthy process with extra testing as the entire driver is papered with sysfs > references and seems to me more of a standalone > cleanup, just like switching to devm_ and drmm_ work. To me at least it seems > that it makes more sense > to finalize and push the hot unplug patches so that this new functionality can > be part of the driver sooner > and then incrementally improve it by working on those other topics. Just as > devm_/drmm_ I also added sysfs cleanup > to my TODO list in the RFC patch. Hm, whether you solve this with the default group stuff to auto-remove, or remove explicitly at the right time doesn't matter much. The underlying problem you have here is that it's done way too late. sysfs removal (like all uapi interfaces) need to be removed as part of drm_dev_unregister. I guess aside from the split into fini_hw and fini_sw, you also need an unregister_late callback (like we have already for drm_connector, so that e.g. backlight and similar stuff can be unregistered). Papering over the underlying bug like this doesn't really fix much, the lifetimes are still wrong. -Daniel > > Andrey > > > > > >> > >> return 0; > >> } > >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c > >> index 2b7c90b..54331fc 100644 > >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c > >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c > >> @@ -24,6 +24,7 @@ > >> #include <linux/firmware.h> > >> #include <linux/slab.h> > >> #include <linux/module.h> > >> +#include <drm/drm_drv.h> > >> > >> #include "amdgpu.h" > >> #include "amdgpu_ucode.h" > >> @@ -464,7 +465,8 @@ int amdgpu_ucode_sysfs_init(struct amdgpu_device *adev) > >> > >> void amdgpu_ucode_sysfs_fini(struct amdgpu_device *adev) > >> { > >> - sysfs_remove_group(&adev->dev->kobj, &fw_attr_group); > >> + if (!drm_dev_is_unplugged(&adev->ddev)) > >> + sysfs_remove_group(&adev->dev->kobj, &fw_attr_group); > >> } > >> > >> static int amdgpu_ucode_init_single_fw(struct amdgpu_device *adev, > >> -- > >> 2.7.4 > >>
On 11/25/20 4:04 AM, Daniel Vetter wrote: > On Tue, Nov 24, 2020 at 11:27 PM Andrey Grodzovsky > <Andrey.Grodzovsky@amd.com> wrote: >> >> On 11/24/20 9:49 AM, Daniel Vetter wrote: >>> On Sat, Nov 21, 2020 at 12:21:20AM -0500, Andrey Grodzovsky wrote: >>>> Avoids NULL ptr due to kobj->sd being unset on device removal. >>>> >>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 +++- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c | 4 +++- >>>> 2 files changed, 6 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c >>>> index caf828a..812e592 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c >>>> @@ -27,6 +27,7 @@ >>>> #include <linux/uaccess.h> >>>> #include <linux/reboot.h> >>>> #include <linux/syscalls.h> >>>> +#include <drm/drm_drv.h> >>>> >>>> #include "amdgpu.h" >>>> #include "amdgpu_ras.h" >>>> @@ -1043,7 +1044,8 @@ static int amdgpu_ras_sysfs_remove_feature_node(struct amdgpu_device *adev) >>>> .attrs = attrs, >>>> }; >>>> >>>> - sysfs_remove_group(&adev->dev->kobj, &group); >>>> + if (!drm_dev_is_unplugged(&adev->ddev)) >>>> + sysfs_remove_group(&adev->dev->kobj, &group); >>> This looks wrong. sysfs, like any other interface, should be >>> unconditionally thrown out when we do the drm_dev_unregister. Whether >>> hotunplugged or not should matter at all. Either this isn't needed at all, >>> or something is wrong with the ordering here. But definitely fishy. >>> -Daniel >> >> So technically this is needed because kobejct's sysfs directory entry kobj->sd >> is set to NULL >> on device removal (from sysfs_remove_dir) but because we don't finalize the device >> until last reference to drm file is dropped (which can happen later) we end up >> calling sysfs_remove_file/dir after >> this pointer is NULL. sysfs_remove_file checks for NULL and aborts while >> sysfs_remove_dir >> is not and that why I guard against calls to sysfs_remove_dir. >> But indeed the whole approach in the driver is incorrect, as Greg pointed out - >> we should use >> default groups attributes instead of explicit calls to sysfs interface and this >> would save those troubles. >> But again. the issue here of scope of work, converting all of amdgpu to default >> groups attributes is somewhat >> lengthy process with extra testing as the entire driver is papered with sysfs >> references and seems to me more of a standalone >> cleanup, just like switching to devm_ and drmm_ work. To me at least it seems >> that it makes more sense >> to finalize and push the hot unplug patches so that this new functionality can >> be part of the driver sooner >> and then incrementally improve it by working on those other topics. Just as >> devm_/drmm_ I also added sysfs cleanup >> to my TODO list in the RFC patch. > Hm, whether you solve this with the default group stuff to > auto-remove, or remove explicitly at the right time doesn't matter > much. The underlying problem you have here is that it's done way too > late. As far as I understood correctly the default group attrs by reading this article by Greg - https://www.linux.com/news/how-create-sysfs-file-correctly/ it will be removed together with the device and not too late like now and I quote from the last paragraph there: "By setting this value, you don’t have to do anything in your probe() or release() functions at all in order for the sysfs files to be properly created and destroyed whenever your device is added or removed from the system. And you will, most importantly, do it in a race-free manner, which is always a good thing." To me this seems like the best solution to the late remove issue. What do you think ? > sysfs removal (like all uapi interfaces) need to be removed as > part of drm_dev_unregister. Do you mean we need to trace and aggregate all sysfs files creation within the low level drivers and then call some sysfs release function inside drm_dev_unregister to iterate and release them all ? > I guess aside from the split into fini_hw > and fini_sw, you also need an unregister_late callback (like we have > already for drm_connector, so that e.g. backlight and similar stuff > can be unregistered). Is this the callback you suggest to call from within drm_dev_unregister and it will be responsible to release all sysfs files created within the driver ? Andrey > > Papering over the underlying bug like this doesn't really fix much, > the lifetimes are still wrong. > -Daniel > >> Andrey >> >> >>>> return 0; >>>> } >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c >>>> index 2b7c90b..54331fc 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c >>>> @@ -24,6 +24,7 @@ >>>> #include <linux/firmware.h> >>>> #include <linux/slab.h> >>>> #include <linux/module.h> >>>> +#include <drm/drm_drv.h> >>>> >>>> #include "amdgpu.h" >>>> #include "amdgpu_ucode.h" >>>> @@ -464,7 +465,8 @@ int amdgpu_ucode_sysfs_init(struct amdgpu_device *adev) >>>> >>>> void amdgpu_ucode_sysfs_fini(struct amdgpu_device *adev) >>>> { >>>> - sysfs_remove_group(&adev->dev->kobj, &fw_attr_group); >>>> + if (!drm_dev_is_unplugged(&adev->ddev)) >>>> + sysfs_remove_group(&adev->dev->kobj, &fw_attr_group); >>>> } >>>> >>>> static int amdgpu_ucode_init_single_fw(struct amdgpu_device *adev, >>>> -- >>>> 2.7.4 >>>> > >
Hey, just a ping on my comments/question bellow. Andrey
On Wed, Nov 25, 2020 at 12:39:47PM -0500, Andrey Grodzovsky wrote: > > On 11/25/20 4:04 AM, Daniel Vetter wrote: > > On Tue, Nov 24, 2020 at 11:27 PM Andrey Grodzovsky > > <Andrey.Grodzovsky@amd.com> wrote: > > > > > > On 11/24/20 9:49 AM, Daniel Vetter wrote: > > > > On Sat, Nov 21, 2020 at 12:21:20AM -0500, Andrey Grodzovsky wrote: > > > > > Avoids NULL ptr due to kobj->sd being unset on device removal. > > > > > > > > > > Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> > > > > > --- > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 +++- > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c | 4 +++- > > > > > 2 files changed, 6 insertions(+), 2 deletions(-) > > > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > > > > index caf828a..812e592 100644 > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > > > > @@ -27,6 +27,7 @@ > > > > > #include <linux/uaccess.h> > > > > > #include <linux/reboot.h> > > > > > #include <linux/syscalls.h> > > > > > +#include <drm/drm_drv.h> > > > > > > > > > > #include "amdgpu.h" > > > > > #include "amdgpu_ras.h" > > > > > @@ -1043,7 +1044,8 @@ static int amdgpu_ras_sysfs_remove_feature_node(struct amdgpu_device *adev) > > > > > .attrs = attrs, > > > > > }; > > > > > > > > > > - sysfs_remove_group(&adev->dev->kobj, &group); > > > > > + if (!drm_dev_is_unplugged(&adev->ddev)) > > > > > + sysfs_remove_group(&adev->dev->kobj, &group); > > > > This looks wrong. sysfs, like any other interface, should be > > > > unconditionally thrown out when we do the drm_dev_unregister. Whether > > > > hotunplugged or not should matter at all. Either this isn't needed at all, > > > > or something is wrong with the ordering here. But definitely fishy. > > > > -Daniel > > > > > > So technically this is needed because kobejct's sysfs directory entry kobj->sd > > > is set to NULL > > > on device removal (from sysfs_remove_dir) but because we don't finalize the device > > > until last reference to drm file is dropped (which can happen later) we end up > > > calling sysfs_remove_file/dir after > > > this pointer is NULL. sysfs_remove_file checks for NULL and aborts while > > > sysfs_remove_dir > > > is not and that why I guard against calls to sysfs_remove_dir. > > > But indeed the whole approach in the driver is incorrect, as Greg pointed out - > > > we should use > > > default groups attributes instead of explicit calls to sysfs interface and this > > > would save those troubles. > > > But again. the issue here of scope of work, converting all of amdgpu to default > > > groups attributes is somewhat > > > lengthy process with extra testing as the entire driver is papered with sysfs > > > references and seems to me more of a standalone > > > cleanup, just like switching to devm_ and drmm_ work. To me at least it seems > > > that it makes more sense > > > to finalize and push the hot unplug patches so that this new functionality can > > > be part of the driver sooner > > > and then incrementally improve it by working on those other topics. Just as > > > devm_/drmm_ I also added sysfs cleanup > > > to my TODO list in the RFC patch. > > Hm, whether you solve this with the default group stuff to > > auto-remove, or remove explicitly at the right time doesn't matter > > much. The underlying problem you have here is that it's done way too > > late. > > As far as I understood correctly the default group attrs by reading this > article by Greg - https://www.linux.com/news/how-create-sysfs-file-correctly/ > it will be removed together with the device and not too late like now and I quote > from the last paragraph there: > > "By setting this value, you don’t have to do anything in your > probe() or release() functions at all in order for the > sysfs files to be properly created and destroyed whenever your > device is added or removed from the system. And you will, most > importantly, do it in a race-free manner, which is always a good thing." > > To me this seems like the best solution to the late remove issue. What do > you think ? > > > > sysfs removal (like all uapi interfaces) need to be removed as > > part of drm_dev_unregister. > > > Do you mean we need to trace and aggregate all sysfs files creation within > the low level drivers and then call some sysfs release function inside > drm_dev_unregister > to iterate and release them all ? That would just reinvent the proper solution Greg explained above. For now I think you just need some driver callback that you call right after drm_dev_unplug (or drm_dev_unregister) to clean up these sysfs interfaces. Afaiui the important part is to clean up your additional interfaces from the ->remove callback, since at that point the core sysfs stuff still exists. Maybe you want to do another loop over all IP blocks and a ->unregister callback, or maybe it's just 1-2 cases you call directly. > > I guess aside from the split into fini_hw > > and fini_sw, you also need an unregister_late callback (like we have > > already for drm_connector, so that e.g. backlight and similar stuff > > can be unregistered). > > > Is this the callback you suggest to call from within drm_dev_unregister and > it will be responsible to release all sysfs files created within the driver ? Nah that would be an amdgpu ip block callback (forgot what it's called, too comfy to fire up an editor right now and look it up, but you have a bunch of these loops all over). I think the core solution we want is what Greg already laid out. This idea here was just an amdgpu interim plan, if the core solution is a bit too invasive to implement right away. -Daniel > > Andrey > > > > > > Papering over the underlying bug like this doesn't really fix much, > > the lifetimes are still wrong. > > -Daniel > > > > > Andrey > > > > > > > > > > > return 0; > > > > > } > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c > > > > > index 2b7c90b..54331fc 100644 > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c > > > > > @@ -24,6 +24,7 @@ > > > > > #include <linux/firmware.h> > > > > > #include <linux/slab.h> > > > > > #include <linux/module.h> > > > > > +#include <drm/drm_drv.h> > > > > > > > > > > #include "amdgpu.h" > > > > > #include "amdgpu_ucode.h" > > > > > @@ -464,7 +465,8 @@ int amdgpu_ucode_sysfs_init(struct amdgpu_device *adev) > > > > > > > > > > void amdgpu_ucode_sysfs_fini(struct amdgpu_device *adev) > > > > > { > > > > > - sysfs_remove_group(&adev->dev->kobj, &fw_attr_group); > > > > > + if (!drm_dev_is_unplugged(&adev->ddev)) > > > > > + sysfs_remove_group(&adev->dev->kobj, &fw_attr_group); > > > > > } > > > > > > > > > > static int amdgpu_ucode_init_single_fw(struct amdgpu_device *adev, > > > > > -- > > > > > 2.7.4 > > > > > > > > >
On 11/27/20 10:04 AM, Daniel Vetter wrote: > On Wed, Nov 25, 2020 at 12:39:47PM -0500, Andrey Grodzovsky wrote: >> On 11/25/20 4:04 AM, Daniel Vetter wrote: >>> On Tue, Nov 24, 2020 at 11:27 PM Andrey Grodzovsky >>> <Andrey.Grodzovsky@amd.com> wrote: >>>> On 11/24/20 9:49 AM, Daniel Vetter wrote: >>>>> On Sat, Nov 21, 2020 at 12:21:20AM -0500, Andrey Grodzovsky wrote: >>>>>> Avoids NULL ptr due to kobj->sd being unset on device removal. >>>>>> >>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> >>>>>> --- >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 +++- >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c | 4 +++- >>>>>> 2 files changed, 6 insertions(+), 2 deletions(-) >>>>>> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c >>>>>> index caf828a..812e592 100644 >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c >>>>>> @@ -27,6 +27,7 @@ >>>>>> #include <linux/uaccess.h> >>>>>> #include <linux/reboot.h> >>>>>> #include <linux/syscalls.h> >>>>>> +#include <drm/drm_drv.h> >>>>>> >>>>>> #include "amdgpu.h" >>>>>> #include "amdgpu_ras.h" >>>>>> @@ -1043,7 +1044,8 @@ static int amdgpu_ras_sysfs_remove_feature_node(struct amdgpu_device *adev) >>>>>> .attrs = attrs, >>>>>> }; >>>>>> >>>>>> - sysfs_remove_group(&adev->dev->kobj, &group); >>>>>> + if (!drm_dev_is_unplugged(&adev->ddev)) >>>>>> + sysfs_remove_group(&adev->dev->kobj, &group); >>>>> This looks wrong. sysfs, like any other interface, should be >>>>> unconditionally thrown out when we do the drm_dev_unregister. Whether >>>>> hotunplugged or not should matter at all. Either this isn't needed at all, >>>>> or something is wrong with the ordering here. But definitely fishy. >>>>> -Daniel >>>> So technically this is needed because kobejct's sysfs directory entry kobj->sd >>>> is set to NULL >>>> on device removal (from sysfs_remove_dir) but because we don't finalize the device >>>> until last reference to drm file is dropped (which can happen later) we end up >>>> calling sysfs_remove_file/dir after >>>> this pointer is NULL. sysfs_remove_file checks for NULL and aborts while >>>> sysfs_remove_dir >>>> is not and that why I guard against calls to sysfs_remove_dir. >>>> But indeed the whole approach in the driver is incorrect, as Greg pointed out - >>>> we should use >>>> default groups attributes instead of explicit calls to sysfs interface and this >>>> would save those troubles. >>>> But again. the issue here of scope of work, converting all of amdgpu to default >>>> groups attributes is somewhat >>>> lengthy process with extra testing as the entire driver is papered with sysfs >>>> references and seems to me more of a standalone >>>> cleanup, just like switching to devm_ and drmm_ work. To me at least it seems >>>> that it makes more sense >>>> to finalize and push the hot unplug patches so that this new functionality can >>>> be part of the driver sooner >>>> and then incrementally improve it by working on those other topics. Just as >>>> devm_/drmm_ I also added sysfs cleanup >>>> to my TODO list in the RFC patch. >>> Hm, whether you solve this with the default group stuff to >>> auto-remove, or remove explicitly at the right time doesn't matter >>> much. The underlying problem you have here is that it's done way too >>> late. >> As far as I understood correctly the default group attrs by reading this >> article by Greg - https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linux.com%2Fnews%2Fhow-create-sysfs-file-correctly%2F&data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C3e993d1dfad746ffff2608d892e5bbb9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637420862696611997%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=HAlEqI6CYR3k1n9FFAibpjBlK7I7x9W23yd5CWJVYgM%3D&reserved=0 >> it will be removed together with the device and not too late like now and I quote >> from the last paragraph there: >> >> "By setting this value, you don’t have to do anything in your >> probe() or release() functions at all in order for the >> sysfs files to be properly created and destroyed whenever your >> device is added or removed from the system. And you will, most >> importantly, do it in a race-free manner, which is always a good thing." >> >> To me this seems like the best solution to the late remove issue. What do >> you think ? >> >> >>> sysfs removal (like all uapi interfaces) need to be removed as >>> part of drm_dev_unregister. >> >> Do you mean we need to trace and aggregate all sysfs files creation within >> the low level drivers and then call some sysfs release function inside >> drm_dev_unregister >> to iterate and release them all ? > That would just reinvent the proper solution Greg explained above. For now > I think you just need some driver callback that you call right after > drm_dev_unplug (or drm_dev_unregister) to clean up these sysfs interfaces. > Afaiui the important part is to clean up your additional interfaces from > the ->remove callback, since at that point the core sysfs stuff still > exists. > > Maybe you want to do another loop over all IP blocks and a ->unregister > callback, or maybe it's just 1-2 cases you call directly. Most of them are barried within non ip block entites (e.g amdgpu_device_fini->amdgpu_atombios_fini->amdgpu_atombios_fini->device_remove_file->sysfs_remove_file) or much longer chain in kfd, like amdgpu_device_fini->.....kfd_remove_sysfs_node_entry->kfd_remove_sysfs_file->sysfs_remove_file and so they will will need to be accessed explicitly by creating some accessors functions in their public API in multiple layers. > >>> I guess aside from the split into fini_hw >>> and fini_sw, you also need an unregister_late callback (like we have >>> already for drm_connector, so that e.g. backlight and similar stuff >>> can be unregistered). >> >> Is this the callback you suggest to call from within drm_dev_unregister and >> it will be responsible to release all sysfs files created within the driver ? > Nah that would be an amdgpu ip block callback (forgot what it's called, > too comfy to fire up an editor right now and look it up, but you have a > bunch of these loops all over). > > I think the core solution we want is what Greg already laid out. This idea > here was just an amdgpu interim plan, if the core solution is a bit too > invasive to implement right away. > -Daniel By what I showed above to me it looks that the interim solution might be not really less invasive then the right solution by Greg and so if you feel that this is a blocker for the entire patch set and we absolutely can't live with the temporary band aid which this patch represents then I will just do the real solution as a standalone patch set because I think this one is a big enough change on it's own to combine it with the hot device unplug topic. Andrey > >> Andrey >> >> >>> Papering over the underlying bug like this doesn't really fix much, >>> the lifetimes are still wrong. >>> -Daniel >>> >>>> Andrey >>>> >>>> >>>>>> return 0; >>>>>> } >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c >>>>>> index 2b7c90b..54331fc 100644 >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c >>>>>> @@ -24,6 +24,7 @@ >>>>>> #include <linux/firmware.h> >>>>>> #include <linux/slab.h> >>>>>> #include <linux/module.h> >>>>>> +#include <drm/drm_drv.h> >>>>>> >>>>>> #include "amdgpu.h" >>>>>> #include "amdgpu_ucode.h" >>>>>> @@ -464,7 +465,8 @@ int amdgpu_ucode_sysfs_init(struct amdgpu_device *adev) >>>>>> >>>>>> void amdgpu_ucode_sysfs_fini(struct amdgpu_device *adev) >>>>>> { >>>>>> - sysfs_remove_group(&adev->dev->kobj, &fw_attr_group); >>>>>> + if (!drm_dev_is_unplugged(&adev->ddev)) >>>>>> + sysfs_remove_group(&adev->dev->kobj, &fw_attr_group); >>>>>> } >>>>>> >>>>>> static int amdgpu_ucode_init_single_fw(struct amdgpu_device *adev, >>>>>> -- >>>>>> 2.7.4 >>>>>> >>>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c index caf828a..812e592 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c @@ -27,6 +27,7 @@ #include <linux/uaccess.h> #include <linux/reboot.h> #include <linux/syscalls.h> +#include <drm/drm_drv.h> #include "amdgpu.h" #include "amdgpu_ras.h" @@ -1043,7 +1044,8 @@ static int amdgpu_ras_sysfs_remove_feature_node(struct amdgpu_device *adev) .attrs = attrs, }; - sysfs_remove_group(&adev->dev->kobj, &group); + if (!drm_dev_is_unplugged(&adev->ddev)) + sysfs_remove_group(&adev->dev->kobj, &group); return 0; } diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c index 2b7c90b..54331fc 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c @@ -24,6 +24,7 @@ #include <linux/firmware.h> #include <linux/slab.h> #include <linux/module.h> +#include <drm/drm_drv.h> #include "amdgpu.h" #include "amdgpu_ucode.h" @@ -464,7 +465,8 @@ int amdgpu_ucode_sysfs_init(struct amdgpu_device *adev) void amdgpu_ucode_sysfs_fini(struct amdgpu_device *adev) { - sysfs_remove_group(&adev->dev->kobj, &fw_attr_group); + if (!drm_dev_is_unplugged(&adev->ddev)) + sysfs_remove_group(&adev->dev->kobj, &fw_attr_group); } static int amdgpu_ucode_init_single_fw(struct amdgpu_device *adev,
Avoids NULL ptr due to kobj->sd being unset on device removal. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 +++- drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c | 4 +++- 2 files changed, 6 insertions(+), 2 deletions(-)