Message ID | 169602898991.904193.3059334392093961032.stgit@dwillia2-xfh.jf.intel.com |
---|---|
State | Superseded |
Headers | show |
Series | cxl/mem: Fix shutdown order | expand |
Dan Williams wrote: > Ira reports that removing cxl_mock_mem causes a crash with the following > trace: > > BUG: kernel NULL pointer dereference, address: 0000000000000044 > [..] > RIP: 0010:cxl_region_decode_reset+0x7f/0x180 [cxl_core] > [..] > Call Trace: > <TASK> > cxl_region_detach+0xe8/0x210 [cxl_core] > cxl_decoder_kill_region+0x27/0x40 [cxl_core] > cxld_unregister+0x29/0x40 [cxl_core] > devres_release_all+0xb8/0x110 > device_unbind_cleanup+0xe/0x70 > device_release_driver_internal+0x1d2/0x210 > bus_remove_device+0xd7/0x150 > device_del+0x155/0x3e0 > device_unregister+0x13/0x60 > devm_release_action+0x4d/0x90 > ? __pfx_unregister_port+0x10/0x10 [cxl_core] > delete_endpoint+0x121/0x130 [cxl_core] > devres_release_all+0xb8/0x110 > device_unbind_cleanup+0xe/0x70 > device_release_driver_internal+0x1d2/0x210 > bus_remove_device+0xd7/0x150 > device_del+0x155/0x3e0 > ? lock_release+0x142/0x290 > cdev_device_del+0x15/0x50 > cxl_memdev_unregister+0x54/0x70 [cxl_core] > > This crash is due to the clearing out the cxl_memdev's driver context > (@cxlds) before the subsystem is done with it. This is ultimately due to > the region(s), that this memdev is a member, being torn down and expecting > to be able to de-reference @cxlds, like here: > > static int cxl_region_decode_reset(struct cxl_region *cxlr, int count) > ... > if (cxlds->rcd) > goto endpoint_reset; > ... > > Fix it by keeping the driver context valid until memdev-device > unregistration, and subsequently the entire stack of related > dependencies, unwinds. > > Fixes: 9cc238c7a526 ("cxl/pci: Introduce cdevm_file_operations") > Reported-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Tested-by: Ira Weiny <ira.weiny@intel.com>
On Fri, 29 Sep 2023 16:52:06 -0700 Ira Weiny <ira.weiny@intel.com> wrote: > Dan Williams wrote: > > Ira reports that removing cxl_mock_mem causes a crash with the following > > trace: > > > > BUG: kernel NULL pointer dereference, address: 0000000000000044 > > [..] > > RIP: 0010:cxl_region_decode_reset+0x7f/0x180 [cxl_core] > > [..] > > Call Trace: > > <TASK> > > cxl_region_detach+0xe8/0x210 [cxl_core] > > cxl_decoder_kill_region+0x27/0x40 [cxl_core] > > cxld_unregister+0x29/0x40 [cxl_core] > > devres_release_all+0xb8/0x110 > > device_unbind_cleanup+0xe/0x70 > > device_release_driver_internal+0x1d2/0x210 > > bus_remove_device+0xd7/0x150 > > device_del+0x155/0x3e0 > > device_unregister+0x13/0x60 > > devm_release_action+0x4d/0x90 > > ? __pfx_unregister_port+0x10/0x10 [cxl_core] > > delete_endpoint+0x121/0x130 [cxl_core] > > devres_release_all+0xb8/0x110 > > device_unbind_cleanup+0xe/0x70 > > device_release_driver_internal+0x1d2/0x210 > > bus_remove_device+0xd7/0x150 > > device_del+0x155/0x3e0 > > ? lock_release+0x142/0x290 > > cdev_device_del+0x15/0x50 > > cxl_memdev_unregister+0x54/0x70 [cxl_core] > > > > This crash is due to the clearing out the cxl_memdev's driver context > > (@cxlds) before the subsystem is done with it. This is ultimately due to > > the region(s), that this memdev is a member, being torn down and expecting > > to be able to de-reference @cxlds, like here: > > > > static int cxl_region_decode_reset(struct cxl_region *cxlr, int count) > > ... > > if (cxlds->rcd) > > goto endpoint_reset; > > ... > > > > Fix it by keeping the driver context valid until memdev-device > > unregistration, and subsequently the entire stack of related > > dependencies, unwinds. > > > > Fixes: 9cc238c7a526 ("cxl/pci: Introduce cdevm_file_operations") > > Reported-by: Ira Weiny <ira.weiny@intel.com> > > Reviewed-by: Ira Weiny <ira.weiny@intel.com> > Tested-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Thanks for cleaning up the whole area whilst fixing this. Jonathan
On 9/29/23 16:09, Dan Williams wrote: > Ira reports that removing cxl_mock_mem causes a crash with the following > trace: > > BUG: kernel NULL pointer dereference, address: 0000000000000044 > [..] > RIP: 0010:cxl_region_decode_reset+0x7f/0x180 [cxl_core] > [..] > Call Trace: > <TASK> > cxl_region_detach+0xe8/0x210 [cxl_core] > cxl_decoder_kill_region+0x27/0x40 [cxl_core] > cxld_unregister+0x29/0x40 [cxl_core] > devres_release_all+0xb8/0x110 > device_unbind_cleanup+0xe/0x70 > device_release_driver_internal+0x1d2/0x210 > bus_remove_device+0xd7/0x150 > device_del+0x155/0x3e0 > device_unregister+0x13/0x60 > devm_release_action+0x4d/0x90 > ? __pfx_unregister_port+0x10/0x10 [cxl_core] > delete_endpoint+0x121/0x130 [cxl_core] > devres_release_all+0xb8/0x110 > device_unbind_cleanup+0xe/0x70 > device_release_driver_internal+0x1d2/0x210 > bus_remove_device+0xd7/0x150 > device_del+0x155/0x3e0 > ? lock_release+0x142/0x290 > cdev_device_del+0x15/0x50 > cxl_memdev_unregister+0x54/0x70 [cxl_core] > > This crash is due to the clearing out the cxl_memdev's driver context > (@cxlds) before the subsystem is done with it. This is ultimately due to > the region(s), that this memdev is a member, being torn down and expecting > to be able to de-reference @cxlds, like here: > > static int cxl_region_decode_reset(struct cxl_region *cxlr, int count) > ... > if (cxlds->rcd) > goto endpoint_reset; > ... > > Fix it by keeping the driver context valid until memdev-device > unregistration, and subsequently the entire stack of related > dependencies, unwinds. > > Fixes: 9cc238c7a526 ("cxl/pci: Introduce cdevm_file_operations") > Reported-by: Ira Weiny <ira.weiny@intel.com> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> > --- > drivers/cxl/core/memdev.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c > index a950091e5640..e78b5ead14fa 100644 > --- a/drivers/cxl/core/memdev.c > +++ b/drivers/cxl/core/memdev.c > @@ -570,8 +570,8 @@ static void cxl_memdev_unregister(void *_cxlmd) > struct cxl_memdev *cxlmd = _cxlmd; > struct device *dev = &cxlmd->dev; > > - cxl_memdev_shutdown(dev); > cdev_device_del(&cxlmd->cdev, dev); > + cxl_memdev_shutdown(dev); > put_device(dev); > } > >
On Fri, 29 Sep 2023, Dan Williams wrote: >Ira reports that removing cxl_mock_mem causes a crash with the following >trace: > > BUG: kernel NULL pointer dereference, address: 0000000000000044 > [..] > RIP: 0010:cxl_region_decode_reset+0x7f/0x180 [cxl_core] > [..] > Call Trace: > <TASK> > cxl_region_detach+0xe8/0x210 [cxl_core] > cxl_decoder_kill_region+0x27/0x40 [cxl_core] > cxld_unregister+0x29/0x40 [cxl_core] > devres_release_all+0xb8/0x110 > device_unbind_cleanup+0xe/0x70 > device_release_driver_internal+0x1d2/0x210 > bus_remove_device+0xd7/0x150 > device_del+0x155/0x3e0 > device_unregister+0x13/0x60 > devm_release_action+0x4d/0x90 > ? __pfx_unregister_port+0x10/0x10 [cxl_core] > delete_endpoint+0x121/0x130 [cxl_core] > devres_release_all+0xb8/0x110 > device_unbind_cleanup+0xe/0x70 > device_release_driver_internal+0x1d2/0x210 > bus_remove_device+0xd7/0x150 > device_del+0x155/0x3e0 > ? lock_release+0x142/0x290 > cdev_device_del+0x15/0x50 > cxl_memdev_unregister+0x54/0x70 [cxl_core] > >This crash is due to the clearing out the cxl_memdev's driver context >(@cxlds) before the subsystem is done with it. This is ultimately due to >the region(s), that this memdev is a member, being torn down and expecting >to be able to de-reference @cxlds, like here: > >static int cxl_region_decode_reset(struct cxl_region *cxlr, int count) >... > if (cxlds->rcd) > goto endpoint_reset; >... > >Fix it by keeping the driver context valid until memdev-device >unregistration, and subsequently the entire stack of related >dependencies, unwinds. > >Fixes: 9cc238c7a526 ("cxl/pci: Introduce cdevm_file_operations") >Reported-by: Ira Weiny <ira.weiny@intel.com> >Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c index a950091e5640..e78b5ead14fa 100644 --- a/drivers/cxl/core/memdev.c +++ b/drivers/cxl/core/memdev.c @@ -570,8 +570,8 @@ static void cxl_memdev_unregister(void *_cxlmd) struct cxl_memdev *cxlmd = _cxlmd; struct device *dev = &cxlmd->dev; - cxl_memdev_shutdown(dev); cdev_device_del(&cxlmd->cdev, dev); + cxl_memdev_shutdown(dev); put_device(dev); }
Ira reports that removing cxl_mock_mem causes a crash with the following trace: BUG: kernel NULL pointer dereference, address: 0000000000000044 [..] RIP: 0010:cxl_region_decode_reset+0x7f/0x180 [cxl_core] [..] Call Trace: <TASK> cxl_region_detach+0xe8/0x210 [cxl_core] cxl_decoder_kill_region+0x27/0x40 [cxl_core] cxld_unregister+0x29/0x40 [cxl_core] devres_release_all+0xb8/0x110 device_unbind_cleanup+0xe/0x70 device_release_driver_internal+0x1d2/0x210 bus_remove_device+0xd7/0x150 device_del+0x155/0x3e0 device_unregister+0x13/0x60 devm_release_action+0x4d/0x90 ? __pfx_unregister_port+0x10/0x10 [cxl_core] delete_endpoint+0x121/0x130 [cxl_core] devres_release_all+0xb8/0x110 device_unbind_cleanup+0xe/0x70 device_release_driver_internal+0x1d2/0x210 bus_remove_device+0xd7/0x150 device_del+0x155/0x3e0 ? lock_release+0x142/0x290 cdev_device_del+0x15/0x50 cxl_memdev_unregister+0x54/0x70 [cxl_core] This crash is due to the clearing out the cxl_memdev's driver context (@cxlds) before the subsystem is done with it. This is ultimately due to the region(s), that this memdev is a member, being torn down and expecting to be able to de-reference @cxlds, like here: static int cxl_region_decode_reset(struct cxl_region *cxlr, int count) ... if (cxlds->rcd) goto endpoint_reset; ... Fix it by keeping the driver context valid until memdev-device unregistration, and subsequently the entire stack of related dependencies, unwinds. Fixes: 9cc238c7a526 ("cxl/pci: Introduce cdevm_file_operations") Reported-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- drivers/cxl/core/memdev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)