diff mbox series

[v2,4/4] cxl/mem: Fix shutdown order

Message ID 169602898991.904193.3059334392093961032.stgit@dwillia2-xfh.jf.intel.com
State Superseded
Headers show
Series cxl/mem: Fix shutdown order | expand

Commit Message

Dan Williams Sept. 29, 2023, 11:09 p.m. UTC
Ira reports that removing cxl_mock_mem causes a crash with the following
trace:

 BUG: kernel NULL pointer dereference, address: 0000000000000044
 [..]
 RIP: 0010:cxl_region_decode_reset+0x7f/0x180 [cxl_core]
 [..]
 Call Trace:
  <TASK>
  cxl_region_detach+0xe8/0x210 [cxl_core]
  cxl_decoder_kill_region+0x27/0x40 [cxl_core]
  cxld_unregister+0x29/0x40 [cxl_core]
  devres_release_all+0xb8/0x110
  device_unbind_cleanup+0xe/0x70
  device_release_driver_internal+0x1d2/0x210
  bus_remove_device+0xd7/0x150
  device_del+0x155/0x3e0
  device_unregister+0x13/0x60
  devm_release_action+0x4d/0x90
  ? __pfx_unregister_port+0x10/0x10 [cxl_core]
  delete_endpoint+0x121/0x130 [cxl_core]
  devres_release_all+0xb8/0x110
  device_unbind_cleanup+0xe/0x70
  device_release_driver_internal+0x1d2/0x210
  bus_remove_device+0xd7/0x150
  device_del+0x155/0x3e0
  ? lock_release+0x142/0x290
  cdev_device_del+0x15/0x50
  cxl_memdev_unregister+0x54/0x70 [cxl_core]

This crash is due to the clearing out the cxl_memdev's driver context
(@cxlds) before the subsystem is done with it. This is ultimately due to
the region(s), that this memdev is a member, being torn down and expecting
to be able to de-reference @cxlds, like here:

static int cxl_region_decode_reset(struct cxl_region *cxlr, int count)
...
                if (cxlds->rcd)
                        goto endpoint_reset;
...

Fix it by keeping the driver context valid until memdev-device
unregistration, and subsequently the entire stack of related
dependencies, unwinds.

Fixes: 9cc238c7a526 ("cxl/pci: Introduce cdevm_file_operations")
Reported-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/cxl/core/memdev.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Ira Weiny Sept. 29, 2023, 11:52 p.m. UTC | #1
Dan Williams wrote:
> Ira reports that removing cxl_mock_mem causes a crash with the following
> trace:
> 
>  BUG: kernel NULL pointer dereference, address: 0000000000000044
>  [..]
>  RIP: 0010:cxl_region_decode_reset+0x7f/0x180 [cxl_core]
>  [..]
>  Call Trace:
>   <TASK>
>   cxl_region_detach+0xe8/0x210 [cxl_core]
>   cxl_decoder_kill_region+0x27/0x40 [cxl_core]
>   cxld_unregister+0x29/0x40 [cxl_core]
>   devres_release_all+0xb8/0x110
>   device_unbind_cleanup+0xe/0x70
>   device_release_driver_internal+0x1d2/0x210
>   bus_remove_device+0xd7/0x150
>   device_del+0x155/0x3e0
>   device_unregister+0x13/0x60
>   devm_release_action+0x4d/0x90
>   ? __pfx_unregister_port+0x10/0x10 [cxl_core]
>   delete_endpoint+0x121/0x130 [cxl_core]
>   devres_release_all+0xb8/0x110
>   device_unbind_cleanup+0xe/0x70
>   device_release_driver_internal+0x1d2/0x210
>   bus_remove_device+0xd7/0x150
>   device_del+0x155/0x3e0
>   ? lock_release+0x142/0x290
>   cdev_device_del+0x15/0x50
>   cxl_memdev_unregister+0x54/0x70 [cxl_core]
> 
> This crash is due to the clearing out the cxl_memdev's driver context
> (@cxlds) before the subsystem is done with it. This is ultimately due to
> the region(s), that this memdev is a member, being torn down and expecting
> to be able to de-reference @cxlds, like here:
> 
> static int cxl_region_decode_reset(struct cxl_region *cxlr, int count)
> ...
>                 if (cxlds->rcd)
>                         goto endpoint_reset;
> ...
> 
> Fix it by keeping the driver context valid until memdev-device
> unregistration, and subsequently the entire stack of related
> dependencies, unwinds.
> 
> Fixes: 9cc238c7a526 ("cxl/pci: Introduce cdevm_file_operations")
> Reported-by: Ira Weiny <ira.weiny@intel.com>

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Jonathan Cameron Oct. 2, 2023, 10:11 a.m. UTC | #2
On Fri, 29 Sep 2023 16:52:06 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> Dan Williams wrote:
> > Ira reports that removing cxl_mock_mem causes a crash with the following
> > trace:
> > 
> >  BUG: kernel NULL pointer dereference, address: 0000000000000044
> >  [..]
> >  RIP: 0010:cxl_region_decode_reset+0x7f/0x180 [cxl_core]
> >  [..]
> >  Call Trace:
> >   <TASK>
> >   cxl_region_detach+0xe8/0x210 [cxl_core]
> >   cxl_decoder_kill_region+0x27/0x40 [cxl_core]
> >   cxld_unregister+0x29/0x40 [cxl_core]
> >   devres_release_all+0xb8/0x110
> >   device_unbind_cleanup+0xe/0x70
> >   device_release_driver_internal+0x1d2/0x210
> >   bus_remove_device+0xd7/0x150
> >   device_del+0x155/0x3e0
> >   device_unregister+0x13/0x60
> >   devm_release_action+0x4d/0x90
> >   ? __pfx_unregister_port+0x10/0x10 [cxl_core]
> >   delete_endpoint+0x121/0x130 [cxl_core]
> >   devres_release_all+0xb8/0x110
> >   device_unbind_cleanup+0xe/0x70
> >   device_release_driver_internal+0x1d2/0x210
> >   bus_remove_device+0xd7/0x150
> >   device_del+0x155/0x3e0
> >   ? lock_release+0x142/0x290
> >   cdev_device_del+0x15/0x50
> >   cxl_memdev_unregister+0x54/0x70 [cxl_core]
> > 
> > This crash is due to the clearing out the cxl_memdev's driver context
> > (@cxlds) before the subsystem is done with it. This is ultimately due to
> > the region(s), that this memdev is a member, being torn down and expecting
> > to be able to de-reference @cxlds, like here:
> > 
> > static int cxl_region_decode_reset(struct cxl_region *cxlr, int count)
> > ...
> >                 if (cxlds->rcd)
> >                         goto endpoint_reset;
> > ...
> > 
> > Fix it by keeping the driver context valid until memdev-device
> > unregistration, and subsequently the entire stack of related
> > dependencies, unwinds.
> > 
> > Fixes: 9cc238c7a526 ("cxl/pci: Introduce cdevm_file_operations")
> > Reported-by: Ira Weiny <ira.weiny@intel.com>  
> 
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Thanks for cleaning up the whole area whilst fixing this.

Jonathan
Dave Jiang Oct. 2, 2023, 4:59 p.m. UTC | #3
On 9/29/23 16:09, Dan Williams wrote:
> Ira reports that removing cxl_mock_mem causes a crash with the following
> trace:
> 
>  BUG: kernel NULL pointer dereference, address: 0000000000000044
>  [..]
>  RIP: 0010:cxl_region_decode_reset+0x7f/0x180 [cxl_core]
>  [..]
>  Call Trace:
>   <TASK>
>   cxl_region_detach+0xe8/0x210 [cxl_core]
>   cxl_decoder_kill_region+0x27/0x40 [cxl_core]
>   cxld_unregister+0x29/0x40 [cxl_core]
>   devres_release_all+0xb8/0x110
>   device_unbind_cleanup+0xe/0x70
>   device_release_driver_internal+0x1d2/0x210
>   bus_remove_device+0xd7/0x150
>   device_del+0x155/0x3e0
>   device_unregister+0x13/0x60
>   devm_release_action+0x4d/0x90
>   ? __pfx_unregister_port+0x10/0x10 [cxl_core]
>   delete_endpoint+0x121/0x130 [cxl_core]
>   devres_release_all+0xb8/0x110
>   device_unbind_cleanup+0xe/0x70
>   device_release_driver_internal+0x1d2/0x210
>   bus_remove_device+0xd7/0x150
>   device_del+0x155/0x3e0
>   ? lock_release+0x142/0x290
>   cdev_device_del+0x15/0x50
>   cxl_memdev_unregister+0x54/0x70 [cxl_core]
> 
> This crash is due to the clearing out the cxl_memdev's driver context
> (@cxlds) before the subsystem is done with it. This is ultimately due to
> the region(s), that this memdev is a member, being torn down and expecting
> to be able to de-reference @cxlds, like here:
> 
> static int cxl_region_decode_reset(struct cxl_region *cxlr, int count)
> ...
>                 if (cxlds->rcd)
>                         goto endpoint_reset;
> ...
> 
> Fix it by keeping the driver context valid until memdev-device
> unregistration, and subsequently the entire stack of related
> dependencies, unwinds.
> 
> Fixes: 9cc238c7a526 ("cxl/pci: Introduce cdevm_file_operations")
> Reported-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
>  drivers/cxl/core/memdev.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> index a950091e5640..e78b5ead14fa 100644
> --- a/drivers/cxl/core/memdev.c
> +++ b/drivers/cxl/core/memdev.c
> @@ -570,8 +570,8 @@ static void cxl_memdev_unregister(void *_cxlmd)
>  	struct cxl_memdev *cxlmd = _cxlmd;
>  	struct device *dev = &cxlmd->dev;
>  
> -	cxl_memdev_shutdown(dev);
>  	cdev_device_del(&cxlmd->cdev, dev);
> +	cxl_memdev_shutdown(dev);
>  	put_device(dev);
>  }
>  
>
Davidlohr Bueso Oct. 3, 2023, 5:40 p.m. UTC | #4
On Fri, 29 Sep 2023, Dan Williams wrote:

>Ira reports that removing cxl_mock_mem causes a crash with the following
>trace:
>
> BUG: kernel NULL pointer dereference, address: 0000000000000044
> [..]
> RIP: 0010:cxl_region_decode_reset+0x7f/0x180 [cxl_core]
> [..]
> Call Trace:
>  <TASK>
>  cxl_region_detach+0xe8/0x210 [cxl_core]
>  cxl_decoder_kill_region+0x27/0x40 [cxl_core]
>  cxld_unregister+0x29/0x40 [cxl_core]
>  devres_release_all+0xb8/0x110
>  device_unbind_cleanup+0xe/0x70
>  device_release_driver_internal+0x1d2/0x210
>  bus_remove_device+0xd7/0x150
>  device_del+0x155/0x3e0
>  device_unregister+0x13/0x60
>  devm_release_action+0x4d/0x90
>  ? __pfx_unregister_port+0x10/0x10 [cxl_core]
>  delete_endpoint+0x121/0x130 [cxl_core]
>  devres_release_all+0xb8/0x110
>  device_unbind_cleanup+0xe/0x70
>  device_release_driver_internal+0x1d2/0x210
>  bus_remove_device+0xd7/0x150
>  device_del+0x155/0x3e0
>  ? lock_release+0x142/0x290
>  cdev_device_del+0x15/0x50
>  cxl_memdev_unregister+0x54/0x70 [cxl_core]
>
>This crash is due to the clearing out the cxl_memdev's driver context
>(@cxlds) before the subsystem is done with it. This is ultimately due to
>the region(s), that this memdev is a member, being torn down and expecting
>to be able to de-reference @cxlds, like here:
>
>static int cxl_region_decode_reset(struct cxl_region *cxlr, int count)
>...
>                if (cxlds->rcd)
>                        goto endpoint_reset;
>...
>
>Fix it by keeping the driver context valid until memdev-device
>unregistration, and subsequently the entire stack of related
>dependencies, unwinds.
>
>Fixes: 9cc238c7a526 ("cxl/pci: Introduce cdevm_file_operations")
>Reported-by: Ira Weiny <ira.weiny@intel.com>
>Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
diff mbox series

Patch

diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index a950091e5640..e78b5ead14fa 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -570,8 +570,8 @@  static void cxl_memdev_unregister(void *_cxlmd)
 	struct cxl_memdev *cxlmd = _cxlmd;
 	struct device *dev = &cxlmd->dev;
 
-	cxl_memdev_shutdown(dev);
 	cdev_device_del(&cxlmd->cdev, dev);
+	cxl_memdev_shutdown(dev);
 	put_device(dev);
 }