diff mbox series

cxl/mem: Fix shutdown order

Message ID 169596542307.790108.11339208844199665348.stgit@dwillia2-xfh.jf.intel.com
State Superseded
Headers show
Series cxl/mem: Fix shutdown order | expand

Commit Message

Dan Williams Sept. 29, 2023, 5:30 a.m. UTC
Ira reports that removing cxl_mock_mem causes a crash with the following
trace:

 BUG: kernel NULL pointer dereference, address: 0000000000000044
 [..]
 RIP: 0010:cxl_region_decode_reset+0x7f/0x180 [cxl_core]
 [..]
 Call Trace:
  <TASK>
  cxl_region_detach+0xe8/0x210 [cxl_core]
  cxl_decoder_kill_region+0x27/0x40 [cxl_core]
  cxld_unregister+0x29/0x40 [cxl_core]
  devres_release_all+0xb8/0x110
  device_unbind_cleanup+0xe/0x70
  device_release_driver_internal+0x1d2/0x210
  bus_remove_device+0xd7/0x150
  device_del+0x155/0x3e0
  device_unregister+0x13/0x60
  devm_release_action+0x4d/0x90
  ? __pfx_unregister_port+0x10/0x10 [cxl_core]
  delete_endpoint+0x121/0x130 [cxl_core]
  devres_release_all+0xb8/0x110
  device_unbind_cleanup+0xe/0x70
  device_release_driver_internal+0x1d2/0x210
  bus_remove_device+0xd7/0x150
  device_del+0x155/0x3e0
  ? lock_release+0x142/0x290
  cdev_device_del+0x15/0x50
  cxl_memdev_unregister+0x54/0x70 [cxl_core]

This crash is due to the clearing out the cxl_memdev's driver context
(@cxlds) before the driver is done with it. Fix it by keeping the driver
context valid until device unregistration completes.

Fixes: 9cc238c7a526 ("cxl/pci: Introduce cdevm_file_operations")
Reported-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/cxl/core/memdev.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Jonathan Cameron Sept. 29, 2023, 9:03 a.m. UTC | #1
On Thu, 28 Sep 2023 22:30:23 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> Ira reports that removing cxl_mock_mem causes a crash with the following
> trace:
> 
>  BUG: kernel NULL pointer dereference, address: 0000000000000044
>  [..]
>  RIP: 0010:cxl_region_decode_reset+0x7f/0x180 [cxl_core]
>  [..]
>  Call Trace:
>   <TASK>
>   cxl_region_detach+0xe8/0x210 [cxl_core]
>   cxl_decoder_kill_region+0x27/0x40 [cxl_core]
>   cxld_unregister+0x29/0x40 [cxl_core]
>   devres_release_all+0xb8/0x110
>   device_unbind_cleanup+0xe/0x70
>   device_release_driver_internal+0x1d2/0x210
>   bus_remove_device+0xd7/0x150
>   device_del+0x155/0x3e0
>   device_unregister+0x13/0x60
>   devm_release_action+0x4d/0x90
>   ? __pfx_unregister_port+0x10/0x10 [cxl_core]
>   delete_endpoint+0x121/0x130 [cxl_core]
>   devres_release_all+0xb8/0x110
>   device_unbind_cleanup+0xe/0x70
>   device_release_driver_internal+0x1d2/0x210
>   bus_remove_device+0xd7/0x150
>   device_del+0x155/0x3e0
>   ? lock_release+0x142/0x290
>   cdev_device_del+0x15/0x50
>   cxl_memdev_unregister+0x54/0x70 [cxl_core]
> 
> This crash is due to the clearing out the cxl_memdev's driver context
> (@cxlds) before the driver is done with it. Fix it by keeping the driver
> context valid until device unregistration completes.
Ideally good to have an explicit statement of which bit of context
is used in the cdev_device_del() call.
> 
> Fixes: 9cc238c7a526 ("cxl/pci: Introduce cdevm_file_operations")
> Reported-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

I'm not that keen on the fix, because it makes some incidental ordering
changes and the ordering in here is messy.  What you have feels like
it's fixing on symptom whereas there are other issues here.

Firstly, as a side note given I'm looking at this
code cxl_memdev_security_shutdown() should perhaps take a
struct cxl_memdev rather than the struct device that it then
uses just to get hold of the cxl_memdev available in
cxl_memdev_shutdown().  The equivalent setup code
cxl_memdev_security_init() already does take a struct
cxl_memdev so this would be desirable for symmetry if nothing else.

However, cxl_memdev_security_shutdown() doesn't actually undo
the stuff in cxl_memdev_security_init() which previously sent me
on a wild goose chase.

Offloading that put_santize callback here to devm looks dubious to me:
https://elixir.bootlin.com/linux/v6.6-rc3/source/drivers/cxl/core/memdev.c#L1031

.. for a function mid way through a bunch of stuff that is otherwise cleaned up
by cxl_memdev_unregister() via devm here
https://elixir.bootlin.com/linux/v6.6-rc3/source/drivers/cxl/core/memdev.c#L1066
seems like a recipe for future pain. Could we just move the put_santize()
call directly into cxl_memdev_unregister() and the error path in
cxl_memdev_add_dev() and make sure the order is as expected.

I had a long argument written up on why cxl_memdev_security_shutdown()
was in the wrong place in the new sequence before seeing that it
didn't balance with what I assumed it did.  So probably still in an illogical
place given I'd expect all the 'security stuff' to be initialized at same
point in setup and tear down sequences (reversed obviously.)

Thanks,

Jonathan






> ---
>  drivers/cxl/core/memdev.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> index 14b547c07f54..92d40c5e7efb 100644
> --- a/drivers/cxl/core/memdev.c
> +++ b/drivers/cxl/core/memdev.c
> @@ -580,8 +580,8 @@ static void cxl_memdev_unregister(void *_cxlmd)
>  	struct cxl_memdev *cxlmd = _cxlmd;
>  	struct device *dev = &cxlmd->dev;
>  
> -	cxl_memdev_shutdown(dev);
>  	cdev_device_del(&cxlmd->cdev, dev);
> +	cxl_memdev_shutdown(dev);
>  	put_device(dev);
>  }
>  
>
Dan Williams Sept. 29, 2023, 4:27 p.m. UTC | #2
Jonathan Cameron wrote:
> On Thu, 28 Sep 2023 22:30:23 -0700
> Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > Ira reports that removing cxl_mock_mem causes a crash with the following
> > trace:
> > 
> >  BUG: kernel NULL pointer dereference, address: 0000000000000044
> >  [..]
> >  RIP: 0010:cxl_region_decode_reset+0x7f/0x180 [cxl_core]
> >  [..]
> >  Call Trace:
> >   <TASK>
> >   cxl_region_detach+0xe8/0x210 [cxl_core]
> >   cxl_decoder_kill_region+0x27/0x40 [cxl_core]
> >   cxld_unregister+0x29/0x40 [cxl_core]
> >   devres_release_all+0xb8/0x110
> >   device_unbind_cleanup+0xe/0x70
> >   device_release_driver_internal+0x1d2/0x210
> >   bus_remove_device+0xd7/0x150
> >   device_del+0x155/0x3e0
> >   device_unregister+0x13/0x60
> >   devm_release_action+0x4d/0x90
> >   ? __pfx_unregister_port+0x10/0x10 [cxl_core]
> >   delete_endpoint+0x121/0x130 [cxl_core]
> >   devres_release_all+0xb8/0x110
> >   device_unbind_cleanup+0xe/0x70
> >   device_release_driver_internal+0x1d2/0x210
> >   bus_remove_device+0xd7/0x150
> >   device_del+0x155/0x3e0
> >   ? lock_release+0x142/0x290
> >   cdev_device_del+0x15/0x50
> >   cxl_memdev_unregister+0x54/0x70 [cxl_core]
> > 
> > This crash is due to the clearing out the cxl_memdev's driver context
> > (@cxlds) before the driver is done with it. Fix it by keeping the driver
> > context valid until device unregistration completes.
> Ideally good to have an explicit statement of which bit of context
> is used in the cdev_device_del() call.
> > 
> > Fixes: 9cc238c7a526 ("cxl/pci: Introduce cdevm_file_operations")
> > Reported-by: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> 
> I'm not that keen on the fix, because it makes some incidental ordering
> changes and the ordering in here is messy.  What you have feels like
> it's fixing on symptom whereas there are other issues here.
> 
> Firstly, as a side note given I'm looking at this
> code cxl_memdev_security_shutdown() should perhaps take a
> struct cxl_memdev rather than the struct device that it then

cxl_memdev_security_shutdown() is in the wrong place, it should just be
shutdown like any other optional capability.

> uses just to get hold of the cxl_memdev available in
> cxl_memdev_shutdown().  The equivalent setup code
> cxl_memdev_security_init() already does take a struct
> cxl_memdev so this would be desirable for symmetry if nothing else.
> 
> However, cxl_memdev_security_shutdown() doesn't actually undo
> the stuff in cxl_memdev_security_init() which previously sent me
> on a wild goose chase.

Yes, that needs to be cleaned up.

> 
> Offloading that put_santize callback here to devm looks dubious to me:
> https://elixir.bootlin.com/linux/v6.6-rc3/source/drivers/cxl/core/memdev.c#L1031
> 
> .. for a function mid way through a bunch of stuff that is otherwise cleaned up
> by cxl_memdev_unregister() via devm here
> https://elixir.bootlin.com/linux/v6.6-rc3/source/drivers/cxl/core/memdev.c#L1066
> seems like a recipe for future pain. Could we just move the put_santize()
> call directly into cxl_memdev_unregister() and the error path in
> cxl_memdev_add_dev() and make sure the order is as expected.
> 
> I had a long argument written up on why cxl_memdev_security_shutdown()
> was in the wrong place in the new sequence before seeing that it
> didn't balance with what I assumed it did.  So probably still in an illogical
> place given I'd expect all the 'security stuff' to be initialized at same
> point in setup and tear down sequences (reversed obviously.)

Yes, I tripped over that and should have mentioned the grumbling as I
held my nose and fixed the crash.
diff mbox series

Patch

diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index 14b547c07f54..92d40c5e7efb 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -580,8 +580,8 @@  static void cxl_memdev_unregister(void *_cxlmd)
 	struct cxl_memdev *cxlmd = _cxlmd;
 	struct device *dev = &cxlmd->dev;
 
-	cxl_memdev_shutdown(dev);
 	cdev_device_del(&cxlmd->cdev, dev);
+	cxl_memdev_shutdown(dev);
 	put_device(dev);
 }