diff mbox series

[v2] RDMA/cma: prevent rdma id destroy during cma_iw_handler

Message ID 20230612054237.1855292-1-shinichiro.kawasaki@wdc.com (mailing list archive)
State Accepted
Headers show
Series [v2] RDMA/cma: prevent rdma id destroy during cma_iw_handler | expand

Commit Message

Shinichiro Kawasaki June 12, 2023, 5:42 a.m. UTC
When rdma_destroy_id() and cma_iw_handler() race, struct rdma_id_private
*id_priv can be destroyed during cma_iw_handler call. This causes "BUG:
KASAN: slab-use-after-free" at mutex_lock() in cma_iw_handler() [1].
To prevent the destroy of id_priv, keep its reference count by calling
cma_id_get() and cma_id_put() at start and end of cma_iw_handler().

[1]

==================================================================
BUG: KASAN: slab-use-after-free in __mutex_lock+0x1324/0x18f0
Read of size 8 at addr ffff888197b37418 by task kworker/u8:0/9

CPU: 0 PID: 9 Comm: kworker/u8:0 Not tainted 6.3.0 #62
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
Workqueue: iw_cm_wq cm_work_handler [iw_cm]
Call Trace:
 <TASK>
 dump_stack_lvl+0x57/0x90
 print_report+0xcf/0x660
 ? __mutex_lock+0x1324/0x18f0
 kasan_report+0xa4/0xe0
 ? __mutex_lock+0x1324/0x18f0
 __mutex_lock+0x1324/0x18f0
 ? cma_iw_handler+0xac/0x4f0 [rdma_cm]
 ? _raw_spin_unlock_irqrestore+0x30/0x60
 ? rcu_is_watching+0x11/0xb0
 ? _raw_spin_unlock_irqrestore+0x30/0x60
 ? trace_hardirqs_on+0x12/0x100
 ? __pfx___mutex_lock+0x10/0x10
 ? __percpu_counter_sum+0x147/0x1e0
 ? domain_dirty_limits+0x246/0x390
 ? wb_over_bg_thresh+0x4d5/0x610
 ? rcu_is_watching+0x11/0xb0
 ? cma_iw_handler+0xac/0x4f0 [rdma_cm]
 cma_iw_handler+0xac/0x4f0 [rdma_cm]
 ? rcu_is_watching+0x11/0xb0
 ? __pfx_cma_iw_handler+0x10/0x10 [rdma_cm]
 ? attach_entity_load_avg+0x4e2/0x920
 ? _raw_spin_unlock_irqrestore+0x30/0x60
 ? rcu_is_watching+0x11/0xb0
 cm_work_handler+0x139e/0x1c50 [iw_cm]
 ? __pfx_cm_work_handler+0x10/0x10 [iw_cm]
 ? rcu_is_watching+0x11/0xb0
 ? __pfx_try_to_wake_up+0x10/0x10
 ? __pfx_do_raw_spin_lock+0x10/0x10
 ? __pfx___might_resched+0x10/0x10
 ? _raw_spin_unlock_irq+0x24/0x50
 process_one_work+0x843/0x1350
 ? __pfx_lock_acquire+0x10/0x10
 ? __pfx_process_one_work+0x10/0x10
 ? __pfx_do_raw_spin_lock+0x10/0x10
 worker_thread+0xfc/0x1260
 ? __pfx_worker_thread+0x10/0x10
 kthread+0x29e/0x340
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x2c/0x50
 </TASK>

Allocated by task 4225:
 kasan_save_stack+0x2f/0x50
 kasan_set_track+0x21/0x30
 __kasan_kmalloc+0xa6/0xb0
 __rdma_create_id+0x5b/0x5d0 [rdma_cm]
 __rdma_create_kernel_id+0x12/0x40 [rdma_cm]
 nvme_rdma_alloc_queue+0x26a/0x5f0 [nvme_rdma]
 nvme_rdma_setup_ctrl+0xb84/0x1d90 [nvme_rdma]
 nvme_rdma_create_ctrl+0x7b5/0xd20 [nvme_rdma]
 nvmf_dev_write+0xddd/0x22b0 [nvme_fabrics]
 vfs_write+0x211/0xd50
 ksys_write+0x100/0x1e0
 do_syscall_64+0x5b/0x80
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

Freed by task 4227:
 kasan_save_stack+0x2f/0x50
 kasan_set_track+0x21/0x30
 kasan_save_free_info+0x2a/0x50
 ____kasan_slab_free+0x169/0x1c0
 slab_free_freelist_hook+0xdb/0x1b0
 __kmem_cache_free+0xb8/0x2e0
 nvme_rdma_free_queue+0x4a/0x70 [nvme_rdma]
 nvme_rdma_teardown_io_queues.part.0+0x14a/0x1e0 [nvme_rdma]
 nvme_rdma_delete_ctrl+0x4f/0x100 [nvme_rdma]
 nvme_do_delete_ctrl+0x14e/0x240 [nvme_core]
 nvme_sysfs_delete+0xcb/0x100 [nvme_core]
 kernfs_fop_write_iter+0x359/0x530
 vfs_write+0x58f/0xd50
 ksys_write+0x100/0x1e0
 do_syscall_64+0x5b/0x80
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

The buggy address belongs to the object at ffff888197b37000
        which belongs to the cache kmalloc-2k of size 2048
The buggy address is located 1048 bytes inside of
        freed 2048-byte region [ffff888197b37000, ffff888197b37800)

The buggy address belongs to the physical page:
page:00000000fbe33a6e refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x197b30
head:00000000fbe33a6e order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
anon flags: 0x17ffffc0010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
raw: 0017ffffc0010200 ffff888100042f00 0000000000000000 dead000000000001
raw: 0000000000000000 0000000000080008 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff888197b37300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff888197b37380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff888197b37400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                            ^
 ffff888197b37480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff888197b37500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

Fixes: de910bd92137 ("RDMA/cma: Simplify locking needed for serialization of callbacks")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: stable@vger.kernel.org
---
The BUG KASAN was observed with blktests at test cases nvme/030 or nvme/031,
using SIW transport [*]. To reproduce it, it is required to repeat the test
cases from 30 to 50 times on my test system.

[*] https://lore.kernel.org/linux-block/rsmmxrchy6voi5qhl4irss5sprna3f5owkqtvybxglcv2pnylm@xmrnpfu3tfpe/

Changes from v1:
* Improved the commit message per comments on the list

 drivers/infiniband/core/cma.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Leon Romanovsky June 12, 2023, 7:13 a.m. UTC | #1
On Mon, 12 Jun 2023 14:42:37 +0900, Shin'ichiro Kawasaki wrote:
> When rdma_destroy_id() and cma_iw_handler() race, struct rdma_id_private
> *id_priv can be destroyed during cma_iw_handler call. This causes "BUG:
> KASAN: slab-use-after-free" at mutex_lock() in cma_iw_handler() [1].
> To prevent the destroy of id_priv, keep its reference count by calling
> cma_id_get() and cma_id_put() at start and end of cma_iw_handler().
> 
> [1]
> 
> [...]

Applied, thanks!

[1/1] RDMA/cma: prevent rdma id destroy during cma_iw_handler
      https://git.kernel.org/rdma/rdma/c/fd06a5925e4773

Best regards,
Jason Gunthorpe June 12, 2023, 2:18 p.m. UTC | #2
On Mon, Jun 12, 2023 at 02:42:37PM +0900, Shin'ichiro Kawasaki wrote:
> When rdma_destroy_id() and cma_iw_handler() race, struct rdma_id_private
> *id_priv can be destroyed during cma_iw_handler call. This causes "BUG:
> KASAN: slab-use-after-free" at mutex_lock() in cma_iw_handler() [1].
> To prevent the destroy of id_priv, keep its reference count by calling
> cma_id_get() and cma_id_put() at start and end of cma_iw_handler().
> 
> [1]
> 
> ==================================================================
> BUG: KASAN: slab-use-after-free in __mutex_lock+0x1324/0x18f0
> Read of size 8 at addr ffff888197b37418 by task kworker/u8:0/9
> 
> CPU: 0 PID: 9 Comm: kworker/u8:0 Not tainted 6.3.0 #62
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
> Workqueue: iw_cm_wq cm_work_handler [iw_cm]
> Call Trace:
>  <TASK>
>  dump_stack_lvl+0x57/0x90
>  print_report+0xcf/0x660
>  ? __mutex_lock+0x1324/0x18f0
>  kasan_report+0xa4/0xe0
>  ? __mutex_lock+0x1324/0x18f0
>  __mutex_lock+0x1324/0x18f0
>  ? cma_iw_handler+0xac/0x4f0 [rdma_cm]
>  ? _raw_spin_unlock_irqrestore+0x30/0x60
>  ? rcu_is_watching+0x11/0xb0
>  ? _raw_spin_unlock_irqrestore+0x30/0x60
>  ? trace_hardirqs_on+0x12/0x100
>  ? __pfx___mutex_lock+0x10/0x10
>  ? __percpu_counter_sum+0x147/0x1e0
>  ? domain_dirty_limits+0x246/0x390
>  ? wb_over_bg_thresh+0x4d5/0x610
>  ? rcu_is_watching+0x11/0xb0
>  ? cma_iw_handler+0xac/0x4f0 [rdma_cm]
>  cma_iw_handler+0xac/0x4f0 [rdma_cm]

What is the full call chain here, eg with the static functions
un-inlined?
> 
>  drivers/infiniband/core/cma.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
> index 93a1c48d0c32..c5267d9bb184 100644
> --- a/drivers/infiniband/core/cma.c
> +++ b/drivers/infiniband/core/cma.c
> @@ -2477,6 +2477,7 @@ static int cma_iw_handler(struct iw_cm_id *iw_id, struct iw_cm_event *iw_event)
>  	struct sockaddr *laddr = (struct sockaddr *)&iw_event->local_addr;
>  	struct sockaddr *raddr = (struct sockaddr *)&iw_event->remote_addr;
>  
> +	cma_id_get(id_priv);
>  	mutex_lock(&id_priv->handler_mutex);
>  	if (READ_ONCE(id_priv->state) != RDMA_CM_CONNECT)
>  		goto out;
> @@ -2524,12 +2525,14 @@ static int cma_iw_handler(struct iw_cm_id *iw_id, struct iw_cm_event *iw_event)
>  	if (ret) {
>  		/* Destroy the CM ID by returning a non-zero value. */
>  		id_priv->cm_id.iw = NULL;
> +		cma_id_put(id_priv);
>  		destroy_id_handler_unlock(id_priv);
>  		return ret;
>  	}
>  
>  out:
>  	mutex_unlock(&id_priv->handler_mutex);
> +	cma_id_put(id_priv);
>  	return ret;
>  }

cm_work_handler already has a ref on the iwcm_id_private

I think there is likely some much larger issue with the IW CM if the
cm_id can be destroyed while the iwcm_id is in use? It is weird that
there are two id memories for this :\

Jason
Shinichiro Kawasaki June 13, 2023, 1:43 a.m. UTC | #3
On Jun 12, 2023 / 11:18, Jason Gunthorpe wrote:
> On Mon, Jun 12, 2023 at 02:42:37PM +0900, Shin'ichiro Kawasaki wrote:
> > When rdma_destroy_id() and cma_iw_handler() race, struct rdma_id_private
> > *id_priv can be destroyed during cma_iw_handler call. This causes "BUG:
> > KASAN: slab-use-after-free" at mutex_lock() in cma_iw_handler() [1].
> > To prevent the destroy of id_priv, keep its reference count by calling
> > cma_id_get() and cma_id_put() at start and end of cma_iw_handler().
> > 
> > [1]
> > 
> > ==================================================================
> > BUG: KASAN: slab-use-after-free in __mutex_lock+0x1324/0x18f0
> > Read of size 8 at addr ffff888197b37418 by task kworker/u8:0/9
> > 
> > CPU: 0 PID: 9 Comm: kworker/u8:0 Not tainted 6.3.0 #62
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
> > Workqueue: iw_cm_wq cm_work_handler [iw_cm]
> > Call Trace:
> >  <TASK>
> >  dump_stack_lvl+0x57/0x90
> >  print_report+0xcf/0x660
> >  ? __mutex_lock+0x1324/0x18f0
> >  kasan_report+0xa4/0xe0
> >  ? __mutex_lock+0x1324/0x18f0
> >  __mutex_lock+0x1324/0x18f0
> >  ? cma_iw_handler+0xac/0x4f0 [rdma_cm]
> >  ? _raw_spin_unlock_irqrestore+0x30/0x60
> >  ? rcu_is_watching+0x11/0xb0
> >  ? _raw_spin_unlock_irqrestore+0x30/0x60
> >  ? trace_hardirqs_on+0x12/0x100
> >  ? __pfx___mutex_lock+0x10/0x10
> >  ? __percpu_counter_sum+0x147/0x1e0
> >  ? domain_dirty_limits+0x246/0x390
> >  ? wb_over_bg_thresh+0x4d5/0x610
> >  ? rcu_is_watching+0x11/0xb0
> >  ? cma_iw_handler+0xac/0x4f0 [rdma_cm]
> >  cma_iw_handler+0xac/0x4f0 [rdma_cm]
> 
> What is the full call chain here, eg with the static functions
> un-inlined?

I checked the inlined func call chain from cm_work_handler to cma_iw_handler (I
recreated the symptom using kernel v6.4-rc5, so, address numbers are different):

$ ./scripts/faddr2line ./drivers/infiniband/core/iw_cm.ko cm_work_handler+0xb57/0x1c50
cm_work_handler+0xb57/0x1c50:
cm_close_handler at /home/shin/Linux/linux/drivers/infiniband/core/iwcm.c:974
(inlined by) process_event at /home/shin/Linux/linux/drivers/infiniband/core/iwcm.c:997
(inlined by) cm_work_handler at /home/shin/Linux/linux/drivers/infiniband/core/iwcm.c:1036

With this, my understanding of the full call chain from NVME driver to
cma_iw_handler is as follows, including task switch to cm_work_handler:

nvme_rdma_teardown_io_queue
 nvme_rdma_stop_io_queues
  nvme_rdma_stop_queue
   __nvme_rdma_stop_queue
    rdma_disconnect
     iw_cm_disconnect
      iwcm_modify_qp_sqd
       ib_modify_qp
        _ib_modify_qp
         ib_security_modify_qp
          siw_verbs_modify_qp
           siw_qp_modify
            siw_qp_cm_drop
             siw_cm_upcall(IW_CM_EVENT_CLOSE)
              cm_event_handler -> refcount_inc(&cm_id_priv->refoucnt)
               queue_work
      -> cm_work_handler
          process_event
           cm_close_handler
            cm_work_handler
             cma_iw_handler

> > 
> >  drivers/infiniband/core/cma.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
> > index 93a1c48d0c32..c5267d9bb184 100644
> > --- a/drivers/infiniband/core/cma.c
> > +++ b/drivers/infiniband/core/cma.c
> > @@ -2477,6 +2477,7 @@ static int cma_iw_handler(struct iw_cm_id *iw_id, struct iw_cm_event *iw_event)
> >  	struct sockaddr *laddr = (struct sockaddr *)&iw_event->local_addr;
> >  	struct sockaddr *raddr = (struct sockaddr *)&iw_event->remote_addr;
> >  
> > +	cma_id_get(id_priv);
> >  	mutex_lock(&id_priv->handler_mutex);
> >  	if (READ_ONCE(id_priv->state) != RDMA_CM_CONNECT)
> >  		goto out;
> > @@ -2524,12 +2525,14 @@ static int cma_iw_handler(struct iw_cm_id *iw_id, struct iw_cm_event *iw_event)
> >  	if (ret) {
> >  		/* Destroy the CM ID by returning a non-zero value. */
> >  		id_priv->cm_id.iw = NULL;
> > +		cma_id_put(id_priv);
> >  		destroy_id_handler_unlock(id_priv);
> >  		return ret;
> >  	}
> >  
> >  out:
> >  	mutex_unlock(&id_priv->handler_mutex);
> > +	cma_id_put(id_priv);
> >  	return ret;
> >  }
> 
> cm_work_handler already has a ref on the iwcm_id_private
> 
> I think there is likely some much larger issue with the IW CM if the
> cm_id can be destroyed while the iwcm_id is in use? It is weird that
> there are two id memories for this :\

My understanding about the call chain to rdma id destroy is as follows. I guess
_destory_id calls iw_destory_cm_id before destroying the rdma id, but not sure
why it does not wait for cm_id deref by cm_work_handler.

nvme_rdma_teardown_io_queueus
 nvme_rdma_stop_io_queues -> chained to cma_iw_handler
 nvme_rdma_free_io_queues
  nvme_rdma_free_queue
   rdma_destroy_id
    mutex_lock(&id_priv->handler_mutex)
    destroy_id_handler_unlock
     mutex_unlock(&id_priv->handler_mutex)
     _destory_id
       iw_destroy_cm_id
       wait_for_completiion(&id_priv->comp)
       kfree(id_priv)
Shinichiro Kawasaki June 13, 2023, 6:47 a.m. UTC | #4
On Jun 13, 2023 / 10:43, Shin'ichiro Kawasaki wrote:

[...]

> My understanding about the call chain to rdma id destroy is as follows. I guess
> _destory_id calls iw_destory_cm_id before destroying the rdma id, but not sure
> why it does not wait for cm_id deref by cm_work_handler.

I found the commit 59c68ac31e15 ("iw_cm: free cm_id resources on the last
deref") looks related the issue. Its commit message says that it "removed the
block in iw_destroy_cm_id()" to avoid potential deadlock. So it looks cm_id ref
no longer guards rdma id destroy. I'm not sure how to fix this issue at IW CM
level.
Jason Gunthorpe June 13, 2023, 1:30 p.m. UTC | #5
On Tue, Jun 13, 2023 at 01:43:43AM +0000, Shinichiro Kawasaki wrote:
> > I think there is likely some much larger issue with the IW CM if the
> > cm_id can be destroyed while the iwcm_id is in use? It is weird that
> > there are two id memories for this :\
> 
> My understanding about the call chain to rdma id destroy is as follows. I guess
> _destory_id calls iw_destory_cm_id before destroying the rdma id, but not sure
> why it does not wait for cm_id deref by cm_work_handler.
> 
> nvme_rdma_teardown_io_queueus
>  nvme_rdma_stop_io_queues -> chained to cma_iw_handler
>  nvme_rdma_free_io_queues
>   nvme_rdma_free_queue
>    rdma_destroy_id
>     mutex_lock(&id_priv->handler_mutex)
>     destroy_id_handler_unlock
>      mutex_unlock(&id_priv->handler_mutex)
>      _destory_id
>        iw_destroy_cm_id
>        wait_for_completiion(&id_priv->comp)
>        kfree(id_priv)

Once a destroy_cm_id() has returned that layer is no longer
permitted to run or be running in its handlers. The iw cm is broken if
it allows this, and that is the cause of the bug.

Taking more refs within handlers that are already not allowed to be
running is just racy.

Jason
Leon Romanovsky June 13, 2023, 6:07 p.m. UTC | #6
On Tue, Jun 13, 2023 at 10:30:37AM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 13, 2023 at 01:43:43AM +0000, Shinichiro Kawasaki wrote:
> > > I think there is likely some much larger issue with the IW CM if the
> > > cm_id can be destroyed while the iwcm_id is in use? It is weird that
> > > there are two id memories for this :\
> > 
> > My understanding about the call chain to rdma id destroy is as follows. I guess
> > _destory_id calls iw_destory_cm_id before destroying the rdma id, but not sure
> > why it does not wait for cm_id deref by cm_work_handler.
> > 
> > nvme_rdma_teardown_io_queueus
> >  nvme_rdma_stop_io_queues -> chained to cma_iw_handler
> >  nvme_rdma_free_io_queues
> >   nvme_rdma_free_queue
> >    rdma_destroy_id
> >     mutex_lock(&id_priv->handler_mutex)
> >     destroy_id_handler_unlock
> >      mutex_unlock(&id_priv->handler_mutex)
> >      _destory_id
> >        iw_destroy_cm_id
> >        wait_for_completiion(&id_priv->comp)
> >        kfree(id_priv)
> 
> Once a destroy_cm_id() has returned that layer is no longer
> permitted to run or be running in its handlers. The iw cm is broken if
> it allows this, and that is the cause of the bug.
> 
> Taking more refs within handlers that are already not allowed to be
> running is just racy.

So we need to revert that patch from our rdma-rc.

Thanks

> 
> Jason
Shinichiro Kawasaki June 14, 2023, 7:53 a.m. UTC | #7
On Jun 13, 2023 / 21:07, Leon Romanovsky wrote:
> On Tue, Jun 13, 2023 at 10:30:37AM -0300, Jason Gunthorpe wrote:
> > On Tue, Jun 13, 2023 at 01:43:43AM +0000, Shinichiro Kawasaki wrote:
> > > > I think there is likely some much larger issue with the IW CM if the
> > > > cm_id can be destroyed while the iwcm_id is in use? It is weird that
> > > > there are two id memories for this :\
> > > 
> > > My understanding about the call chain to rdma id destroy is as follows. I guess
> > > _destory_id calls iw_destory_cm_id before destroying the rdma id, but not sure
> > > why it does not wait for cm_id deref by cm_work_handler.
> > > 
> > > nvme_rdma_teardown_io_queueus
> > >  nvme_rdma_stop_io_queues -> chained to cma_iw_handler
> > >  nvme_rdma_free_io_queues
> > >   nvme_rdma_free_queue
> > >    rdma_destroy_id
> > >     mutex_lock(&id_priv->handler_mutex)
> > >     destroy_id_handler_unlock
> > >      mutex_unlock(&id_priv->handler_mutex)
> > >      _destory_id
> > >        iw_destroy_cm_id
> > >        wait_for_completiion(&id_priv->comp)
> > >        kfree(id_priv)
> > 
> > Once a destroy_cm_id() has returned that layer is no longer
> > permitted to run or be running in its handlers. The iw cm is broken if
> > it allows this, and that is the cause of the bug.
> > 
> > Taking more refs within handlers that are already not allowed to be
> > running is just racy.
> 
> So we need to revert that patch from our rdma-rc.

I see, thanks for the clarifications.

As another fix approach, I reverted the commit 59c68ac31e15 ("iw_cm: free cm_id
resources on the last deref") so that iw_destroy_cm_id() waits for deref of
cm_id. With that revert, the KASAN slab-use-after-free disappeared. Is this
the right fix approach?
Jason Gunthorpe June 14, 2023, 5:36 p.m. UTC | #8
On Wed, Jun 14, 2023 at 07:53:49AM +0000, Shinichiro Kawasaki wrote:
> On Jun 13, 2023 / 21:07, Leon Romanovsky wrote:
> > On Tue, Jun 13, 2023 at 10:30:37AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Jun 13, 2023 at 01:43:43AM +0000, Shinichiro Kawasaki wrote:
> > > > > I think there is likely some much larger issue with the IW CM if the
> > > > > cm_id can be destroyed while the iwcm_id is in use? It is weird that
> > > > > there are two id memories for this :\
> > > > 
> > > > My understanding about the call chain to rdma id destroy is as follows. I guess
> > > > _destory_id calls iw_destory_cm_id before destroying the rdma id, but not sure
> > > > why it does not wait for cm_id deref by cm_work_handler.
> > > > 
> > > > nvme_rdma_teardown_io_queueus
> > > >  nvme_rdma_stop_io_queues -> chained to cma_iw_handler
> > > >  nvme_rdma_free_io_queues
> > > >   nvme_rdma_free_queue
> > > >    rdma_destroy_id
> > > >     mutex_lock(&id_priv->handler_mutex)
> > > >     destroy_id_handler_unlock
> > > >      mutex_unlock(&id_priv->handler_mutex)
> > > >      _destory_id
> > > >        iw_destroy_cm_id
> > > >        wait_for_completiion(&id_priv->comp)
> > > >        kfree(id_priv)
> > > 
> > > Once a destroy_cm_id() has returned that layer is no longer
> > > permitted to run or be running in its handlers. The iw cm is broken if
> > > it allows this, and that is the cause of the bug.
> > > 
> > > Taking more refs within handlers that are already not allowed to be
> > > running is just racy.
> > 
> > So we need to revert that patch from our rdma-rc.
> 
> I see, thanks for the clarifications.
> 
> As another fix approach, I reverted the commit 59c68ac31e15 ("iw_cm: free cm_id
> resources on the last deref") so that iw_destroy_cm_id() waits for deref of
> cm_id. With that revert, the KASAN slab-use-after-free disappeared. Is this
> the right fix approach?

That seems like it would bring back the bug it was fixing, though it
isn't totally clear what that is

There is something wrong with the iwarp cm if it is destroying IDs in
handlers, IB cm avoids doing that to avoid the deadlock, the same
solution will be needed for iwarp too.

Also the code this patch removed is quite ugly, if we are going back
to waiting it should be written in a more modern way without the test
bit and so on.

Jason
Shinichiro Kawasaki June 15, 2023, 12:45 a.m. UTC | #9
On Jun 14, 2023 / 14:36, Jason Gunthorpe wrote:
> On Wed, Jun 14, 2023 at 07:53:49AM +0000, Shinichiro Kawasaki wrote:
[...]
> > As another fix approach, I reverted the commit 59c68ac31e15 ("iw_cm: free cm_id
> > resources on the last deref") so that iw_destroy_cm_id() waits for deref of
> > cm_id. With that revert, the KASAN slab-use-after-free disappeared. Is this
> > the right fix approach?
> 
> That seems like it would bring back the bug it was fixing, though it
> isn't totally clear what that is
> 
> There is something wrong with the iwarp cm if it is destroying IDs in
> handlers, IB cm avoids doing that to avoid the deadlock, the same
> solution will be needed for iwarp too.
> 
> Also the code this patch removed is quite ugly, if we are going back
> to waiting it should be written in a more modern way without the test
> bit and so on.

I see, thanks for the clarifications. I took a look in ib_destroy_cm_id() and
fount it does differently from iw_destroy_cm_id(). I guess certain amount of
changes will be required for the good fix in iwarp code.

I stop my fix attempt here, since it looks beyond my bandwidth now. If anyone
provides fix patches, I'm willing to test them.
Shinichiro Kawasaki Sept. 5, 2023, 12:39 a.m. UTC | #10
On Jun 15, 2023 / 09:45, Shin'ichiro Kawasaki wrote:
[...]
> I stop my fix attempt here, since it looks beyond my bandwidth now. If anyone
> provides fix patches, I'm willing to test them.

FYI, I tried to recreate this failure on my test system this week, and found it
is no longer recreated. In case your TODO list includes a work item for this
issue, please put lower priority on it.

Two month ago, the failure had been recreated by repeating the blktests test
cases nvme/030 and nvme/031 with SIW tranposrt around 30 times. Now I do not see
it when I repeat them 100 times. I suspected some kernel changes between v6.4
and v6.5 would have fixed the issue, but even when I use both kernel versions,
the failure is no longer seen. I reverted nvme-cli and libnvme to older version,
but still do not see the failure. I just guess some change on my test system
affected the symptom (Fedora 38 on QEMU).
Daniel Wagner Sept. 6, 2023, 8:54 a.m. UTC | #11
On Tue, Sep 05, 2023 at 12:39:38AM +0000, Shinichiro Kawasaki wrote:
> Two month ago, the failure had been recreated by repeating the blktests test
> cases nvme/030 and nvme/031 with SIW tranposrt around 30 times. Now I do not see
> it when I repeat them 100 times. I suspected some kernel changes between v6.4
> and v6.5 would have fixed the issue, but even when I use both kernel versions,
> the failure is no longer seen. I reverted nvme-cli and libnvme to older version,
> but still do not see the failure. I just guess some change on my test system
> affected the symptom (Fedora 38 on QEMU).

Maybe the refactoring in blktests could also be the reason the tests are
not working? Just an idea.
Shinichiro Kawasaki Sept. 6, 2023, 11:31 a.m. UTC | #12
On Sep 06, 2023 / 10:54, Daniel Wagner wrote:
> On Tue, Sep 05, 2023 at 12:39:38AM +0000, Shinichiro Kawasaki wrote:
> > Two month ago, the failure had been recreated by repeating the blktests test
> > cases nvme/030 and nvme/031 with SIW tranposrt around 30 times. Now I do not see
> > it when I repeat them 100 times. I suspected some kernel changes between v6.4
> > and v6.5 would have fixed the issue, but even when I use both kernel versions,
> > the failure is no longer seen. I reverted nvme-cli and libnvme to older version,
> > but still do not see the failure. I just guess some change on my test system
> > affected the symptom (Fedora 38 on QEMU).
> 
> Maybe the refactoring in blktests could also be the reason the tests are
> not working? Just an idea.

That is an idea. I rolled back blktests to the git revision of two months ago,
but still the failure is not recreated. Hmm.
Yi Zhang Sept. 7, 2023, 7:47 a.m. UTC | #13
Seems I reproduced this issue[3] when I run [1] on the latest
linux-block/for-next with kernel config[2].

[1]nvme_trtype=rdma ./check nvme/

[2] https://pastebin.com/u9fh7dbX

[3]
[  387.880016] run blktests nvme/003 at 2023-09-07 03:31:15
[  388.173979] (null): rxe_set_mtu: Set mtu to 1024
[  388.310460] infiniband eno1_rxe: set active
[  388.310495] infiniband eno1_rxe: added eno1
[  388.708380] loop0: detected capacity change from 0 to 2097152
[  388.767800] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[  388.839397] nvmet_rdma: enabling port 0 (10.16.221.26:4420)
[  389.053251] nvmet: creating discovery controller 1 for subsystem
nqn.2014-08.org.nvmexpress.discovery for NQN
nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
[  389.061415] nvme nvme2: new ctrl: NQN
"nqn.2014-08.org.nvmexpress.discovery", addr 10.16.221.26:4420
[  402.202057] nvme nvme2: Removing ctrl: NQN
"nqn.2014-08.org.nvmexpress.discovery"

[  402.251605] ======================================================
[  402.257782] WARNING: possible circular locking dependency detected
[  402.263963] 6.5.0+ #1 Not tainted
[  402.267280] ------------------------------------------------------
[  402.273459] kworker/5:0/48 is trying to acquire lock:
[  402.278512] ffff88831a9b6450 (&id_priv->handler_mutex){+.+.}-{3:3},
at: rdma_destroy_id+0x17/0x20 [rdma_cm]
[  402.288279]
               but task is already holding lock:
[  402.294112] ffffc90000687da0
((work_completion)(&queue->release_work)){+.+.}-{0:0}, at:
process_one_work+0x83a/0x1660
[  402.304729]
               which lock already depends on the new lock.

[  402.312903]
               the existing dependency chain (in reverse order) is:
[  402.320380]
               -> #3 ((work_completion)(&queue->release_work)){+.+.}-{0:0}:
[  402.328561]        __lock_acquire+0xbd6/0x1c00
[  402.333015]        lock_acquire+0x1da/0x5e0
[  402.337203]        process_one_work+0x884/0x1660
[  402.341822]        worker_thread+0x5be/0xef0
[  402.346093]        kthread+0x2f4/0x3d0
[  402.349845]        ret_from_fork+0x30/0x70
[  402.353947]        ret_from_fork_asm+0x1b/0x30
[  402.358391]
               -> #2 ((wq_completion)nvmet-wq){+.+.}-{0:0}:
[  402.365186]        __lock_acquire+0xbd6/0x1c00
[  402.369631]        lock_acquire+0x1da/0x5e0
[  402.373818]        __flush_workqueue+0x101/0x1230
[  402.378524]        nvmet_rdma_queue_connect+0x106/0x4b0 [nvmet_rdma]
[  402.384885]        cma_cm_event_handler+0xf2/0x430 [rdma_cm]
[  402.390562]        cma_ib_req_handler+0x93a/0x1cd0 [rdma_cm]
[  402.396238]        cm_process_work+0x48/0x3b0 [ib_cm]
[  402.401307]        cm_req_handler+0xf9b/0x2610 [ib_cm]
[  402.406465]        cm_work_handler+0x80a/0x1380 [ib_cm]
[  402.411708]        process_one_work+0x955/0x1660
[  402.416326]        worker_thread+0x5be/0xef0
[  402.420600]        kthread+0x2f4/0x3d0
[  402.424351]        ret_from_fork+0x30/0x70
[  402.428450]        ret_from_fork_asm+0x1b/0x30
[  402.432897]
               -> #1 (&id_priv->handler_mutex/1){+.+.}-{3:3}:
[  402.439874]        __lock_acquire+0xbd6/0x1c00
[  402.444320]        lock_acquire+0x1da/0x5e0
[  402.448504]        __mutex_lock+0x154/0x1490
[  402.452776]        cma_ib_req_handler+0x4f8/0x1cd0 [rdma_cm]
[  402.458455]        cm_process_work+0x48/0x3b0 [ib_cm]
[  402.463525]        cm_req_handler+0xf9b/0x2610 [ib_cm]
[  402.468681]        cm_work_handler+0x80a/0x1380 [ib_cm]
[  402.473925]        process_one_work+0x955/0x1660
[  402.478543]        worker_thread+0x5be/0xef0
[  402.482817]        kthread+0x2f4/0x3d0
[  402.486570]        ret_from_fork+0x30/0x70
[  402.490668]        ret_from_fork_asm+0x1b/0x30
[  402.495113]
               -> #0 (&id_priv->handler_mutex){+.+.}-{3:3}:
[  402.501908]        check_prev_add+0x1bd/0x23a0
[  402.506352]        validate_chain+0xb02/0xf30
[  402.510713]        __lock_acquire+0xbd6/0x1c00
[  402.515159]        lock_acquire+0x1da/0x5e0
[  402.519344]        __mutex_lock+0x154/0x1490
[  402.523617]        rdma_destroy_id+0x17/0x20 [rdma_cm]
[  402.528774]        nvmet_rdma_free_queue+0x7b/0x380 [nvmet_rdma]
[  402.534790]        nvmet_rdma_release_queue_work+0x3e/0x90 [nvmet_rdma]
[  402.541408]        process_one_work+0x955/0x1660
[  402.546029]        worker_thread+0x5be/0xef0
[  402.550303]        kthread+0x2f4/0x3d0
[  402.554054]        ret_from_fork+0x30/0x70
[  402.558153]        ret_from_fork_asm+0x1b/0x30
[  402.562599]
               other info that might help us debug this:

[  402.570599] Chain exists of:
                 &id_priv->handler_mutex --> (wq_completion)nvmet-wq
--> (work_completion)(&queue->release_work)

[  402.584854]  Possible unsafe locking scenario:

[  402.590775]        CPU0                    CPU1
[  402.595307]        ----                    ----
[  402.599839]   lock((work_completion)(&queue->release_work));
[  402.605508]                                lock((wq_completion)nvmet-wq);
[  402.612300]
lock((work_completion)(&queue->release_work));
[  402.620475]   lock(&id_priv->handler_mutex);
[  402.624755]
                *** DEADLOCK ***

[  402.630675] 2 locks held by kworker/5:0/48:
[  402.634860]  #0: ffff8883070c9d48
((wq_completion)nvmet-wq){+.+.}-{0:0}, at:
process_one_work+0x80c/0x1660
[  402.644524]  #1: ffffc90000687da0
((work_completion)(&queue->release_work)){+.+.}-{0:0}, at:
process_one_work+0x83a/0x1660
[  402.655572]
               stack backtrace:
[  402.659933] CPU: 5 PID: 48 Comm: kworker/5:0 Not tainted 6.5.0+ #1
[  402.666110] Hardware name: Dell Inc. PowerEdge R6515/07PXPY, BIOS
2.11.4 03/22/2023
[  402.673763] Workqueue: nvmet-wq nvmet_rdma_release_queue_work [nvmet_rdma]
[  402.680646] Call Trace:
[  402.683099]  <TASK>
[  402.685205]  dump_stack_lvl+0x60/0xb0
[  402.688879]  check_noncircular+0x2f9/0x3e0
[  402.692982]  ? __pfx_check_noncircular+0x10/0x10
[  402.697609]  ? srso_return_thunk+0x5/0x10
[  402.701635]  ? srso_return_thunk+0x5/0x10
[  402.705649]  ? alloc_chain_hlocks+0x1de/0x520
[  402.710010]  check_prev_add+0x1bd/0x23a0
[  402.713944]  ? mark_lock.part.0+0xca/0xab0
[  402.718054]  validate_chain+0xb02/0xf30
[  402.721909]  ? __pfx_validate_chain+0x10/0x10
[  402.726274]  ? _raw_spin_unlock_irq+0x24/0x50
[  402.730635]  ? lockdep_hardirqs_on+0x79/0x100
[  402.735000]  ? srso_return_thunk+0x5/0x10
[  402.739012]  ? _raw_spin_unlock_irq+0x2f/0x50
[  402.743380]  __lock_acquire+0xbd6/0x1c00
[  402.747321]  lock_acquire+0x1da/0x5e0
[  402.750991]  ? rdma_destroy_id+0x17/0x20 [rdma_cm]
[  402.755805]  ? __pfx_lock_acquire+0x10/0x10
[  402.759997]  ? do_raw_spin_trylock+0xb5/0x180
[  402.764363]  ? srso_return_thunk+0x5/0x10
[  402.768381]  ? srso_return_thunk+0x5/0x10
[  402.772402]  __mutex_lock+0x154/0x1490
[  402.776159]  ? rdma_destroy_id+0x17/0x20 [rdma_cm]
[  402.780973]  ? rdma_destroy_id+0x17/0x20 [rdma_cm]
[  402.785782]  ? __pfx___mutex_lock+0x10/0x10
[  402.789981]  ? srso_return_thunk+0x5/0x10
[  402.793992]  ? rcu_is_watching+0x11/0xb0
[  402.797929]  ? rdma_destroy_id+0x17/0x20 [rdma_cm]
[  402.802738]  rdma_destroy_id+0x17/0x20 [rdma_cm]
[  402.807377]  nvmet_rdma_free_queue+0x7b/0x380 [nvmet_rdma]
[  402.812875]  nvmet_rdma_release_queue_work+0x3e/0x90 [nvmet_rdma]
[  402.818972]  process_one_work+0x955/0x1660
[  402.823078]  ? __lock_acquired+0x207/0x7b0
[  402.827177]  ? __pfx_process_one_work+0x10/0x10
[  402.831711]  ? __pfx___lock_acquired+0x10/0x10
[  402.836164]  ? worker_thread+0x15a/0xef0
[  402.840095]  worker_thread+0x5be/0xef0
[  402.843862]  ? __pfx_worker_thread+0x10/0x10
[  402.848134]  kthread+0x2f4/0x3d0
[  402.851366]  ? __pfx_kthread+0x10/0x10
[  402.855121]  ret_from_fork+0x30/0x70
[  402.858704]  ? __pfx_kthread+0x10/0x10
[  402.862459]  ret_from_fork_asm+0x1b/0x30
[  402.866398]  </TASK>
[  403.865323] rdma_rxe: unloaded
[  405.242319] rdma_rxe: loaded
[  405.272491] loop: module loaded
[  405.352101] run blktests nvme/004 at 2023-09-07 03:31:32
[  405.504367] (null): rxe_set_mtu: Set mtu to 1024
[  405.587145] infiniband eno1_rxe: set active
[  405.595981] infiniband eno1_rxe: added eno1
[  405.714913] loop0: detected capacity change from 0 to 2097152
[  405.746602] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[  405.784747] nvmet_rdma: enabling port 0 (10.16.221.26:4420)
[  405.876322] nvmet: creating nvm controller 1 for subsystem
blktests-subsystem-1 for NQN
nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
[  405.894750] nvme nvme2: creating 16 I/O queues.
[  406.181403] nvme nvme2: mapped 16/0/0 default/read/poll queues.
[  406.216348] nvme nvme2: new ctrl: NQN "blktests-subsystem-1", addr
10.16.221.26:4420
[  406.552633] nvme nvme2: Removing ctrl: NQN "blktests-subsystem-1"
[  407.879250] rdma_rxe: unloaded
[  409.146426] rdma_rxe: loaded
[  409.176282] loop: module loaded
[  409.460211] rdma_rxe: unloaded
[  410.938596] rdma_rxe: loaded
[  410.966908] loop: module loaded


On Wed, Sep 6, 2023 at 7:33 PM Shinichiro Kawasaki
<shinichiro.kawasaki@wdc.com> wrote:
>
> On Sep 06, 2023 / 10:54, Daniel Wagner wrote:
> > On Tue, Sep 05, 2023 at 12:39:38AM +0000, Shinichiro Kawasaki wrote:
> > > Two month ago, the failure had been recreated by repeating the blktests test
> > > cases nvme/030 and nvme/031 with SIW tranposrt around 30 times. Now I do not see
> > > it when I repeat them 100 times. I suspected some kernel changes between v6.4
> > > and v6.5 would have fixed the issue, but even when I use both kernel versions,
> > > the failure is no longer seen. I reverted nvme-cli and libnvme to older version,
> > > but still do not see the failure. I just guess some change on my test system
> > > affected the symptom (Fedora 38 on QEMU).
> >
> > Maybe the refactoring in blktests could also be the reason the tests are
> > not working? Just an idea.
>
> That is an idea. I rolled back blktests to the git revision of two months ago,
> but still the failure is not recreated. Hmm.
>
Jason Gunthorpe Sept. 7, 2023, 4:17 p.m. UTC | #14
On Thu, Sep 07, 2023 at 03:47:03PM +0800, Yi Zhang wrote:
> Seems I reproduced this issue[3] when I run [1] on the latest
> linux-block/for-next with kernel config[2].

https://lore.kernel.org/linux-rdma/13441b9b-cc13-f0e0-bd46-f14983dadd49@grimberg.me/

So far nobody has come with a solution

Jason
Hannes Reinecke Sept. 8, 2023, 7:18 a.m. UTC | #15
On 9/7/23 09:47, Yi Zhang wrote:
> Seems I reproduced this issue[3] when I run [1] on the latest
> linux-block/for-next with kernel config[2].
> 
> [1]nvme_trtype=rdma ./check nvme/
> 
> [2] https://pastebin.com/u9fh7dbX
> 
But isn't this the same issue as the circular locking discussed earlier?

Can you check if the patch
'nvmet-tcp: avoid circular locking dependency on install_queue()'
fixes this issue, too?

Cheers,

Hannes
--
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman
Yi Zhang Sept. 8, 2023, 8:33 a.m. UTC | #16
On Fri, Sep 8, 2023 at 3:18 PM Hannes Reinecke <hare@suse.de> wrote:
>
> On 9/7/23 09:47, Yi Zhang wrote:
> > Seems I reproduced this issue[3] when I run [1] on the latest
> > linux-block/for-next with kernel config[2].
> >
> > [1]nvme_trtype=rdma ./check nvme/
> >
> > [2] https://pastebin.com/u9fh7dbX
> >
> But isn't this the same issue as the circular locking discussed earlier?

Sorry, maybe I mixed up the issues here, There are three issues here:

1. "BUG:KASAN: slab-use-after-free at mutex_lock() in
cma_iw_handler()." which Shin'ichiro mentioned in this patch and
triggered by nvme/rdma nvme/030 nvme/031 use siw

2. nvme/tcp circular locking on install_queue which could be fixed by:
https://lore.kernel.org/linux-nvme/20230810132006.129365-1-hare@suse.de/
https://lore.kernel.org/linux-nvme/20230726142939.10062-1-guoqing.jiang@linux.dev/

3. circular locking at rdma_destroy_id+0x17/0x20 [rdma_cm] which was
already filed last year but still not get one solution:
https://lore.kernel.org/linux-rdma/CAHj4cs93BfTRgWF6PbuZcfq6AARHgYC2g=RQ-7Jgcf1-6h+2SQ@mail.gmail.com/


>
> Can you check if the patch
> 'nvmet-tcp: avoid circular locking dependency on install_queue()'
> fixes this issue, too?
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke                Kernel Storage Architect
> hare@suse.de                              +49 911 74053 688
> SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
> HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
> Myers, Andrew McDonald, Martje Boudien Moerman
>
--
Best Regards,
  Yi Zhang
diff mbox series

Patch

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 93a1c48d0c32..c5267d9bb184 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -2477,6 +2477,7 @@  static int cma_iw_handler(struct iw_cm_id *iw_id, struct iw_cm_event *iw_event)
 	struct sockaddr *laddr = (struct sockaddr *)&iw_event->local_addr;
 	struct sockaddr *raddr = (struct sockaddr *)&iw_event->remote_addr;
 
+	cma_id_get(id_priv);
 	mutex_lock(&id_priv->handler_mutex);
 	if (READ_ONCE(id_priv->state) != RDMA_CM_CONNECT)
 		goto out;
@@ -2524,12 +2525,14 @@  static int cma_iw_handler(struct iw_cm_id *iw_id, struct iw_cm_event *iw_event)
 	if (ret) {
 		/* Destroy the CM ID by returning a non-zero value. */
 		id_priv->cm_id.iw = NULL;
+		cma_id_put(id_priv);
 		destroy_id_handler_unlock(id_priv);
 		return ret;
 	}
 
 out:
 	mutex_unlock(&id_priv->handler_mutex);
+	cma_id_put(id_priv);
 	return ret;
 }