diff mbox series

iscsi: Fix deadlock on recovery path during GFP_IO reclaim

Message ID 20200508055954.843165-1-krisman@collabora.com (mailing list archive)
State Superseded
Headers show
Series iscsi: Fix deadlock on recovery path during GFP_IO reclaim | expand

Commit Message

Gabriel Krisman Bertazi May 8, 2020, 5:59 a.m. UTC
iscsi suffers from a deadlock in case a management command submitted via
the netlink socket sleeps on an allocation while holding the
rx_queue_mutex, if that allocation causes a memory reclaim that
writebacks to a failed iscsi device.  Then, the recovery procedure can
never make progress to recover the failed disk or abort outstanding IO
operations to complete the reclaim (since rx_queue_mutex is locked),
thus locking the system.

Nevertheless, just marking all allocations under rx_queue_mutex as
GFP_NOIO (or locking the userspace process with something like
PF_MEMALLOC_NOIO) is not enough, since the iscsi command code relies on
other subsystems that try to grab locked mutexes, whose threads are
GFP_IO, leading to the same deadlock. One instance where this situation
can be observed is in the backtraces below, stitched from multiple bugs
reports, involving the kobj uevent sent when a session is created.

The root of the problem is not the fact that iscsi does GFP_IO
allocations, that is acceptable. The actual problem is that
rx_queue_mutex has a very large granularity, covering every unrelated
netlink command execution at the same time as the error recovery path.

The proposed fix leverages the recently added mechanism to stop failed
connections from the kernel, by enabling it to execute even though a
management command from the netlink socket is being run (rx_queue_mutex
is held), provided that the command is known to be safe.  It splits the
rx_queue_mutex in two mutexes, one protecting from concurrent command
execution from the netlink socket, and one protecting stop_conn from
racing with other connection management operations that might conflict
with it.

It is not very pretty, but it is the simplest way to resolve the
deadlock.  I considered making it a lock per connection, but some
external mutex would still be needed to deal with iscsi_if_destroy_conn.

The patch was tested by forcing a memory shrinker (unrelated, but used
bufio/dm-verity) to reclaim ISCSI pages every time
ISCSI_UEVENT_CREATE_SESSION happens, which is reasonable to simulate
reclaims that might happen with GFP_KERNEL on that path.  Then, a faulty
hung target causes a connection to fail during intensive IO, at the same
time a new session is added by iscsid.

The following stacktraces are stiches from several bug reports, showing
a case where the deadlock can happen.

 iSCSI-write
         holding: rx_queue_mutex
         waiting: uevent_sock_mutex

         kobject_uevent_env+0x1bd/0x419
         kobject_uevent+0xb/0xd
         device_add+0x48a/0x678
         scsi_add_host_with_dma+0xc5/0x22d
         iscsi_host_add+0x53/0x55
         iscsi_sw_tcp_session_create+0xa6/0x129
         iscsi_if_rx+0x100/0x1247
         netlink_unicast+0x213/0x4f0
         netlink_sendmsg+0x230/0x3c0

 iscsi_fail iscsi_conn_failure
         waiting: rx_queue_mutex

         schedule_preempt_disabled+0x325/0x734
         __mutex_lock_slowpath+0x18b/0x230
         mutex_lock+0x22/0x40
         iscsi_conn_failure+0x42/0x149
         worker_thread+0x24a/0xbc0

 EventManager_
         holding: uevent_sock_mutex
         waiting: dm_bufio_client->lock

         dm_bufio_lock+0xe/0x10
         shrink+0x34/0xf7
         shrink_slab+0x177/0x5d0
         do_try_to_free_pages+0x129/0x470
         try_to_free_mem_cgroup_pages+0x14f/0x210
         memcg_kmem_newpage_charge+0xa6d/0x13b0
         __alloc_pages_nodemask+0x4a3/0x1a70
         fallback_alloc+0x1b2/0x36c
         __kmalloc_node_track_caller+0xb9/0x10d0
         __alloc_skb+0x83/0x2f0
         kobject_uevent_env+0x26b/0x419
         dm_kobject_uevent+0x70/0x79
         dev_suspend+0x1a9/0x1e7
         ctl_ioctl+0x3e9/0x411
         dm_ctl_ioctl+0x13/0x17
         do_vfs_ioctl+0xb3/0x460
         SyS_ioctl+0x5e/0x90

 MemcgReclaimerD"
         holding: dm_bufio_client->lock
         waiting: stuck io to finish (needs iscsi_fail thread to progress)

         schedule at ffffffffbd603618
         io_schedule at ffffffffbd603ba4
         do_io_schedule at ffffffffbdaf0d94
         __wait_on_bit at ffffffffbd6008a6
         out_of_line_wait_on_bit at ffffffffbd600960
         wait_on_bit.constprop.10 at ffffffffbdaf0f17
         __make_buffer_clean at ffffffffbdaf18ba
         __cleanup_old_buffer at ffffffffbdaf192f
         shrink at ffffffffbdaf19fd
         do_shrink_slab at ffffffffbd6ec000
         shrink_slab at ffffffffbd6ec24a
         do_try_to_free_pages at ffffffffbd6eda09
         try_to_free_mem_cgroup_pages at ffffffffbd6ede7e
         mem_cgroup_resize_limit at ffffffffbd7024c0
         mem_cgroup_write at ffffffffbd703149
         cgroup_file_write at ffffffffbd6d9c6e
         sys_write at ffffffffbd6662ea
         system_call_fastpath at ffffffffbdbc34a2

Reported-by: Khazhismel Kumykov <khazhy@google.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
---
 drivers/scsi/scsi_transport_iscsi.c | 67 +++++++++++++++++++++--------
 1 file changed, 49 insertions(+), 18 deletions(-)

Comments

Martin K. Petersen May 15, 2020, 12:11 a.m. UTC | #1
Lee/Chris: Please review!

> iscsi suffers from a deadlock in case a management command submitted via
> the netlink socket sleeps on an allocation while holding the
> rx_queue_mutex, if that allocation causes a memory reclaim that
> writebacks to a failed iscsi device.  Then, the recovery procedure can
> never make progress to recover the failed disk or abort outstanding IO
> operations to complete the reclaim (since rx_queue_mutex is locked),
> thus locking the system.
>
> Nevertheless, just marking all allocations under rx_queue_mutex as
> GFP_NOIO (or locking the userspace process with something like
> PF_MEMALLOC_NOIO) is not enough, since the iscsi command code relies on
> other subsystems that try to grab locked mutexes, whose threads are
> GFP_IO, leading to the same deadlock. One instance where this situation
> can be observed is in the backtraces below, stitched from multiple bugs
> reports, involving the kobj uevent sent when a session is created.
>
> The root of the problem is not the fact that iscsi does GFP_IO
> allocations, that is acceptable. The actual problem is that
> rx_queue_mutex has a very large granularity, covering every unrelated
> netlink command execution at the same time as the error recovery path.
>
> The proposed fix leverages the recently added mechanism to stop failed
> connections from the kernel, by enabling it to execute even though a
> management command from the netlink socket is being run (rx_queue_mutex
> is held), provided that the command is known to be safe.  It splits the
> rx_queue_mutex in two mutexes, one protecting from concurrent command
> execution from the netlink socket, and one protecting stop_conn from
> racing with other connection management operations that might conflict
> with it.
>
> It is not very pretty, but it is the simplest way to resolve the
> deadlock.  I considered making it a lock per connection, but some
> external mutex would still be needed to deal with iscsi_if_destroy_conn.
>
> The patch was tested by forcing a memory shrinker (unrelated, but used
> bufio/dm-verity) to reclaim ISCSI pages every time
> ISCSI_UEVENT_CREATE_SESSION happens, which is reasonable to simulate
> reclaims that might happen with GFP_KERNEL on that path.  Then, a faulty
> hung target causes a connection to fail during intensive IO, at the same
> time a new session is added by iscsid.
>
> The following stacktraces are stiches from several bug reports, showing
> a case where the deadlock can happen.
>
>  iSCSI-write
>          holding: rx_queue_mutex
>          waiting: uevent_sock_mutex
>
>          kobject_uevent_env+0x1bd/0x419
>          kobject_uevent+0xb/0xd
>          device_add+0x48a/0x678
>          scsi_add_host_with_dma+0xc5/0x22d
>          iscsi_host_add+0x53/0x55
>          iscsi_sw_tcp_session_create+0xa6/0x129
>          iscsi_if_rx+0x100/0x1247
>          netlink_unicast+0x213/0x4f0
>          netlink_sendmsg+0x230/0x3c0
>
>  iscsi_fail iscsi_conn_failure
>          waiting: rx_queue_mutex
>
>          schedule_preempt_disabled+0x325/0x734
>          __mutex_lock_slowpath+0x18b/0x230
>          mutex_lock+0x22/0x40
>          iscsi_conn_failure+0x42/0x149
>          worker_thread+0x24a/0xbc0
>
>  EventManager_
>          holding: uevent_sock_mutex
>          waiting: dm_bufio_client->lock
>
>          dm_bufio_lock+0xe/0x10
>          shrink+0x34/0xf7
>          shrink_slab+0x177/0x5d0
>          do_try_to_free_pages+0x129/0x470
>          try_to_free_mem_cgroup_pages+0x14f/0x210
>          memcg_kmem_newpage_charge+0xa6d/0x13b0
>          __alloc_pages_nodemask+0x4a3/0x1a70
>          fallback_alloc+0x1b2/0x36c
>          __kmalloc_node_track_caller+0xb9/0x10d0
>          __alloc_skb+0x83/0x2f0
>          kobject_uevent_env+0x26b/0x419
>          dm_kobject_uevent+0x70/0x79
>          dev_suspend+0x1a9/0x1e7
>          ctl_ioctl+0x3e9/0x411
>          dm_ctl_ioctl+0x13/0x17
>          do_vfs_ioctl+0xb3/0x460
>          SyS_ioctl+0x5e/0x90
>
>  MemcgReclaimerD"
>          holding: dm_bufio_client->lock
>          waiting: stuck io to finish (needs iscsi_fail thread to progress)
>
>          schedule at ffffffffbd603618
>          io_schedule at ffffffffbd603ba4
>          do_io_schedule at ffffffffbdaf0d94
>          __wait_on_bit at ffffffffbd6008a6
>          out_of_line_wait_on_bit at ffffffffbd600960
>          wait_on_bit.constprop.10 at ffffffffbdaf0f17
>          __make_buffer_clean at ffffffffbdaf18ba
>          __cleanup_old_buffer at ffffffffbdaf192f
>          shrink at ffffffffbdaf19fd
>          do_shrink_slab at ffffffffbd6ec000
>          shrink_slab at ffffffffbd6ec24a
>          do_try_to_free_pages at ffffffffbd6eda09
>          try_to_free_mem_cgroup_pages at ffffffffbd6ede7e
>          mem_cgroup_resize_limit at ffffffffbd7024c0
>          mem_cgroup_write at ffffffffbd703149
>          cgroup_file_write at ffffffffbd6d9c6e
>          sys_write at ffffffffbd6662ea
>          system_call_fastpath at ffffffffbdbc34a2
>
> Reported-by: Khazhismel Kumykov <khazhy@google.com>
> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
> ---
>  drivers/scsi/scsi_transport_iscsi.c | 67 +++++++++++++++++++++--------
>  1 file changed, 49 insertions(+), 18 deletions(-)
>
> diff --git a/drivers/scsi/scsi_transport_iscsi.c b/drivers/scsi/scsi_transport_iscsi.c
> index 17a45716a0fe..d99c17306dff 100644
> --- a/drivers/scsi/scsi_transport_iscsi.c
> +++ b/drivers/scsi/scsi_transport_iscsi.c
> @@ -1616,6 +1616,12 @@ static DECLARE_TRANSPORT_CLASS(iscsi_connection_class,
>  static struct sock *nls;
>  static DEFINE_MUTEX(rx_queue_mutex);
>  
> +/*
> + * conn_mutex protects the {start,bind,stop,destroy}_conn from racing
> + * against the kernel stop_connection recovery mechanism
> + */
> +static DEFINE_MUTEX(conn_mutex);
> +
>  static LIST_HEAD(sesslist);
>  static LIST_HEAD(sessdestroylist);
>  static DEFINE_SPINLOCK(sesslock);
> @@ -2442,6 +2448,32 @@ int iscsi_offload_mesg(struct Scsi_Host *shost,
>  }
>  EXPORT_SYMBOL_GPL(iscsi_offload_mesg);
>  
> +/*
> + * This can be called without the rx_queue_mutex, if invoked by the kernel
> + * stop work. But, in that case, it is guaranteed not to race with
> + * iscsi_destroy by conn_mutex.
> + */
> +static void iscsi_if_stop_conn(struct iscsi_cls_conn *conn, int flag)
> +{
> +	/*
> +	 * It is important that this path doesn't rely on
> +	 * rx_queue_mutex, otherwise, a thread doing allocation on a
> +	 * start_session/start_connection could sleep waiting on a
> +	 * writeback to a failed iscsi device, that cannot be recovered
> +	 * because the lock is held.  If we don't hold it here, the
> +	 * kernel stop_conn_work_fn has a chance to stop the broken
> +	 * session and resolve the allocation.
> +	 *
> +	 * Still, the user invoked .stop_conn() needs to be serialized
> +	 * with stop_conn_work_fn by a private mutex.  Not pretty, but
> +	 * it works.
> +	 */
> +	mutex_lock(&conn_mutex);
> +	conn->transport->stop_conn(conn, flag);
> +	mutex_unlock(&conn_mutex);
> +
> +}
> +
>  static void stop_conn_work_fn(struct work_struct *work)
>  {
>  	struct iscsi_cls_conn *conn, *tmp;
> @@ -2460,30 +2492,17 @@ static void stop_conn_work_fn(struct work_struct *work)
>  		uint32_t sid = iscsi_conn_get_sid(conn);
>  		struct iscsi_cls_session *session;
>  
> -		mutex_lock(&rx_queue_mutex);
> -
>  		session = iscsi_session_lookup(sid);
>  		if (session) {
>  			if (system_state != SYSTEM_RUNNING) {
>  				session->recovery_tmo = 0;
> -				conn->transport->stop_conn(conn,
> -							   STOP_CONN_TERM);
> +				iscsi_if_stop_conn(conn, STOP_CONN_TERM);
>  			} else {
> -				conn->transport->stop_conn(conn,
> -							   STOP_CONN_RECOVER);
> +				iscsi_if_stop_conn(conn, STOP_CONN_RECOVER);
>  			}
>  		}
>  
>  		list_del_init(&conn->conn_list_err);
> -
> -		mutex_unlock(&rx_queue_mutex);
> -
> -		/* we don't want to hold rx_queue_mutex for too long,
> -		 * for instance if many conns failed at the same time,
> -		 * since this stall other iscsi maintenance operations.
> -		 * Give other users a chance to proceed.
> -		 */
> -		cond_resched();
>  	}
>  }
>  
> @@ -2843,8 +2862,11 @@ iscsi_if_destroy_conn(struct iscsi_transport *transport, struct iscsi_uevent *ev
>  	spin_unlock_irqrestore(&connlock, flags);
>  
>  	ISCSI_DBG_TRANS_CONN(conn, "Destroying transport conn\n");
> +
> +	mutex_lock(&conn_mutex);
>  	if (transport->destroy_conn)
>  		transport->destroy_conn(conn);
> +	mutex_unlock(&conn_mutex);
>  
>  	return 0;
>  }
> @@ -3686,9 +3708,12 @@ iscsi_if_recv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, uint32_t *group)
>  			break;
>  		}
>  
> +		mutex_lock(&conn_mutex);
>  		ev->r.retcode =	transport->bind_conn(session, conn,
>  						ev->u.b_conn.transport_eph,
>  						ev->u.b_conn.is_leading);
> +		mutex_unlock(&conn_mutex);
> +
>  		if (ev->r.retcode || !transport->ep_connect)
>  			break;
>  
> @@ -3709,25 +3734,31 @@ iscsi_if_recv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, uint32_t *group)
>  		break;
>  	case ISCSI_UEVENT_START_CONN:
>  		conn = iscsi_conn_lookup(ev->u.start_conn.sid, ev->u.start_conn.cid);
> -		if (conn)
> +		if (conn) {
> +			mutex_lock(&conn_mutex);
>  			ev->r.retcode = transport->start_conn(conn);
> +			mutex_unlock(&conn_mutex);
> +		}
>  		else
>  			err = -EINVAL;
>  		break;
>  	case ISCSI_UEVENT_STOP_CONN:
>  		conn = iscsi_conn_lookup(ev->u.stop_conn.sid, ev->u.stop_conn.cid);
>  		if (conn)
> -			transport->stop_conn(conn, ev->u.stop_conn.flag);
> +			iscsi_if_stop_conn(conn, ev->u.stop_conn.flag);
>  		else
>  			err = -EINVAL;
>  		break;
>  	case ISCSI_UEVENT_SEND_PDU:
>  		conn = iscsi_conn_lookup(ev->u.send_pdu.sid, ev->u.send_pdu.cid);
> -		if (conn)
> +		if (conn) {
> +			mutex_lock(&conn_mutex);
>  			ev->r.retcode =	transport->send_pdu(conn,
>  				(struct iscsi_hdr*)((char*)ev + sizeof(*ev)),
>  				(char*)ev + sizeof(*ev) + ev->u.send_pdu.hdr_size,
>  				ev->u.send_pdu.data_size);
> +			mutex_unlock(&conn_mutex);
> +		}
>  		else
>  			err = -EINVAL;
>  		break;
Lee Duncan May 17, 2020, 5:30 p.m. UTC | #2
On 5/7/20 10:59 PM, Gabriel Krisman Bertazi wrote:
> iscsi suffers from a deadlock in case a management command submitted via
> the netlink socket sleeps on an allocation while holding the
> rx_queue_mutex, if that allocation causes a memory reclaim that
> writebacks to a failed iscsi device.  Then, the recovery procedure can
> never make progress to recover the failed disk or abort outstanding IO
> operations to complete the reclaim (since rx_queue_mutex is locked),
> thus locking the system.
> 
> Nevertheless, just marking all allocations under rx_queue_mutex as
> GFP_NOIO (or locking the userspace process with something like
> PF_MEMALLOC_NOIO) is not enough, since the iscsi command code relies on
> other subsystems that try to grab locked mutexes, whose threads are
> GFP_IO, leading to the same deadlock. One instance where this situation
> can be observed is in the backtraces below, stitched from multiple bugs
> reports, involving the kobj uevent sent when a session is created.
> 
> The root of the problem is not the fact that iscsi does GFP_IO
> allocations, that is acceptable. The actual problem is that
> rx_queue_mutex has a very large granularity, covering every unrelated
> netlink command execution at the same time as the error recovery path.
> 
> The proposed fix leverages the recently added mechanism to stop failed
> connections from the kernel, by enabling it to execute even though a
> management command from the netlink socket is being run (rx_queue_mutex
> is held), provided that the command is known to be safe.  It splits the
> rx_queue_mutex in two mutexes, one protecting from concurrent command
> execution from the netlink socket, and one protecting stop_conn from
> racing with other connection management operations that might conflict
> with it.
> 
> It is not very pretty, but it is the simplest way to resolve the
> deadlock.  I considered making it a lock per connection, but some
> external mutex would still be needed to deal with iscsi_if_destroy_conn.
> 
> The patch was tested by forcing a memory shrinker (unrelated, but used
> bufio/dm-verity) to reclaim ISCSI pages every time
> ISCSI_UEVENT_CREATE_SESSION happens, which is reasonable to simulate
> reclaims that might happen with GFP_KERNEL on that path.  Then, a faulty
> hung target causes a connection to fail during intensive IO, at the same
> time a new session is added by iscsid.
> 
> The following stacktraces are stiches from several bug reports, showing
> a case where the deadlock can happen.
> 
>  iSCSI-write
>          holding: rx_queue_mutex
>          waiting: uevent_sock_mutex
> 
>          kobject_uevent_env+0x1bd/0x419
>          kobject_uevent+0xb/0xd
>          device_add+0x48a/0x678
>          scsi_add_host_with_dma+0xc5/0x22d
>          iscsi_host_add+0x53/0x55
>          iscsi_sw_tcp_session_create+0xa6/0x129
>          iscsi_if_rx+0x100/0x1247
>          netlink_unicast+0x213/0x4f0
>          netlink_sendmsg+0x230/0x3c0
> 
>  iscsi_fail iscsi_conn_failure
>          waiting: rx_queue_mutex
> 
>          schedule_preempt_disabled+0x325/0x734
>          __mutex_lock_slowpath+0x18b/0x230
>          mutex_lock+0x22/0x40
>          iscsi_conn_failure+0x42/0x149
>          worker_thread+0x24a/0xbc0
> 
>  EventManager_
>          holding: uevent_sock_mutex
>          waiting: dm_bufio_client->lock
> 
>          dm_bufio_lock+0xe/0x10
>          shrink+0x34/0xf7
>          shrink_slab+0x177/0x5d0
>          do_try_to_free_pages+0x129/0x470
>          try_to_free_mem_cgroup_pages+0x14f/0x210
>          memcg_kmem_newpage_charge+0xa6d/0x13b0
>          __alloc_pages_nodemask+0x4a3/0x1a70
>          fallback_alloc+0x1b2/0x36c
>          __kmalloc_node_track_caller+0xb9/0x10d0
>          __alloc_skb+0x83/0x2f0
>          kobject_uevent_env+0x26b/0x419
>          dm_kobject_uevent+0x70/0x79
>          dev_suspend+0x1a9/0x1e7
>          ctl_ioctl+0x3e9/0x411
>          dm_ctl_ioctl+0x13/0x17
>          do_vfs_ioctl+0xb3/0x460
>          SyS_ioctl+0x5e/0x90
> 
>  MemcgReclaimerD"
>          holding: dm_bufio_client->lock
>          waiting: stuck io to finish (needs iscsi_fail thread to progress)
> 
>          schedule at ffffffffbd603618
>          io_schedule at ffffffffbd603ba4
>          do_io_schedule at ffffffffbdaf0d94
>          __wait_on_bit at ffffffffbd6008a6
>          out_of_line_wait_on_bit at ffffffffbd600960
>          wait_on_bit.constprop.10 at ffffffffbdaf0f17
>          __make_buffer_clean at ffffffffbdaf18ba
>          __cleanup_old_buffer at ffffffffbdaf192f
>          shrink at ffffffffbdaf19fd
>          do_shrink_slab at ffffffffbd6ec000
>          shrink_slab at ffffffffbd6ec24a
>          do_try_to_free_pages at ffffffffbd6eda09
>          try_to_free_mem_cgroup_pages at ffffffffbd6ede7e
>          mem_cgroup_resize_limit at ffffffffbd7024c0
>          mem_cgroup_write at ffffffffbd703149
>          cgroup_file_write at ffffffffbd6d9c6e
>          sys_write at ffffffffbd6662ea
>          system_call_fastpath at ffffffffbdbc34a2
> 
> Reported-by: Khazhismel Kumykov <khazhy@google.com>
> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
> ---
>  drivers/scsi/scsi_transport_iscsi.c | 67 +++++++++++++++++++++--------
>  1 file changed, 49 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_transport_iscsi.c b/drivers/scsi/scsi_transport_iscsi.c
> index 17a45716a0fe..d99c17306dff 100644
> --- a/drivers/scsi/scsi_transport_iscsi.c
> +++ b/drivers/scsi/scsi_transport_iscsi.c
> @@ -1616,6 +1616,12 @@ static DECLARE_TRANSPORT_CLASS(iscsi_connection_class,
>  static struct sock *nls;
>  static DEFINE_MUTEX(rx_queue_mutex);
>  
> +/*
> + * conn_mutex protects the {start,bind,stop,destroy}_conn from racing
> + * against the kernel stop_connection recovery mechanism
> + */
> +static DEFINE_MUTEX(conn_mutex);
> +
>  static LIST_HEAD(sesslist);
>  static LIST_HEAD(sessdestroylist);
>  static DEFINE_SPINLOCK(sesslock);
> @@ -2442,6 +2448,32 @@ int iscsi_offload_mesg(struct Scsi_Host *shost,
>  }
>  EXPORT_SYMBOL_GPL(iscsi_offload_mesg);
>  
> +/*
> + * This can be called without the rx_queue_mutex, if invoked by the kernel
> + * stop work. But, in that case, it is guaranteed not to race with
> + * iscsi_destroy by conn_mutex.
> + */
> +static void iscsi_if_stop_conn(struct iscsi_cls_conn *conn, int flag)
> +{
> +	/*
> +	 * It is important that this path doesn't rely on
> +	 * rx_queue_mutex, otherwise, a thread doing allocation on a
> +	 * start_session/start_connection could sleep waiting on a
> +	 * writeback to a failed iscsi device, that cannot be recovered
> +	 * because the lock is held.  If we don't hold it here, the
> +	 * kernel stop_conn_work_fn has a chance to stop the broken
> +	 * session and resolve the allocation.
> +	 *
> +	 * Still, the user invoked .stop_conn() needs to be serialized
> +	 * with stop_conn_work_fn by a private mutex.  Not pretty, but
> +	 * it works.
> +	 */
> +	mutex_lock(&conn_mutex);
> +	conn->transport->stop_conn(conn, flag);
> +	mutex_unlock(&conn_mutex);
> +
> +}
> +
>  static void stop_conn_work_fn(struct work_struct *work)
>  {
>  	struct iscsi_cls_conn *conn, *tmp;
> @@ -2460,30 +2492,17 @@ static void stop_conn_work_fn(struct work_struct *work)
>  		uint32_t sid = iscsi_conn_get_sid(conn);
>  		struct iscsi_cls_session *session;
>  
> -		mutex_lock(&rx_queue_mutex);
> -
>  		session = iscsi_session_lookup(sid);
>  		if (session) {
>  			if (system_state != SYSTEM_RUNNING) {
>  				session->recovery_tmo = 0;
> -				conn->transport->stop_conn(conn,
> -							   STOP_CONN_TERM);
> +				iscsi_if_stop_conn(conn, STOP_CONN_TERM);
>  			} else {
> -				conn->transport->stop_conn(conn,
> -							   STOP_CONN_RECOVER);
> +				iscsi_if_stop_conn(conn, STOP_CONN_RECOVER);
>  			}
>  		}
>  
>  		list_del_init(&conn->conn_list_err);
> -
> -		mutex_unlock(&rx_queue_mutex);
> -
> -		/* we don't want to hold rx_queue_mutex for too long,
> -		 * for instance if many conns failed at the same time,
> -		 * since this stall other iscsi maintenance operations.
> -		 * Give other users a chance to proceed.
> -		 */
> -		cond_resched();
>  	}
>  }

I'm curious about why you removed the cond_resched() here. Is it because
it is no longer needed, with shorter (mutex) waiting time?

>  
> @@ -2843,8 +2862,11 @@ iscsi_if_destroy_conn(struct iscsi_transport *transport, struct iscsi_uevent *ev
>  	spin_unlock_irqrestore(&connlock, flags);
>  
>  	ISCSI_DBG_TRANS_CONN(conn, "Destroying transport conn\n");
> +
> +	mutex_lock(&conn_mutex);
>  	if (transport->destroy_conn)
>  		transport->destroy_conn(conn);
> +	mutex_unlock(&conn_mutex);
>  
>  	return 0;
>  }
> @@ -3686,9 +3708,12 @@ iscsi_if_recv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, uint32_t *group)
>  			break;
>  		}
>  
> +		mutex_lock(&conn_mutex);
>  		ev->r.retcode =	transport->bind_conn(session, conn,
>  						ev->u.b_conn.transport_eph,
>  						ev->u.b_conn.is_leading);
> +		mutex_unlock(&conn_mutex);
> +
>  		if (ev->r.retcode || !transport->ep_connect)
>  			break;
>  
> @@ -3709,25 +3734,31 @@ iscsi_if_recv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, uint32_t *group)
>  		break;
>  	case ISCSI_UEVENT_START_CONN:
>  		conn = iscsi_conn_lookup(ev->u.start_conn.sid, ev->u.start_conn.cid);
> -		if (conn)
> +		if (conn) {
> +			mutex_lock(&conn_mutex);
>  			ev->r.retcode = transport->start_conn(conn);
> +			mutex_unlock(&conn_mutex);
> +		}
>  		else
>  			err = -EINVAL;
>  		break;
>  	case ISCSI_UEVENT_STOP_CONN:
>  		conn = iscsi_conn_lookup(ev->u.stop_conn.sid, ev->u.stop_conn.cid);
>  		if (conn)
> -			transport->stop_conn(conn, ev->u.stop_conn.flag);
> +			iscsi_if_stop_conn(conn, ev->u.stop_conn.flag);
>  		else
>  			err = -EINVAL;
>  		break;
>  	case ISCSI_UEVENT_SEND_PDU:
>  		conn = iscsi_conn_lookup(ev->u.send_pdu.sid, ev->u.send_pdu.cid);
> -		if (conn)
> +		if (conn) {
> +			mutex_lock(&conn_mutex);
>  			ev->r.retcode =	transport->send_pdu(conn,
>  				(struct iscsi_hdr*)((char*)ev + sizeof(*ev)),
>  				(char*)ev + sizeof(*ev) + ev->u.send_pdu.hdr_size,
>  				ev->u.send_pdu.data_size);
> +			mutex_unlock(&conn_mutex);
> +		}
>  		else
>  			err = -EINVAL;
>  		break;
> 

My question above is for my own information, so I'll still say:

Reviewed-by: Lee Duncan <lduncan@suse.com>
Gabriel Krisman Bertazi May 18, 2020, 6:53 p.m. UTC | #3
Lee Duncan <lduncan@suse.com> writes:

> On 5/7/20 10:59 PM, Gabriel Krisman Bertazi wrote:
>> iscsi suffers from a deadlock in case a management command submitted via
>> the netlink socket sleeps on an allocation while holding the
>> rx_queue_mutex, if that allocation causes a memory reclaim that
>> writebacks to a failed iscsi device.  Then, the recovery procedure can
>> never make progress to recover the failed disk or abort outstanding IO
>> operations to complete the reclaim (since rx_queue_mutex is locked),
>> thus locking the system.
>> 
>> Nevertheless, just marking all allocations under rx_queue_mutex as
>> GFP_NOIO (or locking the userspace process with something like
>> PF_MEMALLOC_NOIO) is not enough, since the iscsi command code relies on
>> other subsystems that try to grab locked mutexes, whose threads are
>> GFP_IO, leading to the same deadlock. One instance where this situation
>> can be observed is in the backtraces below, stitched from multiple bugs
>> reports, involving the kobj uevent sent when a session is created.
>> 
>> The root of the problem is not the fact that iscsi does GFP_IO
>> allocations, that is acceptable. The actual problem is that
>> rx_queue_mutex has a very large granularity, covering every unrelated
>> netlink command execution at the same time as the error recovery path.
>> 
>> The proposed fix leverages the recently added mechanism to stop failed
>> connections from the kernel, by enabling it to execute even though a
>> management command from the netlink socket is being run (rx_queue_mutex
>> is held), provided that the command is known to be safe.  It splits the
>> rx_queue_mutex in two mutexes, one protecting from concurrent command
>> execution from the netlink socket, and one protecting stop_conn from
>> racing with other connection management operations that might conflict
>> with it.
>> 
>> It is not very pretty, but it is the simplest way to resolve the
>> deadlock.  I considered making it a lock per connection, but some
>> external mutex would still be needed to deal with iscsi_if_destroy_conn.
>> 
>> The patch was tested by forcing a memory shrinker (unrelated, but used
>> bufio/dm-verity) to reclaim ISCSI pages every time
>> ISCSI_UEVENT_CREATE_SESSION happens, which is reasonable to simulate
>> reclaims that might happen with GFP_KERNEL on that path.  Then, a faulty
>> hung target causes a connection to fail during intensive IO, at the same
>> time a new session is added by iscsid.
>> 
>> The following stacktraces are stiches from several bug reports, showing
>> a case where the deadlock can happen.
>> 
>>  iSCSI-write
>>          holding: rx_queue_mutex
>>          waiting: uevent_sock_mutex
>> 
>>          kobject_uevent_env+0x1bd/0x419
>>          kobject_uevent+0xb/0xd
>>          device_add+0x48a/0x678
>>          scsi_add_host_with_dma+0xc5/0x22d
>>          iscsi_host_add+0x53/0x55
>>          iscsi_sw_tcp_session_create+0xa6/0x129
>>          iscsi_if_rx+0x100/0x1247
>>          netlink_unicast+0x213/0x4f0
>>          netlink_sendmsg+0x230/0x3c0
>> 
>>  iscsi_fail iscsi_conn_failure
>>          waiting: rx_queue_mutex
>> 
>>          schedule_preempt_disabled+0x325/0x734
>>          __mutex_lock_slowpath+0x18b/0x230
>>          mutex_lock+0x22/0x40
>>          iscsi_conn_failure+0x42/0x149
>>          worker_thread+0x24a/0xbc0
>> 
>>  EventManager_
>>          holding: uevent_sock_mutex
>>          waiting: dm_bufio_client->lock
>> 
>>          dm_bufio_lock+0xe/0x10
>>          shrink+0x34/0xf7
>>          shrink_slab+0x177/0x5d0
>>          do_try_to_free_pages+0x129/0x470
>>          try_to_free_mem_cgroup_pages+0x14f/0x210
>>          memcg_kmem_newpage_charge+0xa6d/0x13b0
>>          __alloc_pages_nodemask+0x4a3/0x1a70
>>          fallback_alloc+0x1b2/0x36c
>>          __kmalloc_node_track_caller+0xb9/0x10d0
>>          __alloc_skb+0x83/0x2f0
>>          kobject_uevent_env+0x26b/0x419
>>          dm_kobject_uevent+0x70/0x79
>>          dev_suspend+0x1a9/0x1e7
>>          ctl_ioctl+0x3e9/0x411
>>          dm_ctl_ioctl+0x13/0x17
>>          do_vfs_ioctl+0xb3/0x460
>>          SyS_ioctl+0x5e/0x90
>> 
>>  MemcgReclaimerD"
>>          holding: dm_bufio_client->lock
>>          waiting: stuck io to finish (needs iscsi_fail thread to progress)
>> 
>>          schedule at ffffffffbd603618
>>          io_schedule at ffffffffbd603ba4
>>          do_io_schedule at ffffffffbdaf0d94
>>          __wait_on_bit at ffffffffbd6008a6
>>          out_of_line_wait_on_bit at ffffffffbd600960
>>          wait_on_bit.constprop.10 at ffffffffbdaf0f17
>>          __make_buffer_clean at ffffffffbdaf18ba
>>          __cleanup_old_buffer at ffffffffbdaf192f
>>          shrink at ffffffffbdaf19fd
>>          do_shrink_slab at ffffffffbd6ec000
>>          shrink_slab at ffffffffbd6ec24a
>>          do_try_to_free_pages at ffffffffbd6eda09
>>          try_to_free_mem_cgroup_pages at ffffffffbd6ede7e
>>          mem_cgroup_resize_limit at ffffffffbd7024c0
>>          mem_cgroup_write at ffffffffbd703149
>>          cgroup_file_write at ffffffffbd6d9c6e
>>          sys_write at ffffffffbd6662ea
>>          system_call_fastpath at ffffffffbdbc34a2
>> 
>> Reported-by: Khazhismel Kumykov <khazhy@google.com>
>> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
>> ---
>>  drivers/scsi/scsi_transport_iscsi.c | 67 +++++++++++++++++++++--------
>>  1 file changed, 49 insertions(+), 18 deletions(-)
>> 
>> diff --git a/drivers/scsi/scsi_transport_iscsi.c b/drivers/scsi/scsi_transport_iscsi.c
>> index 17a45716a0fe..d99c17306dff 100644
>> --- a/drivers/scsi/scsi_transport_iscsi.c
>> +++ b/drivers/scsi/scsi_transport_iscsi.c
>> @@ -1616,6 +1616,12 @@ static DECLARE_TRANSPORT_CLASS(iscsi_connection_class,
>>  static struct sock *nls;
>>  static DEFINE_MUTEX(rx_queue_mutex);
>>  
>> +/*
>> + * conn_mutex protects the {start,bind,stop,destroy}_conn from racing
>> + * against the kernel stop_connection recovery mechanism
>> + */
>> +static DEFINE_MUTEX(conn_mutex);
>> +
>>  static LIST_HEAD(sesslist);
>>  static LIST_HEAD(sessdestroylist);
>>  static DEFINE_SPINLOCK(sesslock);
>> @@ -2442,6 +2448,32 @@ int iscsi_offload_mesg(struct Scsi_Host *shost,
>>  }
>>  EXPORT_SYMBOL_GPL(iscsi_offload_mesg);
>>  
>> +/*
>> + * This can be called without the rx_queue_mutex, if invoked by the kernel
>> + * stop work. But, in that case, it is guaranteed not to race with
>> + * iscsi_destroy by conn_mutex.
>> + */
>> +static void iscsi_if_stop_conn(struct iscsi_cls_conn *conn, int flag)
>> +{
>> +	/*
>> +	 * It is important that this path doesn't rely on
>> +	 * rx_queue_mutex, otherwise, a thread doing allocation on a
>> +	 * start_session/start_connection could sleep waiting on a
>> +	 * writeback to a failed iscsi device, that cannot be recovered
>> +	 * because the lock is held.  If we don't hold it here, the
>> +	 * kernel stop_conn_work_fn has a chance to stop the broken
>> +	 * session and resolve the allocation.
>> +	 *
>> +	 * Still, the user invoked .stop_conn() needs to be serialized
>> +	 * with stop_conn_work_fn by a private mutex.  Not pretty, but
>> +	 * it works.
>> +	 */
>> +	mutex_lock(&conn_mutex);
>> +	conn->transport->stop_conn(conn, flag);
>> +	mutex_unlock(&conn_mutex);
>> +
>> +}
>> +
>>  static void stop_conn_work_fn(struct work_struct *work)
>>  {
>>  	struct iscsi_cls_conn *conn, *tmp;
>> @@ -2460,30 +2492,17 @@ static void stop_conn_work_fn(struct work_struct *work)
>>  		uint32_t sid = iscsi_conn_get_sid(conn);
>>  		struct iscsi_cls_session *session;
>>  
>> -		mutex_lock(&rx_queue_mutex);
>> -
>>  		session = iscsi_session_lookup(sid);
>>  		if (session) {
>>  			if (system_state != SYSTEM_RUNNING) {
>>  				session->recovery_tmo = 0;
>> -				conn->transport->stop_conn(conn,
>> -							   STOP_CONN_TERM);
>> +				iscsi_if_stop_conn(conn, STOP_CONN_TERM);
>>  			} else {
>> -				conn->transport->stop_conn(conn,
>> -							   STOP_CONN_RECOVER);
>> +				iscsi_if_stop_conn(conn, STOP_CONN_RECOVER);
>>  			}
>>  		}
>>  
>>  		list_del_init(&conn->conn_list_err);
>> -
>> -		mutex_unlock(&rx_queue_mutex);
>> -
>> -		/* we don't want to hold rx_queue_mutex for too long,
>> -		 * for instance if many conns failed at the same time,
>> -		 * since this stall other iscsi maintenance operations.
>> -		 * Give other users a chance to proceed.
>> -		 */
>> -		cond_resched();
>>  	}
>>  }
>
> I'm curious about why you removed the cond_resched() here. Is it because
> it is no longer needed, with shorter (mutex) waiting time?
>

Hi Lee,

My thought was that the main reason for cond_resched here was to give a
chance for other iscsi maintenance operations to run.  But, since
conn_mutex, differently from rx_queue_mutex, won't stop most of the
operations, it felt unnecessary.  If you disagree, I can submit a v2
that doesn't change it.



>>  
>> @@ -2843,8 +2862,11 @@ iscsi_if_destroy_conn(struct iscsi_transport *transport, struct iscsi_uevent *ev
>>  	spin_unlock_irqrestore(&connlock, flags);
>>  
>>  	ISCSI_DBG_TRANS_CONN(conn, "Destroying transport conn\n");
>> +
>> +	mutex_lock(&conn_mutex);
>>  	if (transport->destroy_conn)
>>  		transport->destroy_conn(conn);
>> +	mutex_unlock(&conn_mutex);
>>  
>>  	return 0;
>>  }
>> @@ -3686,9 +3708,12 @@ iscsi_if_recv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, uint32_t *group)
>>  			break;
>>  		}
>>  
>> +		mutex_lock(&conn_mutex);
>>  		ev->r.retcode =	transport->bind_conn(session, conn,
>>  						ev->u.b_conn.transport_eph,
>>  						ev->u.b_conn.is_leading);
>> +		mutex_unlock(&conn_mutex);
>> +
>>  		if (ev->r.retcode || !transport->ep_connect)
>>  			break;
>>  
>> @@ -3709,25 +3734,31 @@ iscsi_if_recv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, uint32_t *group)
>>  		break;
>>  	case ISCSI_UEVENT_START_CONN:
>>  		conn = iscsi_conn_lookup(ev->u.start_conn.sid, ev->u.start_conn.cid);
>> -		if (conn)
>> +		if (conn) {
>> +			mutex_lock(&conn_mutex);
>>  			ev->r.retcode = transport->start_conn(conn);
>> +			mutex_unlock(&conn_mutex);
>> +		}
>>  		else
>>  			err = -EINVAL;
>>  		break;
>>  	case ISCSI_UEVENT_STOP_CONN:
>>  		conn = iscsi_conn_lookup(ev->u.stop_conn.sid, ev->u.stop_conn.cid);
>>  		if (conn)
>> -			transport->stop_conn(conn, ev->u.stop_conn.flag);
>> +			iscsi_if_stop_conn(conn, ev->u.stop_conn.flag);
>>  		else
>>  			err = -EINVAL;
>>  		break;
>>  	case ISCSI_UEVENT_SEND_PDU:
>>  		conn = iscsi_conn_lookup(ev->u.send_pdu.sid, ev->u.send_pdu.cid);
>> -		if (conn)
>> +		if (conn) {
>> +			mutex_lock(&conn_mutex);
>>  			ev->r.retcode =	transport->send_pdu(conn,
>>  				(struct iscsi_hdr*)((char*)ev + sizeof(*ev)),
>>  				(char*)ev + sizeof(*ev) + ev->u.send_pdu.hdr_size,
>>  				ev->u.send_pdu.data_size);
>> +			mutex_unlock(&conn_mutex);
>> +		}
>>  		else
>>  			err = -EINVAL;
>>  		break;
>> 
>
> My question above is for my own information, so I'll still say:
>
> Reviewed-by: Lee Duncan <lduncan@suse.com>
Lee Duncan May 18, 2020, 7:05 p.m. UTC | #4
On 5/18/20 11:53 AM, Gabriel Krisman Bertazi wrote:
> Lee Duncan <lduncan@suse.com> writes:
> 
>> On 5/7/20 10:59 PM, Gabriel Krisman Bertazi wrote:
>>> iscsi suffers from a deadlock in case a management command submitted via
>>> the netlink socket sleeps on an allocation while holding the
>>> rx_queue_mutex, if that allocation causes a memory reclaim that
>>> writebacks to a failed iscsi device.  Then, the recovery procedure can
>>> never make progress to recover the failed disk or abort outstanding IO
>>> operations to complete the reclaim (since rx_queue_mutex is locked),
>>> thus locking the system.
>>>
>>> Nevertheless, just marking all allocations under rx_queue_mutex as
>>> GFP_NOIO (or locking the userspace process with something like
>>> PF_MEMALLOC_NOIO) is not enough, since the iscsi command code relies on
>>> other subsystems that try to grab locked mutexes, whose threads are
>>> GFP_IO, leading to the same deadlock. One instance where this situation
>>> can be observed is in the backtraces below, stitched from multiple bugs
>>> reports, involving the kobj uevent sent when a session is created.
>>>
>>> The root of the problem is not the fact that iscsi does GFP_IO
>>> allocations, that is acceptable. The actual problem is that
>>> rx_queue_mutex has a very large granularity, covering every unrelated
>>> netlink command execution at the same time as the error recovery path.
>>>
>>> The proposed fix leverages the recently added mechanism to stop failed
>>> connections from the kernel, by enabling it to execute even though a
>>> management command from the netlink socket is being run (rx_queue_mutex
>>> is held), provided that the command is known to be safe.  It splits the
>>> rx_queue_mutex in two mutexes, one protecting from concurrent command
>>> execution from the netlink socket, and one protecting stop_conn from
>>> racing with other connection management operations that might conflict
>>> with it.
>>>
>>> It is not very pretty, but it is the simplest way to resolve the
>>> deadlock.  I considered making it a lock per connection, but some
>>> external mutex would still be needed to deal with iscsi_if_destroy_conn.
>>>
>>> The patch was tested by forcing a memory shrinker (unrelated, but used
>>> bufio/dm-verity) to reclaim ISCSI pages every time
>>> ISCSI_UEVENT_CREATE_SESSION happens, which is reasonable to simulate
>>> reclaims that might happen with GFP_KERNEL on that path.  Then, a faulty
>>> hung target causes a connection to fail during intensive IO, at the same
>>> time a new session is added by iscsid.
>>>
>>> The following stacktraces are stiches from several bug reports, showing
>>> a case where the deadlock can happen.
>>>
>>>  iSCSI-write
>>>          holding: rx_queue_mutex
>>>          waiting: uevent_sock_mutex
>>>
>>>          kobject_uevent_env+0x1bd/0x419
>>>          kobject_uevent+0xb/0xd
>>>          device_add+0x48a/0x678
>>>          scsi_add_host_with_dma+0xc5/0x22d
>>>          iscsi_host_add+0x53/0x55
>>>          iscsi_sw_tcp_session_create+0xa6/0x129
>>>          iscsi_if_rx+0x100/0x1247
>>>          netlink_unicast+0x213/0x4f0
>>>          netlink_sendmsg+0x230/0x3c0
>>>
>>>  iscsi_fail iscsi_conn_failure
>>>          waiting: rx_queue_mutex
>>>
>>>          schedule_preempt_disabled+0x325/0x734
>>>          __mutex_lock_slowpath+0x18b/0x230
>>>          mutex_lock+0x22/0x40
>>>          iscsi_conn_failure+0x42/0x149
>>>          worker_thread+0x24a/0xbc0
>>>
>>>  EventManager_
>>>          holding: uevent_sock_mutex
>>>          waiting: dm_bufio_client->lock
>>>
>>>          dm_bufio_lock+0xe/0x10
>>>          shrink+0x34/0xf7
>>>          shrink_slab+0x177/0x5d0
>>>          do_try_to_free_pages+0x129/0x470
>>>          try_to_free_mem_cgroup_pages+0x14f/0x210
>>>          memcg_kmem_newpage_charge+0xa6d/0x13b0
>>>          __alloc_pages_nodemask+0x4a3/0x1a70
>>>          fallback_alloc+0x1b2/0x36c
>>>          __kmalloc_node_track_caller+0xb9/0x10d0
>>>          __alloc_skb+0x83/0x2f0
>>>          kobject_uevent_env+0x26b/0x419
>>>          dm_kobject_uevent+0x70/0x79
>>>          dev_suspend+0x1a9/0x1e7
>>>          ctl_ioctl+0x3e9/0x411
>>>          dm_ctl_ioctl+0x13/0x17
>>>          do_vfs_ioctl+0xb3/0x460
>>>          SyS_ioctl+0x5e/0x90
>>>
>>>  MemcgReclaimerD"
>>>          holding: dm_bufio_client->lock
>>>          waiting: stuck io to finish (needs iscsi_fail thread to progress)
>>>
>>>          schedule at ffffffffbd603618
>>>          io_schedule at ffffffffbd603ba4
>>>          do_io_schedule at ffffffffbdaf0d94
>>>          __wait_on_bit at ffffffffbd6008a6
>>>          out_of_line_wait_on_bit at ffffffffbd600960
>>>          wait_on_bit.constprop.10 at ffffffffbdaf0f17
>>>          __make_buffer_clean at ffffffffbdaf18ba
>>>          __cleanup_old_buffer at ffffffffbdaf192f
>>>          shrink at ffffffffbdaf19fd
>>>          do_shrink_slab at ffffffffbd6ec000
>>>          shrink_slab at ffffffffbd6ec24a
>>>          do_try_to_free_pages at ffffffffbd6eda09
>>>          try_to_free_mem_cgroup_pages at ffffffffbd6ede7e
>>>          mem_cgroup_resize_limit at ffffffffbd7024c0
>>>          mem_cgroup_write at ffffffffbd703149
>>>          cgroup_file_write at ffffffffbd6d9c6e
>>>          sys_write at ffffffffbd6662ea
>>>          system_call_fastpath at ffffffffbdbc34a2
>>>
>>> Reported-by: Khazhismel Kumykov <khazhy@google.com>
>>> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
>>> ---
>>>  drivers/scsi/scsi_transport_iscsi.c | 67 +++++++++++++++++++++--------
>>>  1 file changed, 49 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/drivers/scsi/scsi_transport_iscsi.c b/drivers/scsi/scsi_transport_iscsi.c
>>> index 17a45716a0fe..d99c17306dff 100644
>>> --- a/drivers/scsi/scsi_transport_iscsi.c
>>> +++ b/drivers/scsi/scsi_transport_iscsi.c
>>> @@ -1616,6 +1616,12 @@ static DECLARE_TRANSPORT_CLASS(iscsi_connection_class,
>>>  static struct sock *nls;
>>>  static DEFINE_MUTEX(rx_queue_mutex);
>>>  
>>> +/*
>>> + * conn_mutex protects the {start,bind,stop,destroy}_conn from racing
>>> + * against the kernel stop_connection recovery mechanism
>>> + */
>>> +static DEFINE_MUTEX(conn_mutex);
>>> +
>>>  static LIST_HEAD(sesslist);
>>>  static LIST_HEAD(sessdestroylist);
>>>  static DEFINE_SPINLOCK(sesslock);
>>> @@ -2442,6 +2448,32 @@ int iscsi_offload_mesg(struct Scsi_Host *shost,
>>>  }
>>>  EXPORT_SYMBOL_GPL(iscsi_offload_mesg);
>>>  
>>> +/*
>>> + * This can be called without the rx_queue_mutex, if invoked by the kernel
>>> + * stop work. But, in that case, it is guaranteed not to race with
>>> + * iscsi_destroy by conn_mutex.
>>> + */
>>> +static void iscsi_if_stop_conn(struct iscsi_cls_conn *conn, int flag)
>>> +{
>>> +	/*
>>> +	 * It is important that this path doesn't rely on
>>> +	 * rx_queue_mutex, otherwise, a thread doing allocation on a
>>> +	 * start_session/start_connection could sleep waiting on a
>>> +	 * writeback to a failed iscsi device, that cannot be recovered
>>> +	 * because the lock is held.  If we don't hold it here, the
>>> +	 * kernel stop_conn_work_fn has a chance to stop the broken
>>> +	 * session and resolve the allocation.
>>> +	 *
>>> +	 * Still, the user invoked .stop_conn() needs to be serialized
>>> +	 * with stop_conn_work_fn by a private mutex.  Not pretty, but
>>> +	 * it works.
>>> +	 */
>>> +	mutex_lock(&conn_mutex);
>>> +	conn->transport->stop_conn(conn, flag);
>>> +	mutex_unlock(&conn_mutex);
>>> +
>>> +}
>>> +
>>>  static void stop_conn_work_fn(struct work_struct *work)
>>>  {
>>>  	struct iscsi_cls_conn *conn, *tmp;
>>> @@ -2460,30 +2492,17 @@ static void stop_conn_work_fn(struct work_struct *work)
>>>  		uint32_t sid = iscsi_conn_get_sid(conn);
>>>  		struct iscsi_cls_session *session;
>>>  
>>> -		mutex_lock(&rx_queue_mutex);
>>> -
>>>  		session = iscsi_session_lookup(sid);
>>>  		if (session) {
>>>  			if (system_state != SYSTEM_RUNNING) {
>>>  				session->recovery_tmo = 0;
>>> -				conn->transport->stop_conn(conn,
>>> -							   STOP_CONN_TERM);
>>> +				iscsi_if_stop_conn(conn, STOP_CONN_TERM);
>>>  			} else {
>>> -				conn->transport->stop_conn(conn,
>>> -							   STOP_CONN_RECOVER);
>>> +				iscsi_if_stop_conn(conn, STOP_CONN_RECOVER);
>>>  			}
>>>  		}
>>>  
>>>  		list_del_init(&conn->conn_list_err);
>>> -
>>> -		mutex_unlock(&rx_queue_mutex);
>>> -
>>> -		/* we don't want to hold rx_queue_mutex for too long,
>>> -		 * for instance if many conns failed at the same time,
>>> -		 * since this stall other iscsi maintenance operations.
>>> -		 * Give other users a chance to proceed.
>>> -		 */
>>> -		cond_resched();
>>>  	}
>>>  }
>>
>> I'm curious about why you removed the cond_resched() here. Is it because
>> it is no longer needed, with shorter (mutex) waiting time?
>>
> 
> Hi Lee,
> 
> My thought was that the main reason for cond_resched here was to give a
> chance for other iscsi maintenance operations to run.  But, since
> conn_mutex, differently from rx_queue_mutex, won't stop most of the
> operations, it felt unnecessary.  If you disagree, I can submit a v2
> that doesn't change it.
> 
> 

As far as I know, use of cond_resched() seems semi-random. I suspected
your answer would be what you said. I'm fine with that. If we see iscsi
transmit threads start to hog CPU time we can always revisit this choice.

> 
>>>  
>>> @@ -2843,8 +2862,11 @@ iscsi_if_destroy_conn(struct iscsi_transport *transport, struct iscsi_uevent *ev
>>>  	spin_unlock_irqrestore(&connlock, flags);
>>>  
>>>  	ISCSI_DBG_TRANS_CONN(conn, "Destroying transport conn\n");
>>> +
>>> +	mutex_lock(&conn_mutex);
>>>  	if (transport->destroy_conn)
>>>  		transport->destroy_conn(conn);
>>> +	mutex_unlock(&conn_mutex);
>>>  
>>>  	return 0;
>>>  }
>>> @@ -3686,9 +3708,12 @@ iscsi_if_recv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, uint32_t *group)
>>>  			break;
>>>  		}
>>>  
>>> +		mutex_lock(&conn_mutex);
>>>  		ev->r.retcode =	transport->bind_conn(session, conn,
>>>  						ev->u.b_conn.transport_eph,
>>>  						ev->u.b_conn.is_leading);
>>> +		mutex_unlock(&conn_mutex);
>>> +
>>>  		if (ev->r.retcode || !transport->ep_connect)
>>>  			break;
>>>  
>>> @@ -3709,25 +3734,31 @@ iscsi_if_recv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, uint32_t *group)
>>>  		break;
>>>  	case ISCSI_UEVENT_START_CONN:
>>>  		conn = iscsi_conn_lookup(ev->u.start_conn.sid, ev->u.start_conn.cid);
>>> -		if (conn)
>>> +		if (conn) {
>>> +			mutex_lock(&conn_mutex);
>>>  			ev->r.retcode = transport->start_conn(conn);
>>> +			mutex_unlock(&conn_mutex);
>>> +		}
>>>  		else
>>>  			err = -EINVAL;
>>>  		break;
>>>  	case ISCSI_UEVENT_STOP_CONN:
>>>  		conn = iscsi_conn_lookup(ev->u.stop_conn.sid, ev->u.stop_conn.cid);
>>>  		if (conn)
>>> -			transport->stop_conn(conn, ev->u.stop_conn.flag);
>>> +			iscsi_if_stop_conn(conn, ev->u.stop_conn.flag);
>>>  		else
>>>  			err = -EINVAL;
>>>  		break;
>>>  	case ISCSI_UEVENT_SEND_PDU:
>>>  		conn = iscsi_conn_lookup(ev->u.send_pdu.sid, ev->u.send_pdu.cid);
>>> -		if (conn)
>>> +		if (conn) {
>>> +			mutex_lock(&conn_mutex);
>>>  			ev->r.retcode =	transport->send_pdu(conn,
>>>  				(struct iscsi_hdr*)((char*)ev + sizeof(*ev)),
>>>  				(char*)ev + sizeof(*ev) + ev->u.send_pdu.hdr_size,
>>>  				ev->u.send_pdu.data_size);
>>> +			mutex_unlock(&conn_mutex);
>>> +		}
>>>  		else
>>>  			err = -EINVAL;
>>>  		break;
>>>
>>
>> My question above is for my own information, so I'll still say:
>>
>> Reviewed-by: Lee Duncan <lduncan@suse.com>
>
Martin K. Petersen May 20, 2020, 1:12 a.m. UTC | #5
Gabriel,

> iscsi suffers from a deadlock in case a management command submitted
> via the netlink socket sleeps on an allocation while holding the
> rx_queue_mutex,

This does not apply to 5.8/scsi-queue. Please resubmit.

Thanks!
Gabriel Krisman Bertazi May 20, 2020, 1:37 a.m. UTC | #6
"Martin K. Petersen" <martin.petersen@oracle.com> writes:

> Gabriel,
>
>> iscsi suffers from a deadlock in case a management command submitted
>> via the netlink socket sleeps on an allocation while holding the
>> rx_queue_mutex,
>
> This does not apply to 5.8/scsi-queue. Please resubmit.

Hi Martin,

Apparently it conflicted with my own code coming from -next.  Sorry.  I
will resubmit as soon as I finish testing it.
diff mbox series

Patch

diff --git a/drivers/scsi/scsi_transport_iscsi.c b/drivers/scsi/scsi_transport_iscsi.c
index 17a45716a0fe..d99c17306dff 100644
--- a/drivers/scsi/scsi_transport_iscsi.c
+++ b/drivers/scsi/scsi_transport_iscsi.c
@@ -1616,6 +1616,12 @@  static DECLARE_TRANSPORT_CLASS(iscsi_connection_class,
 static struct sock *nls;
 static DEFINE_MUTEX(rx_queue_mutex);
 
+/*
+ * conn_mutex protects the {start,bind,stop,destroy}_conn from racing
+ * against the kernel stop_connection recovery mechanism
+ */
+static DEFINE_MUTEX(conn_mutex);
+
 static LIST_HEAD(sesslist);
 static LIST_HEAD(sessdestroylist);
 static DEFINE_SPINLOCK(sesslock);
@@ -2442,6 +2448,32 @@  int iscsi_offload_mesg(struct Scsi_Host *shost,
 }
 EXPORT_SYMBOL_GPL(iscsi_offload_mesg);
 
+/*
+ * This can be called without the rx_queue_mutex, if invoked by the kernel
+ * stop work. But, in that case, it is guaranteed not to race with
+ * iscsi_destroy by conn_mutex.
+ */
+static void iscsi_if_stop_conn(struct iscsi_cls_conn *conn, int flag)
+{
+	/*
+	 * It is important that this path doesn't rely on
+	 * rx_queue_mutex, otherwise, a thread doing allocation on a
+	 * start_session/start_connection could sleep waiting on a
+	 * writeback to a failed iscsi device, that cannot be recovered
+	 * because the lock is held.  If we don't hold it here, the
+	 * kernel stop_conn_work_fn has a chance to stop the broken
+	 * session and resolve the allocation.
+	 *
+	 * Still, the user invoked .stop_conn() needs to be serialized
+	 * with stop_conn_work_fn by a private mutex.  Not pretty, but
+	 * it works.
+	 */
+	mutex_lock(&conn_mutex);
+	conn->transport->stop_conn(conn, flag);
+	mutex_unlock(&conn_mutex);
+
+}
+
 static void stop_conn_work_fn(struct work_struct *work)
 {
 	struct iscsi_cls_conn *conn, *tmp;
@@ -2460,30 +2492,17 @@  static void stop_conn_work_fn(struct work_struct *work)
 		uint32_t sid = iscsi_conn_get_sid(conn);
 		struct iscsi_cls_session *session;
 
-		mutex_lock(&rx_queue_mutex);
-
 		session = iscsi_session_lookup(sid);
 		if (session) {
 			if (system_state != SYSTEM_RUNNING) {
 				session->recovery_tmo = 0;
-				conn->transport->stop_conn(conn,
-							   STOP_CONN_TERM);
+				iscsi_if_stop_conn(conn, STOP_CONN_TERM);
 			} else {
-				conn->transport->stop_conn(conn,
-							   STOP_CONN_RECOVER);
+				iscsi_if_stop_conn(conn, STOP_CONN_RECOVER);
 			}
 		}
 
 		list_del_init(&conn->conn_list_err);
-
-		mutex_unlock(&rx_queue_mutex);
-
-		/* we don't want to hold rx_queue_mutex for too long,
-		 * for instance if many conns failed at the same time,
-		 * since this stall other iscsi maintenance operations.
-		 * Give other users a chance to proceed.
-		 */
-		cond_resched();
 	}
 }
 
@@ -2843,8 +2862,11 @@  iscsi_if_destroy_conn(struct iscsi_transport *transport, struct iscsi_uevent *ev
 	spin_unlock_irqrestore(&connlock, flags);
 
 	ISCSI_DBG_TRANS_CONN(conn, "Destroying transport conn\n");
+
+	mutex_lock(&conn_mutex);
 	if (transport->destroy_conn)
 		transport->destroy_conn(conn);
+	mutex_unlock(&conn_mutex);
 
 	return 0;
 }
@@ -3686,9 +3708,12 @@  iscsi_if_recv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, uint32_t *group)
 			break;
 		}
 
+		mutex_lock(&conn_mutex);
 		ev->r.retcode =	transport->bind_conn(session, conn,
 						ev->u.b_conn.transport_eph,
 						ev->u.b_conn.is_leading);
+		mutex_unlock(&conn_mutex);
+
 		if (ev->r.retcode || !transport->ep_connect)
 			break;
 
@@ -3709,25 +3734,31 @@  iscsi_if_recv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, uint32_t *group)
 		break;
 	case ISCSI_UEVENT_START_CONN:
 		conn = iscsi_conn_lookup(ev->u.start_conn.sid, ev->u.start_conn.cid);
-		if (conn)
+		if (conn) {
+			mutex_lock(&conn_mutex);
 			ev->r.retcode = transport->start_conn(conn);
+			mutex_unlock(&conn_mutex);
+		}
 		else
 			err = -EINVAL;
 		break;
 	case ISCSI_UEVENT_STOP_CONN:
 		conn = iscsi_conn_lookup(ev->u.stop_conn.sid, ev->u.stop_conn.cid);
 		if (conn)
-			transport->stop_conn(conn, ev->u.stop_conn.flag);
+			iscsi_if_stop_conn(conn, ev->u.stop_conn.flag);
 		else
 			err = -EINVAL;
 		break;
 	case ISCSI_UEVENT_SEND_PDU:
 		conn = iscsi_conn_lookup(ev->u.send_pdu.sid, ev->u.send_pdu.cid);
-		if (conn)
+		if (conn) {
+			mutex_lock(&conn_mutex);
 			ev->r.retcode =	transport->send_pdu(conn,
 				(struct iscsi_hdr*)((char*)ev + sizeof(*ev)),
 				(char*)ev + sizeof(*ev) + ev->u.send_pdu.hdr_size,
 				ev->u.send_pdu.data_size);
+			mutex_unlock(&conn_mutex);
+		}
 		else
 			err = -EINVAL;
 		break;