[RFC,v1,1/1] vfio: Fix WARNING "do not call blocking ops when !TASK_RUNNING"
diff mbox series

Message ID 881262243b5a6528f26cf24e383ca4f30f79878b.1554146679.git.alifm@linux.ibm.com
State New
Headers show
Series
  • [RFC,v1,1/1] vfio: Fix WARNING "do not call blocking ops when !TASK_RUNNING"
Related show

Commit Message

Farhan Ali April 1, 2019, 7:30 p.m. UTC
vfio_dev_present() which is the condition to
wait_event_interruptible_timeout(), will call vfio_group_get_device
and try to acquire the mutex group->device_lock.

wait_event_interruptible_timeout() will set the state of the current
task to TASK_INTERRUPTIBLE, before doing the condition check. This
means that we will try to accquire the mutex while already in a
sleeping state. The scheduler warns us by giving the following
warning:

[ 4050.264464] ------------[ cut here ]------------
[ 4050.264508] do not call blocking ops when !TASK_RUNNING; state=1 set at [<00000000b33c00e2>] prepare_to_wait_event+0x14a/0x188
[ 4050.264529] WARNING: CPU: 12 PID: 35924 at kernel/sched/core.c:6112 __might_sleep+0x76/0x90
....

 4050.264756] Call Trace:
[ 4050.264765] ([<000000000017bbaa>] __might_sleep+0x72/0x90)
[ 4050.264774]  [<0000000000b97edc>] __mutex_lock+0x44/0x8c0
[ 4050.264782]  [<0000000000b9878a>] mutex_lock_nested+0x32/0x40
[ 4050.264793]  [<000003ff800d7abe>] vfio_group_get_device+0x36/0xa8 [vfio]
[ 4050.264803]  [<000003ff800d87c0>] vfio_del_group_dev+0x238/0x378 [vfio]
[ 4050.264813]  [<000003ff8015f67c>] mdev_remove+0x3c/0x68 [mdev]
[ 4050.264825]  [<00000000008e01b0>] device_release_driver_internal+0x168/0x268
[ 4050.264834]  [<00000000008de692>] bus_remove_device+0x162/0x190
[ 4050.264843]  [<00000000008daf42>] device_del+0x1e2/0x368
[ 4050.264851]  [<00000000008db12c>] device_unregister+0x64/0x88
[ 4050.264862]  [<000003ff8015ed84>] mdev_device_remove+0xec/0x130 [mdev]
[ 4050.264872]  [<000003ff8015f074>] remove_store+0x6c/0xa8 [mdev]
[ 4050.264881]  [<000000000046f494>] kernfs_fop_write+0x14c/0x1f8
[ 4050.264890]  [<00000000003c1530>] __vfs_write+0x38/0x1a8
[ 4050.264899]  [<00000000003c187c>] vfs_write+0xb4/0x198
[ 4050.264908]  [<00000000003c1af2>] ksys_write+0x5a/0xb0
[ 4050.264916]  [<0000000000b9e270>] system_call+0xdc/0x2d8
[ 4050.264925] 4 locks held by sh/35924:
[ 4050.264933]  #0: 000000001ef90325 (sb_writers#4){.+.+}, at: vfs_write+0x9e/0x198
[ 4050.264948]  #1: 000000005c1ab0b3 (&of->mutex){+.+.}, at: kernfs_fop_write+0x1cc/0x1f8
[ 4050.264963]  #2: 0000000034831ab8 (kn->count#297){++++}, at: kernfs_remove_self+0x12e/0x150
[ 4050.264979]  #3: 00000000e152484f (&dev->mutex){....}, at: device_release_driver_internal+0x5c/0x268
[ 4050.264993] Last Breaking-Event-Address:
[ 4050.265002]  [<000000000017bbaa>] __might_sleep+0x72/0x90
[ 4050.265010] irq event stamp: 7039
[ 4050.265020] hardirqs last  enabled at (7047): [<00000000001cee7a>] console_unlock+0x6d2/0x740
[ 4050.265029] hardirqs last disabled at (7054): [<00000000001ce87e>] console_unlock+0xd6/0x740
[ 4050.265040] softirqs last  enabled at (6416): [<0000000000b8fe26>] __udelay+0xb6/0x100
[ 4050.265049] softirqs last disabled at (6415): [<0000000000b8fe06>] __udelay+0x96/0x100
[ 4050.265057] ---[ end trace d04a07d39d99a9f9 ]---

Let's fix this as described in the article https://lwn.net/Articles/628628/.

Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---

This fixes the warning for me, but would appreciate feedback
and some more testing with this patch.

Thanks
Farhan


 drivers/vfio/vfio.c | 28 ++++++++++------------------
 1 file changed, 10 insertions(+), 18 deletions(-)

Comments

Alex Williamson April 2, 2019, 6:50 p.m. UTC | #1
On Mon,  1 Apr 2019 15:30:27 -0400
Farhan Ali <alifm@linux.ibm.com> wrote:

> vfio_dev_present() which is the condition to
> wait_event_interruptible_timeout(), will call vfio_group_get_device
> and try to acquire the mutex group->device_lock.
> 
> wait_event_interruptible_timeout() will set the state of the current
> task to TASK_INTERRUPTIBLE, before doing the condition check. This
> means that we will try to accquire the mutex while already in a
> sleeping state. The scheduler warns us by giving the following
> warning:
> 
> [ 4050.264464] ------------[ cut here ]------------
> [ 4050.264508] do not call blocking ops when !TASK_RUNNING; state=1 set at [<00000000b33c00e2>] prepare_to_wait_event+0x14a/0x188
> [ 4050.264529] WARNING: CPU: 12 PID: 35924 at kernel/sched/core.c:6112 __might_sleep+0x76/0x90
> ....
> 
>  4050.264756] Call Trace:
> [ 4050.264765] ([<000000000017bbaa>] __might_sleep+0x72/0x90)
> [ 4050.264774]  [<0000000000b97edc>] __mutex_lock+0x44/0x8c0
> [ 4050.264782]  [<0000000000b9878a>] mutex_lock_nested+0x32/0x40
> [ 4050.264793]  [<000003ff800d7abe>] vfio_group_get_device+0x36/0xa8 [vfio]
> [ 4050.264803]  [<000003ff800d87c0>] vfio_del_group_dev+0x238/0x378 [vfio]
> [ 4050.264813]  [<000003ff8015f67c>] mdev_remove+0x3c/0x68 [mdev]
> [ 4050.264825]  [<00000000008e01b0>] device_release_driver_internal+0x168/0x268
> [ 4050.264834]  [<00000000008de692>] bus_remove_device+0x162/0x190
> [ 4050.264843]  [<00000000008daf42>] device_del+0x1e2/0x368
> [ 4050.264851]  [<00000000008db12c>] device_unregister+0x64/0x88
> [ 4050.264862]  [<000003ff8015ed84>] mdev_device_remove+0xec/0x130 [mdev]
> [ 4050.264872]  [<000003ff8015f074>] remove_store+0x6c/0xa8 [mdev]
> [ 4050.264881]  [<000000000046f494>] kernfs_fop_write+0x14c/0x1f8
> [ 4050.264890]  [<00000000003c1530>] __vfs_write+0x38/0x1a8
> [ 4050.264899]  [<00000000003c187c>] vfs_write+0xb4/0x198
> [ 4050.264908]  [<00000000003c1af2>] ksys_write+0x5a/0xb0
> [ 4050.264916]  [<0000000000b9e270>] system_call+0xdc/0x2d8
> [ 4050.264925] 4 locks held by sh/35924:
> [ 4050.264933]  #0: 000000001ef90325 (sb_writers#4){.+.+}, at: vfs_write+0x9e/0x198
> [ 4050.264948]  #1: 000000005c1ab0b3 (&of->mutex){+.+.}, at: kernfs_fop_write+0x1cc/0x1f8
> [ 4050.264963]  #2: 0000000034831ab8 (kn->count#297){++++}, at: kernfs_remove_self+0x12e/0x150
> [ 4050.264979]  #3: 00000000e152484f (&dev->mutex){....}, at: device_release_driver_internal+0x5c/0x268
> [ 4050.264993] Last Breaking-Event-Address:
> [ 4050.265002]  [<000000000017bbaa>] __might_sleep+0x72/0x90
> [ 4050.265010] irq event stamp: 7039
> [ 4050.265020] hardirqs last  enabled at (7047): [<00000000001cee7a>] console_unlock+0x6d2/0x740
> [ 4050.265029] hardirqs last disabled at (7054): [<00000000001ce87e>] console_unlock+0xd6/0x740
> [ 4050.265040] softirqs last  enabled at (6416): [<0000000000b8fe26>] __udelay+0xb6/0x100
> [ 4050.265049] softirqs last disabled at (6415): [<0000000000b8fe06>] __udelay+0x96/0x100
> [ 4050.265057] ---[ end trace d04a07d39d99a9f9 ]---
> 
> Let's fix this as described in the article https://lwn.net/Articles/628628/.
> 
> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> ---
> 
> This fixes the warning for me, but would appreciate feedback
> and some more testing with this patch.

Hi Farhan, thanks for the patch.  I was able to reproduce the original
issue with vfio-pci by enabling CONFIG_DEBUG_ATOMIC_SLEEP=y in my
kernel config and attempting to remove the device in the host while
assigned to a guest.  The patch below resolves that warning, but
introduces other poor behavior which you might not notice on an mdev
device since vfio_mdev_dev_ops doesn't implement a request callback...

>  drivers/vfio/vfio.c | 28 ++++++++++------------------
>  1 file changed, 10 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index a3030cd..8a696f0 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -922,13 +922,12 @@ static bool vfio_dev_present(struct vfio_group *group, struct device *dev)
>   * removed.  Open file descriptors for the device... */
>  void *vfio_del_group_dev(struct device *dev)
>  {
> +	DEFINE_WAIT_FUNC(wait, woken_wake_function);
>  	struct vfio_device *device = dev_get_drvdata(dev);
>  	struct vfio_group *group = device->group;
>  	void *device_data = device->device_data;
>  	struct vfio_unbound_dev *unbound;
>  	unsigned int i = 0;
> -	long ret;
> -	bool interrupted = false;
>  
>  	/*
>  	 * The group exists so long as we have a device reference.  Get
> @@ -964,6 +963,8 @@ void *vfio_del_group_dev(struct device *dev)
>  	 * interval with counter to allow the driver to take escalating
>  	 * measures to release the device if it has the ability to do so.
>  	 */
> +	add_wait_queue(&vfio.release_q, &wait);
> +
>  	do {
>  		device = vfio_group_get_device(group, dev);
>  		if (!device)
> @@ -974,23 +975,14 @@ void *vfio_del_group_dev(struct device *dev)
>  
>  		vfio_device_put(device);
>  
> -		if (interrupted) {
> -			ret = wait_event_timeout(vfio.release_q,
> -					!vfio_dev_present(group, dev), HZ * 10);
> -		} else {
> -			ret = wait_event_interruptible_timeout(vfio.release_q,
> -					!vfio_dev_present(group, dev), HZ * 10);
> -			if (ret == -ERESTARTSYS) {
> -				interrupted = true;
> -				dev_warn(dev,
> -					 "Device is currently in use, task"
> -					 " \"%s\" (%d) "
> -					 "blocked until device is released",
> -					 current->comm, task_pid_nr(current));
> -			}
> -		}
> -	} while (ret <= 0);
> +		if (!vfio_dev_present(group, dev))
> +			break;
> +
> +		wait_woken(&wait, TASK_INTERRUPTIBLE, HZ * 10);

When this gets interrupted, the interrupting signal is not cleared, so
this will return immediately on the next loop.  We go from casually
bumping the driver to release the device and checking every 10s to
burning a full core until the device is released.  IIRC, this is
partially why the old code switched from the interruptible timeout
version of wait_event to simply the timeout version (also I thought it
useful to acknowledge the first attempt to interrupt the blocked
process to indicate we're not deadlocked, we're waiting for the
device).  Ideally we'd keep this same behavior, knowing if we've been
interrupted to leave a breadcrumb in the log, but we at least need to
figure out how to consume or block the signal that generated the
interruption so we can go back to waiting gracefully.  Thanks,

Alex

> +
> +	} while (1);
>  
> +	remove_wait_queue(&vfio.release_q, &wait);
>  	/*
>  	 * In order to support multiple devices per group, devices can be
>  	 * plucked from the group while other devices in the group are still
Farhan Ali April 2, 2019, 9:27 p.m. UTC | #2
On 04/02/2019 02:50 PM, Alex Williamson wrote:
> On Mon,  1 Apr 2019 15:30:27 -0400
> Farhan Ali <alifm@linux.ibm.com> wrote:
> 
>> vfio_dev_present() which is the condition to
>> wait_event_interruptible_timeout(), will call vfio_group_get_device
>> and try to acquire the mutex group->device_lock.
>>
>> wait_event_interruptible_timeout() will set the state of the current
>> task to TASK_INTERRUPTIBLE, before doing the condition check. This
>> means that we will try to accquire the mutex while already in a
>> sleeping state. The scheduler warns us by giving the following
>> warning:
>>
>> [ 4050.264464] ------------[ cut here ]------------
>> [ 4050.264508] do not call blocking ops when !TASK_RUNNING; state=1 set at [<00000000b33c00e2>] prepare_to_wait_event+0x14a/0x188
>> [ 4050.264529] WARNING: CPU: 12 PID: 35924 at kernel/sched/core.c:6112 __might_sleep+0x76/0x90
>> ....
>>
>>   4050.264756] Call Trace:
>> [ 4050.264765] ([<000000000017bbaa>] __might_sleep+0x72/0x90)
>> [ 4050.264774]  [<0000000000b97edc>] __mutex_lock+0x44/0x8c0
>> [ 4050.264782]  [<0000000000b9878a>] mutex_lock_nested+0x32/0x40
>> [ 4050.264793]  [<000003ff800d7abe>] vfio_group_get_device+0x36/0xa8 [vfio]
>> [ 4050.264803]  [<000003ff800d87c0>] vfio_del_group_dev+0x238/0x378 [vfio]
>> [ 4050.264813]  [<000003ff8015f67c>] mdev_remove+0x3c/0x68 [mdev]
>> [ 4050.264825]  [<00000000008e01b0>] device_release_driver_internal+0x168/0x268
>> [ 4050.264834]  [<00000000008de692>] bus_remove_device+0x162/0x190
>> [ 4050.264843]  [<00000000008daf42>] device_del+0x1e2/0x368
>> [ 4050.264851]  [<00000000008db12c>] device_unregister+0x64/0x88
>> [ 4050.264862]  [<000003ff8015ed84>] mdev_device_remove+0xec/0x130 [mdev]
>> [ 4050.264872]  [<000003ff8015f074>] remove_store+0x6c/0xa8 [mdev]
>> [ 4050.264881]  [<000000000046f494>] kernfs_fop_write+0x14c/0x1f8
>> [ 4050.264890]  [<00000000003c1530>] __vfs_write+0x38/0x1a8
>> [ 4050.264899]  [<00000000003c187c>] vfs_write+0xb4/0x198
>> [ 4050.264908]  [<00000000003c1af2>] ksys_write+0x5a/0xb0
>> [ 4050.264916]  [<0000000000b9e270>] system_call+0xdc/0x2d8
>> [ 4050.264925] 4 locks held by sh/35924:
>> [ 4050.264933]  #0: 000000001ef90325 (sb_writers#4){.+.+}, at: vfs_write+0x9e/0x198
>> [ 4050.264948]  #1: 000000005c1ab0b3 (&of->mutex){+.+.}, at: kernfs_fop_write+0x1cc/0x1f8
>> [ 4050.264963]  #2: 0000000034831ab8 (kn->count#297){++++}, at: kernfs_remove_self+0x12e/0x150
>> [ 4050.264979]  #3: 00000000e152484f (&dev->mutex){....}, at: device_release_driver_internal+0x5c/0x268
>> [ 4050.264993] Last Breaking-Event-Address:
>> [ 4050.265002]  [<000000000017bbaa>] __might_sleep+0x72/0x90
>> [ 4050.265010] irq event stamp: 7039
>> [ 4050.265020] hardirqs last  enabled at (7047): [<00000000001cee7a>] console_unlock+0x6d2/0x740
>> [ 4050.265029] hardirqs last disabled at (7054): [<00000000001ce87e>] console_unlock+0xd6/0x740
>> [ 4050.265040] softirqs last  enabled at (6416): [<0000000000b8fe26>] __udelay+0xb6/0x100
>> [ 4050.265049] softirqs last disabled at (6415): [<0000000000b8fe06>] __udelay+0x96/0x100
>> [ 4050.265057] ---[ end trace d04a07d39d99a9f9 ]---
>>
>> Let's fix this as described in the article https://lwn.net/Articles/628628/.
>>
>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
>> ---
>>
>> This fixes the warning for me, but would appreciate feedback
>> and some more testing with this patch.
> 
> Hi Farhan, thanks for the patch.  I was able to reproduce the original
> issue with vfio-pci by enabling CONFIG_DEBUG_ATOMIC_SLEEP=y in my
> kernel config and attempting to remove the device in the host while
> assigned to a guest.  The patch below resolves that warning, but
> introduces other poor behavior which you might not notice on an mdev
> device since vfio_mdev_dev_ops doesn't implement a request callback...

Hi Alex,

Thanks for your response and testing the patch.

> 
>>   drivers/vfio/vfio.c | 28 ++++++++++------------------
>>   1 file changed, 10 insertions(+), 18 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>> index a3030cd..8a696f0 100644
>> --- a/drivers/vfio/vfio.c
>> +++ b/drivers/vfio/vfio.c
>> @@ -922,13 +922,12 @@ static bool vfio_dev_present(struct vfio_group *group, struct device *dev)
>>    * removed.  Open file descriptors for the device... */
>>   void *vfio_del_group_dev(struct device *dev)
>>   {
>> +	DEFINE_WAIT_FUNC(wait, woken_wake_function);
>>   	struct vfio_device *device = dev_get_drvdata(dev);
>>   	struct vfio_group *group = device->group;
>>   	void *device_data = device->device_data;
>>   	struct vfio_unbound_dev *unbound;
>>   	unsigned int i = 0;
>> -	long ret;
>> -	bool interrupted = false;
>>   
>>   	/*
>>   	 * The group exists so long as we have a device reference.  Get
>> @@ -964,6 +963,8 @@ void *vfio_del_group_dev(struct device *dev)
>>   	 * interval with counter to allow the driver to take escalating
>>   	 * measures to release the device if it has the ability to do so.
>>   	 */
>> +	add_wait_queue(&vfio.release_q, &wait);
>> +
>>   	do {
>>   		device = vfio_group_get_device(group, dev);
>>   		if (!device)
>> @@ -974,23 +975,14 @@ void *vfio_del_group_dev(struct device *dev)
>>   
>>   		vfio_device_put(device);
>>   
>> -		if (interrupted) {
>> -			ret = wait_event_timeout(vfio.release_q,
>> -					!vfio_dev_present(group, dev), HZ * 10);
>> -		} else {
>> -			ret = wait_event_interruptible_timeout(vfio.release_q,
>> -					!vfio_dev_present(group, dev), HZ * 10);
>> -			if (ret == -ERESTARTSYS) {
>> -				interrupted = true;
>> -				dev_warn(dev,
>> -					 "Device is currently in use, task"
>> -					 " \"%s\" (%d) "
>> -					 "blocked until device is released",
>> -					 current->comm, task_pid_nr(current));
>> -			}
>> -		}
>> -	} while (ret <= 0);
>> +		if (!vfio_dev_present(group, dev))
>> +			break;
>> +
>> +		wait_woken(&wait, TASK_INTERRUPTIBLE, HZ * 10);
> 
> When this gets interrupted, the interrupting signal is not cleared, so
> this will return immediately on the next loop.  We go from casually
> bumping the driver to release the device and checking every 10s to
> burning a full core until the device is released.  IIRC, this is
> partially why the old code switched from the interruptible timeout
> version of wait_event to simply the timeout version (also I thought it
> useful to acknowledge the first attempt to interrupt the blocked
> process to indicate we're not deadlocked, we're waiting for the
> device).  Ideally we'd keep this same behavior, knowing if we've been
> interrupted to leave a breadcrumb in the log, but we at least need to
> figure out how to consume or block the signal that generated the
> interruption so we can go back to waiting gracefully.  Thanks,
> 
> Alex

Hrmm, we could call the wait_woken to set task state to 
TASK_UNINTERRUPTIBLE, which would prevent us from getting interrupted.

Something like the snippet below:

if (interrupted) {

   wait_woken(&wait, TASK_UNINTERRUPTIBLE, HZ * 10);
} else {
   wait_woken(&wait, TASK_INTERRUPTIBLE, HZ * 10);
   if (signal_pending(current)) {
       interrupted = true;
   }
}

what do you think?

Thanks
Farhan


> 
>> +
>> +	} while (1);
>>   
>> +	remove_wait_queue(&vfio.release_q, &wait);
>>   	/*
>>   	 * In order to support multiple devices per group, devices can be
>>   	 * plucked from the group while other devices in the group are still
> 
>
Alex Williamson April 2, 2019, 11:09 p.m. UTC | #3
On Tue, 2 Apr 2019 17:27:49 -0400
Farhan Ali <alifm@linux.ibm.com> wrote:

> On 04/02/2019 02:50 PM, Alex Williamson wrote:
> > On Mon,  1 Apr 2019 15:30:27 -0400
> > Farhan Ali <alifm@linux.ibm.com> wrote:
> >   
> >> vfio_dev_present() which is the condition to
> >> wait_event_interruptible_timeout(), will call vfio_group_get_device
> >> and try to acquire the mutex group->device_lock.
> >>
> >> wait_event_interruptible_timeout() will set the state of the current
> >> task to TASK_INTERRUPTIBLE, before doing the condition check. This
> >> means that we will try to accquire the mutex while already in a
> >> sleeping state. The scheduler warns us by giving the following
> >> warning:
> >>
> >> [ 4050.264464] ------------[ cut here ]------------
> >> [ 4050.264508] do not call blocking ops when !TASK_RUNNING; state=1 set at [<00000000b33c00e2>] prepare_to_wait_event+0x14a/0x188
> >> [ 4050.264529] WARNING: CPU: 12 PID: 35924 at kernel/sched/core.c:6112 __might_sleep+0x76/0x90
> >> ....
> >>
> >>   4050.264756] Call Trace:
> >> [ 4050.264765] ([<000000000017bbaa>] __might_sleep+0x72/0x90)
> >> [ 4050.264774]  [<0000000000b97edc>] __mutex_lock+0x44/0x8c0
> >> [ 4050.264782]  [<0000000000b9878a>] mutex_lock_nested+0x32/0x40
> >> [ 4050.264793]  [<000003ff800d7abe>] vfio_group_get_device+0x36/0xa8 [vfio]
> >> [ 4050.264803]  [<000003ff800d87c0>] vfio_del_group_dev+0x238/0x378 [vfio]
> >> [ 4050.264813]  [<000003ff8015f67c>] mdev_remove+0x3c/0x68 [mdev]
> >> [ 4050.264825]  [<00000000008e01b0>] device_release_driver_internal+0x168/0x268
> >> [ 4050.264834]  [<00000000008de692>] bus_remove_device+0x162/0x190
> >> [ 4050.264843]  [<00000000008daf42>] device_del+0x1e2/0x368
> >> [ 4050.264851]  [<00000000008db12c>] device_unregister+0x64/0x88
> >> [ 4050.264862]  [<000003ff8015ed84>] mdev_device_remove+0xec/0x130 [mdev]
> >> [ 4050.264872]  [<000003ff8015f074>] remove_store+0x6c/0xa8 [mdev]
> >> [ 4050.264881]  [<000000000046f494>] kernfs_fop_write+0x14c/0x1f8
> >> [ 4050.264890]  [<00000000003c1530>] __vfs_write+0x38/0x1a8
> >> [ 4050.264899]  [<00000000003c187c>] vfs_write+0xb4/0x198
> >> [ 4050.264908]  [<00000000003c1af2>] ksys_write+0x5a/0xb0
> >> [ 4050.264916]  [<0000000000b9e270>] system_call+0xdc/0x2d8
> >> [ 4050.264925] 4 locks held by sh/35924:
> >> [ 4050.264933]  #0: 000000001ef90325 (sb_writers#4){.+.+}, at: vfs_write+0x9e/0x198
> >> [ 4050.264948]  #1: 000000005c1ab0b3 (&of->mutex){+.+.}, at: kernfs_fop_write+0x1cc/0x1f8
> >> [ 4050.264963]  #2: 0000000034831ab8 (kn->count#297){++++}, at: kernfs_remove_self+0x12e/0x150
> >> [ 4050.264979]  #3: 00000000e152484f (&dev->mutex){....}, at: device_release_driver_internal+0x5c/0x268
> >> [ 4050.264993] Last Breaking-Event-Address:
> >> [ 4050.265002]  [<000000000017bbaa>] __might_sleep+0x72/0x90
> >> [ 4050.265010] irq event stamp: 7039
> >> [ 4050.265020] hardirqs last  enabled at (7047): [<00000000001cee7a>] console_unlock+0x6d2/0x740
> >> [ 4050.265029] hardirqs last disabled at (7054): [<00000000001ce87e>] console_unlock+0xd6/0x740
> >> [ 4050.265040] softirqs last  enabled at (6416): [<0000000000b8fe26>] __udelay+0xb6/0x100
> >> [ 4050.265049] softirqs last disabled at (6415): [<0000000000b8fe06>] __udelay+0x96/0x100
> >> [ 4050.265057] ---[ end trace d04a07d39d99a9f9 ]---
> >>
> >> Let's fix this as described in the article https://lwn.net/Articles/628628/.
> >>
> >> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> >> ---
> >>
> >> This fixes the warning for me, but would appreciate feedback
> >> and some more testing with this patch.  
> > 
> > Hi Farhan, thanks for the patch.  I was able to reproduce the original
> > issue with vfio-pci by enabling CONFIG_DEBUG_ATOMIC_SLEEP=y in my
> > kernel config and attempting to remove the device in the host while
> > assigned to a guest.  The patch below resolves that warning, but
> > introduces other poor behavior which you might not notice on an mdev
> > device since vfio_mdev_dev_ops doesn't implement a request callback...  
> 
> Hi Alex,
> 
> Thanks for your response and testing the patch.
> 
> >   
> >>   drivers/vfio/vfio.c | 28 ++++++++++------------------
> >>   1 file changed, 10 insertions(+), 18 deletions(-)
> >>
> >> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> >> index a3030cd..8a696f0 100644
> >> --- a/drivers/vfio/vfio.c
> >> +++ b/drivers/vfio/vfio.c
> >> @@ -922,13 +922,12 @@ static bool vfio_dev_present(struct vfio_group *group, struct device *dev)
> >>    * removed.  Open file descriptors for the device... */
> >>   void *vfio_del_group_dev(struct device *dev)
> >>   {
> >> +	DEFINE_WAIT_FUNC(wait, woken_wake_function);
> >>   	struct vfio_device *device = dev_get_drvdata(dev);
> >>   	struct vfio_group *group = device->group;
> >>   	void *device_data = device->device_data;
> >>   	struct vfio_unbound_dev *unbound;
> >>   	unsigned int i = 0;
> >> -	long ret;
> >> -	bool interrupted = false;
> >>   
> >>   	/*
> >>   	 * The group exists so long as we have a device reference.  Get
> >> @@ -964,6 +963,8 @@ void *vfio_del_group_dev(struct device *dev)
> >>   	 * interval with counter to allow the driver to take escalating
> >>   	 * measures to release the device if it has the ability to do so.
> >>   	 */
> >> +	add_wait_queue(&vfio.release_q, &wait);
> >> +
> >>   	do {
> >>   		device = vfio_group_get_device(group, dev);
> >>   		if (!device)
> >> @@ -974,23 +975,14 @@ void *vfio_del_group_dev(struct device *dev)
> >>   
> >>   		vfio_device_put(device);
> >>   
> >> -		if (interrupted) {
> >> -			ret = wait_event_timeout(vfio.release_q,
> >> -					!vfio_dev_present(group, dev), HZ * 10);
> >> -		} else {
> >> -			ret = wait_event_interruptible_timeout(vfio.release_q,
> >> -					!vfio_dev_present(group, dev), HZ * 10);
> >> -			if (ret == -ERESTARTSYS) {
> >> -				interrupted = true;
> >> -				dev_warn(dev,
> >> -					 "Device is currently in use, task"
> >> -					 " \"%s\" (%d) "
> >> -					 "blocked until device is released",
> >> -					 current->comm, task_pid_nr(current));
> >> -			}
> >> -		}
> >> -	} while (ret <= 0);
> >> +		if (!vfio_dev_present(group, dev))
> >> +			break;
> >> +
> >> +		wait_woken(&wait, TASK_INTERRUPTIBLE, HZ * 10);  
> > 
> > When this gets interrupted, the interrupting signal is not cleared, so
> > this will return immediately on the next loop.  We go from casually
> > bumping the driver to release the device and checking every 10s to
> > burning a full core until the device is released.  IIRC, this is
> > partially why the old code switched from the interruptible timeout
> > version of wait_event to simply the timeout version (also I thought it
> > useful to acknowledge the first attempt to interrupt the blocked
> > process to indicate we're not deadlocked, we're waiting for the
> > device).  Ideally we'd keep this same behavior, knowing if we've been
> > interrupted to leave a breadcrumb in the log, but we at least need to
> > figure out how to consume or block the signal that generated the
> > interruption so we can go back to waiting gracefully.  Thanks,
> > 
> > Alex  
> 
> Hrmm, we could call the wait_woken to set task state to 
> TASK_UNINTERRUPTIBLE, which would prevent us from getting interrupted.
> 
> Something like the snippet below:
> 
> if (interrupted) {
> 
>    wait_woken(&wait, TASK_UNINTERRUPTIBLE, HZ * 10);
> } else {
>    wait_woken(&wait, TASK_INTERRUPTIBLE, HZ * 10);
>    if (signal_pending(current)) {
>        interrupted = true;
>    }
> }
> 
> what do you think?

Yeah, that could work.  Thanks,

Alex
Farhan Ali April 2, 2019, 11:22 p.m. UTC | #4
On 04/02/2019 07:09 PM, Alex Williamson wrote:
> On Tue, 2 Apr 2019 17:27:49 -0400
> Farhan Ali <alifm@linux.ibm.com> wrote:
> 
>> On 04/02/2019 02:50 PM, Alex Williamson wrote:
>>> On Mon,  1 Apr 2019 15:30:27 -0400
>>> Farhan Ali <alifm@linux.ibm.com> wrote:
>>>    
>>>> vfio_dev_present() which is the condition to
>>>> wait_event_interruptible_timeout(), will call vfio_group_get_device
>>>> and try to acquire the mutex group->device_lock.
>>>>
>>>> wait_event_interruptible_timeout() will set the state of the current
>>>> task to TASK_INTERRUPTIBLE, before doing the condition check. This
>>>> means that we will try to accquire the mutex while already in a
>>>> sleeping state. The scheduler warns us by giving the following
>>>> warning:
>>>>
>>>> [ 4050.264464] ------------[ cut here ]------------
>>>> [ 4050.264508] do not call blocking ops when !TASK_RUNNING; state=1 set at [<00000000b33c00e2>] prepare_to_wait_event+0x14a/0x188
>>>> [ 4050.264529] WARNING: CPU: 12 PID: 35924 at kernel/sched/core.c:6112 __might_sleep+0x76/0x90
>>>> ....
>>>>
>>>>    4050.264756] Call Trace:
>>>> [ 4050.264765] ([<000000000017bbaa>] __might_sleep+0x72/0x90)
>>>> [ 4050.264774]  [<0000000000b97edc>] __mutex_lock+0x44/0x8c0
>>>> [ 4050.264782]  [<0000000000b9878a>] mutex_lock_nested+0x32/0x40
>>>> [ 4050.264793]  [<000003ff800d7abe>] vfio_group_get_device+0x36/0xa8 [vfio]
>>>> [ 4050.264803]  [<000003ff800d87c0>] vfio_del_group_dev+0x238/0x378 [vfio]
>>>> [ 4050.264813]  [<000003ff8015f67c>] mdev_remove+0x3c/0x68 [mdev]
>>>> [ 4050.264825]  [<00000000008e01b0>] device_release_driver_internal+0x168/0x268
>>>> [ 4050.264834]  [<00000000008de692>] bus_remove_device+0x162/0x190
>>>> [ 4050.264843]  [<00000000008daf42>] device_del+0x1e2/0x368
>>>> [ 4050.264851]  [<00000000008db12c>] device_unregister+0x64/0x88
>>>> [ 4050.264862]  [<000003ff8015ed84>] mdev_device_remove+0xec/0x130 [mdev]
>>>> [ 4050.264872]  [<000003ff8015f074>] remove_store+0x6c/0xa8 [mdev]
>>>> [ 4050.264881]  [<000000000046f494>] kernfs_fop_write+0x14c/0x1f8
>>>> [ 4050.264890]  [<00000000003c1530>] __vfs_write+0x38/0x1a8
>>>> [ 4050.264899]  [<00000000003c187c>] vfs_write+0xb4/0x198
>>>> [ 4050.264908]  [<00000000003c1af2>] ksys_write+0x5a/0xb0
>>>> [ 4050.264916]  [<0000000000b9e270>] system_call+0xdc/0x2d8
>>>> [ 4050.264925] 4 locks held by sh/35924:
>>>> [ 4050.264933]  #0: 000000001ef90325 (sb_writers#4){.+.+}, at: vfs_write+0x9e/0x198
>>>> [ 4050.264948]  #1: 000000005c1ab0b3 (&of->mutex){+.+.}, at: kernfs_fop_write+0x1cc/0x1f8
>>>> [ 4050.264963]  #2: 0000000034831ab8 (kn->count#297){++++}, at: kernfs_remove_self+0x12e/0x150
>>>> [ 4050.264979]  #3: 00000000e152484f (&dev->mutex){....}, at: device_release_driver_internal+0x5c/0x268
>>>> [ 4050.264993] Last Breaking-Event-Address:
>>>> [ 4050.265002]  [<000000000017bbaa>] __might_sleep+0x72/0x90
>>>> [ 4050.265010] irq event stamp: 7039
>>>> [ 4050.265020] hardirqs last  enabled at (7047): [<00000000001cee7a>] console_unlock+0x6d2/0x740
>>>> [ 4050.265029] hardirqs last disabled at (7054): [<00000000001ce87e>] console_unlock+0xd6/0x740
>>>> [ 4050.265040] softirqs last  enabled at (6416): [<0000000000b8fe26>] __udelay+0xb6/0x100
>>>> [ 4050.265049] softirqs last disabled at (6415): [<0000000000b8fe06>] __udelay+0x96/0x100
>>>> [ 4050.265057] ---[ end trace d04a07d39d99a9f9 ]---
>>>>
>>>> Let's fix this as described in the article https://lwn.net/Articles/628628/.
>>>>
>>>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
>>>> ---
>>>>
>>>> This fixes the warning for me, but would appreciate feedback
>>>> and some more testing with this patch.
>>>
>>> Hi Farhan, thanks for the patch.  I was able to reproduce the original
>>> issue with vfio-pci by enabling CONFIG_DEBUG_ATOMIC_SLEEP=y in my
>>> kernel config and attempting to remove the device in the host while
>>> assigned to a guest.  The patch below resolves that warning, but
>>> introduces other poor behavior which you might not notice on an mdev
>>> device since vfio_mdev_dev_ops doesn't implement a request callback...
>>
>> Hi Alex,
>>
>> Thanks for your response and testing the patch.
>>
>>>    
>>>>    drivers/vfio/vfio.c | 28 ++++++++++------------------
>>>>    1 file changed, 10 insertions(+), 18 deletions(-)
>>>>
>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>>>> index a3030cd..8a696f0 100644
>>>> --- a/drivers/vfio/vfio.c
>>>> +++ b/drivers/vfio/vfio.c
>>>> @@ -922,13 +922,12 @@ static bool vfio_dev_present(struct vfio_group *group, struct device *dev)
>>>>     * removed.  Open file descriptors for the device... */
>>>>    void *vfio_del_group_dev(struct device *dev)
>>>>    {
>>>> +	DEFINE_WAIT_FUNC(wait, woken_wake_function);
>>>>    	struct vfio_device *device = dev_get_drvdata(dev);
>>>>    	struct vfio_group *group = device->group;
>>>>    	void *device_data = device->device_data;
>>>>    	struct vfio_unbound_dev *unbound;
>>>>    	unsigned int i = 0;
>>>> -	long ret;
>>>> -	bool interrupted = false;
>>>>    
>>>>    	/*
>>>>    	 * The group exists so long as we have a device reference.  Get
>>>> @@ -964,6 +963,8 @@ void *vfio_del_group_dev(struct device *dev)
>>>>    	 * interval with counter to allow the driver to take escalating
>>>>    	 * measures to release the device if it has the ability to do so.
>>>>    	 */
>>>> +	add_wait_queue(&vfio.release_q, &wait);
>>>> +
>>>>    	do {
>>>>    		device = vfio_group_get_device(group, dev);
>>>>    		if (!device)
>>>> @@ -974,23 +975,14 @@ void *vfio_del_group_dev(struct device *dev)
>>>>    
>>>>    		vfio_device_put(device);
>>>>    
>>>> -		if (interrupted) {
>>>> -			ret = wait_event_timeout(vfio.release_q,
>>>> -					!vfio_dev_present(group, dev), HZ * 10);
>>>> -		} else {
>>>> -			ret = wait_event_interruptible_timeout(vfio.release_q,
>>>> -					!vfio_dev_present(group, dev), HZ * 10);
>>>> -			if (ret == -ERESTARTSYS) {
>>>> -				interrupted = true;
>>>> -				dev_warn(dev,
>>>> -					 "Device is currently in use, task"
>>>> -					 " \"%s\" (%d) "
>>>> -					 "blocked until device is released",
>>>> -					 current->comm, task_pid_nr(current));
>>>> -			}
>>>> -		}
>>>> -	} while (ret <= 0);
>>>> +		if (!vfio_dev_present(group, dev))
>>>> +			break;
>>>> +
>>>> +		wait_woken(&wait, TASK_INTERRUPTIBLE, HZ * 10);
>>>
>>> When this gets interrupted, the interrupting signal is not cleared, so
>>> this will return immediately on the next loop.  We go from casually
>>> bumping the driver to release the device and checking every 10s to
>>> burning a full core until the device is released.  IIRC, this is
>>> partially why the old code switched from the interruptible timeout
>>> version of wait_event to simply the timeout version (also I thought it
>>> useful to acknowledge the first attempt to interrupt the blocked
>>> process to indicate we're not deadlocked, we're waiting for the
>>> device).  Ideally we'd keep this same behavior, knowing if we've been
>>> interrupted to leave a breadcrumb in the log, but we at least need to
>>> figure out how to consume or block the signal that generated the
>>> interruption so we can go back to waiting gracefully.  Thanks,
>>>
>>> Alex
>>
>> Hrmm, we could call the wait_woken to set task state to
>> TASK_UNINTERRUPTIBLE, which would prevent us from getting interrupted.
>>
>> Something like the snippet below:
>>
>> if (interrupted) {
>>
>>     wait_woken(&wait, TASK_UNINTERRUPTIBLE, HZ * 10);
>> } else {
>>     wait_woken(&wait, TASK_INTERRUPTIBLE, HZ * 10);
>>     if (signal_pending(current)) {
>>         interrupted = true;
>>     }
>> }
>>
>> what do you think?
> 
> Yeah, that could work.  Thanks,
> 
> Alex
> 
> 

Great, I will send out a v2 soon. Thanks for your feedback.

Thanks
Farhan

Patch
diff mbox series

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index a3030cd..8a696f0 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -922,13 +922,12 @@  static bool vfio_dev_present(struct vfio_group *group, struct device *dev)
  * removed.  Open file descriptors for the device... */
 void *vfio_del_group_dev(struct device *dev)
 {
+	DEFINE_WAIT_FUNC(wait, woken_wake_function);
 	struct vfio_device *device = dev_get_drvdata(dev);
 	struct vfio_group *group = device->group;
 	void *device_data = device->device_data;
 	struct vfio_unbound_dev *unbound;
 	unsigned int i = 0;
-	long ret;
-	bool interrupted = false;
 
 	/*
 	 * The group exists so long as we have a device reference.  Get
@@ -964,6 +963,8 @@  void *vfio_del_group_dev(struct device *dev)
 	 * interval with counter to allow the driver to take escalating
 	 * measures to release the device if it has the ability to do so.
 	 */
+	add_wait_queue(&vfio.release_q, &wait);
+
 	do {
 		device = vfio_group_get_device(group, dev);
 		if (!device)
@@ -974,23 +975,14 @@  void *vfio_del_group_dev(struct device *dev)
 
 		vfio_device_put(device);
 
-		if (interrupted) {
-			ret = wait_event_timeout(vfio.release_q,
-					!vfio_dev_present(group, dev), HZ * 10);
-		} else {
-			ret = wait_event_interruptible_timeout(vfio.release_q,
-					!vfio_dev_present(group, dev), HZ * 10);
-			if (ret == -ERESTARTSYS) {
-				interrupted = true;
-				dev_warn(dev,
-					 "Device is currently in use, task"
-					 " \"%s\" (%d) "
-					 "blocked until device is released",
-					 current->comm, task_pid_nr(current));
-			}
-		}
-	} while (ret <= 0);
+		if (!vfio_dev_present(group, dev))
+			break;
+
+		wait_woken(&wait, TASK_INTERRUPTIBLE, HZ * 10);
+
+	} while (1);
 
+	remove_wait_queue(&vfio.release_q, &wait);
 	/*
 	 * In order to support multiple devices per group, devices can be
 	 * plucked from the group while other devices in the group are still