[RFC,v1,2/3] vfio-ccw: Prevent quiesce function going into an infinite loop
diff mbox series

Message ID d321a34443461c9d92e6e86800a0d468b619b6c5.1554222348.git.alifm@linux.ibm.com
State New
Headers show
Series
  • vfio-ccw fixes for kernel stacktraces
Related show

Commit Message

Farhan Ali April 2, 2019, 4:44 p.m. UTC
The quiesce function calls cio_cancel_halt_clear() and if we
get an -EBUSY we go into a loop where we:
	- wait for any interrupts
	- flush all I/O in the workqueue
	- retry cio_cancel_halt_clear

During the period where we are waiting for interrupts or
flushing all I/O, the channel subsystem could have completed
a halt/clear action and turned off the corresponding activity
control bits in the subchannel status word. This means the next
time we call cio_cancel_halt_clear(), we will again start by
calling cancel subchannel and so we can be stuck between calling
cancel and halt forever.

Rather than calling cio_cancel_halt_clear() immediately after
waiting, let's try to disable the subchannel. If we succeed in
disabling the subchannel then we know nothing else can happen
with the device.

Suggested-by: Eric Farman <farman@linux.ibm.com>
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
---
 drivers/s390/cio/vfio_ccw_drv.c | 28 ++++++++++++----------------
 1 file changed, 12 insertions(+), 16 deletions(-)

Comments

Cornelia Huck April 3, 2019, 7:58 a.m. UTC | #1
On Tue,  2 Apr 2019 12:44:34 -0400
Farhan Ali <alifm@linux.ibm.com> wrote:

> The quiesce function calls cio_cancel_halt_clear() and if we
> get an -EBUSY we go into a loop where we:
> 	- wait for any interrupts
> 	- flush all I/O in the workqueue
> 	- retry cio_cancel_halt_clear
> 
> During the period where we are waiting for interrupts or
> flushing all I/O, the channel subsystem could have completed
> a halt/clear action and turned off the corresponding activity
> control bits in the subchannel status word. This means the next
> time we call cio_cancel_halt_clear(), we will again start by
> calling cancel subchannel and so we can be stuck between calling
> cancel and halt forever.
> 
> Rather than calling cio_cancel_halt_clear() immediately after
> waiting, let's try to disable the subchannel. If we succeed in
> disabling the subchannel then we know nothing else can happen
> with the device.

Hmm... I had to spend some time looking at the code and it seems to be
a bit confused (not your patch, the general logic).

Basically, we call the quiesce function for two reasons:
- The device is shutdown/removed. We don't want to do anything with the
  device afterwards, and especially want nobody to start anything new.
  The subchannel will be disabled, and stay like that.
- The mdev reset callback is invoked. When resetting, we basically
  disable the subchannel (any I/O of course needs to be done prior to
  that), and then enable it again.

In the first case, our goal is to disable the subchannel, and nothing
more will be done with it.

In the second case, we simply want to perform a reset operation; that
is, get the subchannel into a clean state. The disable/enable is just a
means to get there.

Looking at what the common I/O layer does with the cancel_halt_clear
operation, it moves the device into a special quiescing state so that
no new I/O will be started. And I think that's how it is intended to be
used: If we want to quiesce the subchannel prior to remove/shutdown,
fencing off any new I/O is what we want. That means nobody will submit
requests we're not aware of.

That leaves the reset case, in which disabling the subchannel is only a
means to reset the subchannel to a defined state. We definitely want to
accept new I/O requests once we're done with the disable/enable dance.
That leaves the question of what to do with requests that come in while
resetting: Reject them? Queue them for later? Is disable/enable even
the best solution here? We basically came up with it because we
couldn't think of anything else...

(And yes, I also find it confusing that the quiesce function not only
clears out pending I/O, but also disables the subchannel...)

> 
> Suggested-by: Eric Farman <farman@linux.ibm.com>
> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> ---
>  drivers/s390/cio/vfio_ccw_drv.c | 28 ++++++++++++----------------
>  1 file changed, 12 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/s390/cio/vfio_ccw_drv.c b/drivers/s390/cio/vfio_ccw_drv.c
> index 5aca475..f87757b 100644
> --- a/drivers/s390/cio/vfio_ccw_drv.c
> +++ b/drivers/s390/cio/vfio_ccw_drv.c
> @@ -43,26 +43,22 @@ int vfio_ccw_sch_quiesce(struct subchannel *sch)
>  	if (ret != -EBUSY)
>  		goto out_unlock;
>  
> +	iretry = 255;
>  	do {
> -		iretry = 255;
>  
> -		ret = cio_cancel_halt_clear(sch, &iretry);
> -		while (ret == -EBUSY) {
> -			/*
> -			 * Flush all I/O and wait for
> -			 * cancel/halt/clear completion.
> -			 */
> -			private->completion = &completion;
> -			spin_unlock_irq(sch->lock);
> +		cio_cancel_halt_clear(sch, &iretry);

I think you still want to check the return code here...

> +		/*
> +		 * Flush all I/O and wait for
> +		 * cancel/halt/clear completion.
> +		 */
> +		private->completion = &completion;
> +		spin_unlock_irq(sch->lock);
>  
> -			wait_for_completion_timeout(&completion, 3*HZ);
> -
> -			private->completion = NULL;
> -			flush_workqueue(vfio_ccw_work_q);
> -			spin_lock_irq(sch->lock);
> -			ret = cio_cancel_halt_clear(sch, &iretry);
> -		};
> +		wait_for_completion_timeout(&completion, 3*HZ);

...because you don't want to wait for an interrupt that won't arrive
(e.g. when xsch was successful, or if the subchannel has become
non-operational in the meantime.)

>  
> +		private->completion = NULL;
> +		flush_workqueue(vfio_ccw_work_q);
> +		spin_lock_irq(sch->lock);
>  		ret = cio_disable_subchannel(sch);
>  	} while (ret == -EBUSY);
>  out_unlock:

Regardless of my comments above, this looks like an improvement over
what we have now. Have you been able to observe the issue in real life?
Farhan Ali April 3, 2019, 3:06 p.m. UTC | #2
On 04/03/2019 03:58 AM, Cornelia Huck wrote:
> On Tue,  2 Apr 2019 12:44:34 -0400
> Farhan Ali <alifm@linux.ibm.com> wrote:
> 
>> The quiesce function calls cio_cancel_halt_clear() and if we
>> get an -EBUSY we go into a loop where we:
>> 	- wait for any interrupts
>> 	- flush all I/O in the workqueue
>> 	- retry cio_cancel_halt_clear
>>
>> During the period where we are waiting for interrupts or
>> flushing all I/O, the channel subsystem could have completed
>> a halt/clear action and turned off the corresponding activity
>> control bits in the subchannel status word. This means the next
>> time we call cio_cancel_halt_clear(), we will again start by
>> calling cancel subchannel and so we can be stuck between calling
>> cancel and halt forever.
>>
>> Rather than calling cio_cancel_halt_clear() immediately after
>> waiting, let's try to disable the subchannel. If we succeed in
>> disabling the subchannel then we know nothing else can happen
>> with the device.
> 
> Hmm... I had to spend some time looking at the code and it seems to be
> a bit confused (not your patch, the general logic).
> 
> Basically, we call the quiesce function for two reasons:
> - The device is shutdown/removed. We don't want to do anything with the
>    device afterwards, and especially want nobody to start anything new.
>    The subchannel will be disabled, and stay like that.
> - The mdev reset callback is invoked. When resetting, we basically
>    disable the subchannel (any I/O of course needs to be done prior to
>    that), and then enable it again.
> 
> In the first case, our goal is to disable the subchannel, and nothing
> more will be done with it.
> 
> In the second case, we simply want to perform a reset operation; that
> is, get the subchannel into a clean state. The disable/enable is just a
> means to get there.
>  > Looking at what the common I/O layer does with the cancel_halt_clear
> operation, it moves the device into a special quiescing state so that
> no new I/O will be started. And I think that's how it is intended to be
> used: If we want to quiesce the subchannel prior to remove/shutdown,
> fencing off any new I/O is what we want. That means nobody will submit
> requests we're not aware of.

Agreed. Under normal conditions, if someone explicitly requested the 
removal of the device we can safely assume that they are no more I/O 
request sent from the userspace.

> 
> That leaves the reset case, in which disabling the subchannel is only a
> means to reset the subchannel to a defined state. We definitely want to
> accept new I/O requests once we're done with the disable/enable dance.
> That leaves the question of what to do with requests that come in while
> resetting: Reject them? Queue them for later? Is disable/enable even
> the best solution here? We basically came up with it because we
> couldn't think of anything else...
> 

So right now the reset callback is invoked through an ioctl. Would a 
sane userspace try to issue read/write request before the completion of 
reset? I am assuming that if userspace is requesting a device reset, it 
will at least wait for the reset to complete?


> (And yes, I also find it confusing that the quiesce function not only
> clears out pending I/O, but also disables the subchannel...)
> 
>>
>> Suggested-by: Eric Farman <farman@linux.ibm.com>
>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
>> ---
>>   drivers/s390/cio/vfio_ccw_drv.c | 28 ++++++++++++----------------
>>   1 file changed, 12 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/s390/cio/vfio_ccw_drv.c b/drivers/s390/cio/vfio_ccw_drv.c
>> index 5aca475..f87757b 100644
>> --- a/drivers/s390/cio/vfio_ccw_drv.c
>> +++ b/drivers/s390/cio/vfio_ccw_drv.c
>> @@ -43,26 +43,22 @@ int vfio_ccw_sch_quiesce(struct subchannel *sch)
>>   	if (ret != -EBUSY)
>>   		goto out_unlock;
>>   
>> +	iretry = 255;
>>   	do {
>> -		iretry = 255;
>>   
>> -		ret = cio_cancel_halt_clear(sch, &iretry);
>> -		while (ret == -EBUSY) {
>> -			/*
>> -			 * Flush all I/O and wait for
>> -			 * cancel/halt/clear completion.
>> -			 */
>> -			private->completion = &completion;
>> -			spin_unlock_irq(sch->lock);
>> +		cio_cancel_halt_clear(sch, &iretry);
> 
> I think you still want to check the return code here...
> 
>> +		/*
>> +		 * Flush all I/O and wait for
>> +		 * cancel/halt/clear completion.
>> +		 */
>> +		private->completion = &completion;
>> +		spin_unlock_irq(sch->lock);
>>   
>> -			wait_for_completion_timeout(&completion, 3*HZ);
>> -
>> -			private->completion = NULL;
>> -			flush_workqueue(vfio_ccw_work_q);
>> -			spin_lock_irq(sch->lock);
>> -			ret = cio_cancel_halt_clear(sch, &iretry);
>> -		};
>> +		wait_for_completion_timeout(&completion, 3*HZ);
> 
> ...because you don't want to wait for an interrupt that won't arrive
> (e.g. when xsch was successful, or if the subchannel has become
> non-operational in the meantime.)
> 

Hmm, this makes sense. I think we probably could wrap the waiting within 
an if condition, similar to io_subchannel_quiesce()?

>>   
>> +		private->completion = NULL;
>> +		flush_workqueue(vfio_ccw_work_q);
>> +		spin_lock_irq(sch->lock);
>>   		ret = cio_disable_subchannel(sch);
>>   	} while (ret == -EBUSY);
>>   out_unlock:
> 
> Regardless of my comments above, this looks like an improvement over
> what we have now. Have you been able to observe the issue in real life?
> 
> 
Yes, I have observed this in my testing. I think if you try to remove 
the mediated device (echo 1 > /path/to/mdev/remove) while a guest is 
running I/O, there is a good chance you will enter in this infinite loop.

Thanks for taking a look at the patches.

Thanks
Farhan
Cornelia Huck April 4, 2019, 8:58 a.m. UTC | #3
On Wed, 3 Apr 2019 11:06:14 -0400
Farhan Ali <alifm@linux.ibm.com> wrote:

> On 04/03/2019 03:58 AM, Cornelia Huck wrote:
> > On Tue,  2 Apr 2019 12:44:34 -0400
> > Farhan Ali <alifm@linux.ibm.com> wrote:
> >   
> >> The quiesce function calls cio_cancel_halt_clear() and if we
> >> get an -EBUSY we go into a loop where we:
> >> 	- wait for any interrupts
> >> 	- flush all I/O in the workqueue
> >> 	- retry cio_cancel_halt_clear
> >>
> >> During the period where we are waiting for interrupts or
> >> flushing all I/O, the channel subsystem could have completed
> >> a halt/clear action and turned off the corresponding activity
> >> control bits in the subchannel status word. This means the next
> >> time we call cio_cancel_halt_clear(), we will again start by
> >> calling cancel subchannel and so we can be stuck between calling
> >> cancel and halt forever.
> >>
> >> Rather than calling cio_cancel_halt_clear() immediately after
> >> waiting, let's try to disable the subchannel. If we succeed in
> >> disabling the subchannel then we know nothing else can happen
> >> with the device.  
> > 
> > Hmm... I had to spend some time looking at the code and it seems to be
> > a bit confused (not your patch, the general logic).
> > 
> > Basically, we call the quiesce function for two reasons:
> > - The device is shutdown/removed. We don't want to do anything with the
> >    device afterwards, and especially want nobody to start anything new.
> >    The subchannel will be disabled, and stay like that.
> > - The mdev reset callback is invoked. When resetting, we basically
> >    disable the subchannel (any I/O of course needs to be done prior to
> >    that), and then enable it again.
> > 
> > In the first case, our goal is to disable the subchannel, and nothing
> > more will be done with it.
> > 
> > In the second case, we simply want to perform a reset operation; that
> > is, get the subchannel into a clean state. The disable/enable is just a
> > means to get there.  
> >  > Looking at what the common I/O layer does with the cancel_halt_clear  
> > operation, it moves the device into a special quiescing state so that
> > no new I/O will be started. And I think that's how it is intended to be
> > used: If we want to quiesce the subchannel prior to remove/shutdown,
> > fencing off any new I/O is what we want. That means nobody will submit
> > requests we're not aware of.  
> 
> Agreed. Under normal conditions, if someone explicitly requested the 
> removal of the device we can safely assume that they are no more I/O 
> request sent from the userspace.

Yes, and we would not lose any functionality if we actively fence this.

> 
> > 
> > That leaves the reset case, in which disabling the subchannel is only a
> > means to reset the subchannel to a defined state. We definitely want to
> > accept new I/O requests once we're done with the disable/enable dance.
> > That leaves the question of what to do with requests that come in while
> > resetting: Reject them? Queue them for later? Is disable/enable even
> > the best solution here? We basically came up with it because we
> > couldn't think of anything else...
> >   
> 
> So right now the reset callback is invoked through an ioctl. Would a 
> sane userspace try to issue read/write request before the completion of 
> reset? I am assuming that if userspace is requesting a device reset, it 
> will at least wait for the reset to complete?

It depends on the semantics of reset, of which I am not completely
clear TBH. It's probably ok to fence here as well, maybe with a
different error ("try again later" vs. "the device is going away
anyway" in the first case).

> 
> 
> > (And yes, I also find it confusing that the quiesce function not only
> > clears out pending I/O, but also disables the subchannel...)
> >   
> >>
> >> Suggested-by: Eric Farman <farman@linux.ibm.com>
> >> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> >> ---
> >>   drivers/s390/cio/vfio_ccw_drv.c | 28 ++++++++++++----------------
> >>   1 file changed, 12 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/drivers/s390/cio/vfio_ccw_drv.c b/drivers/s390/cio/vfio_ccw_drv.c
> >> index 5aca475..f87757b 100644
> >> --- a/drivers/s390/cio/vfio_ccw_drv.c
> >> +++ b/drivers/s390/cio/vfio_ccw_drv.c
> >> @@ -43,26 +43,22 @@ int vfio_ccw_sch_quiesce(struct subchannel *sch)
> >>   	if (ret != -EBUSY)
> >>   		goto out_unlock;
> >>   
> >> +	iretry = 255;
> >>   	do {
> >> -		iretry = 255;
> >>   
> >> -		ret = cio_cancel_halt_clear(sch, &iretry);
> >> -		while (ret == -EBUSY) {
> >> -			/*
> >> -			 * Flush all I/O and wait for
> >> -			 * cancel/halt/clear completion.
> >> -			 */
> >> -			private->completion = &completion;
> >> -			spin_unlock_irq(sch->lock);
> >> +		cio_cancel_halt_clear(sch, &iretry);  
> > 
> > I think you still want to check the return code here...
> >   
> >> +		/*
> >> +		 * Flush all I/O and wait for
> >> +		 * cancel/halt/clear completion.
> >> +		 */
> >> +		private->completion = &completion;
> >> +		spin_unlock_irq(sch->lock);
> >>   
> >> -			wait_for_completion_timeout(&completion, 3*HZ);
> >> -
> >> -			private->completion = NULL;
> >> -			flush_workqueue(vfio_ccw_work_q);
> >> -			spin_lock_irq(sch->lock);
> >> -			ret = cio_cancel_halt_clear(sch, &iretry);
> >> -		};
> >> +		wait_for_completion_timeout(&completion, 3*HZ);  
> > 
> > ...because you don't want to wait for an interrupt that won't arrive
> > (e.g. when xsch was successful, or if the subchannel has become
> > non-operational in the meantime.)
> >   
> 
> Hmm, this makes sense. I think we probably could wrap the waiting within 
> an if condition, similar to io_subchannel_quiesce()?

Probably yes.

> 
> >>   
> >> +		private->completion = NULL;
> >> +		flush_workqueue(vfio_ccw_work_q);
> >> +		spin_lock_irq(sch->lock);
> >>   		ret = cio_disable_subchannel(sch);
> >>   	} while (ret == -EBUSY);
> >>   out_unlock:  
> > 
> > Regardless of my comments above, this looks like an improvement over
> > what we have now. Have you been able to observe the issue in real life?
> > 
> >   
> Yes, I have observed this in my testing. I think if you try to remove 
> the mediated device (echo 1 > /path/to/mdev/remove) while a guest is 
> running I/O, there is a good chance you will enter in this infinite loop.

Good to know, a testcase is always helpful :)

> 
> Thanks for taking a look at the patches.
> 
> Thanks
> Farhan
>

Patch
diff mbox series

diff --git a/drivers/s390/cio/vfio_ccw_drv.c b/drivers/s390/cio/vfio_ccw_drv.c
index 5aca475..f87757b 100644
--- a/drivers/s390/cio/vfio_ccw_drv.c
+++ b/drivers/s390/cio/vfio_ccw_drv.c
@@ -43,26 +43,22 @@  int vfio_ccw_sch_quiesce(struct subchannel *sch)
 	if (ret != -EBUSY)
 		goto out_unlock;
 
+	iretry = 255;
 	do {
-		iretry = 255;
 
-		ret = cio_cancel_halt_clear(sch, &iretry);
-		while (ret == -EBUSY) {
-			/*
-			 * Flush all I/O and wait for
-			 * cancel/halt/clear completion.
-			 */
-			private->completion = &completion;
-			spin_unlock_irq(sch->lock);
+		cio_cancel_halt_clear(sch, &iretry);
+		/*
+		 * Flush all I/O and wait for
+		 * cancel/halt/clear completion.
+		 */
+		private->completion = &completion;
+		spin_unlock_irq(sch->lock);
 
-			wait_for_completion_timeout(&completion, 3*HZ);
-
-			private->completion = NULL;
-			flush_workqueue(vfio_ccw_work_q);
-			spin_lock_irq(sch->lock);
-			ret = cio_cancel_halt_clear(sch, &iretry);
-		};
+		wait_for_completion_timeout(&completion, 3*HZ);
 
+		private->completion = NULL;
+		flush_workqueue(vfio_ccw_work_q);
+		spin_lock_irq(sch->lock);
 		ret = cio_disable_subchannel(sch);
 	} while (ret == -EBUSY);
 out_unlock: