diff mbox series

[v1,1/1] vfio-ccw: Don't free channel programs for unrelated interrupts

Message ID 20200124145455.51181-2-farman@linux.ibm.com (mailing list archive)
State New, archived
Headers show
Series vfio-ccw: Fix interrupt handling for HALT/CLEAR | expand

Commit Message

Eric Farman Jan. 24, 2020, 2:54 p.m. UTC
With the addition of asynchronous channel programs (HALT or CLEAR
SUBCHANNEL instructions), the hardware can receive interrupts that
are not related to any channel program currently active.  An attempt
is made to ensure that only associated resources are freed, but the
host can fail in unpleasant ways:

[ 1051.330289] Unable to handle kernel pointer dereference in virtual kernel address space
[ 1051.330360] Failing address: c880003d16572000 TEID: c880003d16572803
[ 1051.330365] Fault in home space mode while using kernel ASCE.
[ 1051.330372] AS:00000000fde9c007 R3:0000000000000024
...snip...
[ 1051.330539]  [<00000000fccbd33e>] __kmalloc+0xd6/0x3d8
[ 1051.330543] ([<00000000fccbd514>] __kmalloc+0x2ac/0x3d8)
[ 1051.330550]  [<000003ff801452b4>] cp_prefetch+0xc4/0x3b8 [vfio_ccw]
[ 1051.330554]  [<000003ff801471e4>] fsm_io_request+0x2d4/0x7b8 [vfio_ccw]
[ 1051.330559]  [<000003ff80145d9c>] vfio_ccw_mdev_write+0x17c/0x300 [vfio_ccw]
[ 1051.330564]  [<00000000fccf0d20>] vfs_write+0xb0/0x1b8
[ 1051.330568]  [<00000000fccf1236>] ksys_pwrite64+0x7e/0xb8
[ 1051.330574]  [<00000000fd4524c0>] system_call+0xdc/0x2d8

The problem is corruption of the dma-kmalloc-8 slab [1], if an interrupt
occurs for a CLEAR or HALT that is not obviously associated with the
current channel program.  If the channel_program struct is freed AND
another interrupt for that I/O occurs, then this may occur:

583.612967 00          cp_prefetch  NEW SSCH
583.613180 03 vfio_ccw_sch_io_todo  orb.cpa=03012690 irb.cpa=03012698
                                    ccw=2704004203015600 *cda=1955d8fb8
                                    irb: fctl=4 actl=0 stctl=7
587.039292 04          cp_prefetch  NEW SSCH
587.039296 01 vfio_ccw_sch_io_todo  orb.cpa=7fe209f0 irb.cpa=03012698
                                    ccw=3424000c030a4068 *cda=1999e9cf0
                                    irb: fctl=2 actl=0 stctl=1
587.039505 01 vfio_ccw_sch_io_todo  orb.cpa=7fe209f0 irb.cpa=7fe209f8
                                    ccw=3424000c030a4068 *cda=0030a4070
                                    irb: fctl=4 actl=0 stctl=7

Note how the last vfio_ccw_sch_io_todo() call has a ccw.cda that is
right next to its supposed IDAW, compared to the previous one?  That
is the result of the previous one freeing the cp (and its IDAL), and
kfree writing the next available address at the beginning of the
newly released memory.  When the channel goes to store data, it
believes the IDAW is valid and overwrites that pointer and causes
kmalloc to fail some time later.

Since the vfio-ccw interrupt handler walks the list of ccwchain structs
to determine if the guest SCSW needs to be updated, it can be changed
to indicate whether the interrupt points within the channel_program.
If yes, then the channel_program is valid and its resources can be freed.
It not, then another interrupt is expected to do that later.

[1] It could be other dma-kmalloc-xxx slabs; this just happens to be the
one driven most frequently in my testing.

Fixes: f4c9939433bd ("vfio-ccw: Don't call cp_free if we are processing a channel program")
Signed-off-by: Eric Farman <farman@linux.ibm.com>
---
 drivers/s390/cio/vfio_ccw_cp.c  | 11 +++++++++--
 drivers/s390/cio/vfio_ccw_cp.h  |  2 +-
 drivers/s390/cio/vfio_ccw_drv.c |  4 ++--
 3 files changed, 12 insertions(+), 5 deletions(-)

Comments

Cornelia Huck Jan. 24, 2020, 3:33 p.m. UTC | #1
On Fri, 24 Jan 2020 15:54:55 +0100
Eric Farman <farman@linux.ibm.com> wrote:

> With the addition of asynchronous channel programs (HALT or CLEAR
> SUBCHANNEL instructions), the hardware can receive interrupts that
> are not related to any channel program currently active.  An attempt
> is made to ensure that only associated resources are freed, but the
> host can fail in unpleasant ways:
> 
> [ 1051.330289] Unable to handle kernel pointer dereference in virtual kernel address space
> [ 1051.330360] Failing address: c880003d16572000 TEID: c880003d16572803
> [ 1051.330365] Fault in home space mode while using kernel ASCE.
> [ 1051.330372] AS:00000000fde9c007 R3:0000000000000024
> ...snip...
> [ 1051.330539]  [<00000000fccbd33e>] __kmalloc+0xd6/0x3d8
> [ 1051.330543] ([<00000000fccbd514>] __kmalloc+0x2ac/0x3d8)
> [ 1051.330550]  [<000003ff801452b4>] cp_prefetch+0xc4/0x3b8 [vfio_ccw]
> [ 1051.330554]  [<000003ff801471e4>] fsm_io_request+0x2d4/0x7b8 [vfio_ccw]
> [ 1051.330559]  [<000003ff80145d9c>] vfio_ccw_mdev_write+0x17c/0x300 [vfio_ccw]
> [ 1051.330564]  [<00000000fccf0d20>] vfs_write+0xb0/0x1b8
> [ 1051.330568]  [<00000000fccf1236>] ksys_pwrite64+0x7e/0xb8
> [ 1051.330574]  [<00000000fd4524c0>] system_call+0xdc/0x2d8
> 
> The problem is corruption of the dma-kmalloc-8 slab [1], if an interrupt
> occurs for a CLEAR or HALT that is not obviously associated with the
> current channel program.  If the channel_program struct is freed AND
> another interrupt for that I/O occurs, then this may occur:
> 
> 583.612967 00          cp_prefetch  NEW SSCH
> 583.613180 03 vfio_ccw_sch_io_todo  orb.cpa=03012690 irb.cpa=03012698
>                                     ccw=2704004203015600 *cda=1955d8fb8
>                                     irb: fctl=4 actl=0 stctl=7
> 587.039292 04          cp_prefetch  NEW SSCH
> 587.039296 01 vfio_ccw_sch_io_todo  orb.cpa=7fe209f0 irb.cpa=03012698
>                                     ccw=3424000c030a4068 *cda=1999e9cf0
>                                     irb: fctl=2 actl=0 stctl=1
> 587.039505 01 vfio_ccw_sch_io_todo  orb.cpa=7fe209f0 irb.cpa=7fe209f8
>                                     ccw=3424000c030a4068 *cda=0030a4070
>                                     irb: fctl=4 actl=0 stctl=7
> 
> Note how the last vfio_ccw_sch_io_todo() call has a ccw.cda that is
> right next to its supposed IDAW, compared to the previous one?  That
> is the result of the previous one freeing the cp (and its IDAL), and
> kfree writing the next available address at the beginning of the
> newly released memory.  When the channel goes to store data, it
> believes the IDAW is valid and overwrites that pointer and causes
> kmalloc to fail some time later.
> 
> Since the vfio-ccw interrupt handler walks the list of ccwchain structs
> to determine if the guest SCSW needs to be updated, it can be changed
> to indicate whether the interrupt points within the channel_program.
> If yes, then the channel_program is valid and its resources can be freed.
> It not, then another interrupt is expected to do that later.
> 
> [1] It could be other dma-kmalloc-xxx slabs; this just happens to be the
> one driven most frequently in my testing.
> 
> Fixes: f4c9939433bd ("vfio-ccw: Don't call cp_free if we are processing a channel program")
> Signed-off-by: Eric Farman <farman@linux.ibm.com>
> ---
>  drivers/s390/cio/vfio_ccw_cp.c  | 11 +++++++++--
>  drivers/s390/cio/vfio_ccw_cp.h  |  2 +-
>  drivers/s390/cio/vfio_ccw_drv.c |  4 ++--
>  3 files changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/s390/cio/vfio_ccw_cp.c b/drivers/s390/cio/vfio_ccw_cp.c
> index 3645d1720c4b..2d942433baf9 100644
> --- a/drivers/s390/cio/vfio_ccw_cp.c
> +++ b/drivers/s390/cio/vfio_ccw_cp.c
> @@ -803,15 +803,19 @@ union orb *cp_get_orb(struct channel_program *cp, u32 intparm, u8 lpm)
>   *
>   * This function updates @scsw->cpa to its coressponding guest physical
>   * address.
> + *
> + * Returns true if the channel program address in the irb was found
> + * within the chain of CCWs for this channel program.
>   */
> -void cp_update_scsw(struct channel_program *cp, union scsw *scsw)
> +bool cp_update_scsw(struct channel_program *cp, union scsw *scsw)
>  {
>  	struct ccwchain *chain;
>  	u32 cpa = scsw->cmd.cpa;
>  	u32 ccw_head;
> +	bool within_chain = false;
>  
>  	if (!cp->initialized)
> -		return;
> +		return false;
>  
>  	/*
>  	 * LATER:
> @@ -833,11 +837,14 @@ void cp_update_scsw(struct channel_program *cp, union scsw *scsw)
>  			 * head gets us the guest cpa.
>  			 */
>  			cpa = chain->ch_iova + (cpa - ccw_head);
> +			within_chain = true;
>  			break;
>  		}
>  	}
>  
>  	scsw->cmd.cpa = cpa;

Looking at this, I'm wondering why we would want to update the cpa if
!within_chain. But I'm probably too tired to review this properly
today...

> +
> +	return within_chain;
>  }
>  
>  /**
> diff --git a/drivers/s390/cio/vfio_ccw_cp.h b/drivers/s390/cio/vfio_ccw_cp.h
> index ba31240ce965..a4cb6527bd4e 100644
> --- a/drivers/s390/cio/vfio_ccw_cp.h
> +++ b/drivers/s390/cio/vfio_ccw_cp.h
> @@ -47,7 +47,7 @@ extern int cp_init(struct channel_program *cp, struct device *mdev,
>  extern void cp_free(struct channel_program *cp);
>  extern int cp_prefetch(struct channel_program *cp);
>  extern union orb *cp_get_orb(struct channel_program *cp, u32 intparm, u8 lpm);
> -extern void cp_update_scsw(struct channel_program *cp, union scsw *scsw);
> +extern bool cp_update_scsw(struct channel_program *cp, union scsw *scsw);
>  extern bool cp_iova_pinned(struct channel_program *cp, u64 iova);
>  
>  #endif
> diff --git a/drivers/s390/cio/vfio_ccw_drv.c b/drivers/s390/cio/vfio_ccw_drv.c
> index e401a3d0aa57..a8ab256a217b 100644
> --- a/drivers/s390/cio/vfio_ccw_drv.c
> +++ b/drivers/s390/cio/vfio_ccw_drv.c
> @@ -90,8 +90,8 @@ static void vfio_ccw_sch_io_todo(struct work_struct *work)
>  	is_final = !(scsw_actl(&irb->scsw) &
>  		     (SCSW_ACTL_DEVACT | SCSW_ACTL_SCHACT));
>  	if (scsw_is_solicited(&irb->scsw)) {
> -		cp_update_scsw(&private->cp, &irb->scsw);
> -		if (is_final && private->state == VFIO_CCW_STATE_CP_PENDING)
> +		if (cp_update_scsw(&private->cp, &irb->scsw) &&
> +		    is_final && private->state == VFIO_CCW_STATE_CP_PENDING)

...but I still wonder why is_final is not catching non-ssch related
interrupts, as I thought it would. We might want to adapt that check,
instead. (Or was the scsw_is_solicited() check supposed to catch that?
As said, too tired right now...)

>  			cp_free(&private->cp);
>  	}
>  	mutex_lock(&private->io_mutex);
Eric Farman Jan. 24, 2020, 4:08 p.m. UTC | #2
On 1/24/20 10:33 AM, Cornelia Huck wrote:
> On Fri, 24 Jan 2020 15:54:55 +0100
> Eric Farman <farman@linux.ibm.com> wrote:
> 
>> With the addition of asynchronous channel programs (HALT or CLEAR
>> SUBCHANNEL instructions), the hardware can receive interrupts that
>> are not related to any channel program currently active.  An attempt
>> is made to ensure that only associated resources are freed, but the
>> host can fail in unpleasant ways:
>>
>> [ 1051.330289] Unable to handle kernel pointer dereference in virtual kernel address space
>> [ 1051.330360] Failing address: c880003d16572000 TEID: c880003d16572803
>> [ 1051.330365] Fault in home space mode while using kernel ASCE.
>> [ 1051.330372] AS:00000000fde9c007 R3:0000000000000024
>> ...snip...
>> [ 1051.330539]  [<00000000fccbd33e>] __kmalloc+0xd6/0x3d8
>> [ 1051.330543] ([<00000000fccbd514>] __kmalloc+0x2ac/0x3d8)
>> [ 1051.330550]  [<000003ff801452b4>] cp_prefetch+0xc4/0x3b8 [vfio_ccw]
>> [ 1051.330554]  [<000003ff801471e4>] fsm_io_request+0x2d4/0x7b8 [vfio_ccw]
>> [ 1051.330559]  [<000003ff80145d9c>] vfio_ccw_mdev_write+0x17c/0x300 [vfio_ccw]
>> [ 1051.330564]  [<00000000fccf0d20>] vfs_write+0xb0/0x1b8
>> [ 1051.330568]  [<00000000fccf1236>] ksys_pwrite64+0x7e/0xb8
>> [ 1051.330574]  [<00000000fd4524c0>] system_call+0xdc/0x2d8
>>
>> The problem is corruption of the dma-kmalloc-8 slab [1], if an interrupt
>> occurs for a CLEAR or HALT that is not obviously associated with the
>> current channel program.  If the channel_program struct is freed AND
>> another interrupt for that I/O occurs, then this may occur:
>>
>> 583.612967 00          cp_prefetch  NEW SSCH
>> 583.613180 03 vfio_ccw_sch_io_todo  orb.cpa=03012690 irb.cpa=03012698
>>                                     ccw=2704004203015600 *cda=1955d8fb8
>>                                     irb: fctl=4 actl=0 stctl=7
>> 587.039292 04          cp_prefetch  NEW SSCH
>> 587.039296 01 vfio_ccw_sch_io_todo  orb.cpa=7fe209f0 irb.cpa=03012698
>>                                     ccw=3424000c030a4068 *cda=1999e9cf0
>>                                     irb: fctl=2 actl=0 stctl=1
>> 587.039505 01 vfio_ccw_sch_io_todo  orb.cpa=7fe209f0 irb.cpa=7fe209f8
>>                                     ccw=3424000c030a4068 *cda=0030a4070
>>                                     irb: fctl=4 actl=0 stctl=7
>>
>> Note how the last vfio_ccw_sch_io_todo() call has a ccw.cda that is
>> right next to its supposed IDAW, compared to the previous one?  That
>> is the result of the previous one freeing the cp (and its IDAL), and
>> kfree writing the next available address at the beginning of the
>> newly released memory.  When the channel goes to store data, it
>> believes the IDAW is valid and overwrites that pointer and causes
>> kmalloc to fail some time later.
>>
>> Since the vfio-ccw interrupt handler walks the list of ccwchain structs
>> to determine if the guest SCSW needs to be updated, it can be changed
>> to indicate whether the interrupt points within the channel_program.
>> If yes, then the channel_program is valid and its resources can be freed.
>> It not, then another interrupt is expected to do that later.
>>
>> [1] It could be other dma-kmalloc-xxx slabs; this just happens to be the
>> one driven most frequently in my testing.
>>
>> Fixes: f4c9939433bd ("vfio-ccw: Don't call cp_free if we are processing a channel program")
>> Signed-off-by: Eric Farman <farman@linux.ibm.com>
>> ---
>>  drivers/s390/cio/vfio_ccw_cp.c  | 11 +++++++++--
>>  drivers/s390/cio/vfio_ccw_cp.h  |  2 +-
>>  drivers/s390/cio/vfio_ccw_drv.c |  4 ++--
>>  3 files changed, 12 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/s390/cio/vfio_ccw_cp.c b/drivers/s390/cio/vfio_ccw_cp.c
>> index 3645d1720c4b..2d942433baf9 100644
>> --- a/drivers/s390/cio/vfio_ccw_cp.c
>> +++ b/drivers/s390/cio/vfio_ccw_cp.c
>> @@ -803,15 +803,19 @@ union orb *cp_get_orb(struct channel_program *cp, u32 intparm, u8 lpm)
>>   *
>>   * This function updates @scsw->cpa to its coressponding guest physical
>>   * address.
>> + *
>> + * Returns true if the channel program address in the irb was found
>> + * within the chain of CCWs for this channel program.
>>   */
>> -void cp_update_scsw(struct channel_program *cp, union scsw *scsw)
>> +bool cp_update_scsw(struct channel_program *cp, union scsw *scsw)
>>  {
>>  	struct ccwchain *chain;
>>  	u32 cpa = scsw->cmd.cpa;
>>  	u32 ccw_head;
>> +	bool within_chain = false;
>>  
>>  	if (!cp->initialized)
>> -		return;
>> +		return false;
>>  
>>  	/*
>>  	 * LATER:
>> @@ -833,11 +837,14 @@ void cp_update_scsw(struct channel_program *cp, union scsw *scsw)
>>  			 * head gets us the guest cpa.
>>  			 */
>>  			cpa = chain->ch_iova + (cpa - ccw_head);
>> +			within_chain = true;
>>  			break;
>>  		}
>>  	}
>>  
>>  	scsw->cmd.cpa = cpa;
> 
> Looking at this, I'm wondering why we would want to update the cpa if
> !within_chain. But I'm probably too tired to review this properly
> today...

We probably don't, it's just what happens today and I didn't want to be
too disruptive (yet)  :)

> 
>> +
>> +	return within_chain;
>>  }
>>  
>>  /**
>> diff --git a/drivers/s390/cio/vfio_ccw_cp.h b/drivers/s390/cio/vfio_ccw_cp.h
>> index ba31240ce965..a4cb6527bd4e 100644
>> --- a/drivers/s390/cio/vfio_ccw_cp.h
>> +++ b/drivers/s390/cio/vfio_ccw_cp.h
>> @@ -47,7 +47,7 @@ extern int cp_init(struct channel_program *cp, struct device *mdev,
>>  extern void cp_free(struct channel_program *cp);
>>  extern int cp_prefetch(struct channel_program *cp);
>>  extern union orb *cp_get_orb(struct channel_program *cp, u32 intparm, u8 lpm);
>> -extern void cp_update_scsw(struct channel_program *cp, union scsw *scsw);
>> +extern bool cp_update_scsw(struct channel_program *cp, union scsw *scsw);
>>  extern bool cp_iova_pinned(struct channel_program *cp, u64 iova);
>>  
>>  #endif
>> diff --git a/drivers/s390/cio/vfio_ccw_drv.c b/drivers/s390/cio/vfio_ccw_drv.c
>> index e401a3d0aa57..a8ab256a217b 100644
>> --- a/drivers/s390/cio/vfio_ccw_drv.c
>> +++ b/drivers/s390/cio/vfio_ccw_drv.c
>> @@ -90,8 +90,8 @@ static void vfio_ccw_sch_io_todo(struct work_struct *work)
>>  	is_final = !(scsw_actl(&irb->scsw) &
>>  		     (SCSW_ACTL_DEVACT | SCSW_ACTL_SCHACT));
>>  	if (scsw_is_solicited(&irb->scsw)) {
>> -		cp_update_scsw(&private->cp, &irb->scsw);
>> -		if (is_final && private->state == VFIO_CCW_STATE_CP_PENDING)
>> +		if (cp_update_scsw(&private->cp, &irb->scsw) &&
>> +		    is_final && private->state == VFIO_CCW_STATE_CP_PENDING)
> 
> ...but I still wonder why is_final is not catching non-ssch related
> interrupts, as I thought it would. We might want to adapt that check,
> instead. (Or was the scsw_is_solicited() check supposed to catch that?
> As said, too tired right now...)

I had looked at the (un)solicited bits at one point, and saw very few
unsolicited interrupts.  The ones that did show up didn't appear to
affect things in the way that would cause the problems I'm seeing.

As for is_final...  That POPS table states that for "status pending
[alone] after termination of HALT or CLEAR ... cpa is unpredictable",
which is what happens here.  In the example above, the cpa is the same
as the previous (successful) interrupt, and thus unrelated to the
current chain.  Perhaps is_final needs to check that the function
control in the interrupt is for a start?

I will keep pondering the bit-checking before I hit my end of day.  Have
a fine weekend!

> 
>>  			cp_free(&private->cp);
>>  	}
>>  	mutex_lock(&private->io_mutex);
>
Cornelia Huck Jan. 27, 2020, 12:52 p.m. UTC | #3
On Fri, 24 Jan 2020 11:08:12 -0500
Eric Farman <farman@linux.ibm.com> wrote:

> On 1/24/20 10:33 AM, Cornelia Huck wrote:
> > On Fri, 24 Jan 2020 15:54:55 +0100
> > Eric Farman <farman@linux.ibm.com> wrote:

> >> diff --git a/drivers/s390/cio/vfio_ccw_drv.c b/drivers/s390/cio/vfio_ccw_drv.c
> >> index e401a3d0aa57..a8ab256a217b 100644
> >> --- a/drivers/s390/cio/vfio_ccw_drv.c
> >> +++ b/drivers/s390/cio/vfio_ccw_drv.c
> >> @@ -90,8 +90,8 @@ static void vfio_ccw_sch_io_todo(struct work_struct *work)
> >>  	is_final = !(scsw_actl(&irb->scsw) &
> >>  		     (SCSW_ACTL_DEVACT | SCSW_ACTL_SCHACT));
> >>  	if (scsw_is_solicited(&irb->scsw)) {
> >> -		cp_update_scsw(&private->cp, &irb->scsw);
> >> -		if (is_final && private->state == VFIO_CCW_STATE_CP_PENDING)
> >> +		if (cp_update_scsw(&private->cp, &irb->scsw) &&
> >> +		    is_final && private->state == VFIO_CCW_STATE_CP_PENDING)  
> > 
> > ...but I still wonder why is_final is not catching non-ssch related
> > interrupts, as I thought it would. We might want to adapt that check,
> > instead. (Or was the scsw_is_solicited() check supposed to catch that?
> > As said, too tired right now...)  
> 
> I had looked at the (un)solicited bits at one point, and saw very few
> unsolicited interrupts.  The ones that did show up didn't appear to
> affect things in the way that would cause the problems I'm seeing.

Ok, so that check is hopefully fine.

> 
> As for is_final...  That POPS table states that for "status pending
> [alone] after termination of HALT or CLEAR ... cpa is unpredictable",
> which is what happens here.  In the example above, the cpa is the same
> as the previous (successful) interrupt, and thus unrelated to the
> current chain.  Perhaps is_final needs to check that the function
> control in the interrupt is for a start?

I think our reasoning last time we discussed this function was that we
only are in CP_PENDING if we actually did a ssch previously. Now, if we
do a hsch/csch before we got final status for the program started by
the ssch, we don't move out of the CP_PENDING, but the cpa still might
not be what we're looking for. So, we should probably check that we
have only the start function indicated in the fctl.

But if we do that, we still have a chain allocated for something that
has already been terminated... how do we find the right chain to clean
up, if needed?
Eric Farman Jan. 27, 2020, 9:28 p.m. UTC | #4
On 1/27/20 7:52 AM, Cornelia Huck wrote:
> On Fri, 24 Jan 2020 11:08:12 -0500
> Eric Farman <farman@linux.ibm.com> wrote:
> 
>> On 1/24/20 10:33 AM, Cornelia Huck wrote:
>>> On Fri, 24 Jan 2020 15:54:55 +0100
>>> Eric Farman <farman@linux.ibm.com> wrote:
> 
>>>> diff --git a/drivers/s390/cio/vfio_ccw_drv.c b/drivers/s390/cio/vfio_ccw_drv.c
>>>> index e401a3d0aa57..a8ab256a217b 100644
>>>> --- a/drivers/s390/cio/vfio_ccw_drv.c
>>>> +++ b/drivers/s390/cio/vfio_ccw_drv.c
>>>> @@ -90,8 +90,8 @@ static void vfio_ccw_sch_io_todo(struct work_struct *work)
>>>>  	is_final = !(scsw_actl(&irb->scsw) &
>>>>  		     (SCSW_ACTL_DEVACT | SCSW_ACTL_SCHACT));
>>>>  	if (scsw_is_solicited(&irb->scsw)) {
>>>> -		cp_update_scsw(&private->cp, &irb->scsw);
>>>> -		if (is_final && private->state == VFIO_CCW_STATE_CP_PENDING)
>>>> +		if (cp_update_scsw(&private->cp, &irb->scsw) &&
>>>> +		    is_final && private->state == VFIO_CCW_STATE_CP_PENDING)  
>>>
>>> ...but I still wonder why is_final is not catching non-ssch related
>>> interrupts, as I thought it would. We might want to adapt that check,
>>> instead. (Or was the scsw_is_solicited() check supposed to catch that?
>>> As said, too tired right now...)  
>>
>> I had looked at the (un)solicited bits at one point, and saw very few
>> unsolicited interrupts.  The ones that did show up didn't appear to
>> affect things in the way that would cause the problems I'm seeing.
> 
> Ok, so that check is hopefully fine.
> 
>>
>> As for is_final...  That POPS table states that for "status pending
>> [alone] after termination of HALT or CLEAR ... cpa is unpredictable",
>> which is what happens here.  In the example above, the cpa is the same
>> as the previous (successful) interrupt, and thus unrelated to the
>> current chain.  Perhaps is_final needs to check that the function
>> control in the interrupt is for a start?
> 
> I think our reasoning last time we discussed this function was that we
> only are in CP_PENDING if we actually did a ssch previously. Now, if we

I spent a little time looking at the conversations on the patch that
added the CP_PENDING check.  Sadly, those patches hit the list when I
left for holiday so I came late to those discussions and there appears
some loose ends that I should've chased down at the time.  Sorry.

But yes, we should only be in CP_PENDING because of the SSCH, but the
only check of the interrupt here is the "is_final" check, and not that
the interrupt was for a start function.

> do a hsch/csch before we got final status for the program started by
> the ssch, we don't move out of the CP_PENDING, but the cpa still might
> not be what we're looking for. 

As long as we get an interrupt that's "is_final" then don't we come out
of CP_PENDING state at the end of this routine, regardless of whether or
not it does the cp_free() call?  I think your original diagnosis [1] was
that even if the cpa is invalid, calling cp_update_scsw() is okay
because garbage-in-garbage-out.  This patch makes that part of the
criteria for doing the cp_free(), so maybe that's too heavy?  After all,
it does mean that we may leave private->cp "initialized", but reset the
state back to IDLE.  (More on that in a minute.)

> So, we should probably check that we
> have only the start function indicated in the fctl.

For the call to cp_update_scsw() or cp_free()?  Or both?

> 
> But if we do that, we still have a chain allocated for something that
> has already been terminated... how do we find the right chain to clean
> up, if needed?

Don't we free all/none of the chains?  Ideally, since we only have one
set of chains per cp (and thus, per SSCH), they should either all be
freed or ignored.

But regardless, this patch is at least not complete, if not incorrect.
I left a test running for the weekend and while I don't see the storage
damage I saw before, there's a lot of unreleased memory because of stuff
like this:

950.541644 06 ...sch_io_todo sch 09c5: state=3 orb.cpa=7f586f48
                                               irb.w0=00001001
                                               irb.cpa=02e35d58
                                               irb.w2=0000000c
                                               ccw=0
                                               *cda=0
950.541837 06 ...sch_io_todo sch 09c5: state=2 orb.cpa=030ec750
                                               irb.w0=00c04007
                                               irb.cpa=7f586f50
                                               irb.w2=0c000000
                                               ccw=3424000c030ea840
                                               *cda=190757ef0

(I was only tracing instances where vfio-ccw did NOT call cp_free() on
the interrupt path; so I don't have a complete history of what happened.)

The orb.cpa address in the first trace looks like something which came
from the guest, rather than something built by vfio-ccw.  The irb.cpa
address in the second trace is 8 bytes after the first orb.cpa address.
And the storage referenced by both the CP and IDAL referenced in trace 2
are still active when I started poking at the state of things.

There's a lot just to unravel just with this instance.  Like why a guest
CPA is in orb, and thus an irb.  Or why cp_prefetch() checks that
!cp->initialized, but cp_init() does no such thing.  I guess I'll put in
a check to see how that helps this particular situation, while I sort
out the other problems here.

> 

[1] https://lore.kernel.org/kvm/20190702115134.790f8891.cohuck@redhat.com/
Cornelia Huck Jan. 28, 2020, 9:58 a.m. UTC | #5
On Mon, 27 Jan 2020 16:28:18 -0500
Eric Farman <farman@linux.ibm.com> wrote:

> On 1/27/20 7:52 AM, Cornelia Huck wrote:
> > On Fri, 24 Jan 2020 11:08:12 -0500
> > Eric Farman <farman@linux.ibm.com> wrote:
> >   
> >> On 1/24/20 10:33 AM, Cornelia Huck wrote:  
> >>> On Fri, 24 Jan 2020 15:54:55 +0100
> >>> Eric Farman <farman@linux.ibm.com> wrote:  
> >   
> >>>> diff --git a/drivers/s390/cio/vfio_ccw_drv.c b/drivers/s390/cio/vfio_ccw_drv.c
> >>>> index e401a3d0aa57..a8ab256a217b 100644
> >>>> --- a/drivers/s390/cio/vfio_ccw_drv.c
> >>>> +++ b/drivers/s390/cio/vfio_ccw_drv.c
> >>>> @@ -90,8 +90,8 @@ static void vfio_ccw_sch_io_todo(struct work_struct *work)
> >>>>  	is_final = !(scsw_actl(&irb->scsw) &
> >>>>  		     (SCSW_ACTL_DEVACT | SCSW_ACTL_SCHACT));
> >>>>  	if (scsw_is_solicited(&irb->scsw)) {
> >>>> -		cp_update_scsw(&private->cp, &irb->scsw);
> >>>> -		if (is_final && private->state == VFIO_CCW_STATE_CP_PENDING)
> >>>> +		if (cp_update_scsw(&private->cp, &irb->scsw) &&
> >>>> +		    is_final && private->state == VFIO_CCW_STATE_CP_PENDING)    
> >>>
> >>> ...but I still wonder why is_final is not catching non-ssch related
> >>> interrupts, as I thought it would. We might want to adapt that check,
> >>> instead. (Or was the scsw_is_solicited() check supposed to catch that?
> >>> As said, too tired right now...)    
> >>
> >> I had looked at the (un)solicited bits at one point, and saw very few
> >> unsolicited interrupts.  The ones that did show up didn't appear to
> >> affect things in the way that would cause the problems I'm seeing.  
> > 
> > Ok, so that check is hopefully fine.
> >   
> >>
> >> As for is_final...  That POPS table states that for "status pending
> >> [alone] after termination of HALT or CLEAR ... cpa is unpredictable",
> >> which is what happens here.  In the example above, the cpa is the same
> >> as the previous (successful) interrupt, and thus unrelated to the
> >> current chain.  Perhaps is_final needs to check that the function
> >> control in the interrupt is for a start?  
> > 
> > I think our reasoning last time we discussed this function was that we
> > only are in CP_PENDING if we actually did a ssch previously. Now, if we  
> 
> I spent a little time looking at the conversations on the patch that
> added the CP_PENDING check.  Sadly, those patches hit the list when I
> left for holiday so I came late to those discussions and there appears
> some loose ends that I should've chased down at the time.  Sorry.
> 
> But yes, we should only be in CP_PENDING because of the SSCH, but the
> only check of the interrupt here is the "is_final" check, and not that
> the interrupt was for a start function.

Hm. If we are in CP_PENDING, we have issued a ssch in the past for
which we did not get a final state yet. So getting here in CP_STANDING
without the start function set in the fctl would indicate a bug
elsewhere (missing state transition, or a thinko?)

> 
> > do a hsch/csch before we got final status for the program started by
> > the ssch, we don't move out of the CP_PENDING, but the cpa still might
> > not be what we're looking for.   
> 
> As long as we get an interrupt that's "is_final" then don't we come out
> of CP_PENDING state at the end of this routine, regardless of whether or
> not it does the cp_free() call?  I think your original diagnosis [1] was
> that even if the cpa is invalid, calling cp_update_scsw() is okay
> because garbage-in-garbage-out.  This patch makes that part of the
> criteria for doing the cp_free(), so maybe that's too heavy?  After all,
> it does mean that we may leave private->cp "initialized", but reset the
> state back to IDLE.  (More on that in a minute.)

Yes, rereading this I still think it's fine to call cp_update_scsw().

If we did a clear, we will certainly not get any more solicited
interrupts for this subchannel before we initiate a new operation. I
think that also holds for hsch? So we probably should free and move
state... but it seems we're still missing something.

> 
> > So, we should probably check that we
> > have only the start function indicated in the fctl.  
> 
> For the call to cp_update_scsw() or cp_free()?  Or both?

As said above, we probably don't need to do that... but I'm not sure.

> 
> > 
> > But if we do that, we still have a chain allocated for something that
> > has already been terminated... how do we find the right chain to clean
> > up, if needed?  
> 
> Don't we free all/none of the chains?  Ideally, since we only have one
> set of chains per cp (and thus, per SSCH), they should either all be
> freed or ignored.

Ah, yes. I reread the code, and I think you're correct.

> 
> But regardless, this patch is at least not complete, if not incorrect.
> I left a test running for the weekend and while I don't see the storage
> damage I saw before, there's a lot of unreleased memory because of stuff
> like this:
> 
> 950.541644 06 ...sch_io_todo sch 09c5: state=3 orb.cpa=7f586f48

We're in CP_PROCESSING? Maybe that's the problem? We got an interrupt
while still setting up the cp...

>                                                irb.w0=00001001

That's an interrupt for a csch, if I'm counting the bits right. I'm
wondering what sequence in the guest gets us here: If we're
CP_PROCESSING, we're still interpreting the ssch (and have not yet
returned control to the guest again). Still, there's an interrupt for a
csch... either the guest does not lock subchannel operations correctly
(so that it did a csch while another cpu did a ssch), or it did a ssch
without waiting for the interrupt for a csch. Or, of course,
something's completely messed up on the vfio-ccw side. As Linux seems
to run fine as a guest elsewhere, I rather suspect vfio-ccw :(

>                                                irb.cpa=02e35d58

That's... weird. Probably the "unpredictable" part of an interrupt for
a clear :)

>                                                irb.w2=0000000c

The count is not meaningful for an interrupt for a clear, either.

>                                                ccw=0
>                                                *cda=0
> 950.541837 06 ...sch_io_todo sch 09c5: state=2 orb.cpa=030ec750
>                                                irb.w0=00c04007
>                                                irb.cpa=7f586f50
>                                                irb.w2=0c000000
>                                                ccw=3424000c030ea840
>                                                *cda=190757ef0

That one looks like a normal interrupt for a channel program; but why
are we in IDLE?

> 
> (I was only tracing instances where vfio-ccw did NOT call cp_free() on
> the interrupt path; so I don't have a complete history of what happened.)
> 
> The orb.cpa address in the first trace looks like something which came
> from the guest, rather than something built by vfio-ccw.  The irb.cpa
> address in the second trace is 8 bytes after the first orb.cpa address.
> And the storage referenced by both the CP and IDAL referenced in trace 2
> are still active when I started poking at the state of things.
> 
> There's a lot just to unravel just with this instance.  Like why a guest
> CPA is in orb, and thus an irb.  Or why cp_prefetch() checks that
> !cp->initialized, but cp_init() does no such thing.  I guess I'll put in
> a check to see how that helps this particular situation, while I sort
> out the other problems here.

cp_init checking cp->initialized would probably be good to catch
errors, in any case. (Maybe put a trace there, just to see if it fires?)

I'm wondering if we're looking at the right part of the code. After
rereading, I think the interrupt handling code is ok... it seems
something goes wrong earlier (looking at that weird first trace).

Have you seen this without your path handling changes? The thing that
probably exposes the problem is that path validation etc. might do some
extra csch, and also some extra I/O restricted to a certain path.

> 
> >   
> 
> [1] https://lore.kernel.org/kvm/20190702115134.790f8891.cohuck@redhat.com/
>
Eric Farman Jan. 28, 2020, 2:42 p.m. UTC | #6
On 1/28/20 4:58 AM, Cornelia Huck wrote:
> On Mon, 27 Jan 2020 16:28:18 -0500
> Eric Farman <farman@linux.ibm.com> wrote:
> 
>> On 1/27/20 7:52 AM, Cornelia Huck wrote:
>>> On Fri, 24 Jan 2020 11:08:12 -0500
>>> Eric Farman <farman@linux.ibm.com> wrote:
>>>   
>>>> On 1/24/20 10:33 AM, Cornelia Huck wrote:  
>>>>> On Fri, 24 Jan 2020 15:54:55 +0100
>>>>> Eric Farman <farman@linux.ibm.com> wrote:  
>>>   
>>>>>> diff --git a/drivers/s390/cio/vfio_ccw_drv.c b/drivers/s390/cio/vfio_ccw_drv.c
>>>>>> index e401a3d0aa57..a8ab256a217b 100644
>>>>>> --- a/drivers/s390/cio/vfio_ccw_drv.c
>>>>>> +++ b/drivers/s390/cio/vfio_ccw_drv.c
>>>>>> @@ -90,8 +90,8 @@ static void vfio_ccw_sch_io_todo(struct work_struct *work)
>>>>>>  	is_final = !(scsw_actl(&irb->scsw) &
>>>>>>  		     (SCSW_ACTL_DEVACT | SCSW_ACTL_SCHACT));
>>>>>>  	if (scsw_is_solicited(&irb->scsw)) {
>>>>>> -		cp_update_scsw(&private->cp, &irb->scsw);
>>>>>> -		if (is_final && private->state == VFIO_CCW_STATE_CP_PENDING)
>>>>>> +		if (cp_update_scsw(&private->cp, &irb->scsw) &&
>>>>>> +		    is_final && private->state == VFIO_CCW_STATE_CP_PENDING)    
>>>>>
>>>>> ...but I still wonder why is_final is not catching non-ssch related
>>>>> interrupts, as I thought it would. We might want to adapt that check,
>>>>> instead. (Or was the scsw_is_solicited() check supposed to catch that?
>>>>> As said, too tired right now...)    
>>>>
>>>> I had looked at the (un)solicited bits at one point, and saw very few
>>>> unsolicited interrupts.  The ones that did show up didn't appear to
>>>> affect things in the way that would cause the problems I'm seeing.  
>>>
>>> Ok, so that check is hopefully fine.
>>>   
>>>>
>>>> As for is_final...  That POPS table states that for "status pending
>>>> [alone] after termination of HALT or CLEAR ... cpa is unpredictable",
>>>> which is what happens here.  In the example above, the cpa is the same
>>>> as the previous (successful) interrupt, and thus unrelated to the
>>>> current chain.  Perhaps is_final needs to check that the function
>>>> control in the interrupt is for a start?  
>>>
>>> I think our reasoning last time we discussed this function was that we
>>> only are in CP_PENDING if we actually did a ssch previously. Now, if we  
>>
>> I spent a little time looking at the conversations on the patch that
>> added the CP_PENDING check.  Sadly, those patches hit the list when I
>> left for holiday so I came late to those discussions and there appears
>> some loose ends that I should've chased down at the time.  Sorry.
>>
>> But yes, we should only be in CP_PENDING because of the SSCH, but the
>> only check of the interrupt here is the "is_final" check, and not that
>> the interrupt was for a start function.
> 
> Hm. If we are in CP_PENDING, we have issued a ssch in the past for
> which we did not get a final state yet. So getting here in CP_STANDING
> without the start function set in the fctl would indicate a bug
> elsewhere (missing state transition, or a thinko?)
> 
>>
>>> do a hsch/csch before we got final status for the program started by
>>> the ssch, we don't move out of the CP_PENDING, but the cpa still might
>>> not be what we're looking for.   
>>
>> As long as we get an interrupt that's "is_final" then don't we come out
>> of CP_PENDING state at the end of this routine, regardless of whether or
>> not it does the cp_free() call?  I think your original diagnosis [1] was
>> that even if the cpa is invalid, calling cp_update_scsw() is okay
>> because garbage-in-garbage-out.  This patch makes that part of the
>> criteria for doing the cp_free(), so maybe that's too heavy?  After all,
>> it does mean that we may leave private->cp "initialized", but reset the
>> state back to IDLE.  (More on that in a minute.)
> 
> Yes, rereading this I still think it's fine to call cp_update_scsw().
> 
> If we did a clear, we will certainly not get any more solicited
> interrupts for this subchannel before we initiate a new operation. I
> think that also holds for hsch? So we probably should free and move
> state... but it seems we're still missing something.
> 
>>
>>> So, we should probably check that we
>>> have only the start function indicated in the fctl.  
>>
>> For the call to cp_update_scsw() or cp_free()?  Or both?
> 
> As said above, we probably don't need to do that... but I'm not sure.
> 
>>
>>>
>>> But if we do that, we still have a chain allocated for something that
>>> has already been terminated... how do we find the right chain to clean
>>> up, if needed?  
>>
>> Don't we free all/none of the chains?  Ideally, since we only have one
>> set of chains per cp (and thus, per SSCH), they should either all be
>> freed or ignored.
> 
> Ah, yes. I reread the code, and I think you're correct.
> 
>>
>> But regardless, this patch is at least not complete, if not incorrect.
>> I left a test running for the weekend and while I don't see the storage
>> damage I saw before, there's a lot of unreleased memory because of stuff
>> like this:
>>
>> 950.541644 06 ...sch_io_todo sch 09c5: state=3 orb.cpa=7f586f48
> 
> We're in CP_PROCESSING? Maybe that's the problem? We got an interrupt
> while still setting up the cp...
> 
>>                                                irb.w0=00001001
> 
> That's an interrupt for a csch, if I'm counting the bits right. I'm

Correct.

> wondering what sequence in the guest gets us here: If we're
> CP_PROCESSING, we're still interpreting the ssch (and have not yet
> returned control to the guest again). Still, there's an interrupt for a
> csch... either the guest does not lock subchannel operations correctly
> (so that it did a csch while another cpu did a ssch), or it did a ssch

Hrm, that's a thought...  As soon as csch is issued, we're free to
process a ssch on another cpu while the interrupt for the csch is
returned.  But since this routine always sets private->state=IDLE as
long as is_final is true, which is the case here, we leak the cp being
processed at this point.  As you stated earlier, we need to consider the
fctl when it comes to cleaning things up here.  That's why the second
interrupt has a state of IDLE when things look (mostly) okay.

> without waiting for the interrupt for a csch. Or, of course,
> something's completely messed up on the vfio-ccw side. As Linux seems
> to run fine as a guest elsewhere, I rather suspect vfio-ccw :(

Me too.  And I would rather expect weirdness in the guest if it were a
base Linux problem, rather than at the host level.

> 
>>                                                irb.cpa=02e35d58
> 
> That's... weird. Probably the "unpredictable" part of an interrupt for
> a clear :)

:)

> 
>>                                                irb.w2=0000000c
> 
> The count is not meaningful for an interrupt for a clear, either.
> 
>>                                                ccw=0
>>                                                *cda=0
>> 950.541837 06 ...sch_io_todo sch 09c5: state=2 orb.cpa=030ec750
>>                                                irb.w0=00c04007
>>                                                irb.cpa=7f586f50
>>                                                irb.w2=0c000000
>>                                                ccw=3424000c030ea840
>>                                                *cda=190757ef0
> 
> That one looks like a normal interrupt for a channel program; but why
> are we in IDLE?

No idea, but it means we would've left the cp struct intact even without
this patch.

> 
>>
>> (I was only tracing instances where vfio-ccw did NOT call cp_free() on
>> the interrupt path; so I don't have a complete history of what happened.)
>>
>> The orb.cpa address in the first trace looks like something which came
>> from the guest, rather than something built by vfio-ccw.  The irb.cpa
>> address in the second trace is 8 bytes after the first orb.cpa address.
>> And the storage referenced by both the CP and IDAL referenced in trace 2
>> are still active when I started poking at the state of things.
>>
>> There's a lot just to unravel just with this instance.  Like why a guest
>> CPA is in orb, and thus an irb.  Or why cp_prefetch() checks that
>> !cp->initialized, but cp_init() does no such thing.  I guess I'll put in
>> a check to see how that helps this particular situation, while I sort
>> out the other problems here.
> 
> cp_init checking cp->initialized would probably be good to catch
> errors, in any case. (Maybe put a trace there, just to see if it fires?)

I did this last night, and got frustrated.  The unfortunate thing was
that once it fires, we end up flooding our trace buffers with errors as
the guest continually retries.  So I need to either make a smarter trace
that is rate limited or just crash my host once this condition occurs.
Will try to do that between meetings today.

> 
> I'm wondering if we're looking at the right part of the code. After
> rereading, I think the interrupt handling code is ok... it seems
> something goes wrong earlier (looking at that weird first trace).

Perhaps it is a race across cpus, as you suggest.

> 
> Have you seen this without your path handling changes? The thing that
> probably exposes the problem is that path validation etc. might do some
> extra csch, and also some extra I/O restricted to a certain path.

I have not, and I've run some pretty unpleasant tests over the past
several months prior to the path handling changes.  Now, I haven't
explicitly looked at the dma slabs to see if there's any funny business
lurking after those tests, but the systems have behaved normally.  I
probably spent too much time thinking the problem lurks in the path
handling code, but rather it's just the extra I/Os that the guest drives
in reaction to the CRWs we give it.  (Hrm, could it be that we give N
CRWs based on the number of subchannels that share a chpid?  I should
re-try this with just a single vfio-ccw device within the guest.)

> 
>>
>>>   
>>
>> [1] https://lore.kernel.org/kvm/20190702115134.790f8891.cohuck@redhat.com/
>>
>
Eric Farman Jan. 29, 2020, 4:13 a.m. UTC | #7
On 1/28/20 9:42 AM, Eric Farman wrote:
> 
> 
> On 1/28/20 4:58 AM, Cornelia Huck wrote:
>> On Mon, 27 Jan 2020 16:28:18 -0500

...snip...

>>
>> cp_init checking cp->initialized would probably be good to catch
>> errors, in any case. (Maybe put a trace there, just to see if it fires?)
> 
> I did this last night, and got frustrated.  The unfortunate thing was
> that once it fires, we end up flooding our trace buffers with errors as
> the guest continually retries.  So I need to either make a smarter trace
> that is rate limited or just crash my host once this condition occurs.
> Will try to do that between meetings today.
> 

I reverted the subject patch, and simply triggered
BUG_ON(cp->initialized) in cp_init().  It sprung VERY quickly (all
traces are for the same device):

366.399682 03 ...sch_io_todo state=4 o.cpa=03017810
                             i.w0=00c04007 i.cpa=03017818 i.w2=0c000000
366.399832 03 ...sch_io_todo state=3 o.cpa=7f53dd30 UNSOLICITED
                             i.w0=00c00011 i.cpa=03017818 i.w2=85000000
366.400086 03 ...sch_io_todo state=2 o.cpa=03017930
                             i.w0=00c04007 i.cpa=03017938 i.w2=0c000000
366.400313 03 ...sch_io_todo state=3 o.cpa=03017930
                             i.w0=00001001 i.cpa=03017938 i.w2=00000000

Ah, of course...  Unsolicited interrupts DO reset private->state back to
idle, but leave cp->initialized and any channel_program struct remains
allocated.  So there's one problem (a memory leak), and an easy one to
rectify.

After more than a few silly rabbit holes, I had this trace:

429.928480 07 ...sch_io_todo state=4 init=1 o.cpa=7fed8e10
                             i.w0=00001001 i.cpa=7fed8e18 i.w2=00000000
429.929132 07 ...sch_io_todo state=4 init=1 o.cpa=0305aed0
                             i.w0=00c04007 i.cpa=0305aed8 i.w2=0c000000
429.929538 07 ...sch_io_todo state=4 init=1 o.cpa=0305af30
                             i.w0=00c04007 i.cpa=0305af38 i.w2=0c000000
467.339389 07   ...chp_event mask=0x80 event=1
467.339865 03 ...sch_io_todo state=3 init=0 o.cpa=01814548
                             i.w0=00c02001 i.cpa=0305af38 i.w2=00000000

So my trace is at the beginning of vfio_ccw_sch_io_todo(), but the
BUG_ON() is at the end of that function where private->state is
(possibly) updated.  Looking at the contents of the vfio_ccw_private
struct in the dump, the failing device is currently state=4 init=1
instead of 3/0 as in the above trace.  So an I/O was being built in
parallel here, and there's no serializing action within the stacked
vfio_ccw_sch_io_todo() call to ensure they don't stomp on one another.
The io_mutex handles the region changes, and the subchannel lock handles
the start/halt/clear subchannel instructions, but nothing on the
interrupt side, nor contention between them.  Sigh.

My brain hurts.  I re-applied this patch (with some validation that the
cpa is valid) to my current franken-code, and will let it run overnight.
 I think it's going to be racing other CPUs and I'll find a dead system
by morning, but who knows.  Maybe not.  :)
Cornelia Huck Jan. 29, 2020, noon UTC | #8
On Tue, 28 Jan 2020 23:13:30 -0500
Eric Farman <farman@linux.ibm.com> wrote:

> On 1/28/20 9:42 AM, Eric Farman wrote:
> > 
> > 
> > On 1/28/20 4:58 AM, Cornelia Huck wrote:  
> >> On Mon, 27 Jan 2020 16:28:18 -0500  
> 
> ...snip...
> 
> >>
> >> cp_init checking cp->initialized would probably be good to catch
> >> errors, in any case. (Maybe put a trace there, just to see if it fires?)  
> > 
> > I did this last night, and got frustrated.  The unfortunate thing was
> > that once it fires, we end up flooding our trace buffers with errors as
> > the guest continually retries.  So I need to either make a smarter trace
> > that is rate limited or just crash my host once this condition occurs.
> > Will try to do that between meetings today.
> >   
> 
> I reverted the subject patch, and simply triggered
> BUG_ON(cp->initialized) in cp_init().  It sprung VERY quickly (all
> traces are for the same device):
> 
> 366.399682 03 ...sch_io_todo state=4 o.cpa=03017810
>                              i.w0=00c04007 i.cpa=03017818 i.w2=0c000000
> 366.399832 03 ...sch_io_todo state=3 o.cpa=7f53dd30 UNSOLICITED
>                              i.w0=00c00011 i.cpa=03017818 i.w2=85000000
> 366.400086 03 ...sch_io_todo state=2 o.cpa=03017930
>                              i.w0=00c04007 i.cpa=03017938 i.w2=0c000000
> 366.400313 03 ...sch_io_todo state=3 o.cpa=03017930
>                              i.w0=00001001 i.cpa=03017938 i.w2=00000000
> 
> Ah, of course...  Unsolicited interrupts DO reset private->state back to
> idle, but leave cp->initialized and any channel_program struct remains
> allocated.  So there's one problem (a memory leak), and an easy one to
> rectify.

For a moment, I suspected a deferred condition code here, but it seems
to be a pure unsolicited interrupt.

But that got me thinking: If we get an unsolicited interrupt while
building the cp, it means that the guest is currently executing ssch.
We need to get the unsolicited interrupt to the guest, while not
executing the ssch. So maybe we need to do the following:

- deliver the unsolicited interrupt to the guest
- make sure we don't execute the ssch, but relay a cc 1 for it back to
  the guest
- clean up the cp

Maybe not avoiding issuing the ssch is what gets us in that pickle? We
either leak memory or free too much, it seems.

> 
> After more than a few silly rabbit holes, I had this trace:
> 
> 429.928480 07 ...sch_io_todo state=4 init=1 o.cpa=7fed8e10
>                              i.w0=00001001 i.cpa=7fed8e18 i.w2=00000000
> 429.929132 07 ...sch_io_todo state=4 init=1 o.cpa=0305aed0
>                              i.w0=00c04007 i.cpa=0305aed8 i.w2=0c000000
> 429.929538 07 ...sch_io_todo state=4 init=1 o.cpa=0305af30
>                              i.w0=00c04007 i.cpa=0305af38 i.w2=0c000000
> 467.339389 07   ...chp_event mask=0x80 event=1
> 467.339865 03 ...sch_io_todo state=3 init=0 o.cpa=01814548
>                              i.w0=00c02001 i.cpa=0305af38 i.w2=00000000
> 
> So my trace is at the beginning of vfio_ccw_sch_io_todo(), but the
> BUG_ON() is at the end of that function where private->state is
> (possibly) updated.  Looking at the contents of the vfio_ccw_private
> struct in the dump, the failing device is currently state=4 init=1
> instead of 3/0 as in the above trace.  So an I/O was being built in
> parallel here, and there's no serializing action within the stacked
> vfio_ccw_sch_io_todo() call to ensure they don't stomp on one another.
> The io_mutex handles the region changes, and the subchannel lock handles
> the start/halt/clear subchannel instructions, but nothing on the
> interrupt side, nor contention between them.  Sigh.

I feel we've been here a few times already, and never seem to come up
with a complete solution :(

There had been some changes by Pierre regarding locking the fsm; maybe
that's what's needed here?

> 
> My brain hurts.  I re-applied this patch (with some validation that the
> cpa is valid) to my current franken-code, and will let it run overnight.
>  I think it's going to be racing other CPUs and I'll find a dead system
> by morning, but who knows.  Maybe not.  :)
> 

I can relate to the brain hurting part :)
Eric Farman Jan. 29, 2020, 4:48 p.m. UTC | #9
On 1/29/20 7:00 AM, Cornelia Huck wrote:
> On Tue, 28 Jan 2020 23:13:30 -0500
> Eric Farman <farman@linux.ibm.com> wrote:
> 
>> On 1/28/20 9:42 AM, Eric Farman wrote:
>>>
>>>
>>> On 1/28/20 4:58 AM, Cornelia Huck wrote:  
>>>> On Mon, 27 Jan 2020 16:28:18 -0500  
>>
>> ...snip...
>>
>>>>
>>>> cp_init checking cp->initialized would probably be good to catch
>>>> errors, in any case. (Maybe put a trace there, just to see if it fires?)  
>>>
>>> I did this last night, and got frustrated.  The unfortunate thing was
>>> that once it fires, we end up flooding our trace buffers with errors as
>>> the guest continually retries.  So I need to either make a smarter trace
>>> that is rate limited or just crash my host once this condition occurs.
>>> Will try to do that between meetings today.
>>>   
>>
>> I reverted the subject patch, and simply triggered
>> BUG_ON(cp->initialized) in cp_init().  It sprung VERY quickly (all
>> traces are for the same device):
>>
>> 366.399682 03 ...sch_io_todo state=4 o.cpa=03017810
>>                              i.w0=00c04007 i.cpa=03017818 i.w2=0c000000
>> 366.399832 03 ...sch_io_todo state=3 o.cpa=7f53dd30 UNSOLICITED
>>                              i.w0=00c00011 i.cpa=03017818 i.w2=85000000
>> 366.400086 03 ...sch_io_todo state=2 o.cpa=03017930
>>                              i.w0=00c04007 i.cpa=03017938 i.w2=0c000000
>> 366.400313 03 ...sch_io_todo state=3 o.cpa=03017930
>>                              i.w0=00001001 i.cpa=03017938 i.w2=00000000
>>
>> Ah, of course...  Unsolicited interrupts DO reset private->state back to
>> idle, but leave cp->initialized and any channel_program struct remains
>> allocated.  So there's one problem (a memory leak), and an easy one to
>> rectify.
> 
> For a moment, I suspected a deferred condition code here, but it seems
> to be a pure unsolicited interrupt.
> 
> But that got me thinking: If we get an unsolicited interrupt while
> building the cp, it means that the guest is currently executing ssch.
> We need to get the unsolicited interrupt to the guest, while not
> executing the ssch. So maybe we need to do the following:
> 
> - deliver the unsolicited interrupt to the guest
> - make sure we don't execute the ssch, but relay a cc 1 for it back to
>   the guest
> - clean up the cp
> 
> Maybe not avoiding issuing the ssch is what gets us in that pickle? We
> either leak memory or free too much, it seems.

It's possible...  I'll try hacking at that for a bit.

> 
>>
>> After more than a few silly rabbit holes, I had this trace:
>>
>> 429.928480 07 ...sch_io_todo state=4 init=1 o.cpa=7fed8e10
>>                              i.w0=00001001 i.cpa=7fed8e18 i.w2=00000000
>> 429.929132 07 ...sch_io_todo state=4 init=1 o.cpa=0305aed0
>>                              i.w0=00c04007 i.cpa=0305aed8 i.w2=0c000000
>> 429.929538 07 ...sch_io_todo state=4 init=1 o.cpa=0305af30
>>                              i.w0=00c04007 i.cpa=0305af38 i.w2=0c000000
>> 467.339389 07   ...chp_event mask=0x80 event=1
>> 467.339865 03 ...sch_io_todo state=3 init=0 o.cpa=01814548
>>                              i.w0=00c02001 i.cpa=0305af38 i.w2=00000000
>>
>> So my trace is at the beginning of vfio_ccw_sch_io_todo(), but the
>> BUG_ON() is at the end of that function where private->state is
>> (possibly) updated.  Looking at the contents of the vfio_ccw_private
>> struct in the dump, the failing device is currently state=4 init=1
>> instead of 3/0 as in the above trace.  So an I/O was being built in
>> parallel here, and there's no serializing action within the stacked
>> vfio_ccw_sch_io_todo() call to ensure they don't stomp on one another.
>> The io_mutex handles the region changes, and the subchannel lock handles
>> the start/halt/clear subchannel instructions, but nothing on the
>> interrupt side, nor contention between them.  Sigh.
> 
> I feel we've been here a few times already, and never seem to come up
> with a complete solution :(
> 
> There had been some changes by Pierre regarding locking the fsm; maybe
> that's what's needed here?

Hrm...  I'd forgotten all about those.  I found them on
patchwork.kernel.org; will see what they encompass.

> 
>>
>> My brain hurts.  I re-applied this patch (with some validation that the
>> cpa is valid) to my current franken-code, and will let it run overnight.
>>  I think it's going to be racing other CPUs and I'll find a dead system
>> by morning, but who knows.  Maybe not.  :)
>>
> 
> I can relate to the brain hurting part :)
> 

:)

My system crashed after about six hours, but not because of the BUG_ON()
traps I placed.  Rather, dma-kmalloc-8 got clobbered again with what
looks like x100 bytes of data from one of the other CCWs.  Of course, I
didn't trace the CCW/IDA data this time, so I don't know when the memory
in question was allocated/released/used.  But, there are 35 deferred
cc=1 interrupts in the trace though, so I'll give some some thought to
the ideas above while re-running with the full traces in place.

Thanks!
diff mbox series

Patch

diff --git a/drivers/s390/cio/vfio_ccw_cp.c b/drivers/s390/cio/vfio_ccw_cp.c
index 3645d1720c4b..2d942433baf9 100644
--- a/drivers/s390/cio/vfio_ccw_cp.c
+++ b/drivers/s390/cio/vfio_ccw_cp.c
@@ -803,15 +803,19 @@  union orb *cp_get_orb(struct channel_program *cp, u32 intparm, u8 lpm)
  *
  * This function updates @scsw->cpa to its coressponding guest physical
  * address.
+ *
+ * Returns true if the channel program address in the irb was found
+ * within the chain of CCWs for this channel program.
  */
-void cp_update_scsw(struct channel_program *cp, union scsw *scsw)
+bool cp_update_scsw(struct channel_program *cp, union scsw *scsw)
 {
 	struct ccwchain *chain;
 	u32 cpa = scsw->cmd.cpa;
 	u32 ccw_head;
+	bool within_chain = false;
 
 	if (!cp->initialized)
-		return;
+		return false;
 
 	/*
 	 * LATER:
@@ -833,11 +837,14 @@  void cp_update_scsw(struct channel_program *cp, union scsw *scsw)
 			 * head gets us the guest cpa.
 			 */
 			cpa = chain->ch_iova + (cpa - ccw_head);
+			within_chain = true;
 			break;
 		}
 	}
 
 	scsw->cmd.cpa = cpa;
+
+	return within_chain;
 }
 
 /**
diff --git a/drivers/s390/cio/vfio_ccw_cp.h b/drivers/s390/cio/vfio_ccw_cp.h
index ba31240ce965..a4cb6527bd4e 100644
--- a/drivers/s390/cio/vfio_ccw_cp.h
+++ b/drivers/s390/cio/vfio_ccw_cp.h
@@ -47,7 +47,7 @@  extern int cp_init(struct channel_program *cp, struct device *mdev,
 extern void cp_free(struct channel_program *cp);
 extern int cp_prefetch(struct channel_program *cp);
 extern union orb *cp_get_orb(struct channel_program *cp, u32 intparm, u8 lpm);
-extern void cp_update_scsw(struct channel_program *cp, union scsw *scsw);
+extern bool cp_update_scsw(struct channel_program *cp, union scsw *scsw);
 extern bool cp_iova_pinned(struct channel_program *cp, u64 iova);
 
 #endif
diff --git a/drivers/s390/cio/vfio_ccw_drv.c b/drivers/s390/cio/vfio_ccw_drv.c
index e401a3d0aa57..a8ab256a217b 100644
--- a/drivers/s390/cio/vfio_ccw_drv.c
+++ b/drivers/s390/cio/vfio_ccw_drv.c
@@ -90,8 +90,8 @@  static void vfio_ccw_sch_io_todo(struct work_struct *work)
 	is_final = !(scsw_actl(&irb->scsw) &
 		     (SCSW_ACTL_DEVACT | SCSW_ACTL_SCHACT));
 	if (scsw_is_solicited(&irb->scsw)) {
-		cp_update_scsw(&private->cp, &irb->scsw);
-		if (is_final && private->state == VFIO_CCW_STATE_CP_PENDING)
+		if (cp_update_scsw(&private->cp, &irb->scsw) &&
+		    is_final && private->state == VFIO_CCW_STATE_CP_PENDING)
 			cp_free(&private->cp);
 	}
 	mutex_lock(&private->io_mutex);