diff mbox series

[PULL,1/1] vfio-ccw: Don't assume there are more ccws after a TIC

Message ID 20190204170643.7521-2-cohuck@redhat.com (mailing list archive)
State New, archived
Headers show
Series [PULL,1/1] vfio-ccw: Don't assume there are more ccws after a TIC | expand

Commit Message

Cornelia Huck Feb. 4, 2019, 5:06 p.m. UTC
From: Farhan Ali <alifm@linux.ibm.com>

When trying to calculate the length of a ccw chain, we assume
there are ccws after a TIC. This can lead to overcounting and
copying garbage data from guest memory.

Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
Message-Id: <d63748c1f1b03147bcbf401596638627a5e35ef7.1548082107.git.alifm@linux.ibm.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
---
 drivers/s390/cio/vfio_ccw_cp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Eric Farman Feb. 20, 2019, 2:49 a.m. UTC | #1
Hi Connie, Farhan,

On 02/04/2019 12:06 PM, Cornelia Huck wrote:
> From: Farhan Ali <alifm@linux.ibm.com>
> 
> When trying to calculate the length of a ccw chain, we assume
> there are ccws after a TIC. This can lead to overcounting and
> copying garbage data from guest memory.
> 
> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> Message-Id: <d63748c1f1b03147bcbf401596638627a5e35ef7.1548082107.git.alifm@linux.ibm.com>
> Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
> ---
>   drivers/s390/cio/vfio_ccw_cp.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/s390/cio/vfio_ccw_cp.c b/drivers/s390/cio/vfio_ccw_cp.c
> index 70a006ba4d05..ba08fe137c2e 100644
> --- a/drivers/s390/cio/vfio_ccw_cp.c
> +++ b/drivers/s390/cio/vfio_ccw_cp.c
> @@ -392,7 +392,7 @@ static int ccwchain_calc_length(u64 iova, struct channel_program *cp)
>   			return -EOPNOTSUPP;
>   		}
>   
> -		if ((!ccw_is_chain(ccw)) && (!ccw_is_tic(ccw)))
> +		if (!ccw_is_chain(ccw))
>   			break;
>   
>   		ccw++;
> 

Sigh.  Let me state up front that I'm running vanilla 5.0-rc7 host 
kernel with this patch applied (or not), and QEMU from a week or two ago.

I stated in another thread that this patch makes things better when 
running fio and other random exercisers within my guest.  But I was 
always booting off a virtio-blk disk when I did that, and using some 
additional disks connected by vfio-ccw.  Today I discovered that this 
patch prevents guest boot from vfio-ccw (which was working previously):

/usr/local/bin/qemu-system-s390x -machine s390-ccw-virtio,accel=kvm -smp 
4 -m 1024 -nographic -s -net none -device 
vfio-ccw,sysfsdev=/sys/class/mdev_bus/0.0.0cbb/9503d38c-b83a-45b1-a2f0-393739817786,devno=fe.0.0e1e,force-orb-pfch=true,bootindex=0 
-bios /usr/src/qemu/build/pc-bios/s390-ccw/s390-ccw.img
LOADPARM=[        ]
vfio-ccw device I/O error - Interrupt Response Block Data:
     Function Ctrl : [Start]
     Activity Ctrl :
     Status Ctrl : [Alert] [Primary] [Secondary] [Status-Pending]
     Device Status : [Channel-End] [Device-End]
     Channel Status : [Incorrect-Length]
     cpa=: 0x0000000000000018
     prev_ccw=: 0x3100003840000008
     this_ccw=: 0x0800001000000000
Eckd Dasd Sense Data (fmt 32-bytes):
     Sense Condition Flags :
     Residual Count     =: 0x0000000000000000
     Phys Drive ID      =: 0x0000000000000000
     low cyl address    =: 0x0000000000000000
     head addr & hi cyl =: 0x0000000000000000
     format/message     =: 0x0000000000000000
     fmt-dependent[0-7] =: 0x0000000000000000
     fmt-dependent[8-15]=: 0x0000000005160f00
     prog action code   =: 0x0000000000000000
     Configuration info =: 0x00000000000040e0
     mcode / hi-cyl     =: 0x0000000000000000
     cyl & head addr [0]=: 0x0000000000000000
     cyl & head addr [1]=: 0x0000000000000000
     cyl & head addr [2]=: 0x0000000000000000
...snip...

Reverting this patch allows the boot to proceed normally.  So, ugh.

I'm not going to claim to know exactly how this is failing, because it's 
too late for coffee and reading IPL records is difficult enough.  But in 
the working case (without this patch), cp_init() calls 
ccwchain_calc_length() and gets a count of four CCWs (SEEK + SIDE + TIC 
+ READ), which is broken up into two chains.  One for the SEEK/SIDE/TIC, 
and another for the READ.  In the failing case (with this patch), 
cp_init() only counts three CCWs, we never read the IPL2 address (the 
second chain in the working case), and our boot fails.

Unsure what the next best move is here.  It is possible that the 
(ill-used) check that Farhan noticed and removed was actually added to 
"make boot work" when handling recursive TIC CCWs such as this.  But I 
still agree that this patch is correct, even though it exposes more 
problems with our TIC handling throughout this code at large.  Any thoughts?

  - Eric
Cornelia Huck Feb. 20, 2019, 9:48 a.m. UTC | #2
On Tue, 19 Feb 2019 21:49:07 -0500
Eric Farman <farman@linux.ibm.com> wrote:

> Hi Connie, Farhan,
> 
> On 02/04/2019 12:06 PM, Cornelia Huck wrote:
> > From: Farhan Ali <alifm@linux.ibm.com>
> > 
> > When trying to calculate the length of a ccw chain, we assume
> > there are ccws after a TIC. This can lead to overcounting and
> > copying garbage data from guest memory.
> > 
> > Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> > Message-Id: <d63748c1f1b03147bcbf401596638627a5e35ef7.1548082107.git.alifm@linux.ibm.com>
> > Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
> > Signed-off-by: Cornelia Huck <cohuck@redhat.com>
> > ---
> >   drivers/s390/cio/vfio_ccw_cp.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/s390/cio/vfio_ccw_cp.c b/drivers/s390/cio/vfio_ccw_cp.c
> > index 70a006ba4d05..ba08fe137c2e 100644
> > --- a/drivers/s390/cio/vfio_ccw_cp.c
> > +++ b/drivers/s390/cio/vfio_ccw_cp.c
> > @@ -392,7 +392,7 @@ static int ccwchain_calc_length(u64 iova, struct channel_program *cp)
> >   			return -EOPNOTSUPP;
> >   		}
> >   
> > -		if ((!ccw_is_chain(ccw)) && (!ccw_is_tic(ccw)))
> > +		if (!ccw_is_chain(ccw))
> >   			break;
> >   
> >   		ccw++;
> >   
> 
> Sigh.  Let me state up front that I'm running vanilla 5.0-rc7 host 
> kernel with this patch applied (or not), and QEMU from a week or two ago.
> 
> I stated in another thread that this patch makes things better when 
> running fio and other random exercisers within my guest.  But I was 
> always booting off a virtio-blk disk when I did that, and using some 
> additional disks connected by vfio-ccw.  Today I discovered that this 
> patch prevents guest boot from vfio-ccw (which was working previously):
> 
> /usr/local/bin/qemu-system-s390x -machine s390-ccw-virtio,accel=kvm -smp 
> 4 -m 1024 -nographic -s -net none -device 
> vfio-ccw,sysfsdev=/sys/class/mdev_bus/0.0.0cbb/9503d38c-b83a-45b1-a2f0-393739817786,devno=fe.0.0e1e,force-orb-pfch=true,bootindex=0 
> -bios /usr/src/qemu/build/pc-bios/s390-ccw/s390-ccw.img
> LOADPARM=[        ]
> vfio-ccw device I/O error - Interrupt Response Block Data:
>      Function Ctrl : [Start]
>      Activity Ctrl :
>      Status Ctrl : [Alert] [Primary] [Secondary] [Status-Pending]
>      Device Status : [Channel-End] [Device-End]
>      Channel Status : [Incorrect-Length]
>      cpa=: 0x0000000000000018
>      prev_ccw=: 0x3100003840000008
>      this_ccw=: 0x0800001000000000
> Eckd Dasd Sense Data (fmt 32-bytes):
>      Sense Condition Flags :
>      Residual Count     =: 0x0000000000000000
>      Phys Drive ID      =: 0x0000000000000000
>      low cyl address    =: 0x0000000000000000
>      head addr & hi cyl =: 0x0000000000000000
>      format/message     =: 0x0000000000000000
>      fmt-dependent[0-7] =: 0x0000000000000000
>      fmt-dependent[8-15]=: 0x0000000005160f00
>      prog action code   =: 0x0000000000000000
>      Configuration info =: 0x00000000000040e0
>      mcode / hi-cyl     =: 0x0000000000000000
>      cyl & head addr [0]=: 0x0000000000000000
>      cyl & head addr [1]=: 0x0000000000000000
>      cyl & head addr [2]=: 0x0000000000000000
> ...snip...
> 
> Reverting this patch allows the boot to proceed normally.  So, ugh.

Agreed on the 'ugh' :(

> 
> I'm not going to claim to know exactly how this is failing, because it's 
> too late for coffee and reading IPL records is difficult enough.  But in 
> the working case (without this patch), cp_init() calls 
> ccwchain_calc_length() and gets a count of four CCWs (SEEK + SIDE + TIC 
> + READ), which is broken up into two chains.  One for the SEEK/SIDE/TIC, 
> and another for the READ.  In the failing case (with this patch), 
> cp_init() only counts three CCWs, we never read the IPL2 address (the 
> second chain in the working case), and our boot fails.
> 
> Unsure what the next best move is here.  It is possible that the 
> (ill-used) check that Farhan noticed and removed was actually added to 
> "make boot work" when handling recursive TIC CCWs such as this.  But I 
> still agree that this patch is correct, even though it exposes more 
> problems with our TIC handling throughout this code at large.  Any thoughts?

Yes, there's probably something rotten in our TIC handling...

Just to make things clear: This patch makes doing I/O on a booted
system more stable, but makes the not-yet-merged QEMU bios boot code
fail, right?

I'll guess I'll stare at the ccw translation code for a bit and see if
something jumps out at me...
Eric Farman Feb. 20, 2019, 11:29 a.m. UTC | #3
On 02/20/2019 04:48 AM, Cornelia Huck wrote:
> On Tue, 19 Feb 2019 21:49:07 -0500
> Eric Farman <farman@linux.ibm.com> wrote:
> 
>> Hi Connie, Farhan,
>>
>> On 02/04/2019 12:06 PM, Cornelia Huck wrote:
>>> From: Farhan Ali <alifm@linux.ibm.com>
>>>
>>> When trying to calculate the length of a ccw chain, we assume
>>> there are ccws after a TIC. This can lead to overcounting and
>>> copying garbage data from guest memory.
>>>
>>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
>>> Message-Id: <d63748c1f1b03147bcbf401596638627a5e35ef7.1548082107.git.alifm@linux.ibm.com>
>>> Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
>>> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
>>> ---
>>>    drivers/s390/cio/vfio_ccw_cp.c | 2 +-
>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/s390/cio/vfio_ccw_cp.c b/drivers/s390/cio/vfio_ccw_cp.c
>>> index 70a006ba4d05..ba08fe137c2e 100644
>>> --- a/drivers/s390/cio/vfio_ccw_cp.c
>>> +++ b/drivers/s390/cio/vfio_ccw_cp.c
>>> @@ -392,7 +392,7 @@ static int ccwchain_calc_length(u64 iova, struct channel_program *cp)
>>>    			return -EOPNOTSUPP;
>>>    		}
>>>    
>>> -		if ((!ccw_is_chain(ccw)) && (!ccw_is_tic(ccw)))
>>> +		if (!ccw_is_chain(ccw))
>>>    			break;
>>>    
>>>    		ccw++;
>>>    
>>
>> Sigh.  Let me state up front that I'm running vanilla 5.0-rc7 host
>> kernel with this patch applied (or not), and QEMU from a week or two ago.
>>
>> I stated in another thread that this patch makes things better when
>> running fio and other random exercisers within my guest.  But I was
>> always booting off a virtio-blk disk when I did that, and using some
>> additional disks connected by vfio-ccw.  Today I discovered that this
>> patch prevents guest boot from vfio-ccw (which was working previously):
>>
>> /usr/local/bin/qemu-system-s390x -machine s390-ccw-virtio,accel=kvm -smp
>> 4 -m 1024 -nographic -s -net none -device
>> vfio-ccw,sysfsdev=/sys/class/mdev_bus/0.0.0cbb/9503d38c-b83a-45b1-a2f0-393739817786,devno=fe.0.0e1e,force-orb-pfch=true,bootindex=0
>> -bios /usr/src/qemu/build/pc-bios/s390-ccw/s390-ccw.img
>> LOADPARM=[        ]
>> vfio-ccw device I/O error - Interrupt Response Block Data:
>>       Function Ctrl : [Start]
>>       Activity Ctrl :
>>       Status Ctrl : [Alert] [Primary] [Secondary] [Status-Pending]
>>       Device Status : [Channel-End] [Device-End]
>>       Channel Status : [Incorrect-Length]
>>       cpa=: 0x0000000000000018
>>       prev_ccw=: 0x3100003840000008
>>       this_ccw=: 0x0800001000000000
>> Eckd Dasd Sense Data (fmt 32-bytes):
>>       Sense Condition Flags :
>>       Residual Count     =: 0x0000000000000000
>>       Phys Drive ID      =: 0x0000000000000000
>>       low cyl address    =: 0x0000000000000000
>>       head addr & hi cyl =: 0x0000000000000000
>>       format/message     =: 0x0000000000000000
>>       fmt-dependent[0-7] =: 0x0000000000000000
>>       fmt-dependent[8-15]=: 0x0000000005160f00
>>       prog action code   =: 0x0000000000000000
>>       Configuration info =: 0x00000000000040e0
>>       mcode / hi-cyl     =: 0x0000000000000000
>>       cyl & head addr [0]=: 0x0000000000000000
>>       cyl & head addr [1]=: 0x0000000000000000
>>       cyl & head addr [2]=: 0x0000000000000000
>> ...snip...
>>
>> Reverting this patch allows the boot to proceed normally.  So, ugh.
> 
> Agreed on the 'ugh' :(
> 
>>
>> I'm not going to claim to know exactly how this is failing, because it's
>> too late for coffee and reading IPL records is difficult enough.  But in
>> the working case (without this patch), cp_init() calls
>> ccwchain_calc_length() and gets a count of four CCWs (SEEK + SIDE + TIC
>> + READ), which is broken up into two chains.  One for the SEEK/SIDE/TIC,
>> and another for the READ.  In the failing case (with this patch),
>> cp_init() only counts three CCWs, we never read the IPL2 address (the
>> second chain in the working case), and our boot fails.
>>
>> Unsure what the next best move is here.  It is possible that the
>> (ill-used) check that Farhan noticed and removed was actually added to
>> "make boot work" when handling recursive TIC CCWs such as this.  But I
>> still agree that this patch is correct, even though it exposes more
>> problems with our TIC handling throughout this code at large.  Any thoughts?
> 
> Yes, there's probably something rotten in our TIC handling...
> 
> Just to make things clear: This patch makes doing I/O on a booted
> system more stable, but makes the not-yet-merged QEMU bios boot code
> fail, right?

Correct.

(I think I deleted that while editing.  Sorry for the confusion.)

> 
> I'll guess I'll stare at the ccw translation code for a bit and see if
> something jumps out at me...
> 

I'll ping Jason a bit later today, and see if anything jumps out at us 
before Farhan returns.
Cornelia Huck Feb. 20, 2019, 12:44 p.m. UTC | #4
On Wed, 20 Feb 2019 06:29:38 -0500
Eric Farman <farman@linux.ibm.com> wrote:

> On 02/20/2019 04:48 AM, Cornelia Huck wrote:
> > On Tue, 19 Feb 2019 21:49:07 -0500
> > Eric Farman <farman@linux.ibm.com> wrote:
> >   
> >> Hi Connie, Farhan,
> >>
> >> On 02/04/2019 12:06 PM, Cornelia Huck wrote:  
> >>> From: Farhan Ali <alifm@linux.ibm.com>
> >>>
> >>> When trying to calculate the length of a ccw chain, we assume
> >>> there are ccws after a TIC. This can lead to overcounting and
> >>> copying garbage data from guest memory.
> >>>
> >>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> >>> Message-Id: <d63748c1f1b03147bcbf401596638627a5e35ef7.1548082107.git.alifm@linux.ibm.com>
> >>> Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
> >>> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
> >>> ---
> >>>    drivers/s390/cio/vfio_ccw_cp.c | 2 +-
> >>>    1 file changed, 1 insertion(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/s390/cio/vfio_ccw_cp.c b/drivers/s390/cio/vfio_ccw_cp.c
> >>> index 70a006ba4d05..ba08fe137c2e 100644
> >>> --- a/drivers/s390/cio/vfio_ccw_cp.c
> >>> +++ b/drivers/s390/cio/vfio_ccw_cp.c
> >>> @@ -392,7 +392,7 @@ static int ccwchain_calc_length(u64 iova, struct channel_program *cp)
> >>>    			return -EOPNOTSUPP;
> >>>    		}
> >>>    
> >>> -		if ((!ccw_is_chain(ccw)) && (!ccw_is_tic(ccw)))
> >>> +		if (!ccw_is_chain(ccw))

OK, this function now returns the length of the chain excluding the
last tic.

> >>>    			break;
> >>>    
> >>>    		ccw++;
> >>>    

Now, cp_init will not copy the last tic to the chain. When it then
looks for tics in that new chain, it won't find any, and stop copying. 
> >>
> >> Sigh.  Let me state up front that I'm running vanilla 5.0-rc7 host
> >> kernel with this patch applied (or not), and QEMU from a week or
> >> two ago.
> >>
> >> I stated in another thread that this patch makes things better when
> >> running fio and other random exercisers within my guest.  But I was
> >> always booting off a virtio-blk disk when I did that, and using
> >> some additional disks connected by vfio-ccw.  Today I discovered
> >> that this patch prevents guest boot from vfio-ccw (which was
> >> working previously):
> >>
> >> /usr/local/bin/qemu-system-s390x -machine
> >> s390-ccw-virtio,accel=kvm -smp 4 -m 1024 -nographic -s -net none
> >> -device
> >> vfio-ccw,sysfsdev=/sys/class/mdev_bus/0.0.0cbb/9503d38c-b83a-45b1-a2f0-393739817786,devno=fe.0.0e1e,force-orb-pfch=true,bootindex=0
> >> -bios /usr/src/qemu/build/pc-bios/s390-ccw/s390-ccw.img
> >> LOADPARM=[        ] vfio-ccw device I/O error - Interrupt Response
> >> Block Data: Function Ctrl : [Start]
> >>       Activity Ctrl :
> >>       Status Ctrl : [Alert] [Primary] [Secondary] [Status-Pending]
> >>       Device Status : [Channel-End] [Device-End]
> >>       Channel Status : [Incorrect-Length]
> >>       cpa=: 0x0000000000000018
> >>       prev_ccw=: 0x3100003840000008
> >>       this_ccw=: 0x0800001000000000
> >> Eckd Dasd Sense Data (fmt 32-bytes):
> >>       Sense Condition Flags :
> >>       Residual Count     =: 0x0000000000000000
> >>       Phys Drive ID      =: 0x0000000000000000
> >>       low cyl address    =: 0x0000000000000000
> >>       head addr & hi cyl =: 0x0000000000000000
> >>       format/message     =: 0x0000000000000000
> >>       fmt-dependent[0-7] =: 0x0000000000000000
> >>       fmt-dependent[8-15]=: 0x0000000005160f00
> >>       prog action code   =: 0x0000000000000000
> >>       Configuration info =: 0x00000000000040e0
> >>       mcode / hi-cyl     =: 0x0000000000000000
> >>       cyl & head addr [0]=: 0x0000000000000000
> >>       cyl & head addr [1]=: 0x0000000000000000
> >>       cyl & head addr [2]=: 0x0000000000000000
> >> ...snip...
> >>
> >> Reverting this patch allows the boot to proceed normally.  So,
> >> ugh.  
> > 
> > Agreed on the 'ugh' :(
> >   
> >>
> >> I'm not going to claim to know exactly how this is failing,
> >> because it's too late for coffee and reading IPL records is
> >> difficult enough.  But in the working case (without this patch),
> >> cp_init() calls ccwchain_calc_length() and gets a count of four
> >> CCWs (SEEK + SIDE + TIC
> >> + READ), which is broken up into two chains.  One for the
> >> SEEK/SIDE/TIC, and another for the READ.  In the failing case
> >> (with this patch), cp_init() only counts three CCWs, we never read
> >> the IPL2 address (the second chain in the working case), and our
> >> boot fails.

I think "we don't copy the tic to the chain where we search for tics"
fits those symptoms.

> >>
> >> Unsure what the next best move is here.  It is possible that the
> >> (ill-used) check that Farhan noticed and removed was actually
> >> added to "make boot work" when handling recursive TIC CCWs such as
> >> this.  But I still agree that this patch is correct, even though
> >> it exposes more problems with our TIC handling throughout this
> >> code at large.  Any thoughts?  
> > 
> > Yes, there's probably something rotten in our TIC handling...
> > 
> > Just to make things clear: This patch makes doing I/O on a booted
> > system more stable, but makes the not-yet-merged QEMU bios boot code
> > fail, right?  
> 
> Correct.
> 
> (I think I deleted that while editing.  Sorry for the confusion.)

np, I thought as much :)

> 
> > 
> > I'll guess I'll stare at the ccw translation code for a bit and see
> > if something jumps out at me...

See above. We may need two chains: One without the trailing tic, and
one to process to see where that tic points to... or rework the way to
follow the tic? Not sure.

> >   
> 
> I'll ping Jason a bit later today, and see if anything jumps out at
> us before Farhan returns.

I'm wondering whether we should keep this patch and fix on top of it,
or revert it for now... I'm not sure tic processing is working at all
right now.

Maybe we need a tool for testing that throws random channel programs at
the device :)
Halil Pasic Feb. 20, 2019, 1:22 p.m. UTC | #5
On Wed, 20 Feb 2019 13:44:46 +0100
Cornelia Huck <cohuck@redhat.com> wrote:

> On Wed, 20 Feb 2019 06:29:38 -0500
> Eric Farman <farman@linux.ibm.com> wrote:
> 
> > On 02/20/2019 04:48 AM, Cornelia Huck wrote:  
> > > On Tue, 19 Feb 2019 21:49:07 -0500
> > > Eric Farman <farman@linux.ibm.com> wrote:
> > >     
> > >> Hi Connie, Farhan,
> > >>
> > >> On 02/04/2019 12:06 PM, Cornelia Huck wrote:    
> > >>> From: Farhan Ali <alifm@linux.ibm.com>
> > >>>
> > >>> When trying to calculate the length of a ccw chain, we assume
> > >>> there are ccws after a TIC. This can lead to overcounting and
> > >>> copying garbage data from guest memory.
> > >>>
> > >>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> > >>> Message-Id: <d63748c1f1b03147bcbf401596638627a5e35ef7.1548082107.git.alifm@linux.ibm.com>
> > >>> Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
> > >>> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
> > >>> ---
> > >>>    drivers/s390/cio/vfio_ccw_cp.c | 2 +-
> > >>>    1 file changed, 1 insertion(+), 1 deletion(-)
> > >>>
> > >>> diff --git a/drivers/s390/cio/vfio_ccw_cp.c b/drivers/s390/cio/vfio_ccw_cp.c
> > >>> index 70a006ba4d05..ba08fe137c2e 100644
> > >>> --- a/drivers/s390/cio/vfio_ccw_cp.c
> > >>> +++ b/drivers/s390/cio/vfio_ccw_cp.c
> > >>> @@ -392,7 +392,7 @@ static int ccwchain_calc_length(u64 iova, struct channel_program *cp)
> > >>>    			return -EOPNOTSUPP;
> > >>>    		}
> > >>>    
> > >>> -		if ((!ccw_is_chain(ccw)) && (!ccw_is_tic(ccw)))
> > >>> +		if (!ccw_is_chain(ccw))  
> 
> OK, this function now returns the length of the chain excluding the
> last tic.
> 

I'm confused. I read this like the length includes the tic, but not the
ccw? after the tic. Or am I wrong?


> > >>>    			break;
> > >>>    
> > >>>    		ccw++;
> > >>>      
> 
> Now, cp_init will not copy the last tic to the chain. When it then
> looks for tics in that new chain, it won't find any, and stop copying. 

Eric also said the TIC is included but the subsequent READ gets 'dropped'
from (SEEK + SIDE + TIC + READ).

Regards,
Halil
> >> + READ
Halil Pasic Feb. 20, 2019, 1:37 p.m. UTC | #6
On Tue, 19 Feb 2019 21:49:07 -0500
Eric Farman <farman@linux.ibm.com> wrote:

> Hi Connie, Farhan,
> 
> On 02/04/2019 12:06 PM, Cornelia Huck wrote:
> > From: Farhan Ali <alifm@linux.ibm.com>
> > 
> > When trying to calculate the length of a ccw chain, we assume
> > there are ccws after a TIC. This can lead to overcounting and
> > copying garbage data from guest memory.
> > 
> > Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> > Message-Id: <d63748c1f1b03147bcbf401596638627a5e35ef7.1548082107.git.alifm@linux.ibm.com>
> > Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
> > Signed-off-by: Cornelia Huck <cohuck@redhat.com>
> > ---
> >   drivers/s390/cio/vfio_ccw_cp.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/s390/cio/vfio_ccw_cp.c b/drivers/s390/cio/vfio_ccw_cp.c
> > index 70a006ba4d05..ba08fe137c2e 100644
> > --- a/drivers/s390/cio/vfio_ccw_cp.c
> > +++ b/drivers/s390/cio/vfio_ccw_cp.c
> > @@ -392,7 +392,7 @@ static int ccwchain_calc_length(u64 iova, struct channel_program *cp)
> >   			return -EOPNOTSUPP;
> >   		}
> >   
> > -		if ((!ccw_is_chain(ccw)) && (!ccw_is_tic(ccw)))
> > +		if (!ccw_is_chain(ccw))
> >   			break;
> >   
> >   		ccw++;
> > 
> 
> Sigh.  Let me state up front that I'm running vanilla 5.0-rc7 host 
> kernel with this patch applied (or not), and QEMU from a week or two ago.
> 
> I stated in another thread that this patch makes things better when 
> running fio and other random exercisers within my guest.  But I was 
> always booting off a virtio-blk disk when I did that, and using some 
> additional disks connected by vfio-ccw.  Today I discovered that this 
> patch prevents guest boot from vfio-ccw (which was working previously):
> 
> /usr/local/bin/qemu-system-s390x -machine s390-ccw-virtio,accel=kvm -smp 
> 4 -m 1024 -nographic -s -net none -device 
> vfio-ccw,sysfsdev=/sys/class/mdev_bus/0.0.0cbb/9503d38c-b83a-45b1-a2f0-393739817786,devno=fe.0.0e1e,force-orb-pfch=true,bootindex=0 
> -bios /usr/src/qemu/build/pc-bios/s390-ccw/s390-ccw.img
> LOADPARM=[        ]
> vfio-ccw device I/O error - Interrupt Response Block Data:
>      Function Ctrl : [Start]
>      Activity Ctrl :
>      Status Ctrl : [Alert] [Primary] [Secondary] [Status-Pending]
>      Device Status : [Channel-End] [Device-End]
>      Channel Status : [Incorrect-Length]
>      cpa=: 0x0000000000000018
>      prev_ccw=: 0x3100003840000008
>      this_ccw=: 0x0800001000000000
> Eckd Dasd Sense Data (fmt 32-bytes):
>      Sense Condition Flags :
>      Residual Count     =: 0x0000000000000000
>      Phys Drive ID      =: 0x0000000000000000
>      low cyl address    =: 0x0000000000000000
>      head addr & hi cyl =: 0x0000000000000000
>      format/message     =: 0x0000000000000000
>      fmt-dependent[0-7] =: 0x0000000000000000
>      fmt-dependent[8-15]=: 0x0000000005160f00
>      prog action code   =: 0x0000000000000000
>      Configuration info =: 0x00000000000040e0
>      mcode / hi-cyl     =: 0x0000000000000000
>      cyl & head addr [0]=: 0x0000000000000000
>      cyl & head addr [1]=: 0x0000000000000000
>      cyl & head addr [2]=: 0x0000000000000000
> ...snip...
> 
> Reverting this patch allows the boot to proceed normally.  So, ugh.
> 
> I'm not going to claim to know exactly how this is failing, because it's 
> too late for coffee and reading IPL records is difficult enough.  But in 
> the working case (without this patch), cp_init() calls 
> ccwchain_calc_length() and gets a count of four CCWs (SEEK + SIDE + TIC 
> + READ), which is broken up into two chains.  One for the SEEK/SIDE/TIC, 
> and another for the READ.  In the failing case (with this patch), 
> cp_init() only counts three CCWs, we never read the IPL2 address (the 
> second chain in the working case), and our boot fails.
> 
> Unsure what the next best move is here.  It is possible that the 
> (ill-used) check that Farhan noticed and removed was actually added to 
> "make boot work" when handling recursive TIC CCWs such as this.  But I 
> still agree that this patch is correct, even though it exposes more 
> problems with our TIC handling throughout this code at large.  Any thoughts?
> 

With the annotation cp[] := {SEEK, SIDE, TIC, READ};

Do we jump over cp[2] (TIC) because we get presented with a status
modifier when executing cp[1] (SIDE)? Where does the cp[2] (TIC) point
to? Does it point to cp[3] (READ)? There should be a new chain
originating from the TIC in both cases.

As I stated previously, there is usally no way to tell where is the end
of an channel program: we can get status modifier and then we have to
jump over the next ccw (which may look like the last one if this jump
on status modifier is ignored). That's why I proposed poisoning what we
suspect is the after last ccw of the channel program.

Does that make sense?

Regards,
Halil
Cornelia Huck Feb. 20, 2019, 3:28 p.m. UTC | #7
On Wed, 20 Feb 2019 14:22:24 +0100
Halil Pasic <pasic@linux.ibm.com> wrote:

> On Wed, 20 Feb 2019 13:44:46 +0100
> Cornelia Huck <cohuck@redhat.com> wrote:
> 
> > On Wed, 20 Feb 2019 06:29:38 -0500
> > Eric Farman <farman@linux.ibm.com> wrote:
> >   
> > > On 02/20/2019 04:48 AM, Cornelia Huck wrote:    
> > > > On Tue, 19 Feb 2019 21:49:07 -0500
> > > > Eric Farman <farman@linux.ibm.com> wrote:
> > > >       
> > > >> Hi Connie, Farhan,
> > > >>
> > > >> On 02/04/2019 12:06 PM, Cornelia Huck wrote:      
> > > >>> From: Farhan Ali <alifm@linux.ibm.com>
> > > >>>
> > > >>> When trying to calculate the length of a ccw chain, we assume
> > > >>> there are ccws after a TIC. This can lead to overcounting and
> > > >>> copying garbage data from guest memory.
> > > >>>
> > > >>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> > > >>> Message-Id: <d63748c1f1b03147bcbf401596638627a5e35ef7.1548082107.git.alifm@linux.ibm.com>
> > > >>> Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
> > > >>> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
> > > >>> ---
> > > >>>    drivers/s390/cio/vfio_ccw_cp.c | 2 +-
> > > >>>    1 file changed, 1 insertion(+), 1 deletion(-)
> > > >>>
> > > >>> diff --git a/drivers/s390/cio/vfio_ccw_cp.c b/drivers/s390/cio/vfio_ccw_cp.c
> > > >>> index 70a006ba4d05..ba08fe137c2e 100644
> > > >>> --- a/drivers/s390/cio/vfio_ccw_cp.c
> > > >>> +++ b/drivers/s390/cio/vfio_ccw_cp.c
> > > >>> @@ -392,7 +392,7 @@ static int ccwchain_calc_length(u64 iova, struct channel_program *cp)
> > > >>>    			return -EOPNOTSUPP;
> > > >>>    		}
> > > >>>    
> > > >>> -		if ((!ccw_is_chain(ccw)) && (!ccw_is_tic(ccw)))
> > > >>> +		if (!ccw_is_chain(ccw))    
> > 
> > OK, this function now returns the length of the chain excluding the
> > last tic.
> >   
> 
> I'm confused. I read this like the length includes the tic, but not the
> ccw? after the tic. Or am I wrong?
> 
> 
> > > >>>    			break;
> > > >>>    
> > > >>>    		ccw++;
> > > >>>        
> > 
> > Now, cp_init will not copy the last tic to the chain. When it then
> > looks for tics in that new chain, it won't find any, and stop copying.   
> 
> Eric also said the TIC is included but the subsequent READ gets 'dropped'
> from (SEEK + SIDE + TIC + READ).

Then I'm out of ideas. Are we sure the channel program is correct?
Eric Farman Feb. 20, 2019, 4:07 p.m. UTC | #8
On 02/20/2019 10:28 AM, Cornelia Huck wrote:
> On Wed, 20 Feb 2019 14:22:24 +0100
> Halil Pasic <pasic@linux.ibm.com> wrote:
> 
>> On Wed, 20 Feb 2019 13:44:46 +0100
>> Cornelia Huck <cohuck@redhat.com> wrote:
>>
>>> On Wed, 20 Feb 2019 06:29:38 -0500
>>> Eric Farman <farman@linux.ibm.com> wrote:
>>>    
>>>> On 02/20/2019 04:48 AM, Cornelia Huck wrote:
>>>>> On Tue, 19 Feb 2019 21:49:07 -0500
>>>>> Eric Farman <farman@linux.ibm.com> wrote:
>>>>>        
>>>>>> Hi Connie, Farhan,
>>>>>>
>>>>>> On 02/04/2019 12:06 PM, Cornelia Huck wrote:
>>>>>>> From: Farhan Ali <alifm@linux.ibm.com>
>>>>>>>
>>>>>>> When trying to calculate the length of a ccw chain, we assume
>>>>>>> there are ccws after a TIC. This can lead to overcounting and
>>>>>>> copying garbage data from guest memory.
>>>>>>>
>>>>>>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
>>>>>>> Message-Id: <d63748c1f1b03147bcbf401596638627a5e35ef7.1548082107.git.alifm@linux.ibm.com>
>>>>>>> Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
>>>>>>> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
>>>>>>> ---
>>>>>>>     drivers/s390/cio/vfio_ccw_cp.c | 2 +-
>>>>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/s390/cio/vfio_ccw_cp.c b/drivers/s390/cio/vfio_ccw_cp.c
>>>>>>> index 70a006ba4d05..ba08fe137c2e 100644
>>>>>>> --- a/drivers/s390/cio/vfio_ccw_cp.c
>>>>>>> +++ b/drivers/s390/cio/vfio_ccw_cp.c
>>>>>>> @@ -392,7 +392,7 @@ static int ccwchain_calc_length(u64 iova, struct channel_program *cp)
>>>>>>>     			return -EOPNOTSUPP;
>>>>>>>     		}
>>>>>>>     
>>>>>>> -		if ((!ccw_is_chain(ccw)) && (!ccw_is_tic(ccw)))
>>>>>>> +		if (!ccw_is_chain(ccw))
>>>
>>> OK, this function now returns the length of the chain excluding the
>>> last tic.
>>>    
>>
>> I'm confused. I read this like the length includes the tic, but not the
>> ccw? after the tic. Or am I wrong?
>>
>>
>>>>>>>     			break;
>>>>>>>     
>>>>>>>     		ccw++;
>>>>>>>         
>>>
>>> Now, cp_init will not copy the last tic to the chain. When it then
>>> looks for tics in that new chain, it won't find any, and stop copying.
>>
>> Eric also said the TIC is included but the subsequent READ gets 'dropped'
>> from (SEEK + SIDE + TIC + READ).
> 
> Then I'm out of ideas. Are we sure the channel program is correct?
> 

We shouldn't assume that I read things correctly.  :)  I have not looked 
at Jason's patches, and was going by what I saw being referenced in 
ccwchain_fetch_{one|tic} with or without this patch.

Now that I've got the morning meetings out of the way, I'll spend some 
time sorting out whether we're dropping a read or a tic.
Eric Farman Feb. 20, 2019, 4:25 p.m. UTC | #9
On 02/20/2019 07:44 AM, Cornelia Huck wrote:
> On Wed, 20 Feb 2019 06:29:38 -0500
> Eric Farman <farman@linux.ibm.com> wrote:
> 
>> On 02/20/2019 04:48 AM, Cornelia Huck wrote:
>>> On Tue, 19 Feb 2019 21:49:07 -0500
>>> Eric Farman <farman@linux.ibm.com> wrote:
>>>    
>>>> Hi Connie, Farhan,
>>>>
>>>> On 02/04/2019 12:06 PM, Cornelia Huck wrote:
>>>>> From: Farhan Ali <alifm@linux.ibm.com>
>>>>>
>>>>> When trying to calculate the length of a ccw chain, we assume
>>>>> there are ccws after a TIC. This can lead to overcounting and
>>>>> copying garbage data from guest memory.
>>>>>
>>>>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
>>>>> Message-Id: <d63748c1f1b03147bcbf401596638627a5e35ef7.1548082107.git.alifm@linux.ibm.com>
>>>>> Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
>>>>> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
>>>>> ---
>>>>>     drivers/s390/cio/vfio_ccw_cp.c | 2 +-
>>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/s390/cio/vfio_ccw_cp.c b/drivers/s390/cio/vfio_ccw_cp.c
>>>>> index 70a006ba4d05..ba08fe137c2e 100644
>>>>> --- a/drivers/s390/cio/vfio_ccw_cp.c
>>>>> +++ b/drivers/s390/cio/vfio_ccw_cp.c
>>>>> @@ -392,7 +392,7 @@ static int ccwchain_calc_length(u64 iova, struct channel_program *cp)
>>>>>     			return -EOPNOTSUPP;
>>>>>     		}
>>>>>     
>>>>> -		if ((!ccw_is_chain(ccw)) && (!ccw_is_tic(ccw)))
>>>>> +		if (!ccw_is_chain(ccw))
> 
> OK, this function now returns the length of the chain excluding the
> last tic.
> 
>>>>>     			break;
>>>>>     
>>>>>     		ccw++;
>>>>>     
> 
> Now, cp_init will not copy the last tic to the chain. When it then
> looks for tics in that new chain, it won't find any, and stop copying.
>>>>
>>>> Sigh.  Let me state up front that I'm running vanilla 5.0-rc7 host
>>>> kernel with this patch applied (or not), and QEMU from a week or
>>>> two ago.
>>>>
>>>> I stated in another thread that this patch makes things better when
>>>> running fio and other random exercisers within my guest.  But I was
>>>> always booting off a virtio-blk disk when I did that, and using
>>>> some additional disks connected by vfio-ccw.  Today I discovered
>>>> that this patch prevents guest boot from vfio-ccw (which was
>>>> working previously):
>>>>
>>>> /usr/local/bin/qemu-system-s390x -machine
>>>> s390-ccw-virtio,accel=kvm -smp 4 -m 1024 -nographic -s -net none
>>>> -device
>>>> vfio-ccw,sysfsdev=/sys/class/mdev_bus/0.0.0cbb/9503d38c-b83a-45b1-a2f0-393739817786,devno=fe.0.0e1e,force-orb-pfch=true,bootindex=0
>>>> -bios /usr/src/qemu/build/pc-bios/s390-ccw/s390-ccw.img
>>>> LOADPARM=[        ] vfio-ccw device I/O error - Interrupt Response
>>>> Block Data: Function Ctrl : [Start]
>>>>        Activity Ctrl :
>>>>        Status Ctrl : [Alert] [Primary] [Secondary] [Status-Pending]
>>>>        Device Status : [Channel-End] [Device-End]
>>>>        Channel Status : [Incorrect-Length]
>>>>        cpa=: 0x0000000000000018
>>>>        prev_ccw=: 0x3100003840000008
>>>>        this_ccw=: 0x0800001000000000
>>>> Eckd Dasd Sense Data (fmt 32-bytes):
>>>>        Sense Condition Flags :
>>>>        Residual Count     =: 0x0000000000000000
>>>>        Phys Drive ID      =: 0x0000000000000000
>>>>        low cyl address    =: 0x0000000000000000
>>>>        head addr & hi cyl =: 0x0000000000000000
>>>>        format/message     =: 0x0000000000000000
>>>>        fmt-dependent[0-7] =: 0x0000000000000000
>>>>        fmt-dependent[8-15]=: 0x0000000005160f00
>>>>        prog action code   =: 0x0000000000000000
>>>>        Configuration info =: 0x00000000000040e0
>>>>        mcode / hi-cyl     =: 0x0000000000000000
>>>>        cyl & head addr [0]=: 0x0000000000000000
>>>>        cyl & head addr [1]=: 0x0000000000000000
>>>>        cyl & head addr [2]=: 0x0000000000000000
>>>> ...snip...
>>>>
>>>> Reverting this patch allows the boot to proceed normally.  So,
>>>> ugh.
>>>
>>> Agreed on the 'ugh' :(
>>>    
>>>>
>>>> I'm not going to claim to know exactly how this is failing,
>>>> because it's too late for coffee and reading IPL records is
>>>> difficult enough.  But in the working case (without this patch),
>>>> cp_init() calls ccwchain_calc_length() and gets a count of four
>>>> CCWs (SEEK + SIDE + TIC
>>>> + READ), which is broken up into two chains.  One for the
>>>> SEEK/SIDE/TIC, and another for the READ.  In the failing case
>>>> (with this patch), cp_init() only counts three CCWs, we never read
>>>> the IPL2 address (the second chain in the working case), and our
>>>> boot fails.
> 
> I think "we don't copy the tic to the chain where we search for tics"
> fits those symptoms.

Indeed.  Hrm...

> 
>>>>
>>>> Unsure what the next best move is here.  It is possible that the
>>>> (ill-used) check that Farhan noticed and removed was actually
>>>> added to "make boot work" when handling recursive TIC CCWs such as
>>>> this.  But I still agree that this patch is correct, even though
>>>> it exposes more problems with our TIC handling throughout this
>>>> code at large.  Any thoughts?
>>>
>>> Yes, there's probably something rotten in our TIC handling...
>>>
>>> Just to make things clear: This patch makes doing I/O on a booted
>>> system more stable, but makes the not-yet-merged QEMU bios boot code
>>> fail, right?
>>
>> Correct.
>>
>> (I think I deleted that while editing.  Sorry for the confusion.)
> 
> np, I thought as much :)
> 
>>
>>>
>>> I'll guess I'll stare at the ccw translation code for a bit and see
>>> if something jumps out at me...
> 
> See above. We may need two chains: One without the trailing tic, and
> one to process to see where that tic points to... or rework the way to
> follow the tic? Not sure.

I hate the idea of two chains, because how much storage do we end up 
needing to consume?  But maybe it's inevitable, or maybe I can think up 
a way to rework it without.

> 
>>>    
>>
>> I'll ping Jason a bit later today, and see if anything jumps out at
>> us before Farhan returns.
> 
> I'm wondering whether we should keep this patch and fix on top of it,
> or revert it for now... I'm not sure tic processing is working at all
> right now.

I've been wondering this too.  Yeah, the bios code isn't merged yet, but 
this patch means there's no point in merging it until we get TIC fixed 
properly.  Which I'd hope is 5.1, but I'm not sure how deep this bowl of 
spaghetti is.

> 
> Maybe we need a tool for testing that throws random channel programs at
> the device :)
> 

Are you peeking at my todo/wish list?  :)
Cornelia Huck Feb. 20, 2019, 4:44 p.m. UTC | #10
On Wed, 20 Feb 2019 11:25:46 -0500
Eric Farman <farman@linux.ibm.com> wrote:

> On 02/20/2019 07:44 AM, Cornelia Huck wrote:
> > On Wed, 20 Feb 2019 06:29:38 -0500
> > Eric Farman <farman@linux.ibm.com> wrote:
> >   
> >> On 02/20/2019 04:48 AM, Cornelia Huck wrote:  
> >>> On Tue, 19 Feb 2019 21:49:07 -0500
> >>> Eric Farman <farman@linux.ibm.com> wrote:
> >>>      
> >>>> Hi Connie, Farhan,
> >>>>
> >>>> On 02/04/2019 12:06 PM, Cornelia Huck wrote:  
> >>>>> From: Farhan Ali <alifm@linux.ibm.com>
> >>>>>
> >>>>> When trying to calculate the length of a ccw chain, we assume
> >>>>> there are ccws after a TIC. This can lead to overcounting and
> >>>>> copying garbage data from guest memory.
> >>>>>
> >>>>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> >>>>> Message-Id: <d63748c1f1b03147bcbf401596638627a5e35ef7.1548082107.git.alifm@linux.ibm.com>
> >>>>> Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
> >>>>> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
> >>>>> ---
> >>>>>     drivers/s390/cio/vfio_ccw_cp.c | 2 +-
> >>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>>
> >>>>> diff --git a/drivers/s390/cio/vfio_ccw_cp.c b/drivers/s390/cio/vfio_ccw_cp.c
> >>>>> index 70a006ba4d05..ba08fe137c2e 100644
> >>>>> --- a/drivers/s390/cio/vfio_ccw_cp.c
> >>>>> +++ b/drivers/s390/cio/vfio_ccw_cp.c
> >>>>> @@ -392,7 +392,7 @@ static int ccwchain_calc_length(u64 iova, struct channel_program *cp)
> >>>>>     			return -EOPNOTSUPP;
> >>>>>     		}
> >>>>>     
> >>>>> -		if ((!ccw_is_chain(ccw)) && (!ccw_is_tic(ccw)))
> >>>>> +		if (!ccw_is_chain(ccw))  
> > 
> > OK, this function now returns the length of the chain excluding the
> > last tic.

Hm, not so sure about that anymore. I'm a bit tired, please apply salt
where needed to what I'm seeing.

> >   
> >>>>>     			break;
> >>>>>     
> >>>>>     		ccw++;
> >>>>>       
> > 
> > Now, cp_init will not copy the last tic to the chain. When it then
> > looks for tics in that new chain, it won't find any, and stop copying.  
(...)
> >>>> I'm not going to claim to know exactly how this is failing,
> >>>> because it's too late for coffee and reading IPL records is
> >>>> difficult enough.  But in the working case (without this patch),
> >>>> cp_init() calls ccwchain_calc_length() and gets a count of four
> >>>> CCWs (SEEK + SIDE + TIC
> >>>> + READ), which is broken up into two chains.  One for the
> >>>> SEEK/SIDE/TIC, and another for the READ.  In the failing case
> >>>> (with this patch), cp_init() only counts three CCWs, we never read
> >>>> the IPL2 address (the second chain in the working case), and our
> >>>> boot fails.  
> > 
> > I think "we don't copy the tic to the chain where we search for tics"
> > fits those symptoms.  
> 
> Indeed.  Hrm...
(...)
> > See above. We may need two chains: One without the trailing tic, and
> > one to process to see where that tic points to... or rework the way to
> > follow the tic? Not sure.  
> 
> I hate the idea of two chains, because how much storage do we end up 
> needing to consume?  But maybe it's inevitable, or maybe I can think up 
> a way to rework it without.

Yeah, that sucks. If we really need to track tics separately (and I'm
not so sure, see above), there should be a better way for that...

(...)

> > I'm wondering whether we should keep this patch and fix on top of it,
> > or revert it for now... I'm not sure tic processing is working at all
> > right now.  
> 
> I've been wondering this too.  Yeah, the bios code isn't merged yet, but 
> this patch means there's no point in merging it until we get TIC fixed 
> properly.  Which I'd hope is 5.1, but I'm not sure how deep this bowl of 
> spaghetti is.

We also need to make sure that this is not a bug in the bios instead...

> 
> > 
> > Maybe we need a tool for testing that throws random channel programs at
> > the device :)
> >   
> 
> Are you peeking at my todo/wish list?  :)

:)

Might be something for kvm unit tests? Or do I misunderstand what they
can do?
Eric Farman Feb. 20, 2019, 10:35 p.m. UTC | #11
On 02/20/2019 11:44 AM, Cornelia Huck wrote:
> On Wed, 20 Feb 2019 11:25:46 -0500
> Eric Farman <farman@linux.ibm.com> wrote:
> 
>> On 02/20/2019 07:44 AM, Cornelia Huck wrote:
>>> On Wed, 20 Feb 2019 06:29:38 -0500
>>> Eric Farman <farman@linux.ibm.com> wrote:
>>>    
>>>> On 02/20/2019 04:48 AM, Cornelia Huck wrote:
>>>>> On Tue, 19 Feb 2019 21:49:07 -0500
>>>>> Eric Farman <farman@linux.ibm.com> wrote:
>>>>>       
>>>>>> Hi Connie, Farhan,
>>>>>>
>>>>>> On 02/04/2019 12:06 PM, Cornelia Huck wrote:
>>>>>>> From: Farhan Ali <alifm@linux.ibm.com>
>>>>>>>
>>>>>>> When trying to calculate the length of a ccw chain, we assume
>>>>>>> there are ccws after a TIC. This can lead to overcounting and
>>>>>>> copying garbage data from guest memory.
>>>>>>>
>>>>>>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
>>>>>>> Message-Id: <d63748c1f1b03147bcbf401596638627a5e35ef7.1548082107.git.alifm@linux.ibm.com>
>>>>>>> Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
>>>>>>> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
>>>>>>> ---
>>>>>>>      drivers/s390/cio/vfio_ccw_cp.c | 2 +-
>>>>>>>      1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/s390/cio/vfio_ccw_cp.c b/drivers/s390/cio/vfio_ccw_cp.c
>>>>>>> index 70a006ba4d05..ba08fe137c2e 100644
>>>>>>> --- a/drivers/s390/cio/vfio_ccw_cp.c
>>>>>>> +++ b/drivers/s390/cio/vfio_ccw_cp.c
>>>>>>> @@ -392,7 +392,7 @@ static int ccwchain_calc_length(u64 iova, struct channel_program *cp)
>>>>>>>      			return -EOPNOTSUPP;
>>>>>>>      		}
>>>>>>>      
>>>>>>> -		if ((!ccw_is_chain(ccw)) && (!ccw_is_tic(ccw)))
>>>>>>> +		if (!ccw_is_chain(ccw))
>>>
>>> OK, this function now returns the length of the chain excluding the
>>> last tic.
> 
> Hm, not so sure about that anymore. I'm a bit tired, please apply salt
> where needed to what I'm seeing.

I didn't get as far as I had hoped today, so I'm going to need to reset 
to tomorrow, have coffee, and try again.  But it does seem that with 
this patch, we calculate the length of the chain up to and including the 
TIC, and nothing beyond it.

That is, during the boot process we calculate a chain length of "3" for 
a SEEK + SIDE + TIC that QEMU built.

> 
>>>    
>>>>>>>      			break;
>>>>>>>      
>>>>>>>      		ccw++;
>>>>>>>        
>>>
>>> Now, cp_init will not copy the last tic to the chain. When it then
>>> looks for tics in that new chain, it won't find any, and stop copying.
> (...)
>>>>>> I'm not going to claim to know exactly how this is failing,
>>>>>> because it's too late for coffee and reading IPL records is
>>>>>> difficult enough.  But in the working case (without this patch),
>>>>>> cp_init() calls ccwchain_calc_length() and gets a count of four
>>>>>> CCWs (SEEK + SIDE + TIC
>>>>>> + READ), which is broken up into two chains.  One for the
>>>>>> SEEK/SIDE/TIC, and another for the READ.  In the failing case
>>>>>> (with this patch), cp_init() only counts three CCWs, we never read
>>>>>> the IPL2 address (the second chain in the working case), and our
>>>>>> boot fails.
>>>
>>> I think "we don't copy the tic to the chain where we search for tics"
>>> fits those symptoms.
>>
>> Indeed.  Hrm...
> (...)
>>> See above. We may need two chains: One without the trailing tic, and
>>> one to process to see where that tic points to... or rework the way to
>>> follow the tic? Not sure.
>>
>> I hate the idea of two chains, because how much storage do we end up
>> needing to consume?  But maybe it's inevitable, or maybe I can think up
>> a way to rework it without.
> 
> Yeah, that sucks. If we really need to track tics separately (and I'm
> not so sure, see above), there should be a better way for that...
> 
> (...)
> 
>>> I'm wondering whether we should keep this patch and fix on top of it,
>>> or revert it for now... I'm not sure tic processing is working at all
>>> right now.
>>
>> I've been wondering this too.  Yeah, the bios code isn't merged yet, but
>> this patch means there's no point in merging it until we get TIC fixed
>> properly.  Which I'd hope is 5.1, but I'm not sure how deep this bowl of
>> spaghetti is.
> 
> We also need to make sure that this is not a bug in the bios instead...

Of course.  I don't see it yet, but I haven't looked at the code very 
closely.  From what he wrote in $qemusrc/docs/devel, it seems okay and 
the CCWs I see showing up in vfio-ccw seem to match with it.  We just 
have trouble distinguishing between a "forward" versus "backward" TIC.

> 
>>
>>>
>>> Maybe we need a tool for testing that throws random channel programs at
>>> the device :)
>>>    
>>
>> Are you peeking at my todo/wish list?  :)
> 
> :)
> 
> Might be something for kvm unit tests? Or do I misunderstand what they
> can do?
> 

I haven't considered that too deeply, but I think that could be 
possible.  Maybe not quickly, but possible.
Eric Farman Feb. 21, 2019, 3:02 a.m. UTC | #12
On 02/20/2019 05:35 PM, Eric Farman wrote:
> 
> 
> On 02/20/2019 11:44 AM, Cornelia Huck wrote:
>> On Wed, 20 Feb 2019 11:25:46 -0500
>> Eric Farman <farman@linux.ibm.com> wrote:
>>
>>> On 02/20/2019 07:44 AM, Cornelia Huck wrote:
>>>> On Wed, 20 Feb 2019 06:29:38 -0500
>>>> Eric Farman <farman@linux.ibm.com> wrote:
>>>>> On 02/20/2019 04:48 AM, Cornelia Huck wrote:
>>>>>> On Tue, 19 Feb 2019 21:49:07 -0500
>>>>>> Eric Farman <farman@linux.ibm.com> wrote:
>>>>>>> Hi Connie, Farhan,
>>>>>>>
>>>>>>> On 02/04/2019 12:06 PM, Cornelia Huck wrote:
>>>>>>>> From: Farhan Ali <alifm@linux.ibm.com>
>>>>>>>>
>>>>>>>> When trying to calculate the length of a ccw chain, we assume
>>>>>>>> there are ccws after a TIC. This can lead to overcounting and
>>>>>>>> copying garbage data from guest memory.
>>>>>>>>
>>>>>>>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
>>>>>>>> Message-Id: 
>>>>>>>> <d63748c1f1b03147bcbf401596638627a5e35ef7.1548082107.git.alifm@linux.ibm.com> 
>>>>>>>>
>>>>>>>> Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
>>>>>>>> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
>>>>>>>> ---
>>>>>>>>      drivers/s390/cio/vfio_ccw_cp.c | 2 +-
>>>>>>>>      1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/s390/cio/vfio_ccw_cp.c 
>>>>>>>> b/drivers/s390/cio/vfio_ccw_cp.c
>>>>>>>> index 70a006ba4d05..ba08fe137c2e 100644
>>>>>>>> --- a/drivers/s390/cio/vfio_ccw_cp.c
>>>>>>>> +++ b/drivers/s390/cio/vfio_ccw_cp.c
>>>>>>>> @@ -392,7 +392,7 @@ static int ccwchain_calc_length(u64 iova, 
>>>>>>>> struct channel_program *cp)
>>>>>>>>                  return -EOPNOTSUPP;
>>>>>>>>              }
>>>>>>>> -        if ((!ccw_is_chain(ccw)) && (!ccw_is_tic(ccw)))
>>>>>>>> +        if (!ccw_is_chain(ccw))
>>>>
>>>> OK, this function now returns the length of the chain excluding the
>>>> last tic.
>>
>> Hm, not so sure about that anymore. I'm a bit tired, please apply salt
>> where needed to what I'm seeing.
> 
> I didn't get as far as I had hoped today, so I'm going to need to reset 
> to tomorrow, have coffee, and try again.  But it does seem that with 
> this patch, we calculate the length of the chain up to and including the 
> TIC, and nothing beyond it.
> 
> That is, during the boot process we calculate a chain length of "3" for 
> a SEEK + SIDE + TIC that QEMU built.
> 
>>
>>>>>>>>                  break;
>>>>>>>>              ccw++;
>>>>
>>>> Now, cp_init will not copy the last tic to the chain. When it then
>>>> looks for tics in that new chain, it won't find any, and stop copying.
>> (...)
>>>>>>> I'm not going to claim to know exactly how this is failing,
>>>>>>> because it's too late for coffee and reading IPL records is
>>>>>>> difficult enough.  But in the working case (without this patch),
>>>>>>> cp_init() calls ccwchain_calc_length() and gets a count of four
>>>>>>> CCWs (SEEK + SIDE + TIC
>>>>>>> + READ), which is broken up into two chains.  One for the
>>>>>>> SEEK/SIDE/TIC, and another for the READ.  In the failing case
>>>>>>> (with this patch), cp_init() only counts three CCWs, we never read
>>>>>>> the IPL2 address (the second chain in the working case), and our
>>>>>>> boot fails.
>>>>
>>>> I think "we don't copy the tic to the chain where we search for tics"
>>>> fits those symptoms.
>>>
>>> Indeed.  Hrm...
>> (...)
>>>> See above. We may need two chains: One without the trailing tic, and
>>>> one to process to see where that tic points to... or rework the way to
>>>> follow the tic? Not sure.
>>>
>>> I hate the idea of two chains, because how much storage do we end up
>>> needing to consume?  But maybe it's inevitable, or maybe I can think up
>>> a way to rework it without.
>>
>> Yeah, that sucks. If we really need to track tics separately (and I'm
>> not so sure, see above), there should be a better way for that...
>>
>> (...)
>>
>>>> I'm wondering whether we should keep this patch and fix on top of it,
>>>> or revert it for now... I'm not sure tic processing is working at all
>>>> right now.
>>>
>>> I've been wondering this too.  Yeah, the bios code isn't merged yet, but
>>> this patch means there's no point in merging it until we get TIC fixed
>>> properly.  Which I'd hope is 5.1, but I'm not sure how deep this bowl of
>>> spaghetti is.
>>
>> We also need to make sure that this is not a bug in the bios instead...
> 
> Of course.  I don't see it yet, but I haven't looked at the code very 
> closely.  From what he wrote in $qemusrc/docs/devel, it seems okay and 
> the CCWs I see showing up in vfio-ccw seem to match with it.  We just 
> have trouble distinguishing between a "forward" versus "backward" TIC.

I just replied to patch 15 of Jason's bios series, with a small fix.  I 
didn't pay particularly close attention at the time, but the original 
error I sent here was an Incorrect Length, and both the SEEK and SIDE 
had a count of 8 bytes.  Fixing that converts the error to a program 
check, and the cpa in the irb points to where the read would be.  So 
this gets things to a more accurate "do we count/copy the TIC when 
handling our inputs" scenario as discussed above.

(Why didn't it complain when SEEK set the SLI bit, but SIDE didn't?  Odd.)

But at this point, both my laptop and I are out of energy, and I should 
plug both in.  Will pick things up tomorrow.

> 
>>
>>>
>>>>
>>>> Maybe we need a tool for testing that throws random channel programs at
>>>> the device :)
>>>
>>> Are you peeking at my todo/wish list?  :)
>>
>> :)
>>
>> Might be something for kvm unit tests? Or do I misunderstand what they
>> can do?
>>
> 
> I haven't considered that too deeply, but I think that could be 
> possible.  Maybe not quickly, but possible.
Cornelia Huck Feb. 21, 2019, 9:30 a.m. UTC | #13
On Wed, 20 Feb 2019 22:02:10 -0500
Eric Farman <farman@linux.ibm.com> wrote:

> On 02/20/2019 05:35 PM, Eric Farman wrote:
> > 
> > 
> > On 02/20/2019 11:44 AM, Cornelia Huck wrote:  
> >> On Wed, 20 Feb 2019 11:25:46 -0500
> >> Eric Farman <farman@linux.ibm.com> wrote:
> >>  
> >>> On 02/20/2019 07:44 AM, Cornelia Huck wrote:  
> >>>> On Wed, 20 Feb 2019 06:29:38 -0500
> >>>> Eric Farman <farman@linux.ibm.com> wrote:  
> >>>>> On 02/20/2019 04:48 AM, Cornelia Huck wrote:  
> >>>>>> On Tue, 19 Feb 2019 21:49:07 -0500
> >>>>>> Eric Farman <farman@linux.ibm.com> wrote:  
> >>>>>>> Hi Connie, Farhan,
> >>>>>>>
> >>>>>>> On 02/04/2019 12:06 PM, Cornelia Huck wrote:  
> >>>>>>>> From: Farhan Ali <alifm@linux.ibm.com>
> >>>>>>>>
> >>>>>>>> When trying to calculate the length of a ccw chain, we assume
> >>>>>>>> there are ccws after a TIC. This can lead to overcounting and
> >>>>>>>> copying garbage data from guest memory.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> >>>>>>>> Message-Id: 
> >>>>>>>> <d63748c1f1b03147bcbf401596638627a5e35ef7.1548082107.git.alifm@linux.ibm.com> 
> >>>>>>>>
> >>>>>>>> Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
> >>>>>>>> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
> >>>>>>>> ---
> >>>>>>>>      drivers/s390/cio/vfio_ccw_cp.c | 2 +-
> >>>>>>>>      1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>>>>>
> >>>>>>>> diff --git a/drivers/s390/cio/vfio_ccw_cp.c 
> >>>>>>>> b/drivers/s390/cio/vfio_ccw_cp.c
> >>>>>>>> index 70a006ba4d05..ba08fe137c2e 100644
> >>>>>>>> --- a/drivers/s390/cio/vfio_ccw_cp.c
> >>>>>>>> +++ b/drivers/s390/cio/vfio_ccw_cp.c
> >>>>>>>> @@ -392,7 +392,7 @@ static int ccwchain_calc_length(u64 iova, 
> >>>>>>>> struct channel_program *cp)
> >>>>>>>>                  return -EOPNOTSUPP;
> >>>>>>>>              }
> >>>>>>>> -        if ((!ccw_is_chain(ccw)) && (!ccw_is_tic(ccw)))
> >>>>>>>> +        if (!ccw_is_chain(ccw))  
> >>>>
> >>>> OK, this function now returns the length of the chain excluding the
> >>>> last tic.  
> >>
> >> Hm, not so sure about that anymore. I'm a bit tired, please apply salt
> >> where needed to what I'm seeing.  
> > 
> > I didn't get as far as I had hoped today, so I'm going to need to reset 
> > to tomorrow, have coffee, and try again.  But it does seem that with 
> > this patch, we calculate the length of the chain up to and including the 
> > TIC, and nothing beyond it.
> > 
> > That is, during the boot process we calculate a chain length of "3" for 
> > a SEEK + SIDE + TIC that QEMU built.

Hm... so it looks like that code does what it says on the tin. But why
are we missing the second round, that should give us a chain with the
forth ccw? Are we missing a loop somewhere?

> >   
> >>  
> >>>>>>>>                  break;
> >>>>>>>>              ccw++;  
> >>>>
> >>>> Now, cp_init will not copy the last tic to the chain. When it then
> >>>> looks for tics in that new chain, it won't find any, and stop copying.  
> >> (...)  
> >>>>>>> I'm not going to claim to know exactly how this is failing,
> >>>>>>> because it's too late for coffee and reading IPL records is
> >>>>>>> difficult enough.  But in the working case (without this patch),
> >>>>>>> cp_init() calls ccwchain_calc_length() and gets a count of four
> >>>>>>> CCWs (SEEK + SIDE + TIC
> >>>>>>> + READ), which is broken up into two chains.  One for the
> >>>>>>> SEEK/SIDE/TIC, and another for the READ.  In the failing case
> >>>>>>> (with this patch), cp_init() only counts three CCWs, we never read
> >>>>>>> the IPL2 address (the second chain in the working case), and our
> >>>>>>> boot fails.  
> >>>>
> >>>> I think "we don't copy the tic to the chain where we search for tics"
> >>>> fits those symptoms.  
> >>>
> >>> Indeed.  Hrm...  
> >> (...)  
> >>>> See above. We may need two chains: One without the trailing tic, and
> >>>> one to process to see where that tic points to... or rework the way to
> >>>> follow the tic? Not sure.  
> >>>
> >>> I hate the idea of two chains, because how much storage do we end up
> >>> needing to consume?  But maybe it's inevitable, or maybe I can think up
> >>> a way to rework it without.  
> >>
> >> Yeah, that sucks. If we really need to track tics separately (and I'm
> >> not so sure, see above), there should be a better way for that...
> >>
> >> (...)
> >>  
> >>>> I'm wondering whether we should keep this patch and fix on top of it,
> >>>> or revert it for now... I'm not sure tic processing is working at all
> >>>> right now.  
> >>>
> >>> I've been wondering this too.  Yeah, the bios code isn't merged yet, but
> >>> this patch means there's no point in merging it until we get TIC fixed
> >>> properly.  Which I'd hope is 5.1, but I'm not sure how deep this bowl of
> >>> spaghetti is.  
> >>
> >> We also need to make sure that this is not a bug in the bios instead...  
> > 
> > Of course.  I don't see it yet, but I haven't looked at the code very 
> > closely.  From what he wrote in $qemusrc/docs/devel, it seems okay and 
> > the CCWs I see showing up in vfio-ccw seem to match with it.  We just 
> > have trouble distinguishing between a "forward" versus "backward" TIC.  
> 
> I just replied to patch 15 of Jason's bios series, with a small fix.  I 
> didn't pay particularly close attention at the time, but the original 
> error I sent here was an Incorrect Length, and both the SEEK and SIDE 
> had a count of 8 bytes.  Fixing that converts the error to a program 
> check, and the cpa in the irb points to where the read would be.  So 
> this gets things to a more accurate "do we count/copy the TIC when 
> handling our inputs" scenario as discussed above.

Your fix in the bios code looks correct.

By "program check" you mean "channel program check", right? Do you know
which ccw triggers that?

> 
> (Why didn't it complain when SEEK set the SLI bit, but SIDE didn't?  Odd.)
> 
> But at this point, both my laptop and I are out of energy, and I should 
> plug both in.  Will pick things up tomorrow.

Sure, please do recharge :) I'll see if I can figure out something
with that information above.
Cornelia Huck Feb. 21, 2019, 10:03 a.m. UTC | #14
On Thu, 21 Feb 2019 10:30:01 +0100
Cornelia Huck <cohuck@redhat.com> wrote:

> On Wed, 20 Feb 2019 22:02:10 -0500
> Eric Farman <farman@linux.ibm.com> wrote:
> 
> > On 02/20/2019 05:35 PM, Eric Farman wrote:  
> > > 
> > > 
> > > On 02/20/2019 11:44 AM, Cornelia Huck wrote:    
> > >> On Wed, 20 Feb 2019 11:25:46 -0500
> > >> Eric Farman <farman@linux.ibm.com> wrote:
> > >>    
> > >>> On 02/20/2019 07:44 AM, Cornelia Huck wrote:    
> > >>>> On Wed, 20 Feb 2019 06:29:38 -0500
> > >>>> Eric Farman <farman@linux.ibm.com> wrote:    
> > >>>>> On 02/20/2019 04:48 AM, Cornelia Huck wrote:    
> > >>>>>> On Tue, 19 Feb 2019 21:49:07 -0500
> > >>>>>> Eric Farman <farman@linux.ibm.com> wrote:    
> > >>>>>>> Hi Connie, Farhan,
> > >>>>>>>
> > >>>>>>> On 02/04/2019 12:06 PM, Cornelia Huck wrote:    
> > >>>>>>>> From: Farhan Ali <alifm@linux.ibm.com>
> > >>>>>>>>
> > >>>>>>>> When trying to calculate the length of a ccw chain, we assume
> > >>>>>>>> there are ccws after a TIC. This can lead to overcounting and
> > >>>>>>>> copying garbage data from guest memory.
> > >>>>>>>>
> > >>>>>>>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> > >>>>>>>> Message-Id: 
> > >>>>>>>> <d63748c1f1b03147bcbf401596638627a5e35ef7.1548082107.git.alifm@linux.ibm.com> 
> > >>>>>>>>
> > >>>>>>>> Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
> > >>>>>>>> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
> > >>>>>>>> ---
> > >>>>>>>>      drivers/s390/cio/vfio_ccw_cp.c | 2 +-
> > >>>>>>>>      1 file changed, 1 insertion(+), 1 deletion(-)
> > >>>>>>>>
> > >>>>>>>> diff --git a/drivers/s390/cio/vfio_ccw_cp.c 
> > >>>>>>>> b/drivers/s390/cio/vfio_ccw_cp.c
> > >>>>>>>> index 70a006ba4d05..ba08fe137c2e 100644
> > >>>>>>>> --- a/drivers/s390/cio/vfio_ccw_cp.c
> > >>>>>>>> +++ b/drivers/s390/cio/vfio_ccw_cp.c
> > >>>>>>>> @@ -392,7 +392,7 @@ static int ccwchain_calc_length(u64 iova, 
> > >>>>>>>> struct channel_program *cp)
> > >>>>>>>>                  return -EOPNOTSUPP;
> > >>>>>>>>              }
> > >>>>>>>> -        if ((!ccw_is_chain(ccw)) && (!ccw_is_tic(ccw)))
> > >>>>>>>> +        if (!ccw_is_chain(ccw))    
> > >>>>
> > >>>> OK, this function now returns the length of the chain excluding the
> > >>>> last tic.    
> > >>
> > >> Hm, not so sure about that anymore. I'm a bit tired, please apply salt
> > >> where needed to what I'm seeing.    
> > > 
> > > I didn't get as far as I had hoped today, so I'm going to need to reset 
> > > to tomorrow, have coffee, and try again.  But it does seem that with 
> > > this patch, we calculate the length of the chain up to and including the 
> > > TIC, and nothing beyond it.
> > > 
> > > That is, during the boot process we calculate a chain length of "3" for 
> > > a SEEK + SIDE + TIC that QEMU built.  
> 
> Hm... so it looks like that code does what it says on the tin. But why
> are we missing the second round, that should give us a chain with the
> forth ccw? Are we missing a loop somewhere?

One thing where this is different now that we don't count the last READ
is that tic_target_chain_exists() won't return 1 if we tic to the READ,
as it has not been copied. So we execute code that presumably had not
been executed before...

> 
> > >     
> > >>    
> > >>>>>>>>                  break;
> > >>>>>>>>              ccw++;    
> > >>>>
> > >>>> Now, cp_init will not copy the last tic to the chain. When it then
> > >>>> looks for tics in that new chain, it won't find any, and stop copying.    
> > >> (...)    
> > >>>>>>> I'm not going to claim to know exactly how this is failing,
> > >>>>>>> because it's too late for coffee and reading IPL records is
> > >>>>>>> difficult enough.  But in the working case (without this patch),
> > >>>>>>> cp_init() calls ccwchain_calc_length() and gets a count of four
> > >>>>>>> CCWs (SEEK + SIDE + TIC
> > >>>>>>> + READ), which is broken up into two chains.  One for the
> > >>>>>>> SEEK/SIDE/TIC, and another for the READ.  In the failing case
> > >>>>>>> (with this patch), cp_init() only counts three CCWs, we never read
> > >>>>>>> the IPL2 address (the second chain in the working case), and our
> > >>>>>>> boot fails.    
> > >>>>
> > >>>> I think "we don't copy the tic to the chain where we search for tics"
> > >>>> fits those symptoms.    
> > >>>
> > >>> Indeed.  Hrm...    
> > >> (...)    
> > >>>> See above. We may need two chains: One without the trailing tic, and
> > >>>> one to process to see where that tic points to... or rework the way to
> > >>>> follow the tic? Not sure.    
> > >>>
> > >>> I hate the idea of two chains, because how much storage do we end up
> > >>> needing to consume?  But maybe it's inevitable, or maybe I can think up
> > >>> a way to rework it without.    
> > >>
> > >> Yeah, that sucks. If we really need to track tics separately (and I'm
> > >> not so sure, see above), there should be a better way for that...
> > >>
> > >> (...)
> > >>    
> > >>>> I'm wondering whether we should keep this patch and fix on top of it,
> > >>>> or revert it for now... I'm not sure tic processing is working at all
> > >>>> right now.    
> > >>>
> > >>> I've been wondering this too.  Yeah, the bios code isn't merged yet, but
> > >>> this patch means there's no point in merging it until we get TIC fixed
> > >>> properly.  Which I'd hope is 5.1, but I'm not sure how deep this bowl of
> > >>> spaghetti is.    
> > >>
> > >> We also need to make sure that this is not a bug in the bios instead...    
> > > 
> > > Of course.  I don't see it yet, but I haven't looked at the code very 
> > > closely.  From what he wrote in $qemusrc/docs/devel, it seems okay and 
> > > the CCWs I see showing up in vfio-ccw seem to match with it.  We just 
> > > have trouble distinguishing between a "forward" versus "backward" TIC.    
> > 
> > I just replied to patch 15 of Jason's bios series, with a small fix.  I 
> > didn't pay particularly close attention at the time, but the original 
> > error I sent here was an Incorrect Length, and both the SEEK and SIDE 
> > had a count of 8 bytes.  Fixing that converts the error to a program 
> > check, and the cpa in the irb points to where the read would be.  So 
> > this gets things to a more accurate "do we count/copy the TIC when 
> > handling our inputs" scenario as discussed above.  
> 
> Your fix in the bios code looks correct.
> 
> By "program check" you mean "channel program check", right? Do you know
> which ccw triggers that?

Also would be interesting when it breaks. Is it when we want to execute
the READ, or does it break on a SEEK?
Eric Farman Feb. 21, 2019, 12:46 p.m. UTC | #15
On 02/21/2019 05:03 AM, Cornelia Huck wrote:
> On Thu, 21 Feb 2019 10:30:01 +0100
> Cornelia Huck <cohuck@redhat.com> wrote:
> 
>> On Wed, 20 Feb 2019 22:02:10 -0500
>> Eric Farman <farman@linux.ibm.com> wrote:
>>
>>> On 02/20/2019 05:35 PM, Eric Farman wrote:
>>>>
>>>>
>>>> On 02/20/2019 11:44 AM, Cornelia Huck wrote:
>>>>> On Wed, 20 Feb 2019 11:25:46 -0500
>>>>> Eric Farman <farman@linux.ibm.com> wrote:
>>>>>     
>>>>>> On 02/20/2019 07:44 AM, Cornelia Huck wrote:
>>>>>>> On Wed, 20 Feb 2019 06:29:38 -0500
>>>>>>> Eric Farman <farman@linux.ibm.com> wrote:
>>>>>>>> On 02/20/2019 04:48 AM, Cornelia Huck wrote:
>>>>>>>>> On Tue, 19 Feb 2019 21:49:07 -0500
>>>>>>>>> Eric Farman <farman@linux.ibm.com> wrote:
>>>>>>>>>> Hi Connie, Farhan,
>>>>>>>>>>
>>>>>>>>>> On 02/04/2019 12:06 PM, Cornelia Huck wrote:
>>>>>>>>>>> From: Farhan Ali <alifm@linux.ibm.com>
>>>>>>>>>>>
>>>>>>>>>>> When trying to calculate the length of a ccw chain, we assume
>>>>>>>>>>> there are ccws after a TIC. This can lead to overcounting and
>>>>>>>>>>> copying garbage data from guest memory.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
>>>>>>>>>>> Message-Id:
>>>>>>>>>>> <d63748c1f1b03147bcbf401596638627a5e35ef7.1548082107.git.alifm@linux.ibm.com>
>>>>>>>>>>>
>>>>>>>>>>> Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
>>>>>>>>>>> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
>>>>>>>>>>> ---
>>>>>>>>>>>       drivers/s390/cio/vfio_ccw_cp.c | 2 +-
>>>>>>>>>>>       1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/s390/cio/vfio_ccw_cp.c
>>>>>>>>>>> b/drivers/s390/cio/vfio_ccw_cp.c
>>>>>>>>>>> index 70a006ba4d05..ba08fe137c2e 100644
>>>>>>>>>>> --- a/drivers/s390/cio/vfio_ccw_cp.c
>>>>>>>>>>> +++ b/drivers/s390/cio/vfio_ccw_cp.c
>>>>>>>>>>> @@ -392,7 +392,7 @@ static int ccwchain_calc_length(u64 iova,
>>>>>>>>>>> struct channel_program *cp)
>>>>>>>>>>>                   return -EOPNOTSUPP;
>>>>>>>>>>>               }
>>>>>>>>>>> -        if ((!ccw_is_chain(ccw)) && (!ccw_is_tic(ccw)))
>>>>>>>>>>> +        if (!ccw_is_chain(ccw))
>>>>>>>
>>>>>>> OK, this function now returns the length of the chain excluding the
>>>>>>> last tic.
>>>>>
>>>>> Hm, not so sure about that anymore. I'm a bit tired, please apply salt
>>>>> where needed to what I'm seeing.
>>>>
>>>> I didn't get as far as I had hoped today, so I'm going to need to reset
>>>> to tomorrow, have coffee, and try again.  But it does seem that with
>>>> this patch, we calculate the length of the chain up to and including the
>>>> TIC, and nothing beyond it.
>>>>
>>>> That is, during the boot process we calculate a chain length of "3" for
>>>> a SEEK + SIDE + TIC that QEMU built.
>>
>> Hm... so it looks like that code does what it says on the tin. But why
>> are we missing the second round, that should give us a chain with the
>> forth ccw? Are we missing a loop somewhere?
> 
> One thing where this is different now that we don't count the last READ
> is that tic_target_chain_exists() won't return 1 if we tic to the READ,
> as it has not been copied. So we execute code that presumably had not
> been executed before...
> 

Coming back to these later, when coffee has kicked in.

>>
>>>>      
>>>>>     
>>>>>>>>>>>                   break;
>>>>>>>>>>>               ccw++;
>>>>>>>
>>>>>>> Now, cp_init will not copy the last tic to the chain. When it then
>>>>>>> looks for tics in that new chain, it won't find any, and stop copying.
>>>>> (...)
>>>>>>>>>> I'm not going to claim to know exactly how this is failing,
>>>>>>>>>> because it's too late for coffee and reading IPL records is
>>>>>>>>>> difficult enough.  But in the working case (without this patch),
>>>>>>>>>> cp_init() calls ccwchain_calc_length() and gets a count of four
>>>>>>>>>> CCWs (SEEK + SIDE + TIC
>>>>>>>>>> + READ), which is broken up into two chains.  One for the
>>>>>>>>>> SEEK/SIDE/TIC, and another for the READ.  In the failing case
>>>>>>>>>> (with this patch), cp_init() only counts three CCWs, we never read
>>>>>>>>>> the IPL2 address (the second chain in the working case), and our
>>>>>>>>>> boot fails.
>>>>>>>
>>>>>>> I think "we don't copy the tic to the chain where we search for tics"
>>>>>>> fits those symptoms.
>>>>>>
>>>>>> Indeed.  Hrm...
>>>>> (...)
>>>>>>> See above. We may need two chains: One without the trailing tic, and
>>>>>>> one to process to see where that tic points to... or rework the way to
>>>>>>> follow the tic? Not sure.
>>>>>>
>>>>>> I hate the idea of two chains, because how much storage do we end up
>>>>>> needing to consume?  But maybe it's inevitable, or maybe I can think up
>>>>>> a way to rework it without.
>>>>>
>>>>> Yeah, that sucks. If we really need to track tics separately (and I'm
>>>>> not so sure, see above), there should be a better way for that...
>>>>>
>>>>> (...)
>>>>>     
>>>>>>> I'm wondering whether we should keep this patch and fix on top of it,
>>>>>>> or revert it for now... I'm not sure tic processing is working at all
>>>>>>> right now.
>>>>>>
>>>>>> I've been wondering this too.  Yeah, the bios code isn't merged yet, but
>>>>>> this patch means there's no point in merging it until we get TIC fixed
>>>>>> properly.  Which I'd hope is 5.1, but I'm not sure how deep this bowl of
>>>>>> spaghetti is.
>>>>>
>>>>> We also need to make sure that this is not a bug in the bios instead...
>>>>
>>>> Of course.  I don't see it yet, but I haven't looked at the code very
>>>> closely.  From what he wrote in $qemusrc/docs/devel, it seems okay and
>>>> the CCWs I see showing up in vfio-ccw seem to match with it.  We just
>>>> have trouble distinguishing between a "forward" versus "backward" TIC.
>>>
>>> I just replied to patch 15 of Jason's bios series, with a small fix.  I
>>> didn't pay particularly close attention at the time, but the original
>>> error I sent here was an Incorrect Length, and both the SEEK and SIDE
>>> had a count of 8 bytes.  Fixing that converts the error to a program
>>> check, and the cpa in the irb points to where the read would be.  So
>>> this gets things to a more accurate "do we count/copy the TIC when
>>> handling our inputs" scenario as discussed above.
>>
>> Your fix in the bios code looks correct.
>>
>> By "program check" you mean "channel program check", right? 

Yes

Do you know
>> which ccw triggers that?
> 
> Also would be interesting when it breaks. Is it when we want to execute
> the READ, or does it break on a SEEK?
> 

I think we failed trying to issue the READ.  But because we didn't count 
it in cp_init(), we didn't build it as part of the channel program 
vfio-ccw issued.  So it's actually pointing to nothing/garbage data.

Certainly the data in the irb suggests it's not unhappy with the SEEK/SIDE.
Halil Pasic Feb. 21, 2019, 12:55 p.m. UTC | #16
On Thu, 21 Feb 2019 10:30:01 +0100
Cornelia Huck <cohuck@redhat.com> wrote:

> On Wed, 20 Feb 2019 22:02:10 -0500
> Eric Farman <farman@linux.ibm.com> wrote:
> 
> > On 02/20/2019 05:35 PM, Eric Farman wrote:
> > > 
> > > 
> > > On 02/20/2019 11:44 AM, Cornelia Huck wrote:  
> > >> On Wed, 20 Feb 2019 11:25:46 -0500
> > >> Eric Farman <farman@linux.ibm.com> wrote:
> > >>  
> > >>> On 02/20/2019 07:44 AM, Cornelia Huck wrote:  
> > >>>> On Wed, 20 Feb 2019 06:29:38 -0500
> > >>>> Eric Farman <farman@linux.ibm.com> wrote:  
> > >>>>> On 02/20/2019 04:48 AM, Cornelia Huck wrote:  
> > >>>>>> On Tue, 19 Feb 2019 21:49:07 -0500
> > >>>>>> Eric Farman <farman@linux.ibm.com> wrote:  
> > >>>>>>> Hi Connie, Farhan,
> > >>>>>>>
> > >>>>>>> On 02/04/2019 12:06 PM, Cornelia Huck wrote:  
> > >>>>>>>> From: Farhan Ali <alifm@linux.ibm.com>
> > >>>>>>>>
> > >>>>>>>> When trying to calculate the length of a ccw chain, we assume
> > >>>>>>>> there are ccws after a TIC. This can lead to overcounting and
> > >>>>>>>> copying garbage data from guest memory.
> > >>>>>>>>
> > >>>>>>>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> > >>>>>>>> Message-Id: 
> > >>>>>>>> <d63748c1f1b03147bcbf401596638627a5e35ef7.1548082107.git.alifm@linux.ibm.com> 
> > >>>>>>>>
> > >>>>>>>> Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
> > >>>>>>>> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
> > >>>>>>>> ---
> > >>>>>>>>      drivers/s390/cio/vfio_ccw_cp.c | 2 +-
> > >>>>>>>>      1 file changed, 1 insertion(+), 1 deletion(-)
> > >>>>>>>>
> > >>>>>>>> diff --git a/drivers/s390/cio/vfio_ccw_cp.c 
> > >>>>>>>> b/drivers/s390/cio/vfio_ccw_cp.c
> > >>>>>>>> index 70a006ba4d05..ba08fe137c2e 100644
> > >>>>>>>> --- a/drivers/s390/cio/vfio_ccw_cp.c
> > >>>>>>>> +++ b/drivers/s390/cio/vfio_ccw_cp.c
> > >>>>>>>> @@ -392,7 +392,7 @@ static int ccwchain_calc_length(u64 iova, 
> > >>>>>>>> struct channel_program *cp)
> > >>>>>>>>                  return -EOPNOTSUPP;
> > >>>>>>>>              }
> > >>>>>>>> -        if ((!ccw_is_chain(ccw)) && (!ccw_is_tic(ccw)))
> > >>>>>>>> +        if (!ccw_is_chain(ccw))  
> > >>>>
> > >>>> OK, this function now returns the length of the chain excluding the
> > >>>> last tic.  
> > >>
> > >> Hm, not so sure about that anymore. I'm a bit tired, please apply salt
> > >> where needed to what I'm seeing.  
> > > 
> > > I didn't get as far as I had hoped today, so I'm going to need to reset 
> > > to tomorrow, have coffee, and try again.  But it does seem that with 
> > > this patch, we calculate the length of the chain up to and including the 
> > > TIC, and nothing beyond it.
> > > 
> > > That is, during the boot process we calculate a chain length of "3" for 
> > > a SEEK + SIDE + TIC that QEMU built.
> 
> Hm... so it looks like that code does what it says on the tin. But why
> are we missing the second round, that should give us a chain with the
> forth ccw? Are we missing a loop somewhere?


Have you seen this email of mine:
https://www.spinics.net/lists/kvm/msg182535.html 
(MID: Message-Id: <20190220143729.6dda5665@oc2783563651>)?

Eric told me the TIC points to the SIDE. I.e the channel program says
loop until SIDE presents a status modifier the jump over the TIC and
continue with the READ.

The point is the fourth ccw is only reachable if we jump over the TIC
because of status modifier (from device). Currently we do not seem to
care about this 'jump over a ccw' possibility. IMHO that is where this
problem comes from.

Regards,
Halil
Eric Farman Feb. 21, 2019, 3:38 p.m. UTC | #17
On 02/21/2019 07:55 AM, Halil Pasic wrote:
> On Thu, 21 Feb 2019 10:30:01 +0100
> Cornelia Huck <cohuck@redhat.com> wrote:
> 
>> On Wed, 20 Feb 2019 22:02:10 -0500
>> Eric Farman <farman@linux.ibm.com> wrote:
>>
>>> On 02/20/2019 05:35 PM, Eric Farman wrote:
>>>>
>>>>
>>>> On 02/20/2019 11:44 AM, Cornelia Huck wrote:
>>>>> On Wed, 20 Feb 2019 11:25:46 -0500
>>>>> Eric Farman <farman@linux.ibm.com> wrote:
>>>>>   
>>>>>> On 02/20/2019 07:44 AM, Cornelia Huck wrote:
>>>>>>> On Wed, 20 Feb 2019 06:29:38 -0500
>>>>>>> Eric Farman <farman@linux.ibm.com> wrote:
>>>>>>>> On 02/20/2019 04:48 AM, Cornelia Huck wrote:
>>>>>>>>> On Tue, 19 Feb 2019 21:49:07 -0500
>>>>>>>>> Eric Farman <farman@linux.ibm.com> wrote:
>>>>>>>>>> Hi Connie, Farhan,
>>>>>>>>>>
>>>>>>>>>> On 02/04/2019 12:06 PM, Cornelia Huck wrote:
>>>>>>>>>>> From: Farhan Ali <alifm@linux.ibm.com>
>>>>>>>>>>>
>>>>>>>>>>> When trying to calculate the length of a ccw chain, we assume
>>>>>>>>>>> there are ccws after a TIC. This can lead to overcounting and
>>>>>>>>>>> copying garbage data from guest memory.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
>>>>>>>>>>> Message-Id:
>>>>>>>>>>> <d63748c1f1b03147bcbf401596638627a5e35ef7.1548082107.git.alifm@linux.ibm.com>
>>>>>>>>>>>
>>>>>>>>>>> Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
>>>>>>>>>>> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
>>>>>>>>>>> ---
>>>>>>>>>>>       drivers/s390/cio/vfio_ccw_cp.c | 2 +-
>>>>>>>>>>>       1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/s390/cio/vfio_ccw_cp.c
>>>>>>>>>>> b/drivers/s390/cio/vfio_ccw_cp.c
>>>>>>>>>>> index 70a006ba4d05..ba08fe137c2e 100644
>>>>>>>>>>> --- a/drivers/s390/cio/vfio_ccw_cp.c
>>>>>>>>>>> +++ b/drivers/s390/cio/vfio_ccw_cp.c
>>>>>>>>>>> @@ -392,7 +392,7 @@ static int ccwchain_calc_length(u64 iova,
>>>>>>>>>>> struct channel_program *cp)
>>>>>>>>>>>                   return -EOPNOTSUPP;
>>>>>>>>>>>               }
>>>>>>>>>>> -        if ((!ccw_is_chain(ccw)) && (!ccw_is_tic(ccw)))
>>>>>>>>>>> +        if (!ccw_is_chain(ccw))
>>>>>>>
>>>>>>> OK, this function now returns the length of the chain excluding the
>>>>>>> last tic.
>>>>>
>>>>> Hm, not so sure about that anymore. I'm a bit tired, please apply salt
>>>>> where needed to what I'm seeing.
>>>>
>>>> I didn't get as far as I had hoped today, so I'm going to need to reset
>>>> to tomorrow, have coffee, and try again.  But it does seem that with
>>>> this patch, we calculate the length of the chain up to and including the
>>>> TIC, and nothing beyond it.
>>>>
>>>> That is, during the boot process we calculate a chain length of "3" for
>>>> a SEEK + SIDE + TIC that QEMU built.
>>
>> Hm... so it looks like that code does what it says on the tin. But why
>> are we missing the second round, that should give us a chain with the
>> forth ccw? Are we missing a loop somewhere?
> 

I think this is close...  More below.

> 
> Have you seen this email of mine:
> https://www.spinics.net/lists/kvm/msg182535.html
> (MID: Message-Id: <20190220143729.6dda5665@oc2783563651>)?
> 
> Eric told me the TIC points to the SIDE.

I thought I put that in one of my responses, but regardless it's what 
the bios code builds so it's not new news.

  I.e the channel program says
> loop until SIDE presents a status modifier the jump over the TIC and
> continue with the READ.

I needed to get past the incorrect length error before I could give 
Halil's email any consideration.  Now that that is cleared up, we can 
look into his idea.

> 
> The point is the fourth ccw is only reachable if we jump over the TIC
> because of status modifier (from device). Currently we do not seem to
> care about this 'jump over a ccw' possibility. IMHO that is where this
> problem comes from.

Well, sort of...  We probably care about the "jump over a CCW" 
possibility if the TIC introduces any recursion.  If it goes "somewhere 
new", then the likelihood of the Status Modifier coming into play is low.

And besides, in this scenario we don't know about a potential Status 
Modifier until the SIDE runs, which means while we're building the 
channel program we don't know whether the memory after the TIC is 
interesting or not.  Reverting this patch makes the "recursion" scenario 
(which the bios introduces, and thus brings about the importance of the 
Status Modifier) work again, but then we have difficulties when the TIC 
branches somewhere else.  (See the use of NOP+TIC chains in error recovery.)

If you're suggesting adding code to handle a Status Modifier, I would 
ask why the interrupt handler should redrive things after the TIC, when 
it would be simpler to detect the TIC recursion and planning for the 
possibility there.  That would seem simpler than leaving a window open 
between (for example) orientation and data transfer.

The difficulty, as Connie pointed out, is that we're still building the 
current chain here, which implies we can't rely on 
tic_target_chain_exists() to determine whether we are dealing with a TIC 
loop or not.  I had some ideas this morning during the commute that 
might apply here, and warrant some experimentation.  Let me cobble this 
together and see what comes from it.

  - Eric

> 
> Regards,
> Halil
>
Eric Farman Feb. 21, 2019, 5:10 p.m. UTC | #18
On 02/21/2019 10:38 AM, Eric Farman wrote:
> 
> 
> On 02/21/2019 07:55 AM, Halil Pasic wrote:
>> On Thu, 21 Feb 2019 10:30:01 +0100
>> Cornelia Huck <cohuck@redhat.com> wrote:
>>
>>> On Wed, 20 Feb 2019 22:02:10 -0500
>>> Eric Farman <farman@linux.ibm.com> wrote:
>>>
>>>> On 02/20/2019 05:35 PM, Eric Farman wrote:
>>>>>
>>>>>
>>>>> On 02/20/2019 11:44 AM, Cornelia Huck wrote:
>>>>>> On Wed, 20 Feb 2019 11:25:46 -0500
>>>>>> Eric Farman <farman@linux.ibm.com> wrote:
>>>>>>> On 02/20/2019 07:44 AM, Cornelia Huck wrote:
>>>>>>>> On Wed, 20 Feb 2019 06:29:38 -0500
>>>>>>>> Eric Farman <farman@linux.ibm.com> wrote:
>>>>>>>>> On 02/20/2019 04:48 AM, Cornelia Huck wrote:
>>>>>>>>>> On Tue, 19 Feb 2019 21:49:07 -0500
>>>>>>>>>> Eric Farman <farman@linux.ibm.com> wrote:
>>>>>>>>>>> Hi Connie, Farhan,
>>>>>>>>>>>
>>>>>>>>>>> On 02/04/2019 12:06 PM, Cornelia Huck wrote:
>>>>>>>>>>>> From: Farhan Ali <alifm@linux.ibm.com>
>>>>>>>>>>>>
>>>>>>>>>>>> When trying to calculate the length of a ccw chain, we assume
>>>>>>>>>>>> there are ccws after a TIC. This can lead to overcounting and
>>>>>>>>>>>> copying garbage data from guest memory.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
>>>>>>>>>>>> Message-Id:
>>>>>>>>>>>> <d63748c1f1b03147bcbf401596638627a5e35ef7.1548082107.git.alifm@linux.ibm.com> 
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
>>>>>>>>>>>> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>>       drivers/s390/cio/vfio_ccw_cp.c | 2 +-
>>>>>>>>>>>>       1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/s390/cio/vfio_ccw_cp.c
>>>>>>>>>>>> b/drivers/s390/cio/vfio_ccw_cp.c
>>>>>>>>>>>> index 70a006ba4d05..ba08fe137c2e 100644
>>>>>>>>>>>> --- a/drivers/s390/cio/vfio_ccw_cp.c
>>>>>>>>>>>> +++ b/drivers/s390/cio/vfio_ccw_cp.c
>>>>>>>>>>>> @@ -392,7 +392,7 @@ static int ccwchain_calc_length(u64 iova,
>>>>>>>>>>>> struct channel_program *cp)
>>>>>>>>>>>>                   return -EOPNOTSUPP;
>>>>>>>>>>>>               }
>>>>>>>>>>>> -        if ((!ccw_is_chain(ccw)) && (!ccw_is_tic(ccw)))
>>>>>>>>>>>> +        if (!ccw_is_chain(ccw))
>>>>>>>>
>>>>>>>> OK, this function now returns the length of the chain excluding the
>>>>>>>> last tic.
>>>>>>
>>>>>> Hm, not so sure about that anymore. I'm a bit tired, please apply 
>>>>>> salt
>>>>>> where needed to what I'm seeing.
>>>>>
>>>>> I didn't get as far as I had hoped today, so I'm going to need to 
>>>>> reset
>>>>> to tomorrow, have coffee, and try again.  But it does seem that with
>>>>> this patch, we calculate the length of the chain up to and 
>>>>> including the
>>>>> TIC, and nothing beyond it.
>>>>>
>>>>> That is, during the boot process we calculate a chain length of "3" 
>>>>> for
>>>>> a SEEK + SIDE + TIC that QEMU built.
>>>
>>> Hm... so it looks like that code does what it says on the tin. But why
>>> are we missing the second round, that should give us a chain with the
>>> forth ccw? Are we missing a loop somewhere?
>>
> 
> I think this is close...  More below.
> 
>>
>> Have you seen this email of mine:
>> https://www.spinics.net/lists/kvm/msg182535.html
>> (MID: Message-Id: <20190220143729.6dda5665@oc2783563651>)?
>>
>> Eric told me the TIC points to the SIDE.
> 
> I thought I put that in one of my responses, but regardless it's what 
> the bios code builds so it's not new news.
> 
>   I.e the channel program says
>> loop until SIDE presents a status modifier the jump over the TIC and
>> continue with the READ.
> 
> I needed to get past the incorrect length error before I could give 
> Halil's email any consideration.  Now that that is cleared up, we can 
> look into his idea.
> 
>>
>> The point is the fourth ccw is only reachable if we jump over the TIC
>> because of status modifier (from device). Currently we do not seem to
>> care about this 'jump over a ccw' possibility. IMHO that is where this
>> problem comes from.
> 
> Well, sort of...  We probably care about the "jump over a CCW" 
> possibility if the TIC introduces any recursion.  If it goes "somewhere 
> new", then the likelihood of the Status Modifier coming into play is low.
> 
> And besides, in this scenario we don't know about a potential Status 
> Modifier until the SIDE runs, which means while we're building the 
> channel program we don't know whether the memory after the TIC is 
> interesting or not.  Reverting this patch makes the "recursion" scenario 
> (which the bios introduces, and thus brings about the importance of the 
> Status Modifier) work again, but then we have difficulties when the TIC 
> branches somewhere else.  (See the use of NOP+TIC chains in error 
> recovery.)
> 
> If you're suggesting adding code to handle a Status Modifier, I would 
> ask why the interrupt handler should redrive things after the TIC, when 
> it would be simpler to detect the TIC recursion and planning for the 
> possibility there.  That would seem simpler than leaving a window open 
> between (for example) orientation and data transfer.
> 
> The difficulty, as Connie pointed out, is that we're still building the 
> current chain here, which implies we can't rely on 
> tic_target_chain_exists() to determine whether we are dealing with a TIC 
> loop or not.  I had some ideas this morning during the commute that 
> might apply here, and warrant some experimentation.  Let me cobble this 
> together and see what comes from it.

Well, taking the TIC-boundary logic from tic_target_chain_exists() and 
stuffing it into ccwchain_calc_length() seems to do the trick for seeing 
if we're sending a TIC back into our current chain, and makes the dasd 
boot work again (the following is built on top of this patch, and is 
surely whitespace damaged):

diff --git a/drivers/s390/cio/vfio_ccw_cp.c b/drivers/s390/cio/vfio_ccw_cp.c
index ba08fe137c2e..a4abe7519b7e 100644
--- a/drivers/s390/cio/vfio_ccw_cp.c
+++ b/drivers/s390/cio/vfio_ccw_cp.c
@@ -362,6 +362,7 @@ static int ccwchain_calc_length(u64 iova, struct 
channel_program *cp)
  {
         struct ccw1 *ccw, *p;
         int cnt;
+       u64 tail;

         /*
          * Copy current chain from guest to host kernel.
@@ -392,7 +393,9 @@ static int ccwchain_calc_length(u64 iova, struct 
channel_program *cp)
                         return -EOPNOTSUPP;
                 }

-               if (!ccw_is_chain(ccw))
+               tail = iova + (cnt - 1) * sizeof(struct ccw1);
+
+               if (!(ccw_is_chain(ccw) || (ccw_is_tic(ccw) && iova <= 
ccw->cda && ccw->cda <= tail)))
                         break;

                 ccw++;

I need to spend some more time looking over whether we should look at 
other chains by also calling tic_target_chain_exists(), and of course 
see what happens to this when we run the original I/O exercisers that 
led Farhan to the original patch.  But it looks promising.  Thoughts?

  - Eric
Halil Pasic Feb. 21, 2019, 7:09 p.m. UTC | #19
On Thu, 21 Feb 2019 12:10:32 -0500
Eric Farman <farman@linux.ibm.com> wrote:

> 
> 
> On 02/21/2019 10:38 AM, Eric Farman wrote:
> > 
> > 
> > On 02/21/2019 07:55 AM, Halil Pasic wrote:
> >> On Thu, 21 Feb 2019 10:30:01 +0100
> >> Cornelia Huck <cohuck@redhat.com> wrote:
> >>
> >>> On Wed, 20 Feb 2019 22:02:10 -0500
> >>> Eric Farman <farman@linux.ibm.com> wrote:
> >>>
> >>>> On 02/20/2019 05:35 PM, Eric Farman wrote:
> >>>>>
> >>>>>
> >>>>> On 02/20/2019 11:44 AM, Cornelia Huck wrote:
> >>>>>> On Wed, 20 Feb 2019 11:25:46 -0500
> >>>>>> Eric Farman <farman@linux.ibm.com> wrote:
> >>>>>>> On 02/20/2019 07:44 AM, Cornelia Huck wrote:
> >>>>>>>> On Wed, 20 Feb 2019 06:29:38 -0500
> >>>>>>>> Eric Farman <farman@linux.ibm.com> wrote:
> >>>>>>>>> On 02/20/2019 04:48 AM, Cornelia Huck wrote:
> >>>>>>>>>> On Tue, 19 Feb 2019 21:49:07 -0500
> >>>>>>>>>> Eric Farman <farman@linux.ibm.com> wrote:
> >>>>>>>>>>> Hi Connie, Farhan,
> >>>>>>>>>>>
> >>>>>>>>>>> On 02/04/2019 12:06 PM, Cornelia Huck wrote:
> >>>>>>>>>>>> From: Farhan Ali <alifm@linux.ibm.com>
> >>>>>>>>>>>>
> >>>>>>>>>>>> When trying to calculate the length of a ccw chain, we assume
> >>>>>>>>>>>> there are ccws after a TIC. This can lead to overcounting and
> >>>>>>>>>>>> copying garbage data from guest memory.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
> >>>>>>>>>>>> Message-Id:
> >>>>>>>>>>>> <d63748c1f1b03147bcbf401596638627a5e35ef7.1548082107.git.alifm@linux.ibm.com> 
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
> >>>>>>>>>>>> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
> >>>>>>>>>>>> ---
> >>>>>>>>>>>>       drivers/s390/cio/vfio_ccw_cp.c | 2 +-
> >>>>>>>>>>>>       1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>>>>>>>>>
> >>>>>>>>>>>> diff --git a/drivers/s390/cio/vfio_ccw_cp.c
> >>>>>>>>>>>> b/drivers/s390/cio/vfio_ccw_cp.c
> >>>>>>>>>>>> index 70a006ba4d05..ba08fe137c2e 100644
> >>>>>>>>>>>> --- a/drivers/s390/cio/vfio_ccw_cp.c
> >>>>>>>>>>>> +++ b/drivers/s390/cio/vfio_ccw_cp.c
> >>>>>>>>>>>> @@ -392,7 +392,7 @@ static int ccwchain_calc_length(u64 iova,
> >>>>>>>>>>>> struct channel_program *cp)
> >>>>>>>>>>>>                   return -EOPNOTSUPP;
> >>>>>>>>>>>>               }
> >>>>>>>>>>>> -        if ((!ccw_is_chain(ccw)) && (!ccw_is_tic(ccw)))
> >>>>>>>>>>>> +        if (!ccw_is_chain(ccw))
> >>>>>>>>
> >>>>>>>> OK, this function now returns the length of the chain excluding the
> >>>>>>>> last tic.
> >>>>>>
> >>>>>> Hm, not so sure about that anymore. I'm a bit tired, please apply 
> >>>>>> salt
> >>>>>> where needed to what I'm seeing.
> >>>>>
> >>>>> I didn't get as far as I had hoped today, so I'm going to need to 
> >>>>> reset
> >>>>> to tomorrow, have coffee, and try again.  But it does seem that with
> >>>>> this patch, we calculate the length of the chain up to and 
> >>>>> including the
> >>>>> TIC, and nothing beyond it.
> >>>>>
> >>>>> That is, during the boot process we calculate a chain length of "3" 
> >>>>> for
> >>>>> a SEEK + SIDE + TIC that QEMU built.
> >>>
> >>> Hm... so it looks like that code does what it says on the tin. But why
> >>> are we missing the second round, that should give us a chain with the
> >>> forth ccw? Are we missing a loop somewhere?
> >>
> > 
> > I think this is close...  More below.
> > 
> >>
> >> Have you seen this email of mine:
> >> https://www.spinics.net/lists/kvm/msg182535.html
> >> (MID: Message-Id: <20190220143729.6dda5665@oc2783563651>)?
> >>
> >> Eric told me the TIC points to the SIDE.
> > 
> > I thought I put that in one of my responses, but regardless it's what 
> > the bios code builds so it's not new news.
> > 
> >   I.e the channel program says
> >> loop until SIDE presents a status modifier the jump over the TIC and
> >> continue with the READ.
> > 
> > I needed to get past the incorrect length error before I could give 
> > Halil's email any consideration.  Now that that is cleared up, we can 
> > look into his idea.
> > 
> >>
> >> The point is the fourth ccw is only reachable if we jump over the TIC
> >> because of status modifier (from device). Currently we do not seem to
> >> care about this 'jump over a ccw' possibility. IMHO that is where this
> >> problem comes from.
> > 
> > Well, sort of...  We probably care about the "jump over a CCW" 
> > possibility if the TIC introduces any recursion.  If it goes "somewhere 
> > new", then the likelihood of the Status Modifier coming into play is low.
> > 
> > And besides, in this scenario we don't know about a potential Status 
> > Modifier until the SIDE runs, which means while we're building the 
> > channel program we don't know whether the memory after the TIC is 
> > interesting or not.  Reverting this patch makes the "recursion" scenario 
> > (which the bios introduces, and thus brings about the importance of the 
> > Status Modifier) work again, but then we have difficulties when the TIC 
> > branches somewhere else.  (See the use of NOP+TIC chains in error 
> > recovery.)
> > 
> > If you're suggesting adding code to handle a Status Modifier, I would 
> > ask why the interrupt handler should redrive things after the TIC, when 
> > it would be simpler to detect the TIC recursion and planning for the 
> > possibility there.  That would seem simpler than leaving a window open 
> > between (for example) orientation and data transfer.
> > 
> > The difficulty, as Connie pointed out, is that we're still building the 
> > current chain here, which implies we can't rely on 
> > tic_target_chain_exists() to determine whether we are dealing with a TIC 
> > loop or not.  I had some ideas this morning during the commute that 
> > might apply here, and warrant some experimentation.  Let me cobble this 
> > together and see what comes from it.
> 
> Well, taking the TIC-boundary logic from tic_target_chain_exists() and 
> stuffing it into ccwchain_calc_length() seems to do the trick for seeing 
> if we're sending a TIC back into our current chain, and makes the dasd 
> boot work again (the following is built on top of this patch, and is 
> surely whitespace damaged):
> 
> diff --git a/drivers/s390/cio/vfio_ccw_cp.c b/drivers/s390/cio/vfio_ccw_cp.c
> index ba08fe137c2e..a4abe7519b7e 100644
> --- a/drivers/s390/cio/vfio_ccw_cp.c
> +++ b/drivers/s390/cio/vfio_ccw_cp.c
> @@ -362,6 +362,7 @@ static int ccwchain_calc_length(u64 iova, struct 
> channel_program *cp)
>   {
>          struct ccw1 *ccw, *p;
>          int cnt;
> +       u64 tail;
> 
>          /*
>           * Copy current chain from guest to host kernel.
> @@ -392,7 +393,9 @@ static int ccwchain_calc_length(u64 iova, struct 
> channel_program *cp)
>                          return -EOPNOTSUPP;
>                  }
> 
> -               if (!ccw_is_chain(ccw))
> +               tail = iova + (cnt - 1) * sizeof(struct ccw1);
> +
> +               if (!(ccw_is_chain(ccw) || (ccw_is_tic(ccw) && iova <= 
> ccw->cda && ccw->cda <= tail)))

After De Morgan this looks like


!ccw_is_chain(ccw)) && !(ccw_is_tic(ccw) && tic_points_in_this_chain())

which only differs form the state without the Farhan fix by not carrying
on for tic's that point outside of this chain.

Would certainly catch the problem you discovered.

Of course there is no guarantee that this covers all the theoretically
possible cases where we could skip over a ccw that would terminate a chain
(tic or not chaining) if there was no status modifier. If you like I can
construct an example where this heuristic fails (at least I think so).

If this is the direction we are going to take I think "vfio-ccw: Don't
assume there are more ccws after a TIC" should be superseded by this. And
we need a comment that explains the second part of the expression.

Regards,
Halil

>                          break;
> 
>                  ccw++;
> 
> I need to spend some more time looking over whether we should look at 
> other chains by also calling tic_target_chain_exists(), and of course 
> see what happens to this when we run the original I/O exercisers that 
> led Farhan to the original patch.  But it looks promising.  Thoughts?
> 
>   - Eric
>
Cornelia Huck Feb. 22, 2019, 8:27 a.m. UTC | #20
On Thu, 21 Feb 2019 20:09:16 +0100
Halil Pasic <pasic@linux.ibm.com> wrote:

> On Thu, 21 Feb 2019 12:10:32 -0500
> Eric Farman <farman@linux.ibm.com> wrote:

> > Well, taking the TIC-boundary logic from tic_target_chain_exists() and 
> > stuffing it into ccwchain_calc_length() seems to do the trick for seeing 
> > if we're sending a TIC back into our current chain, and makes the dasd 
> > boot work again (the following is built on top of this patch, and is 
> > surely whitespace damaged):
> > 
> > diff --git a/drivers/s390/cio/vfio_ccw_cp.c b/drivers/s390/cio/vfio_ccw_cp.c
> > index ba08fe137c2e..a4abe7519b7e 100644
> > --- a/drivers/s390/cio/vfio_ccw_cp.c
> > +++ b/drivers/s390/cio/vfio_ccw_cp.c
> > @@ -362,6 +362,7 @@ static int ccwchain_calc_length(u64 iova, struct 
> > channel_program *cp)
> >   {
> >          struct ccw1 *ccw, *p;
> >          int cnt;
> > +       u64 tail;
> > 
> >          /*
> >           * Copy current chain from guest to host kernel.
> > @@ -392,7 +393,9 @@ static int ccwchain_calc_length(u64 iova, struct 
> > channel_program *cp)
> >                          return -EOPNOTSUPP;
> >                  }
> > 
> > -               if (!ccw_is_chain(ccw))
> > +               tail = iova + (cnt - 1) * sizeof(struct ccw1);
> > +
> > +               if (!(ccw_is_chain(ccw) || (ccw_is_tic(ccw) && iova <= 
> > ccw->cda && ccw->cda <= tail)))  
> 
> After De Morgan this looks like
> 
> 
> !ccw_is_chain(ccw)) && !(ccw_is_tic(ccw) && tic_points_in_this_chain())
> 
> which only differs form the state without the Farhan fix by not carrying
> on for tic's that point outside of this chain.
> 
> Would certainly catch the problem you discovered.
> 
> Of course there is no guarantee that this covers all the theoretically
> possible cases where we could skip over a ccw that would terminate a chain
> (tic or not chaining) if there was no status modifier. If you like I can
> construct an example where this heuristic fails (at least I think so).
> 
> If this is the direction we are going to take I think "vfio-ccw: Don't
> assume there are more ccws after a TIC" should be superseded by this. And
> we need a comment that explains the second part of the expression.

As that patch is already queued, I think we should do that change with
'fixes' on top.

There are probably more corner cases that we're missing, but this fixes
a problem that can be hit today (well, with some posted patches...), so
I'd go ahead and queue it and go through the code and maybe fix/rework
on top.

Did any of you do any stress testing of that change yet?
Eric Farman Feb. 22, 2019, 2:56 p.m. UTC | #21
On 02/22/2019 03:27 AM, Cornelia Huck wrote:
> On Thu, 21 Feb 2019 20:09:16 +0100
> Halil Pasic <pasic@linux.ibm.com> wrote:
> 
>> On Thu, 21 Feb 2019 12:10:32 -0500
>> Eric Farman <farman@linux.ibm.com> wrote:
> 
>>> Well, taking the TIC-boundary logic from tic_target_chain_exists() and
>>> stuffing it into ccwchain_calc_length() seems to do the trick for seeing
>>> if we're sending a TIC back into our current chain, and makes the dasd
>>> boot work again (the following is built on top of this patch, and is
>>> surely whitespace damaged):
>>>
>>> diff --git a/drivers/s390/cio/vfio_ccw_cp.c b/drivers/s390/cio/vfio_ccw_cp.c
>>> index ba08fe137c2e..a4abe7519b7e 100644
>>> --- a/drivers/s390/cio/vfio_ccw_cp.c
>>> +++ b/drivers/s390/cio/vfio_ccw_cp.c
>>> @@ -362,6 +362,7 @@ static int ccwchain_calc_length(u64 iova, struct
>>> channel_program *cp)
>>>    {
>>>           struct ccw1 *ccw, *p;
>>>           int cnt;
>>> +       u64 tail;
>>>
>>>           /*
>>>            * Copy current chain from guest to host kernel.
>>> @@ -392,7 +393,9 @@ static int ccwchain_calc_length(u64 iova, struct
>>> channel_program *cp)
>>>                           return -EOPNOTSUPP;
>>>                   }
>>>
>>> -               if (!ccw_is_chain(ccw))
>>> +               tail = iova + (cnt - 1) * sizeof(struct ccw1);
>>> +
>>> +               if (!(ccw_is_chain(ccw) || (ccw_is_tic(ccw) && iova <=
>>> ccw->cda && ccw->cda <= tail)))
>>
>> After De Morgan this looks like
>>
>>
>> !ccw_is_chain(ccw)) && !(ccw_is_tic(ccw) && tic_points_in_this_chain())
>>
>> which only differs form the state without the Farhan fix by not carrying
>> on for tic's that point outside of this chain.
>>
>> Would certainly catch the problem you discovered.
>>
>> Of course there is no guarantee that this covers all the theoretically
>> possible cases where we could skip over a ccw that would terminate a chain
>> (tic or not chaining) if there was no status modifier. If you like I can
>> construct an example where this heuristic fails (at least I think so).
>>
>> If this is the direction we are going to take I think "vfio-ccw: Don't
>> assume there are more ccws after a TIC" should be superseded by this. And
>> we need a comment that explains the second part of the expression.
> 
> As that patch is already queued, I think we should do that change with
> 'fixes' on top.
> 
> There are probably more corner cases that we're missing, but this fixes
> a problem that can be hit today (well, with some posted patches...), so
> I'd go ahead and queue it and go through the code and maybe fix/rework
> on top.

Sounds good.  I will get that posted so we can look it over with fresh eyes.

> 
> Did any of you do any stress testing of that change yet?
> 

I've been running fio since yesterday after my last email (and 
subsequent lunch), and haven't seen any further errors.  It's still 
running today, and knock on wood it'll keep going nicely.

It just occurred to me that I was trying to run a different set of tests 
when I bumped into this.  I think they are a different set of disks, so 
I should go back and try that while fio runs.

  - Eric
Farhan Ali Feb. 22, 2019, 5:02 p.m. UTC | #22
On 02/22/2019 09:56 AM, Eric Farman wrote:
> 
> 
> On 02/22/2019 03:27 AM, Cornelia Huck wrote:
>> On Thu, 21 Feb 2019 20:09:16 +0100
>> Halil Pasic <pasic@linux.ibm.com> wrote:
>>
>>> On Thu, 21 Feb 2019 12:10:32 -0500
>>> Eric Farman <farman@linux.ibm.com> wrote:
>>
>>>> Well, taking the TIC-boundary logic from tic_target_chain_exists() and
>>>> stuffing it into ccwchain_calc_length() seems to do the trick for 
>>>> seeing
>>>> if we're sending a TIC back into our current chain, and makes the dasd
>>>> boot work again (the following is built on top of this patch, and is
>>>> surely whitespace damaged):
>>>>
>>>> diff --git a/drivers/s390/cio/vfio_ccw_cp.c 
>>>> b/drivers/s390/cio/vfio_ccw_cp.c
>>>> index ba08fe137c2e..a4abe7519b7e 100644
>>>> --- a/drivers/s390/cio/vfio_ccw_cp.c
>>>> +++ b/drivers/s390/cio/vfio_ccw_cp.c
>>>> @@ -362,6 +362,7 @@ static int ccwchain_calc_length(u64 iova, struct
>>>> channel_program *cp)
>>>>    {
>>>>           struct ccw1 *ccw, *p;
>>>>           int cnt;
>>>> +       u64 tail;
>>>>
>>>>           /*
>>>>            * Copy current chain from guest to host kernel.
>>>> @@ -392,7 +393,9 @@ static int ccwchain_calc_length(u64 iova, struct
>>>> channel_program *cp)
>>>>                           return -EOPNOTSUPP;
>>>>                   }
>>>>
>>>> -               if (!ccw_is_chain(ccw))
>>>> +               tail = iova + (cnt - 1) * sizeof(struct ccw1);
>>>> +
>>>> +               if (!(ccw_is_chain(ccw) || (ccw_is_tic(ccw) && iova <=
>>>> ccw->cda && ccw->cda <= tail)))
>>>
>>> After De Morgan this looks like
>>>
>>>
>>> !ccw_is_chain(ccw)) && !(ccw_is_tic(ccw) && tic_points_in_this_chain())
>>>
>>> which only differs form the state without the Farhan fix by not carrying
>>> on for tic's that point outside of this chain.
>>>
>>> Would certainly catch the problem you discovered.
>>>
>>> Of course there is no guarantee that this covers all the theoretically
>>> possible cases where we could skip over a ccw that would terminate a 
>>> chain
>>> (tic or not chaining) if there was no status modifier. If you like I can
>>> construct an example where this heuristic fails (at least I think so).
>>>
>>> If this is the direction we are going to take I think "vfio-ccw: Don't
>>> assume there are more ccws after a TIC" should be superseded by this. 
>>> And
>>> we need a comment that explains the second part of the expression.
>>
>> As that patch is already queued, I think we should do that change with
>> 'fixes' on top.
>>
>> There are probably more corner cases that we're missing, but this fixes
>> a problem that can be hit today (well, with some posted patches...), so
>> I'd go ahead and queue it and go through the code and maybe fix/rework
>> on top.
> 
> Sounds good.  I will get that posted so we can look it over with fresh 
> eyes.
> 

Just catching up with my email backlog. This fix will solve the DASD ipl 
issue and should supersede my patch.


>>
>> Did any of you do any stress testing of that change yet?
>>
> 
> I've been running fio since yesterday after my last email (and 
> subsequent lunch), and haven't seen any further errors.  It's still 
> running today, and knock on wood it'll keep going nicely.
> 
> It just occurred to me that I was trying to run a different set of tests 
> when I bumped into this.  I think they are a different set of disks, so 
> I should go back and try that while fio runs.
> 
>   - Eric
> 

I will try to test this patch as well.

Thanks
Farhan
diff mbox series

Patch

diff --git a/drivers/s390/cio/vfio_ccw_cp.c b/drivers/s390/cio/vfio_ccw_cp.c
index 70a006ba4d05..ba08fe137c2e 100644
--- a/drivers/s390/cio/vfio_ccw_cp.c
+++ b/drivers/s390/cio/vfio_ccw_cp.c
@@ -392,7 +392,7 @@  static int ccwchain_calc_length(u64 iova, struct channel_program *cp)
 			return -EOPNOTSUPP;
 		}
 
-		if ((!ccw_is_chain(ccw)) && (!ccw_is_tic(ccw)))
+		if (!ccw_is_chain(ccw))
 			break;
 
 		ccw++;