mbox series

[RFC,v2,0/4] vfio-ccw: Fix interrupt handling for HALT/CLEAR

Message ID 20200513142934.28788-1-farman@linux.ibm.com (mailing list archive)
Headers show
Series vfio-ccw: Fix interrupt handling for HALT/CLEAR | expand

Message

Eric Farman May 13, 2020, 2:29 p.m. UTC
Hi Conny,

Back in January, I suggested a small patch [1] to try to clean up
the handling of HSCH/CSCH interrupts, especially as it relates to
concurrent SSCH interrupts. Here is a new attempt to address this.

There was some suggestion earlier about locking the FSM, but I'm not
seeing any problems with that. Rather, what I'm noticing is that the
flow between a synchronous START and asynchronous HALT/CLEAR have
different impacts on the FSM state. Consider:

    CPU 1                           CPU 2

    SSCH (set state=CP_PENDING)
    INTERRUPT (set state=IDLE)
    CSCH (no change in state)
                                    SSCH (set state=CP_PENDING)
    INTERRUPT (set state=IDLE)
                                    INTERRUPT (set state=IDLE)

The second interrupt on CPU 1 will call cp_free() for the START
created by CPU 2, and our results will be, um, messy. This
suggests that three things must be true:

 A) cp_free() must be called for either a final interrupt,
or a failure issuing a SSCH
 B) The FSM state must only be set to IDLE once cp_free()
has been called
 C) A SSCH cannot be issued while an interrupt is outstanding

It's not great that items B and C are separated in the interrupt
path by a mutex boundary where the copy into io_region occurs.
We could (and perhaps should) move them together, which would
improve things somewhat, but still doesn't address the scenario
laid out above. Even putting that work within the mutex window
during interrupt processing doesn't address things totally.

So, the patches here do two things. One to handle unsolicited
interrupts [2], which I think is pretty straightforward. Though,
it does make me want to do a more drastic rearranging of the
affected function. :)

The other warrants some discussion, since it's sort of weird.
Basically, recognize what commands we've issued, expect interrupts
for each, and prevent new SSCH's while asynchronous commands are
pending. It doesn't address interrupts from the HSCH/CSCH that
could be generated by the Channel Path Handling code [3] for an
offline CHPID yet, and it needs some TLC to make checkpatch happy.
But it's the best possibility I've seen thus far.

I used private->scsw for this because it's barely used for anything
else; at one point I had been considering removing it outright because
we have entirely too many SCSW's in this code. :) I could implement
this as three small flags in private, to simplify the code and avoid
confusion with the different fctl/actl flags in private vs schib.

It does make me wonder whether the CP_PENDING check before cp_free()
is still necessary, but that's a query for a later date.

Also, perhaps this could be accomplished with two new FSM states,
one for a HALT and one for a CLEAR. I put it aside because the
interrupt handler got a little hairy with that, but I'm still looking
at it in parallel to this discussion.

I look forward to hearing your thoughts...

[1] https://lore.kernel.org/kvm/20200124145455.51181-1-farman@linux.ibm.com/
[2] https://lore.kernel.org/kvm/9635c45f-4652-c837-d256-46f426737a5e@linux.ibm.com/
[3] https://lore.kernel.org/kvm/20200505122745.53208-1-farman@linux.ibm.com/

Eric Farman (4):
  vfio-ccw: Do not reset FSM state for unsolicited interrupts
  vfio-ccw: Utilize scsw actl to serialize start operations
  vfio-ccw: Expand SCSW usage to HALT and CLEAR
  vfio-ccw: Clean up how to react to a failed START

 drivers/s390/cio/vfio_ccw_drv.c | 10 +++++++++-
 drivers/s390/cio/vfio_ccw_fsm.c | 26 ++++++++++++++++++++++++++
 drivers/s390/cio/vfio_ccw_ops.c |  3 +--
 3 files changed, 36 insertions(+), 3 deletions(-)

Comments

Halil Pasic May 14, 2020, 1:46 p.m. UTC | #1
On Wed, 13 May 2020 16:29:30 +0200
Eric Farman <farman@linux.ibm.com> wrote:

> Hi Conny,
> 
> Back in January, I suggested a small patch [1] to try to clean up
> the handling of HSCH/CSCH interrupts, especially as it relates to
> concurrent SSCH interrupts. Here is a new attempt to address this.
> 
> There was some suggestion earlier about locking the FSM, but I'm not
> seeing any problems with that. Rather, what I'm noticing is that the
> flow between a synchronous START and asynchronous HALT/CLEAR have
> different impacts on the FSM state. Consider:
> 
>     CPU 1                           CPU 2
> 
>     SSCH (set state=CP_PENDING)
>     INTERRUPT (set state=IDLE)
>     CSCH (no change in state)
>                                     SSCH (set state=CP_PENDING)
>     INTERRUPT (set state=IDLE)
>                                     INTERRUPT (set state=IDLE)

How does the PoP view of the composite device look like 
(where composite device = vfio-ccw + host device)?

I suppose at the moment where we accept the CSCH the FC bit
indicated clear function (19) goes to set. When this happens
there are 2 possibilities: either the start (17) bit is set,
or it is not. You describe a scenario where the start bit is
not set. In that case we don't have a channel program that
is currently being processed, and any SSCH must bounce right
off without doing anything (with cc 2) as long as FC clear
is set. Please note that we are still talking about the composite
device here.

Thus in my reading CPU2 making the IDLE -> CP_PENDING transition
happen is already wrong. We should have rejected to look at the
channel program in the first place. Because as far as I can tell
for the composite device the FC clear bit remains set until we
process the last interrupt on the CPU 1 side in your scenario. Or
am I wrong?

AFAIR I was preaching about this several times, but could never
convince the people that 'let the host ccw device sort out
concurrency' is not the way this should work.

Maybe I have got a hole in my argument somewhere. If my argument
is wrong, please do point out why!

> 
> The second interrupt on CPU 1 will call cp_free() for the START
> created by CPU 2, and our results will be, um, messy. This
> suggests that three things must be true:
> 
>  A) cp_free() must be called for either a final interrupt,
> or a failure issuing a SSCH
>  B) The FSM state must only be set to IDLE once cp_free()
> has been called
>  C) A SSCH cannot be issued while an interrupt is outstanding

So you propose to reject SSCH in the IDLE state (if an interrupt
is outstanding)? That is one silly IDLE state and FSM for sure.

> 
> It's not great that items B and C are separated in the interrupt
> path by a mutex boundary where the copy into io_region occurs.
> We could (and perhaps should) move them together, which would
> improve things somewhat, but still doesn't address the scenario
> laid out above. Even putting that work within the mutex window
> during interrupt processing doesn't address things totally.
> 
> So, the patches here do two things. One to handle unsolicited
> interrupts [2], which I think is pretty straightforward. Though,
> it does make me want to do a more drastic rearranging of the
> affected function. :)
> 
> The other warrants some discussion, since it's sort of weird.
> Basically, recognize what commands we've issued, expect interrupts
> for each, and prevent new SSCH's while asynchronous commands are
> pending. It doesn't address interrupts from the HSCH/CSCH that
> could be generated by the Channel Path Handling code [3] for an
> offline CHPID yet, and it needs some TLC to make checkpatch happy.
> But it's the best possibility I've seen thus far.
> 
> I used private->scsw for this because it's barely used for anything
> else; at one point I had been considering removing it outright because
> we have entirely too many SCSW's in this code. :) I could implement
> this as three small flags in private, to simplify the code and avoid
> confusion with the different fctl/actl flags in private vs schib.
> 

Implementing the FSM described in the PoP (which in turn
conceptually relies on atomic operations on the FC bits) is IMHO
the way to go. But we can track that info in our FSM states. In
fact our FSM states should just add sub-partitioning of states to
those states (if necessary).


> It does make me wonder whether the CP_PENDING check before cp_free()
> is still necessary, but that's a query for a later date.
> 
> Also, perhaps this could be accomplished with two new FSM states,
> one for a HALT and one for a CLEAR. I put it aside because the
> interrupt handler got a little hairy with that, but I'm still looking
> at it in parallel to this discussion.
> 

You don't necessarily need 2 new states. Just one that corresponds
to FC clear function will do.

> I look forward to hearing your thoughts...

Please see above ;)

Also why do we see the scenario you describe in the wild? I agree that
this should be taken care of in the kernel as well, but according to my
understanding QEMU is already supposed to reject the second SSCH (CPU 2)
with cc 2 because it sees that FC clear function is set. Or?

Regards,
Halil

> 
> [1] https://lore.kernel.org/kvm/20200124145455.51181-1-farman@linux.ibm.com/
> [2] https://lore.kernel.org/kvm/9635c45f-4652-c837-d256-46f426737a5e@linux.ibm.com/
> [3] https://lore.kernel.org/kvm/20200505122745.53208-1-farman@linux.ibm.com/
> 
> Eric Farman (4):
>   vfio-ccw: Do not reset FSM state for unsolicited interrupts
>   vfio-ccw: Utilize scsw actl to serialize start operations
>   vfio-ccw: Expand SCSW usage to HALT and CLEAR
>   vfio-ccw: Clean up how to react to a failed START
> 
>  drivers/s390/cio/vfio_ccw_drv.c | 10 +++++++++-
>  drivers/s390/cio/vfio_ccw_fsm.c | 26 ++++++++++++++++++++++++++
>  drivers/s390/cio/vfio_ccw_ops.c |  3 +--
>  3 files changed, 36 insertions(+), 3 deletions(-)
>
Eric Farman May 15, 2020, 1:09 p.m. UTC | #2
On 5/14/20 9:46 AM, Halil Pasic wrote:
> On Wed, 13 May 2020 16:29:30 +0200
> Eric Farman <farman@linux.ibm.com> wrote:
> 
>> Hi Conny,
>>
>> Back in January, I suggested a small patch [1] to try to clean up
>> the handling of HSCH/CSCH interrupts, especially as it relates to
>> concurrent SSCH interrupts. Here is a new attempt to address this.
>>
>> There was some suggestion earlier about locking the FSM, but I'm not
>> seeing any problems with that. Rather, what I'm noticing is that the
>> flow between a synchronous START and asynchronous HALT/CLEAR have
>> different impacts on the FSM state. Consider:
>>
>>     CPU 1                           CPU 2
>>
>>     SSCH (set state=CP_PENDING)
>>     INTERRUPT (set state=IDLE)
>>     CSCH (no change in state)
>>                                     SSCH (set state=CP_PENDING)
>>     INTERRUPT (set state=IDLE)
>>                                     INTERRUPT (set state=IDLE)
> 
> How does the PoP view of the composite device look like 
> (where composite device = vfio-ccw + host device)?

(Just want to be sure that "composite device" is a term that you're
creating, because it's not one I'm familiar with.)

In today's code, there isn't a "composite device" that contains
POPs-defined state information like we find in, for example, the host
SCHIB, but that incorporates vfio-ccw state. This series (in a far more
architecturally valid, non-RFC, form) would get us closer to that.

> 
> I suppose at the moment where we accept the CSCH the FC bit
> indicated clear function (19) goes to set. When this happens
> there are 2 possibilities: either the start (17) bit is set,
> or it is not. You describe a scenario where the start bit is
> not set. In that case we don't have a channel program that
> is currently being processed, and any SSCH must bounce right
> off without doing anything (with cc 2) as long as FC clear
> is set. Please note that we are still talking about the composite
> device here.

Correct. Though, whether the START function control bit is currently set
is immaterial to a CLEAR function; that's the biggest recovery action we
have at our disposal, and will always go through.

The point is that there is nothing to prevent the START on CPU 2 from
going through. The CLEAR does not affect the FSM today, and doesn't
record a FC CLEAR bit within vfio-ccw, and so we're only relying on the
return code from the SSCH instruction to give us cc0 vs cc2 (or
whatever). The actual results of that will depend, since the CPU may
have presented the interrupt for the CLEAR (via the .irq callback and
thus FSM VFIO_CCW_EVENT_INTERRUPT) and thus a new START is perfectly
legal from its point of view. Since vfio-ccw may not have unstacked the
work it placed to finish up that interrupt handler means we have a problem.

> 
> Thus in my reading CPU2 making the IDLE -> CP_PENDING transition
> happen is already wrong. We should have rejected to look at the
> channel program in the first place. Because as far as I can tell
> for the composite device the FC clear bit remains set until we
> process the last interrupt on the CPU 1 side in your scenario. Or
> am I wrong?

You're not wrong. You're agreeing with everything I've described.  :)

> 
> AFAIR I was preaching about this several times, but could never
> convince the people that 'let the host ccw device sort out
> concurrency' is not the way this should work.

I'm going to ignore this paragraph.

> 
> Maybe I have got a hole in my argument somewhere. If my argument
> is wrong, please do point out why!
> 
>>
>> The second interrupt on CPU 1 will call cp_free() for the START
>> created by CPU 2, and our results will be, um, messy. This
>> suggests that three things must be true:
>>
>>  A) cp_free() must be called for either a final interrupt,
>> or a failure issuing a SSCH
>>  B) The FSM state must only be set to IDLE once cp_free()
>> has been called
>>  C) A SSCH cannot be issued while an interrupt is outstanding
> 
> So you propose to reject SSCH in the IDLE state (if an interrupt
> is outstanding)? That is one silly IDLE state and FSM for sure.

And sending a HSCH/CSCH without affecting the FSM or a POPs-valid
structure is not silly?

> 
>>
>> It's not great that items B and C are separated in the interrupt
>> path by a mutex boundary where the copy into io_region occurs.
>> We could (and perhaps should) move them together, which would
>> improve things somewhat, but still doesn't address the scenario
>> laid out above. Even putting that work within the mutex window
>> during interrupt processing doesn't address things totally.
>>
>> So, the patches here do two things. One to handle unsolicited
>> interrupts [2], which I think is pretty straightforward. Though,
>> it does make me want to do a more drastic rearranging of the
>> affected function. :)
>>
>> The other warrants some discussion, since it's sort of weird.
>> Basically, recognize what commands we've issued, expect interrupts
>> for each, and prevent new SSCH's while asynchronous commands are
>> pending. It doesn't address interrupts from the HSCH/CSCH that
>> could be generated by the Channel Path Handling code [3] for an
>> offline CHPID yet, and it needs some TLC to make checkpatch happy.
>> But it's the best possibility I've seen thus far.
>>
>> I used private->scsw for this because it's barely used for anything
>> else; at one point I had been considering removing it outright because
>> we have entirely too many SCSW's in this code. :) I could implement
>> this as three small flags in private, to simplify the code and avoid
>> confusion with the different fctl/actl flags in private vs schib.
>>
> 
> Implementing the FSM described in the PoP (which in turn
> conceptually relies on atomic operations on the FC bits) is IMHO
> the way to go. But we can track that info in our FSM states. In
> fact our FSM states should just add sub-partitioning of states to
> those states (if necessary).

I'm not prepared to rule this out, as I originally stated, but I'm not
thrilled with this idea. Today, we have FSM events for an IO (START) and
asynchronous commands (HALT and CLEAR), and we have FSM states for the
different stages of a START operation. Making asynchronous commands
affect the FSM state isn't too big of a problem, but what happens if we
expand the asynchronous support to handle other commands, such as CANCEL
or RESUME? They don't have FC bits of their own to map into an FSM, and
(just like HALT/CLEAR) have some reliance on the fctl/actl/stctl bits of
the SCHIB.

The fact that you're talking about sub-partitioning of states, as we
have with CP_PROCESSING and CP_PENDING in the context of a START,
suggest we'd drift farther from what one actually finds in POPs and make
it harder to decipher what vfio-ccw is actually up to.

> 
> 
>> It does make me wonder whether the CP_PENDING check before cp_free()
>> is still necessary, but that's a query for a later date.
>>
>> Also, perhaps this could be accomplished with two new FSM states,
>> one for a HALT and one for a CLEAR. I put it aside because the
>> interrupt handler got a little hairy with that, but I'm still looking
>> at it in parallel to this discussion.
>>
> 
> You don't necessarily need 2 new states. Just one that corresponds
> to FC clear function will do.

I don't think so. Maybe for my example above regarding a CSCH, but we
can issue a HSCH in the same way today, and a HALT should also get a cc2
if either a HALT or CLEAR command is already in progress. Just as a
START should get cc2 for a START, HALT, or CLEAR.

> 
>> I look forward to hearing your thoughts...
> 
> Please see above ;)
> 
> Also why do we see the scenario you describe in the wild? I agree that
> this should be taken care of in the kernel as well, but according to my
> understanding QEMU is already supposed to reject the second SSCH (CPU 2)
> with cc 2 because it sees that FC clear function is set. Or?

Maybe for virtio, but for vfio this all gets passed through to the
kernel who makes that distinction. And as I've mentioned above, that's
not happening.

> 
> Regards,
> Halil
> 
>>
>> [1] https://lore.kernel.org/kvm/20200124145455.51181-1-farman@linux.ibm.com/
>> [2] https://lore.kernel.org/kvm/9635c45f-4652-c837-d256-46f426737a5e@linux.ibm.com/
>> [3] https://lore.kernel.org/kvm/20200505122745.53208-1-farman@linux.ibm.com/
>>
>> Eric Farman (4):
>>   vfio-ccw: Do not reset FSM state for unsolicited interrupts
>>   vfio-ccw: Utilize scsw actl to serialize start operations
>>   vfio-ccw: Expand SCSW usage to HALT and CLEAR
>>   vfio-ccw: Clean up how to react to a failed START
>>
>>  drivers/s390/cio/vfio_ccw_drv.c | 10 +++++++++-
>>  drivers/s390/cio/vfio_ccw_fsm.c | 26 ++++++++++++++++++++++++++
>>  drivers/s390/cio/vfio_ccw_ops.c |  3 +--
>>  3 files changed, 36 insertions(+), 3 deletions(-)
>>
>
Halil Pasic May 15, 2020, 2:55 p.m. UTC | #3
On Fri, 15 May 2020 09:09:47 -0400
Eric Farman <farman@linux.ibm.com> wrote:

> 
> 
> On 5/14/20 9:46 AM, Halil Pasic wrote:
> > On Wed, 13 May 2020 16:29:30 +0200
> > Eric Farman <farman@linux.ibm.com> wrote:
> > 
> >> Hi Conny,
> >>
> >> Back in January, I suggested a small patch [1] to try to clean up
> >> the handling of HSCH/CSCH interrupts, especially as it relates to
> >> concurrent SSCH interrupts. Here is a new attempt to address this.
> >>
> >> There was some suggestion earlier about locking the FSM, but I'm not
> >> seeing any problems with that. Rather, what I'm noticing is that the
> >> flow between a synchronous START and asynchronous HALT/CLEAR have
> >> different impacts on the FSM state. Consider:
> >>
> >>     CPU 1                           CPU 2
> >>
> >>     SSCH (set state=CP_PENDING)
> >>     INTERRUPT (set state=IDLE)
> >>     CSCH (no change in state)
> >>                                     SSCH (set state=CP_PENDING)
> >>     INTERRUPT (set state=IDLE)
> >>                                     INTERRUPT (set state=IDLE)
> > 
> > How does the PoP view of the composite device look like 
> > (where composite device = vfio-ccw + host device)?
> 
> (Just want to be sure that "composite device" is a term that you're
> creating, because it's not one I'm familiar with.)

Yes I created this term because I'm unaware of an established one, and
I could not come up with a better one. If you have something established
or better please do tell, I will start using that.

> 
> In today's code, there isn't a "composite device" that contains
> POPs-defined state information like we find in, for example, the host
> SCHIB, but that incorporates vfio-ccw state. This series (in a far more
> architecturally valid, non-RFC, form) would get us closer to that.
> 

I think it is very important to start thinking about the ccw devices
that are exposed to the guest as a "composite device" in a sense that
the (passed-through) host ccw device is wrapped by vfio-ccw. For instance
the guest sees the SCHIB exposed by the vfio-ccw wrapper (adaptation
code in qemu and the kernel module), and not the SCHIB of the host
device.

This series definitely has some progressive thoughts. It just that
IMHO we sould to be more radical about this. And yes I'm a bit
frustrated.

> > 
> > I suppose at the moment where we accept the CSCH the FC bit
> > indicated clear function (19) goes to set. When this happens
> > there are 2 possibilities: either the start (17) bit is set,
> > or it is not. You describe a scenario where the start bit is
> > not set. In that case we don't have a channel program that
> > is currently being processed, and any SSCH must bounce right
> > off without doing anything (with cc 2) as long as FC clear
> > is set. Please note that we are still talking about the composite
> > device here.
> 
> Correct. Though, whether the START function control bit is currently set
> is immaterial to a CLEAR function; that's the biggest recovery action we
> have at our disposal, and will always go through.
> 
> The point is that there is nothing to prevent the START on CPU 2 from
> going through. The CLEAR does not affect the FSM today, and doesn't
> record a FC CLEAR bit within vfio-ccw, and so we're only relying on the
> return code from the SSCH instruction to give us cc0 vs cc2 (or
> whatever). The actual results of that will depend, since the CPU may
> have presented the interrupt for the CLEAR (via the .irq callback and
> thus FSM VFIO_CCW_EVENT_INTERRUPT) and thus a new START is perfectly
> legal from its point of view. Since vfio-ccw may not have unstacked the
> work it placed to finish up that interrupt handler means we have a problem.
> 
> > 
> > Thus in my reading CPU2 making the IDLE -> CP_PENDING transition
> > happen is already wrong. We should have rejected to look at the
> > channel program in the first place. Because as far as I can tell
> > for the composite device the FC clear bit remains set until we
> > process the last interrupt on the CPU 1 side in your scenario. Or
> > am I wrong?
> 
> You're not wrong. You're agreeing with everything I've described.  :)
> 

I'm happy our understanding seems to converge! :)

My problem is that you tie the denial of SSCH to outstanding interrupts
("C) A SSCH cannot be issued while an interrupt is outstanding") while
the PoP says "Condition code 2 is set, and no other action is
taken, when a start, halt, or clear function is currently
in progress at the subchannel (see “Function Control
(FC)” on page 16-22)".

This may or man not be what you have actually implemented, but it is what
you describe in this cover letter. Also patches 2 and 3 do the 
serialization operate on activity controls instead of on the function
controls (as described by the PoP).

> > 
> > AFAIR I was preaching about this several times, but could never
> > convince the people that 'let the host ccw device sort out
> > concurrency' is not the way this should work.
> 
> I'm going to ignore this paragraph.
> 

;) Wise, was just frustration venting on my side.

> > 
> > Maybe I have got a hole in my argument somewhere. If my argument
> > is wrong, please do point out why!
> > 
> >>
> >> The second interrupt on CPU 1 will call cp_free() for the START
> >> created by CPU 2, and our results will be, um, messy. This
> >> suggests that three things must be true:
> >>
> >>  A) cp_free() must be called for either a final interrupt,
> >> or a failure issuing a SSCH
> >>  B) The FSM state must only be set to IDLE once cp_free()
> >> has been called
> >>  C) A SSCH cannot be issued while an interrupt is outstanding
> > 
> > So you propose to reject SSCH in the IDLE state (if an interrupt
> > is outstanding)? That is one silly IDLE state and FSM for sure.
> 
> And sending a HSCH/CSCH without affecting the FSM or a POPs-valid
> structure is not silly?
> 

It is silly! I'm frustrated because I could not convince my peers that
it is silly.

> > 
> >>
> >> It's not great that items B and C are separated in the interrupt
> >> path by a mutex boundary where the copy into io_region occurs.
> >> We could (and perhaps should) move them together, which would
> >> improve things somewhat, but still doesn't address the scenario
> >> laid out above. Even putting that work within the mutex window
> >> during interrupt processing doesn't address things totally.
> >>
> >> So, the patches here do two things. One to handle unsolicited
> >> interrupts [2], which I think is pretty straightforward. Though,
> >> it does make me want to do a more drastic rearranging of the
> >> affected function. :)
> >>
> >> The other warrants some discussion, since it's sort of weird.
> >> Basically, recognize what commands we've issued, expect interrupts
> >> for each, and prevent new SSCH's while asynchronous commands are
> >> pending. It doesn't address interrupts from the HSCH/CSCH that
> >> could be generated by the Channel Path Handling code [3] for an
> >> offline CHPID yet, and it needs some TLC to make checkpatch happy.
> >> But it's the best possibility I've seen thus far.
> >>
> >> I used private->scsw for this because it's barely used for anything
> >> else; at one point I had been considering removing it outright because
> >> we have entirely too many SCSW's in this code. :) I could implement
> >> this as three small flags in private, to simplify the code and avoid
> >> confusion with the different fctl/actl flags in private vs schib.
> >>
> > 
> > Implementing the FSM described in the PoP (which in turn
> > conceptually relies on atomic operations on the FC bits) is IMHO
> > the way to go. But we can track that info in our FSM states. In
> > fact our FSM states should just add sub-partitioning of states to
> > those states (if necessary).
> 
> I'm not prepared to rule this out, as I originally stated, but I'm not
> thrilled with this idea. Today, we have FSM events for an IO (START) and
> asynchronous commands (HALT and CLEAR), and we have FSM states for the
> different stages of a START operation. Making asynchronous commands
> affect the FSM state isn't too big of a problem, but what happens if we
> expand the asynchronous support to handle other commands, such as CANCEL
> or RESUME? They don't have FC bits of their own to map into an FSM, and
> (just like HALT/CLEAR) have some reliance on the fctl/actl/stctl bits of
> the SCHIB.
> 

Well, fctl/actl/stctl can be seen as a possible representation of states
and state transitions. For example for RESUME the PoP says
"Condition code 2 is set, and no other action is
taken, when the resume function is not applicable.
The resume function is not applicable when the sub-
channel (1) has any function other than the start
function alone specified, (2) has no function speci-
fied, (3) is resume pending, or (4) does not have sus-
pend control specified for the start function in
progress"

What I'm trying to say is, there is a state machine described in the PoP,
and there the state transitions are marked by interlocked updates of bits
in fctl/actl/stctl. I don't know how many bits relevant for the state
machine are, but I'm pretty confident they can be packaged up in a
quadword. So if we want we can represent that stuff with a state variable
of ours.

BTW the whole notion of synchronous and asynchronous commands is IMHO
broken. I tried to argue against it back then. If you read the PoP SSCH
is asynchronous as well.

> The fact that you're talking about sub-partitioning of states, as we
> have with CP_PROCESSING and CP_PENDING in the context of a START,
> suggest we'd drift farther from what one actually finds in POPs and make
> it harder to decipher what vfio-ccw is actually up to.
> 

Yes. This may or may not be justified. Because we need to do that
translation-prefetching and we need to do some cleanup that in a
non-virtualized context isn't there additional states may be beneficial
or even necessary. But if our state machine is interlacing with
the PoP state machine something is very wrong.

> > 
> > 
> >> It does make me wonder whether the CP_PENDING check before cp_free()
> >> is still necessary, but that's a query for a later date.
> >>
> >> Also, perhaps this could be accomplished with two new FSM states,
> >> one for a HALT and one for a CLEAR. I put it aside because the
> >> interrupt handler got a little hairy with that, but I'm still looking
> >> at it in parallel to this discussion.
> >>
> > 
> > You don't necessarily need 2 new states. Just one that corresponds
> > to FC clear function will do.
> 
> I don't think so. Maybe for my example above regarding a CSCH, but we
> can issue a HSCH in the same way today, and a HALT should also get a cc2
> if either a HALT or CLEAR command is already in progress. Just as a
> START should get cc2 for a START, HALT, or CLEAR.
> 

Right. I was wrong. I misremembered something and mixed stuff up. We
have separate FC bits for HALT and CLEAR in the PoP and that's for a
good reason. Sorry for the noise.

> > 
> >> I look forward to hearing your thoughts...
> > 
> > Please see above ;)
> > 
> > Also why do we see the scenario you describe in the wild? I agree that
> > this should be taken care of in the kernel as well, but according to my
> > understanding QEMU is already supposed to reject the second SSCH (CPU 2)
> > with cc 2 because it sees that FC clear function is set. Or?
> 
> Maybe for virtio, but for vfio this all gets passed through to the
> kernel who makes that distinction. And as I've mentioned above, that's
> not happening.

Let's have a look at the following qemu functions. AFAIK it is
common to vfio and virtio, or? Will prefix my inline 
comments with ###


IOInstEnding css_do_ssch(SubchDev *sch, ORB *orb)
{   
    SCHIB *schib = &sch->curr_status;

    if (~(schib->pmcw.flags) & (PMCW_FLAGS_MASK_DNV | PMCW_FLAGS_MASK_ENA)) {
        return IOINST_CC_NOT_OPERATIONAL;
    }
    
    if (schib->scsw.ctrl & SCSW_STCTL_STATUS_PEND) {
        return IOINST_CC_STATUS_PRESENT;
    }
    
    if (schib->scsw.ctrl & (SCSW_FCTL_START_FUNC |
                   SCSW_FCTL_HALT_FUNC |
                   SCSW_FCTL_CLEAR_FUNC)) {
        return IOINST_CC_BUSY;
### Here we check the FC values and give cc2 when we
### need to.
    }
    
    /* If monitoring is active, update counter. */
    if (channel_subsys.chnmon_active) {
        css_update_chnmon(sch);
    }
    sch->orb = *orb;
    sch->channel_prog = orb->cpa;
    /* Trigger the start function. */
    schib->scsw.ctrl |= (SCSW_FCTL_START_FUNC | SCSW_ACTL_START_PEND);
## Here we set the FC
    schib->scsw.flags &= ~SCSW_FLAGS_MASK_PNO;

    return do_subchannel_work(sch);
}

IOInstEnding css_do_csch(SubchDev *sch)
{
    SCHIB *schib = &sch->curr_status;

    if (~(schib->pmcw.flags) & (PMCW_FLAGS_MASK_DNV | PMCW_FLAGS_MASK_ENA)) {
        return IOINST_CC_NOT_OPERATIONAL;
    }

    /* Trigger the clear function. */
    schib->scsw.ctrl &= ~(SCSW_CTRL_MASK_FCTL | SCSW_CTRL_MASK_ACTL);
    schib->scsw.ctrl |= SCSW_FCTL_CLEAR_FUNC | SCSW_ACTL_CLEAR_PEND;
### Here we set FC for the clear

    return do_subchannel_work(sch);
}

So unless somebody (e.g. the kernel vfio-ccw) nukes the FC bits qemu
should prevent the second SSCH from your example getting to the kernel,
or?

This is why I hold the notion of a "composite device" for
important.

Regards,
Halil
Cornelia Huck May 15, 2020, 3:58 p.m. UTC | #4
[only some very high-level comments; I have not had time for this yet
and it's late on a Friday]

On Fri, 15 May 2020 16:55:39 +0200
Halil Pasic <pasic@linux.ibm.com> wrote:

> On Fri, 15 May 2020 09:09:47 -0400
> Eric Farman <farman@linux.ibm.com> wrote:
> 
> > 
> > 
> > On 5/14/20 9:46 AM, Halil Pasic wrote:  
> > > On Wed, 13 May 2020 16:29:30 +0200
> > > Eric Farman <farman@linux.ibm.com> wrote:
> > >   
> > >> Hi Conny,
> > >>
> > >> Back in January, I suggested a small patch [1] to try to clean up
> > >> the handling of HSCH/CSCH interrupts, especially as it relates to
> > >> concurrent SSCH interrupts. Here is a new attempt to address this.
> > >>
> > >> There was some suggestion earlier about locking the FSM, but I'm not
> > >> seeing any problems with that. Rather, what I'm noticing is that the
> > >> flow between a synchronous START and asynchronous HALT/CLEAR have
> > >> different impacts on the FSM state. Consider:
> > >>
> > >>     CPU 1                           CPU 2
> > >>
> > >>     SSCH (set state=CP_PENDING)
> > >>     INTERRUPT (set state=IDLE)
> > >>     CSCH (no change in state)
> > >>                                     SSCH (set state=CP_PENDING)
> > >>     INTERRUPT (set state=IDLE)
> > >>                                     INTERRUPT (set state=IDLE)  
> > > 
> > > How does the PoP view of the composite device look like 
> > > (where composite device = vfio-ccw + host device)?  
> > 
> > (Just want to be sure that "composite device" is a term that you're
> > creating, because it's not one I'm familiar with.)  
> 
> Yes I created this term because I'm unaware of an established one, and
> I could not come up with a better one. If you have something established
> or better please do tell, I will start using that.

I don't think "composite" is a term I would use; in the end, we are
talking about a vfio-ccw device that gets some of its state from the
host device. As such, I don't think there's really anything like a "PoP
view" of it; the part that should comply with what the PoP specifies is
what gets exposed to the guest.

> 
> > 
> > In today's code, there isn't a "composite device" that contains
> > POPs-defined state information like we find in, for example, the host
> > SCHIB, but that incorporates vfio-ccw state. This series (in a far more
> > architecturally valid, non-RFC, form) would get us closer to that.
> >   
> 
> I think it is very important to start thinking about the ccw devices
> that are exposed to the guest as a "composite device" in a sense that
> the (passed-through) host ccw device is wrapped by vfio-ccw. For instance
> the guest sees the SCHIB exposed by the vfio-ccw wrapper (adaptation
> code in qemu and the kernel module), and not the SCHIB of the host
> device.

See my comments on that above.

> 
> This series definitely has some progressive thoughts. It just that
> IMHO we sould to be more radical about this. And yes I'm a bit
> frustrated.
> 
> > > 
> > > I suppose at the moment where we accept the CSCH the FC bit
> > > indicated clear function (19) goes to set. When this happens
> > > there are 2 possibilities: either the start (17) bit is set,
> > > or it is not. You describe a scenario where the start bit is
> > > not set. In that case we don't have a channel program that
> > > is currently being processed, and any SSCH must bounce right
> > > off without doing anything (with cc 2) as long as FC clear
> > > is set. Please note that we are still talking about the composite
> > > device here.  
> > 
> > Correct. Though, whether the START function control bit is currently set
> > is immaterial to a CLEAR function; that's the biggest recovery action we
> > have at our disposal, and will always go through.
> > 
> > The point is that there is nothing to prevent the START on CPU 2 from
> > going through. The CLEAR does not affect the FSM today, and doesn't
> > record a FC CLEAR bit within vfio-ccw, and so we're only relying on the
> > return code from the SSCH instruction to give us cc0 vs cc2 (or
> > whatever). The actual results of that will depend, since the CPU may
> > have presented the interrupt for the CLEAR (via the .irq callback and
> > thus FSM VFIO_CCW_EVENT_INTERRUPT) and thus a new START is perfectly
> > legal from its point of view. Since vfio-ccw may not have unstacked the
> > work it placed to finish up that interrupt handler means we have a problem.
> >   
> > > 
> > > Thus in my reading CPU2 making the IDLE -> CP_PENDING transition
> > > happen is already wrong. We should have rejected to look at the
> > > channel program in the first place. Because as far as I can tell
> > > for the composite device the FC clear bit remains set until we
> > > process the last interrupt on the CPU 1 side in your scenario. Or
> > > am I wrong?  
> > 
> > You're not wrong. You're agreeing with everything I've described.  :)
> >   
> 
> I'm happy our understanding seems to converge! :)
> 
> My problem is that you tie the denial of SSCH to outstanding interrupts
> ("C) A SSCH cannot be issued while an interrupt is outstanding") while
> the PoP says "Condition code 2 is set, and no other action is
> taken, when a start, halt, or clear function is currently
> in progress at the subchannel (see “Function Control
> (FC)” on page 16-22)".
> 
> This may or man not be what you have actually implemented, but it is what
> you describe in this cover letter. Also patches 2 and 3 do the 
> serialization operate on activity controls instead of on the function
> controls (as described by the PoP).

I have not really had the cycles yet to look at this series in detail,
but should our goal really be to mimic what the PoP talks about in
vfio-ccw (both kernel and userspace parts)? IMO, the important part is
that the guest sees a device that acts *towards the guest* in a way
that is compliant with the PoP; we can take different paths inside
vfio-ccw.

(...)

> BTW the whole notion of synchronous and asynchronous commands is IMHO
> broken. I tried to argue against it back then. If you read the PoP SSCH
> is asynchronous as well.

I don't see where we ever stated that SSCH was synchronous. That would
be silly. The async region is called the async region because it is
used for some async commands (HSCH/CSCH), not because it is the only
way to do async commands. (The original idea was to extend the I/O
region for HSCH/CSCH, but that turned out to be awkward.)

(...)

I hope I can find more time to look at this next week, but as it will
be a short work week for me and I'm already swamped with various other
things, I fear that you should keep your expectations low.
Halil Pasic May 15, 2020, 5:41 p.m. UTC | #5
On Fri, 15 May 2020 17:58:49 +0200
Cornelia Huck <cohuck@redhat.com> wrote:

> [only some very high-level comments; I have not had time for this yet
> and it's late on a Friday]
> 
> On Fri, 15 May 2020 16:55:39 +0200
> Halil Pasic <pasic@linux.ibm.com> wrote:
> 
> > On Fri, 15 May 2020 09:09:47 -0400
> > Eric Farman <farman@linux.ibm.com> wrote:
> > 
> > > 
> > > 
> > > On 5/14/20 9:46 AM, Halil Pasic wrote:  
> > > > On Wed, 13 May 2020 16:29:30 +0200
> > > > Eric Farman <farman@linux.ibm.com> wrote:
> > > >   
> > > >> Hi Conny,
> > > >>
> > > >> Back in January, I suggested a small patch [1] to try to clean up
> > > >> the handling of HSCH/CSCH interrupts, especially as it relates to
> > > >> concurrent SSCH interrupts. Here is a new attempt to address this.
> > > >>
> > > >> There was some suggestion earlier about locking the FSM, but I'm not
> > > >> seeing any problems with that. Rather, what I'm noticing is that the
> > > >> flow between a synchronous START and asynchronous HALT/CLEAR have
> > > >> different impacts on the FSM state. Consider:
> > > >>
> > > >>     CPU 1                           CPU 2
> > > >>
> > > >>     SSCH (set state=CP_PENDING)
> > > >>     INTERRUPT (set state=IDLE)
> > > >>     CSCH (no change in state)
> > > >>                                     SSCH (set state=CP_PENDING)
> > > >>     INTERRUPT (set state=IDLE)
> > > >>                                     INTERRUPT (set state=IDLE)  
> > > > 
> > > > How does the PoP view of the composite device look like 
> > > > (where composite device = vfio-ccw + host device)?  
> > > 
> > > (Just want to be sure that "composite device" is a term that you're
> > > creating, because it's not one I'm familiar with.)  
> > 
> > Yes I created this term because I'm unaware of an established one, and
> > I could not come up with a better one. If you have something established
> > or better please do tell, I will start using that.
> 
> I don't think "composite" is a term I would use; in the end, we are
> talking about a vfio-ccw device that gets some of its state from the
> host device. As such, I don't think there's really anything like a "PoP
> view" of it; the part that should comply with what the PoP specifies is
> what gets exposed to the guest.

I agree, what matters is the architectural compliance of the device as
exposed to the guest. I tried to refer the architectural view of the
whole device as the PoP view. The diagram above is clearly operating
on the vfio-ccw kernel module level as state=IDLE and state=CP_PENDING
is stuff on that level.

What I tried to illuminate by using the word 'composite', is that what
the guest sees is a superposition of what the QEMU code does, what the
kernel module code does and what the host device does. And that the
situation described should have been prevented prior the second
SSCH hitting the kernel module (at least according to my understanding).

Let us forget about the term composite. But we do have be aware of how
things compose. ;)

> 
> > 
> > > 
> > > In today's code, there isn't a "composite device" that contains
> > > POPs-defined state information like we find in, for example, the host
> > > SCHIB, but that incorporates vfio-ccw state. This series (in a far more
> > > architecturally valid, non-RFC, form) would get us closer to that.
> > >   
> > 
> > I think it is very important to start thinking about the ccw devices
> > that are exposed to the guest as a "composite device" in a sense that
> > the (passed-through) host ccw device is wrapped by vfio-ccw. For instance
> > the guest sees the SCHIB exposed by the vfio-ccw wrapper (adaptation
> > code in qemu and the kernel module), and not the SCHIB of the host
> > device.
> 
> See my comments on that above.
>

Sorry for muddying the water. :(
 
> > 
> > This series definitely has some progressive thoughts. It just that
> > IMHO we sould to be more radical about this. And yes I'm a bit
> > frustrated.
> > 
> > > > 
> > > > I suppose at the moment where we accept the CSCH the FC bit
> > > > indicated clear function (19) goes to set. When this happens
> > > > there are 2 possibilities: either the start (17) bit is set,
> > > > or it is not. You describe a scenario where the start bit is
> > > > not set. In that case we don't have a channel program that
> > > > is currently being processed, and any SSCH must bounce right
> > > > off without doing anything (with cc 2) as long as FC clear
> > > > is set. Please note that we are still talking about the composite
> > > > device here.  
> > > 
> > > Correct. Though, whether the START function control bit is currently set
> > > is immaterial to a CLEAR function; that's the biggest recovery action we
> > > have at our disposal, and will always go through.
> > > 
> > > The point is that there is nothing to prevent the START on CPU 2 from
> > > going through. The CLEAR does not affect the FSM today, and doesn't
> > > record a FC CLEAR bit within vfio-ccw, and so we're only relying on the
> > > return code from the SSCH instruction to give us cc0 vs cc2 (or
> > > whatever). The actual results of that will depend, since the CPU may
> > > have presented the interrupt for the CLEAR (via the .irq callback and
> > > thus FSM VFIO_CCW_EVENT_INTERRUPT) and thus a new START is perfectly
> > > legal from its point of view. Since vfio-ccw may not have unstacked the
> > > work it placed to finish up that interrupt handler means we have a problem.
> > >   
> > > > 
> > > > Thus in my reading CPU2 making the IDLE -> CP_PENDING transition
> > > > happen is already wrong. We should have rejected to look at the
> > > > channel program in the first place. Because as far as I can tell
> > > > for the composite device the FC clear bit remains set until we
> > > > process the last interrupt on the CPU 1 side in your scenario. Or
> > > > am I wrong?  
> > > 
> > > You're not wrong. You're agreeing with everything I've described.  :)
> > >   
> > 
> > I'm happy our understanding seems to converge! :)
> > 
> > My problem is that you tie the denial of SSCH to outstanding interrupts
> > ("C) A SSCH cannot be issued while an interrupt is outstanding") while
> > the PoP says "Condition code 2 is set, and no other action is
> > taken, when a start, halt, or clear function is currently
> > in progress at the subchannel (see “Function Control
> > (FC)” on page 16-22)".
> > 
> > This may or man not be what you have actually implemented, but it is what
> > you describe in this cover letter. Also patches 2 and 3 do the 
> > serialization operate on activity controls instead of on the function
> > controls (as described by the PoP).
> 
> I have not really had the cycles yet to look at this series in detail,
> but should our goal really be to mimic what the PoP talks about in
> vfio-ccw (both kernel and userspace parts)? IMO, the important part is
> that the guest sees a device that acts *towards the guest* in a way
> that is compliant with the PoP; we can take different paths inside
> vfio-ccw.

I agree with the principle, but I don't agree with some of the paths
taken :)

Moreover, each time we decide to not take the path that maps the PoP
in the most straightforward fashion, we better be very careful.

> 
> (...)
> 
> > BTW the whole notion of synchronous and asynchronous commands is IMHO
> > broken. I tried to argue against it back then. If you read the PoP SSCH
> > is asynchronous as well.
> 
> I don't see where we ever stated that SSCH was synchronous. That would
> be silly. The async region is called the async region because it is
> used for some async commands (HSCH/CSCH), not because it is the only
> way to do async commands. (The original idea was to extend the I/O
> region for HSCH/CSCH, but that turned out to be awkward.)

What I don't understand is why are HALT and CLEAR async commands, and
START a synchronous command, or a non-async command.

> 
> (...)
> 
> I hope I can find more time to look at this next week, but as it will
> be a short work week for me and I'm already swamped with various other
> things, I fear that you should keep your expectations low.
> 

Thanks for chiming in. And please do complain when I talk silly.
Unfortunately I'm no great communicator.

Regards,
Halil
Eric Farman May 15, 2020, 6:12 p.m. UTC | #6
On 5/15/20 10:55 AM, Halil Pasic wrote:
> On Fri, 15 May 2020 09:09:47 -0400
> Eric Farman <farman@linux.ibm.com> wrote:
> 
>>
>>
>> On 5/14/20 9:46 AM, Halil Pasic wrote:
>>> On Wed, 13 May 2020 16:29:30 +0200
>>> Eric Farman <farman@linux.ibm.com> wrote:
>>>
>>>> Hi Conny,
>>>>
>>>> Back in January, I suggested a small patch [1] to try to clean up
>>>> the handling of HSCH/CSCH interrupts, especially as it relates to
>>>> concurrent SSCH interrupts. Here is a new attempt to address this.
>>>>
>>>> There was some suggestion earlier about locking the FSM, but I'm not
>>>> seeing any problems with that. Rather, what I'm noticing is that the
>>>> flow between a synchronous START and asynchronous HALT/CLEAR have
>>>> different impacts on the FSM state. Consider:
>>>>
>>>>     CPU 1                           CPU 2
>>>>
>>>>     SSCH (set state=CP_PENDING)
>>>>     INTERRUPT (set state=IDLE)
>>>>     CSCH (no change in state)
>>>>                                     SSCH (set state=CP_PENDING)
>>>>     INTERRUPT (set state=IDLE)
>>>>                                     INTERRUPT (set state=IDLE)
>>>
>>> How does the PoP view of the composite device look like 
>>> (where composite device = vfio-ccw + host device)?
>>
>> (Just want to be sure that "composite device" is a term that you're
>> creating, because it's not one I'm familiar with.)
> 
> Yes I created this term because I'm unaware of an established one, and
> I could not come up with a better one. If you have something established
> or better please do tell, I will start using that.
> 
>>
>> In today's code, there isn't a "composite device" that contains
>> POPs-defined state information like we find in, for example, the host
>> SCHIB, but that incorporates vfio-ccw state. This series (in a far more
>> architecturally valid, non-RFC, form) would get us closer to that.
>>
> 
> I think it is very important to start thinking about the ccw devices
> that are exposed to the guest as a "composite device" in a sense that
> the (passed-through) host ccw device is wrapped by vfio-ccw. For instance
> the guest sees the SCHIB exposed by the vfio-ccw wrapper (adaptation
> code in qemu and the kernel module), and not the SCHIB of the host
> device.

Well, I think of "ccw devices that are exposed to the guest" as
"vfio-ccw devices" ... They look/smell/sound like CCW devices to the
guest, but have some pieces emulated by QEMU and some passed down to the
host kernel (and/or the device itself). I'm not sure what a "composite
device" buys us in any other context.

> 
> This series definitely has some progressive thoughts. It just that
> IMHO we sould to be more radical about this. And yes I'm a bit
> frustrated.

"we sould to be more radical about this" ... A little help please?

> 
>>>
>>> I suppose at the moment where we accept the CSCH the FC bit
>>> indicated clear function (19) goes to set. When this happens
>>> there are 2 possibilities: either the start (17) bit is set,
>>> or it is not. You describe a scenario where the start bit is
>>> not set. In that case we don't have a channel program that
>>> is currently being processed, and any SSCH must bounce right
>>> off without doing anything (with cc 2) as long as FC clear
>>> is set. Please note that we are still talking about the composite
>>> device here.
>>
>> Correct. Though, whether the START function control bit is currently set
>> is immaterial to a CLEAR function; that's the biggest recovery action we
>> have at our disposal, and will always go through.
>>
>> The point is that there is nothing to prevent the START on CPU 2 from
>> going through. The CLEAR does not affect the FSM today, and doesn't
>> record a FC CLEAR bit within vfio-ccw, and so we're only relying on the
>> return code from the SSCH instruction to give us cc0 vs cc2 (or
>> whatever). The actual results of that will depend, since the CPU may
>> have presented the interrupt for the CLEAR (via the .irq callback and
>> thus FSM VFIO_CCW_EVENT_INTERRUPT) and thus a new START is perfectly
>> legal from its point of view. Since vfio-ccw may not have unstacked the
>> work it placed to finish up that interrupt handler means we have a problem.
>>
>>>
>>> Thus in my reading CPU2 making the IDLE -> CP_PENDING transition
>>> happen is already wrong. We should have rejected to look at the
>>> channel program in the first place. Because as far as I can tell
>>> for the composite device the FC clear bit remains set until we
>>> process the last interrupt on the CPU 1 side in your scenario. Or
>>> am I wrong?
>>
>> You're not wrong. You're agreeing with everything I've described.  :)
>>
> 
> I'm happy our understanding seems to converge! :)
> 
> My problem is that you tie the denial of SSCH to outstanding interrupts
> ("C) A SSCH cannot be issued while an interrupt is outstanding") while
> the PoP says "Condition code 2 is set, and no other action is
> taken, when a start, halt, or clear function is currently
> in progress at the subchannel (see “Function Control
> (FC)” on page 16-22)".
> 
> This may or man not be what you have actually implemented, but it is what
> you describe in this cover letter. Also patches 2 and 3 do the 
> serialization operate on activity controls instead of on the function
> controls (as described by the PoP).

You are conflating two issues here...

1) I do use the ACTL bits in this RFC, which is almost certainly wrong
according to the POPs. But as I suggest here in this cover letter as
well as the commit message for later patches, that's irrelevant. We're
not reflecting these particular bits (or anything else in this SCSW)
back in any way/shape/form today, and they are not used for any other
decision making. I could signal the existence of another operation by
the FSM, or via three non-architected flags, or via the (correct)
architected flags.

2) The denial of the SSCH. You quote the POPs above which says "no other
action is taken, when a start, halt, or clear function is currently in
progress at the subchannel" ... But in this case, vfio-ccw has to
supplement the role of the subchannel. It's NOT done processing the
interrupt, even though the device, subchannel, and (host) CPU think it
all is.

> 
>>>
>>> AFAIR I was preaching about this several times, but could never
>>> convince the people that 'let the host ccw device sort out
>>> concurrency' is not the way this should work.
>>
>> I'm going to ignore this paragraph.
>>
> 
> ;) Wise, was just frustration venting on my side.
> 
>>>
>>> Maybe I have got a hole in my argument somewhere. If my argument
>>> is wrong, please do point out why!
>>>
>>>>
>>>> The second interrupt on CPU 1 will call cp_free() for the START
>>>> created by CPU 2, and our results will be, um, messy. This
>>>> suggests that three things must be true:
>>>>
>>>>  A) cp_free() must be called for either a final interrupt,
>>>> or a failure issuing a SSCH
>>>>  B) The FSM state must only be set to IDLE once cp_free()
>>>> has been called
>>>>  C) A SSCH cannot be issued while an interrupt is outstanding
>>>
>>> So you propose to reject SSCH in the IDLE state (if an interrupt
>>> is outstanding)? That is one silly IDLE state and FSM for sure.
>>
>> And sending a HSCH/CSCH without affecting the FSM or a POPs-valid
>> structure is not silly?
>>
> 
> It is silly! I'm frustrated because I could not convince my peers that
> it is silly.
> 
>>>
>>>>
>>>> It's not great that items B and C are separated in the interrupt
>>>> path by a mutex boundary where the copy into io_region occurs.
>>>> We could (and perhaps should) move them together, which would
>>>> improve things somewhat, but still doesn't address the scenario
>>>> laid out above. Even putting that work within the mutex window
>>>> during interrupt processing doesn't address things totally.
>>>>
>>>> So, the patches here do two things. One to handle unsolicited
>>>> interrupts [2], which I think is pretty straightforward. Though,
>>>> it does make me want to do a more drastic rearranging of the
>>>> affected function. :)
>>>>
>>>> The other warrants some discussion, since it's sort of weird.
>>>> Basically, recognize what commands we've issued, expect interrupts
>>>> for each, and prevent new SSCH's while asynchronous commands are
>>>> pending. It doesn't address interrupts from the HSCH/CSCH that
>>>> could be generated by the Channel Path Handling code [3] for an
>>>> offline CHPID yet, and it needs some TLC to make checkpatch happy.
>>>> But it's the best possibility I've seen thus far.
>>>>
>>>> I used private->scsw for this because it's barely used for anything
>>>> else; at one point I had been considering removing it outright because
>>>> we have entirely too many SCSW's in this code. :) I could implement
>>>> this as three small flags in private, to simplify the code and avoid
>>>> confusion with the different fctl/actl flags in private vs schib.
>>>>
>>>
>>> Implementing the FSM described in the PoP (which in turn
>>> conceptually relies on atomic operations on the FC bits) is IMHO
>>> the way to go. But we can track that info in our FSM states. In
>>> fact our FSM states should just add sub-partitioning of states to
>>> those states (if necessary).
>>
>> I'm not prepared to rule this out, as I originally stated, but I'm not
>> thrilled with this idea. Today, we have FSM events for an IO (START) and
>> asynchronous commands (HALT and CLEAR), and we have FSM states for the
>> different stages of a START operation. Making asynchronous commands
>> affect the FSM state isn't too big of a problem, but what happens if we
>> expand the asynchronous support to handle other commands, such as CANCEL
>> or RESUME? They don't have FC bits of their own to map into an FSM, and
>> (just like HALT/CLEAR) have some reliance on the fctl/actl/stctl bits of
>> the SCHIB.
>>
> 
> Well, fctl/actl/stctl can be seen as a possible representation of states
> and state transitions. For example for RESUME the PoP says
> "Condition code 2 is set, and no other action is
> taken, when the resume function is not applicable.
> The resume function is not applicable when the sub-
> channel (1) has any function other than the start
> function alone specified, (2) has no function speci-
> fied, (3) is resume pending, or (4) does not have sus-
> pend control specified for the start function in
> progress"
> 
> What I'm trying to say is, there is a state machine described in the PoP,
> and there the state transitions are marked by interlocked updates of bits
> in fctl/actl/stctl. I don't know how many bits relevant for the state
> machine are, but I'm pretty confident they can be packaged up in a
> quadword. So if we want we can represent that stuff with a state variable
> of ours.
> 
> BTW the whole notion of synchronous and asynchronous commands is IMHO
> broken. I tried to argue against it back then. If you read the PoP SSCH
> is asynchronous as well.

That is my fault. I was only trying to distinguish the existing
io_region used for SSCH and the new async_region used for HSCH and CSCH.

> 
>> The fact that you're talking about sub-partitioning of states, as we
>> have with CP_PROCESSING and CP_PENDING in the context of a START,
>> suggest we'd drift farther from what one actually finds in POPs and make
>> it harder to decipher what vfio-ccw is actually up to.
>>
> 
> Yes. This may or may not be justified. Because we need to do that
> translation-prefetching and we need to do some cleanup that in a
> non-virtualized context isn't there additional states may be beneficial
> or even necessary. But if our state machine is interlacing with
> the PoP state machine something is very wrong.
> 
>>>
>>>
>>>> It does make me wonder whether the CP_PENDING check before cp_free()
>>>> is still necessary, but that's a query for a later date.
>>>>
>>>> Also, perhaps this could be accomplished with two new FSM states,
>>>> one for a HALT and one for a CLEAR. I put it aside because the
>>>> interrupt handler got a little hairy with that, but I'm still looking
>>>> at it in parallel to this discussion.
>>>>
>>>
>>> You don't necessarily need 2 new states. Just one that corresponds
>>> to FC clear function will do.
>>
>> I don't think so. Maybe for my example above regarding a CSCH, but we
>> can issue a HSCH in the same way today, and a HALT should also get a cc2
>> if either a HALT or CLEAR command is already in progress. Just as a
>> START should get cc2 for a START, HALT, or CLEAR.
>>
> 
> Right. I was wrong. I misremembered something and mixed stuff up. We
> have separate FC bits for HALT and CLEAR in the PoP and that's for a
> good reason. Sorry for the noise.
> 
>>>
>>>> I look forward to hearing your thoughts...
>>>
>>> Please see above ;)
>>>
>>> Also why do we see the scenario you describe in the wild? I agree that
>>> this should be taken care of in the kernel as well, but according to my
>>> understanding QEMU is already supposed to reject the second SSCH (CPU 2)
>>> with cc 2 because it sees that FC clear function is set. Or?
>>
>> Maybe for virtio, but for vfio this all gets passed through to the
>> kernel who makes that distinction. And as I've mentioned above, that's
>> not happening.
> 
> Let's have a look at the following qemu functions. AFAIK it is
> common to vfio and virtio, or? Will prefix my inline 

My mistake, I didn't look far enough up the callchain in my quick look
at the code.

...snip...

> 
> So unless somebody (e.g. the kernel vfio-ccw) nukes the FC bits qemu
> should prevent the second SSCH from your example getting to the kernel,
> or?

It's not so much something "nukes the FC bits" ... but rather that that
the data in the irb_area of the io_region is going to reflect what the
subchannel told us for the interrupt.

Hrm... If something is polling on TSCH instead of waiting for a tap on
the shoulder, that's gonna act weird too. Maybe the bits need to be in
io_region.irb_area proper, rather than this weird private->scsw space.

> 
> This is why I hold the notion of a "composite device" for
> important.
> 
> Regards,
> Halil
>
Eric Farman May 15, 2020, 6:19 p.m. UTC | #7
On 5/15/20 1:41 PM, Halil Pasic wrote:
> On Fri, 15 May 2020 17:58:49 +0200
> Cornelia Huck <cohuck@redhat.com> wrote:
> 
>> [only some very high-level comments; I have not had time for this yet
>> and it's late on a Friday]
>>
>> On Fri, 15 May 2020 16:55:39 +0200
>> Halil Pasic <pasic@linux.ibm.com> wrote:
>>
>>> On Fri, 15 May 2020 09:09:47 -0400
>>> Eric Farman <farman@linux.ibm.com> wrote:
>>>
>>>>
>>>>
>>>> On 5/14/20 9:46 AM, Halil Pasic wrote:  
>>>>> On Wed, 13 May 2020 16:29:30 +0200
>>>>> Eric Farman <farman@linux.ibm.com> wrote:
>>>>>   
>>>>>> Hi Conny,
>>>>>>
>>>>>> Back in January, I suggested a small patch [1] to try to clean up
>>>>>> the handling of HSCH/CSCH interrupts, especially as it relates to
>>>>>> concurrent SSCH interrupts. Here is a new attempt to address this.
>>>>>>
>>>>>> There was some suggestion earlier about locking the FSM, but I'm not
>>>>>> seeing any problems with that. Rather, what I'm noticing is that the
>>>>>> flow between a synchronous START and asynchronous HALT/CLEAR have
>>>>>> different impacts on the FSM state. Consider:
>>>>>>
>>>>>>     CPU 1                           CPU 2
>>>>>>
>>>>>>     SSCH (set state=CP_PENDING)
>>>>>>     INTERRUPT (set state=IDLE)
>>>>>>     CSCH (no change in state)
>>>>>>                                     SSCH (set state=CP_PENDING)
>>>>>>     INTERRUPT (set state=IDLE)
>>>>>>                                     INTERRUPT (set state=IDLE)  
>>>>>
>>>>> How does the PoP view of the composite device look like 
>>>>> (where composite device = vfio-ccw + host device)?  
>>>>
>>>> (Just want to be sure that "composite device" is a term that you're
>>>> creating, because it's not one I'm familiar with.)  
>>>
>>> Yes I created this term because I'm unaware of an established one, and
>>> I could not come up with a better one. If you have something established
>>> or better please do tell, I will start using that.
>>
>> I don't think "composite" is a term I would use; in the end, we are
>> talking about a vfio-ccw device that gets some of its state from the
>> host device. As such, I don't think there's really anything like a "PoP
>> view" of it; the part that should comply with what the PoP specifies is
>> what gets exposed to the guest.
> 
> I agree, what matters is the architectural compliance of the device as
> exposed to the guest. I tried to refer the architectural view of the
> whole device as the PoP view. The diagram above is clearly operating
> on the vfio-ccw kernel module level as state=IDLE and state=CP_PENDING
> is stuff on that level.
> 
> What I tried to illuminate by using the word 'composite', is that what
> the guest sees is a superposition of what the QEMU code does, what the
> kernel module code does and what the host device does. And that the
> situation described should have been prevented prior the second
> SSCH hitting the kernel module (at least according to my understanding).
> 
> Let us forget about the term composite. But we do have be aware of how
> things compose. ;)
> 
>>
>>>
>>>>
>>>> In today's code, there isn't a "composite device" that contains
>>>> POPs-defined state information like we find in, for example, the host
>>>> SCHIB, but that incorporates vfio-ccw state. This series (in a far more
>>>> architecturally valid, non-RFC, form) would get us closer to that.
>>>>   
>>>
>>> I think it is very important to start thinking about the ccw devices
>>> that are exposed to the guest as a "composite device" in a sense that
>>> the (passed-through) host ccw device is wrapped by vfio-ccw. For instance
>>> the guest sees the SCHIB exposed by the vfio-ccw wrapper (adaptation
>>> code in qemu and the kernel module), and not the SCHIB of the host
>>> device.
>>
>> See my comments on that above.
>>
> 
> Sorry for muddying the water. :(
>  
>>>
>>> This series definitely has some progressive thoughts. It just that
>>> IMHO we sould to be more radical about this. And yes I'm a bit
>>> frustrated.
>>>
>>>>>
>>>>> I suppose at the moment where we accept the CSCH the FC bit
>>>>> indicated clear function (19) goes to set. When this happens
>>>>> there are 2 possibilities: either the start (17) bit is set,
>>>>> or it is not. You describe a scenario where the start bit is
>>>>> not set. In that case we don't have a channel program that
>>>>> is currently being processed, and any SSCH must bounce right
>>>>> off without doing anything (with cc 2) as long as FC clear
>>>>> is set. Please note that we are still talking about the composite
>>>>> device here.  
>>>>
>>>> Correct. Though, whether the START function control bit is currently set
>>>> is immaterial to a CLEAR function; that's the biggest recovery action we
>>>> have at our disposal, and will always go through.
>>>>
>>>> The point is that there is nothing to prevent the START on CPU 2 from
>>>> going through. The CLEAR does not affect the FSM today, and doesn't
>>>> record a FC CLEAR bit within vfio-ccw, and so we're only relying on the
>>>> return code from the SSCH instruction to give us cc0 vs cc2 (or
>>>> whatever). The actual results of that will depend, since the CPU may
>>>> have presented the interrupt for the CLEAR (via the .irq callback and
>>>> thus FSM VFIO_CCW_EVENT_INTERRUPT) and thus a new START is perfectly
>>>> legal from its point of view. Since vfio-ccw may not have unstacked the
>>>> work it placed to finish up that interrupt handler means we have a problem.
>>>>   
>>>>>
>>>>> Thus in my reading CPU2 making the IDLE -> CP_PENDING transition
>>>>> happen is already wrong. We should have rejected to look at the
>>>>> channel program in the first place. Because as far as I can tell
>>>>> for the composite device the FC clear bit remains set until we
>>>>> process the last interrupt on the CPU 1 side in your scenario. Or
>>>>> am I wrong?  
>>>>
>>>> You're not wrong. You're agreeing with everything I've described.  :)
>>>>   
>>>
>>> I'm happy our understanding seems to converge! :)
>>>
>>> My problem is that you tie the denial of SSCH to outstanding interrupts
>>> ("C) A SSCH cannot be issued while an interrupt is outstanding") while
>>> the PoP says "Condition code 2 is set, and no other action is
>>> taken, when a start, halt, or clear function is currently
>>> in progress at the subchannel (see “Function Control
>>> (FC)” on page 16-22)".
>>>
>>> This may or man not be what you have actually implemented, but it is what
>>> you describe in this cover letter. Also patches 2 and 3 do the 
>>> serialization operate on activity controls instead of on the function
>>> controls (as described by the PoP).
>>
>> I have not really had the cycles yet to look at this series in detail,
>> but should our goal really be to mimic what the PoP talks about in
>> vfio-ccw (both kernel and userspace parts)? IMO, the important part is
>> that the guest sees a device that acts *towards the guest* in a way
>> that is compliant with the PoP; we can take different paths inside
>> vfio-ccw.
> 
> I agree with the principle, but I don't agree with some of the paths
> taken :)

Hey, I didn't agree with your FSM locking either, and then marked this
as an RFC.  :)

> 
> Moreover, each time we decide to not take the path that maps the PoP
> in the most straightforward fashion, we better be very careful.
> 
>>
>> (...)
>>
>>> BTW the whole notion of synchronous and asynchronous commands is IMHO
>>> broken. I tried to argue against it back then. If you read the PoP SSCH
>>> is asynchronous as well.
>>
>> I don't see where we ever stated that SSCH was synchronous. That would
>> be silly. The async region is called the async region because it is
>> used for some async commands (HSCH/CSCH), not because it is the only
>> way to do async commands. (The original idea was to extend the I/O
>> region for HSCH/CSCH, but that turned out to be awkward.)
> 
> What I don't understand is why are HALT and CLEAR async commands, and
> START a synchronous command, or a non-async command.

As I mentioned in my other response, I introduced this confusion. What I
wanted to convey was that it's a distinction between the io_region (for
SSCH) and the async_region (HSCH/CSCH) ... Writing to the former affects
the vfio-ccw FSM, and the writing to the latter does not. That is all I
wished to convey. I did not intend to convey how the operations are
architected in POPs nor how they work in hardware (or vfio-ccw, minus
the regions being used).

> 
>>
>> (...)
>>
>> I hope I can find more time to look at this next week, but as it will
>> be a short work week for me and I'm already swamped with various other
>> things, I fear that you should keep your expectations low.

These next handful of days are not great for me either, so no rush. It's
just the next thing on my list after Channel Path Handling.

>>
> 
> Thanks for chiming in. And please do complain when I talk silly.
> Unfortunately I'm no great communicator.
> 
> Regards,
> Halil
>
Halil Pasic May 15, 2020, 6:37 p.m. UTC | #8
On Fri, 15 May 2020 14:12:05 -0400
Eric Farman <farman@linux.ibm.com> wrote:

> >>> Also why do we see the scenario you describe in the wild? I agree that
> >>> this should be taken care of in the kernel as well, but according to my
> >>> understanding QEMU is already supposed to reject the second SSCH (CPU 2)
> >>> with cc 2 because it sees that FC clear function is set. Or?  
> >>
> >> Maybe for virtio, but for vfio this all gets passed through to the
> >> kernel who makes that distinction. And as I've mentioned above, that's
> >> not happening.  
> > 
> > Let's have a look at the following qemu functions. AFAIK it is
> > common to vfio and virtio, or? Will prefix my inline   
> 
> My mistake, I didn't look far enough up the callchain in my quick look
> at the code.
> 
> ...snip...
> 

No problem. I'm glad I was at least little helpful.

> > 
> > So unless somebody (e.g. the kernel vfio-ccw) nukes the FC bits qemu
> > should prevent the second SSCH from your example getting to the kernel,
> > or?  
> 
> It's not so much something "nukes the FC bits" ... but rather that that
> the data in the irb_area of the io_region is going to reflect what the
> subchannel told us for the interrupt.

This is why the word composition came into my mind. If the HW subchannel
has FC clear, but QEMU subchannel does not the way things compose (or
superpose) is fishy.

> 
> Hrm... If something is polling on TSCH instead of waiting for a tap on
> the shoulder, that's gonna act weird too. Maybe the bits need to be in
> io_region.irb_area proper, rather than this weird private->scsw space.

Do we agree that the scenario you described with that diagram should not
have hit kernel in the first place, because if things were correct QEMU
should have fenced the second SSCH?

I think you do, but want to be sure. If not, then we need to meditate
some more on this.

I do tend to think that the kernel part is not supposed to rely on
userspace playing nice. Especially when it comes to integrity and
correctness. I can't tell  just yet if this is something we must
or just can catch in the kernel module. I'm for catching it regardless,
but I'm even more for everything working as it is supposed. :)

Regards,
Halil
Halil Pasic May 15, 2020, 7:35 p.m. UTC | #9
On Fri, 15 May 2020 14:12:05 -0400
Eric Farman <farman@linux.ibm.com> wrote:

> 
> 
> On 5/15/20 10:55 AM, Halil Pasic wrote:
> > On Fri, 15 May 2020 09:09:47 -0400
> > Eric Farman <farman@linux.ibm.com> wrote:
> > 
> >>
> >>
> >> On 5/14/20 9:46 AM, Halil Pasic wrote:
> >>> On Wed, 13 May 2020 16:29:30 +0200
> >>> Eric Farman <farman@linux.ibm.com> wrote:
> >>>
> >>>> Hi Conny,
> >>>>
> >>>> Back in January, I suggested a small patch [1] to try to clean up
> >>>> the handling of HSCH/CSCH interrupts, especially as it relates to
> >>>> concurrent SSCH interrupts. Here is a new attempt to address this.
> >>>>
> >>>> There was some suggestion earlier about locking the FSM, but I'm not
> >>>> seeing any problems with that. Rather, what I'm noticing is that the
> >>>> flow between a synchronous START and asynchronous HALT/CLEAR have
> >>>> different impacts on the FSM state. Consider:
> >>>>
> >>>>     CPU 1                           CPU 2
> >>>>
> >>>>     SSCH (set state=CP_PENDING)
> >>>>     INTERRUPT (set state=IDLE)
> >>>>     CSCH (no change in state)
> >>>>                                     SSCH (set state=CP_PENDING)
> >>>>     INTERRUPT (set state=IDLE)
> >>>>                                     INTERRUPT (set state=IDLE)
> >>>
> >>> How does the PoP view of the composite device look like 
> >>> (where composite device = vfio-ccw + host device)?
> >>
> >> (Just want to be sure that "composite device" is a term that you're
> >> creating, because it's not one I'm familiar with.)
> > 
> > Yes I created this term because I'm unaware of an established one, and
> > I could not come up with a better one. If you have something established
> > or better please do tell, I will start using that.
> > 
> >>
> >> In today's code, there isn't a "composite device" that contains
> >> POPs-defined state information like we find in, for example, the host
> >> SCHIB, but that incorporates vfio-ccw state. This series (in a far more
> >> architecturally valid, non-RFC, form) would get us closer to that.
> >>
> > 
> > I think it is very important to start thinking about the ccw devices
> > that are exposed to the guest as a "composite device" in a sense that
> > the (passed-through) host ccw device is wrapped by vfio-ccw. For instance
> > the guest sees the SCHIB exposed by the vfio-ccw wrapper (adaptation
> > code in qemu and the kernel module), and not the SCHIB of the host
> > device.
> 
> Well, I think of "ccw devices that are exposed to the guest" as
> "vfio-ccw devices" ... They look/smell/sound like CCW devices to the
> guest, but have some pieces emulated by QEMU and some passed down to the
> host kernel (and/or the device itself). I'm not sure what a "composite
> device" buys us in any other context.

Sorry I will try to use the "vfio-ccw device" as a terminology for
this from now on. I wasn't looking at this for a while. I was worried
that it could be confused to something with a narrower scope (like the
mdev, or the bits in QEMU). Terminology is difficult also because we
technically actually do  subchannel pass-through. Please bear with
me when I get lost.

> 
> > 
> > This series definitely has some progressive thoughts. It just that
> > IMHO we sould to be more radical about this. And yes I'm a bit
> > frustrated.
> 
> "we sould to be more radical about this" ... A little help please?
> 

Make sure already the QEMU part catches this (see below). Make sure that
we don't have a hodge-podge of FMS and non-FSM states and state
transitions, and that the userspace/kernel interface is sane and
documented. 

I mean, not maintaining a copy of the SCHIB in QEMU was a design option,
and doing it was a design decision AFAIR.

Don't misunderstand me, I'm not blaming you for anything. Just trying
to bring in my perspective. 

> > 
> >>>
> >>> I suppose at the moment where we accept the CSCH the FC bit
> >>> indicated clear function (19) goes to set. When this happens
> >>> there are 2 possibilities: either the start (17) bit is set,
> >>> or it is not. You describe a scenario where the start bit is
> >>> not set. In that case we don't have a channel program that
> >>> is currently being processed, and any SSCH must bounce right
> >>> off without doing anything (with cc 2) as long as FC clear
> >>> is set. Please note that we are still talking about the composite
> >>> device here.
> >>
> >> Correct. Though, whether the START function control bit is currently set
> >> is immaterial to a CLEAR function; that's the biggest recovery action we
> >> have at our disposal, and will always go through.
> >>
> >> The point is that there is nothing to prevent the START on CPU 2 from
> >> going through. The CLEAR does not affect the FSM today, and doesn't
> >> record a FC CLEAR bit within vfio-ccw, and so we're only relying on the
> >> return code from the SSCH instruction to give us cc0 vs cc2 (or
> >> whatever). The actual results of that will depend, since the CPU may
> >> have presented the interrupt for the CLEAR (via the .irq callback and
> >> thus FSM VFIO_CCW_EVENT_INTERRUPT) and thus a new START is perfectly
> >> legal from its point of view. Since vfio-ccw may not have unstacked the
> >> work it placed to finish up that interrupt handler means we have a problem.
> >>
> >>>
> >>> Thus in my reading CPU2 making the IDLE -> CP_PENDING transition
> >>> happen is already wrong. We should have rejected to look at the
> >>> channel program in the first place. Because as far as I can tell
> >>> for the composite device the FC clear bit remains set until we
> >>> process the last interrupt on the CPU 1 side in your scenario. Or
> >>> am I wrong?
> >>
> >> You're not wrong. You're agreeing with everything I've described.  :)
> >>
> > 
> > I'm happy our understanding seems to converge! :)
> > 
> > My problem is that you tie the denial of SSCH to outstanding interrupts
> > ("C) A SSCH cannot be issued while an interrupt is outstanding") while
> > the PoP says "Condition code 2 is set, and no other action is
> > taken, when a start, halt, or clear function is currently
> > in progress at the subchannel (see “Function Control
> > (FC)” on page 16-22)".
> > 
> > This may or man not be what you have actually implemented, but it is what
> > you describe in this cover letter. Also patches 2 and 3 do the 
> > serialization operate on activity controls instead of on the function
> > controls (as described by the PoP).
> 
> You are conflating two issues here...
> 
> 1) I do use the ACTL bits in this RFC, which is almost certainly wrong
> according to the POPs. But as I suggest here in this cover letter as
> well as the commit message for later patches, that's irrelevant. We're
> not reflecting these particular bits (or anything else in this SCSW)
> back in any way/shape/form today, and they are not used for any other
> decision making. I could signal the existence of another operation by
> the FSM, or via three non-architected flags, or via the (correct)
> architected flags.

I got that. But please consider that code is read by humans. And if the
human sees ACTL he is likely to think ACTL.

> 
> 2) The denial of the SSCH. You quote the POPs above which says "no other
> action is taken, when a start, halt, or clear function is currently in
> progress at the subchannel" ... But in this case, vfio-ccw has to
> supplement the role of the subchannel. It's NOT done processing the
> interrupt, even though the device, subchannel, and (host) CPU think it
> all is.

I think we are talking about the same. The words of the PoP need to
be interpreted with respect to the subchannel as seen by the guest.

Each entity in the stack is basically manipulating a subchannel. The
guest certainly, the API exposed to the userspace (QEMU) is
essentially also about manipulatiting a subchannel, and the kernel
manipulates the passed through subchannel (and the passed through dasd
sits behind the passed through subchannel).

What is the relationship between these at any point in time 
is the thing that needs to be well understood (and well documented).

[..]

> >>> Implementing the FSM described in the PoP (which in turn
> >>> conceptually relies on atomic operations on the FC bits) is IMHO
> >>> the way to go. But we can track that info in our FSM states. In
> >>> fact our FSM states should just add sub-partitioning of states to
> >>> those states (if necessary).
> >>
> >> I'm not prepared to rule this out, as I originally stated, but I'm not
> >> thrilled with this idea. Today, we have FSM events for an IO (START) and
> >> asynchronous commands (HALT and CLEAR), and we have FSM states for the
> >> different stages of a START operation. Making asynchronous commands
> >> affect the FSM state isn't too big of a problem, but what happens if we
> >> expand the asynchronous support to handle other commands, such as CANCEL
> >> or RESUME? They don't have FC bits of their own to map into an FSM, and
> >> (just like HALT/CLEAR) have some reliance on the fctl/actl/stctl bits of
> >> the SCHIB.
> >>
> > 
> > Well, fctl/actl/stctl can be seen as a possible representation of states
> > and state transitions. For example for RESUME the PoP says
> > "Condition code 2 is set, and no other action is
> > taken, when the resume function is not applicable.
> > The resume function is not applicable when the sub-
> > channel (1) has any function other than the start
> > function alone specified, (2) has no function speci-
> > fied, (3) is resume pending, or (4) does not have sus-
> > pend control specified for the start function in
> > progress"
> > 
> > What I'm trying to say is, there is a state machine described in the PoP,
> > and there the state transitions are marked by interlocked updates of bits
> > in fctl/actl/stctl. I don't know how many bits relevant for the state
> > machine are, but I'm pretty confident they can be packaged up in a
> > quadword. So if we want we can represent that stuff with a state variable
> > of ours.
> > 
> > BTW the whole notion of synchronous and asynchronous commands is IMHO
> > broken. I tried to argue against it back then. If you read the PoP SSCH
> > is asynchronous as well.
> 
> That is my fault. I was only trying to distinguish the existing
> io_region used for SSCH and the new async_region used for HSCH and CSCH.
> 

Not your fault. It is mine. I never understood these name and they
make me nervous.

Regards,
Halil
Cornelia Huck May 18, 2020, 4:09 p.m. UTC | #10
On Wed, 13 May 2020 16:29:30 +0200
Eric Farman <farman@linux.ibm.com> wrote:

> Hi Conny,
> 
> Back in January, I suggested a small patch [1] to try to clean up
> the handling of HSCH/CSCH interrupts, especially as it relates to
> concurrent SSCH interrupts. Here is a new attempt to address this.
> 
> There was some suggestion earlier about locking the FSM, but I'm not
> seeing any problems with that. Rather, what I'm noticing is that the
> flow between a synchronous START and asynchronous HALT/CLEAR have
> different impacts on the FSM state. Consider:
> 
>     CPU 1                           CPU 2
> 
>     SSCH (set state=CP_PENDING)
>     INTERRUPT (set state=IDLE)
>     CSCH (no change in state)
>                                     SSCH (set state=CP_PENDING)

This is the transition I do not understand. When we get a request via
the I/O area, we go to CP_PROCESSING and start doing translations.
However, we only transition to CP_PENDING if we actually do a SSCH with
cc 0 -- which shouldn't be possible in the flow you outline... unless
it really is something that can be taken care of with locking (state
machine transitioning due to an interrupt without locking, so we go to
IDLE without other parts noticing.)

>     INTERRUPT (set state=IDLE)
>                                     INTERRUPT (set state=IDLE)

But taking a step back (and ignoring your series and the discussion,
sorry about that):

We need to do something (creating a local translation of the guest's
channel program) that does not have any relation to the process in the
architecture at all, but is only something that needs to be done
because of what vfio-ccw is trying to do (issuing a channel program on
behalf of another entity.) Trying to sort that out by poking at actl
and fctl bits does not seem like the best way; especially as keeping
the bits up-to-date via STSCH is an exercise in futility.

What about the following (and yes, I had suggested something vaguely in
that direction before):

- Detach the cp from the subchannel (or better, remove the 1:1
  relationship). By that I mean building the cp as a separately
  allocated structure (maybe embedding a kref, but that might not be
  needed), and appending it to a list after SSCH with cc=0. Discard it
  if cc!=0.
- Remove the CP_PENDING state. The state is either IDLE after any
  successful SSCH/HSCH/CSCH, or a new state in that case. But no
  special state for SSCH.
- A successful CSCH removes the first queued request, if any.
- A final interrupt removes the first queued request, if any.

Thoughts?
Eric Farman May 18, 2020, 9:57 p.m. UTC | #11
On 5/18/20 12:09 PM, Cornelia Huck wrote:
> On Wed, 13 May 2020 16:29:30 +0200
> Eric Farman <farman@linux.ibm.com> wrote:
> 
>> Hi Conny,
>>
>> Back in January, I suggested a small patch [1] to try to clean up
>> the handling of HSCH/CSCH interrupts, especially as it relates to
>> concurrent SSCH interrupts. Here is a new attempt to address this.
>>
>> There was some suggestion earlier about locking the FSM, but I'm not
>> seeing any problems with that. Rather, what I'm noticing is that the
>> flow between a synchronous START and asynchronous HALT/CLEAR have
>> different impacts on the FSM state. Consider:
>>
>>     CPU 1                           CPU 2
>>
>>     SSCH (set state=CP_PENDING)
>>     INTERRUPT (set state=IDLE)
>>     CSCH (no change in state)
>>                                     SSCH (set state=CP_PENDING)
> 
> This is the transition I do not understand. When we get a request via
> the I/O area, we go to CP_PROCESSING and start doing translations.
> However, we only transition to CP_PENDING if we actually do a SSCH with
> cc 0 -- which shouldn't be possible in the flow you outline... unless
> it really is something that can be taken care of with locking (state
> machine transitioning due to an interrupt without locking, so we go to
> IDLE without other parts noticing.)

I'm only going by what the (existing and my temporary) tea leaves in
s390dbf are telling us. :)

> 
>>     INTERRUPT (set state=IDLE)

Part of the problem is that this is actually comprised of these elements:

if (irb_is_final && state == CP_PENDING)
	cp_free()

lock io_mutex
copy irb to io_region
unlock io_mutex

if (irb_is_final)
	state = IDLE

The CP_PENDING check will protect us if a SSCH is still being built at
the time we execute this code. But if we got to CP_PENDING first
(between fsm_irq() stacking to the workqueue and us unstacking
vfio_ccw_sch_io_todo()), we would free an unrelated operation. (This was
the scenario in the first version of my fix back in January.)

We can't add a CP_PENDING check after the io_mutex barrier, because if a
second SSCH is being processed, we will hang on the lock acquisition and
will DEFINITELY be in CP_PENDING state when we come back. But by that
point, we will have skipped freeing the (now active) CP but are back in
an IDLE state.


>>                                     INTERRUPT (set state=IDLE)
> 
> But taking a step back (and ignoring your series and the discussion,
> sorry about that):

No apologies necessary.

> 
> We need to do something (creating a local translation of the guest's
> channel program) that does not have any relation to the process in the
> architecture at all, but is only something that needs to be done
> because of what vfio-ccw is trying to do (issuing a channel program on
> behalf of another entity.) Trying to sort that out by poking at actl
> and fctl bits does not seem like the best way; especially as keeping
> the bits up-to-date via STSCH is an exercise in futility.

I am coming to strongly agree with this sentiment.

> 
> What about the following (and yes, I had suggested something vaguely in
> that direction before):
> 
> - Detach the cp from the subchannel (or better, remove the 1:1
>   relationship). By that I mean building the cp as a separately
>   allocated structure (maybe embedding a kref, but that might not be
>   needed), and appending it to a list after SSCH with cc=0. Discard it
>   if cc!=0.
> - Remove the CP_PENDING state. The state is either IDLE after any
>   successful SSCH/HSCH/CSCH, or a new state in that case. But no
>   special state for SSCH.
> - A successful CSCH removes the first queued request, if any.
> - A final interrupt removes the first queued request, if any.
> 
> Thoughts?
> 

I'm cautiously optimistic, for exactly the reason I mention above. If we
always expect to be in IDLE state once an interrupt arrives, we can just
rely on determining if the interrupt is in relation to an actual
operation we're waiting on. I'll give this a try and report back.
Eric Farman May 18, 2020, 10:01 p.m. UTC | #12
On 5/15/20 2:37 PM, Halil Pasic wrote:
> On Fri, 15 May 2020 14:12:05 -0400
> Eric Farman <farman@linux.ibm.com> wrote:
> 
>>>>> Also why do we see the scenario you describe in the wild? I agree that
>>>>> this should be taken care of in the kernel as well, but according to my
>>>>> understanding QEMU is already supposed to reject the second SSCH (CPU 2)
>>>>> with cc 2 because it sees that FC clear function is set. Or?  
>>>>
>>>> Maybe for virtio, but for vfio this all gets passed through to the
>>>> kernel who makes that distinction. And as I've mentioned above, that's
>>>> not happening.  
>>>
>>> Let's have a look at the following qemu functions. AFAIK it is
>>> common to vfio and virtio, or? Will prefix my inline   
>>
>> My mistake, I didn't look far enough up the callchain in my quick look
>> at the code.
>>
>> ...snip...
>>
> 
> No problem. I'm glad I was at least little helpful.
> 
>>>
>>> So unless somebody (e.g. the kernel vfio-ccw) nukes the FC bits qemu
>>> should prevent the second SSCH from your example getting to the kernel,
>>> or?  
>>
>> It's not so much something "nukes the FC bits" ... but rather that that
>> the data in the irb_area of the io_region is going to reflect what the
>> subchannel told us for the interrupt.
> 
> This is why the word composition came into my mind. If the HW subchannel
> has FC clear, but QEMU subchannel does not the way things compose (or
> superpose) is fishy.
> 
>>
>> Hrm... If something is polling on TSCH instead of waiting for a tap on
>> the shoulder, that's gonna act weird too. Maybe the bits need to be in
>> io_region.irb_area proper, rather than this weird private->scsw space.
> 
> Do we agree that the scenario you described with that diagram should not
> have hit kernel in the first place, because if things were correct QEMU
> should have fenced the second SSCH?
> 
> I think you do, but want to be sure. If not, then we need to meditate
> some more on this.

I think I do too.  :)  I'll meditate on this a bit later, because...

> 
> I do tend to think that the kernel part is not supposed to rely on
> userspace playing nice.

...this is important, and I'd rather get the kernel buttoned up first
before sorting out QEMU.

 Especially when it comes to integrity and
> correctness. I can't tell  just yet if this is something we must
> or just can catch in the kernel module. I'm for catching it regardless,
> but I'm even more for everything working as it is supposed. :)
> 
> Regards,
> Halil
>
Halil Pasic May 18, 2020, 10:09 p.m. UTC | #13
On Mon, 18 May 2020 18:09:03 +0200
Cornelia Huck <cohuck@redhat.com> wrote:

> On Wed, 13 May 2020 16:29:30 +0200
> Eric Farman <farman@linux.ibm.com> wrote:
> 
> > Hi Conny,
> > 
> > Back in January, I suggested a small patch [1] to try to clean up
> > the handling of HSCH/CSCH interrupts, especially as it relates to
> > concurrent SSCH interrupts. Here is a new attempt to address this.
> > 
> > There was some suggestion earlier about locking the FSM, but I'm not
> > seeing any problems with that. Rather, what I'm noticing is that the
> > flow between a synchronous START and asynchronous HALT/CLEAR have
> > different impacts on the FSM state. Consider:
> > 
> >     CPU 1                           CPU 2
> > 
> >     SSCH (set state=CP_PENDING)
> >     INTERRUPT (set state=IDLE)
> >     CSCH (no change in state)
> >                                     SSCH (set state=CP_PENDING)
> 
> This is the transition I do not understand. When we get a request via
> the I/O area, we go to CP_PROCESSING and start doing translations.
> However, we only transition to CP_PENDING if we actually do a SSCH with
> cc 0 -- which shouldn't be possible in the flow you outline... unless
> it really is something that can be taken care of with locking (state
> machine transitioning due to an interrupt without locking, so we go to
> IDLE without other parts noticing.)

I argued, that the second SSCH is to be caught by QEMU. So I think
we are kind of on the same page, and yet when it comes to details
we are not.

The details: We have multiple non-atomic things going on
* the clear function FC gets set at the host subchannel 
* the clear function completes and the subchannel becomes status pending
* an interrupt is delivered that indicates the subchannel event
* the interrupt handler gets invoked
* STSCH does its thing
* state is set to IDLE

So theoretically, between STSCH is done (and cleared FC clear bit)
and state=IDLE an SSCH can go through.


> 
> >     INTERRUPT (set state=IDLE)
> >                                     INTERRUPT (set state=IDLE)
> 
> But taking a step back (and ignoring your series and the discussion,
> sorry about that):
> 
> We need to do something (creating a local translation of the guest's
> channel program) that does not have any relation to the process in the
> architecture at all, but is only something that needs to be done
> because of what vfio-ccw is trying to do (issuing a channel program on
> behalf of another entity.) 

I violently disagree with this point. Looking at the whole vfio-ccw
device the translation is part of the execution of the channel program,
more specifically it fits in as prefetching. Thus it needs to happen
with the FC start bit set. Before FC start is set the subchannel is
not allowed to process (including look at) the channel program. At least
that is what I remember.

> Trying to sort that out by poking at actl
> and fctl bits does not seem like the best way; especially as keeping
> the bits up-to-date via STSCH is an exercise in futility.

I disagree. A single subchannel is processing at most one channel
program at any given point in time. Or am I reading the PoP wrong?

> 
> What about the following (and yes, I had suggested something vaguely in
> that direction before):
> 
> - Detach the cp from the subchannel (or better, remove the 1:1
>   relationship). By that I mean building the cp as a separately
>   allocated structure (maybe embedding a kref, but that might not be
>   needed), and appending it to a list after SSCH with cc=0. Discard it
>   if cc!=0.
> - Remove the CP_PENDING state. The state is either IDLE after any
>   successful SSCH/HSCH/CSCH, or a new state in that case. But no
>   special state for SSCH.
> - A successful CSCH removes the first queued request, if any.
> - A final interrupt removes the first queued request, if any.
> 
> Thoughts?
> 

See above. IMHO the second SSCH is to be rejected by QEMU. I've
explained this in more detail in my previous mail.

Regards,
Halil
Cornelia Huck May 19, 2020, 11:23 a.m. UTC | #14
On Mon, 18 May 2020 17:57:39 -0400
Eric Farman <farman@linux.ibm.com> wrote:

> On 5/18/20 12:09 PM, Cornelia Huck wrote:
> > On Wed, 13 May 2020 16:29:30 +0200
> > Eric Farman <farman@linux.ibm.com> wrote:
> >   
> >> Hi Conny,
> >>
> >> Back in January, I suggested a small patch [1] to try to clean up
> >> the handling of HSCH/CSCH interrupts, especially as it relates to
> >> concurrent SSCH interrupts. Here is a new attempt to address this.
> >>
> >> There was some suggestion earlier about locking the FSM, but I'm not
> >> seeing any problems with that. Rather, what I'm noticing is that the
> >> flow between a synchronous START and asynchronous HALT/CLEAR have
> >> different impacts on the FSM state. Consider:
> >>
> >>     CPU 1                           CPU 2
> >>
> >>     SSCH (set state=CP_PENDING)
> >>     INTERRUPT (set state=IDLE)
> >>     CSCH (no change in state)
> >>                                     SSCH (set state=CP_PENDING)  
> > 
> > This is the transition I do not understand. When we get a request via
> > the I/O area, we go to CP_PROCESSING and start doing translations.
> > However, we only transition to CP_PENDING if we actually do a SSCH with
> > cc 0 -- which shouldn't be possible in the flow you outline... unless
> > it really is something that can be taken care of with locking (state
> > machine transitioning due to an interrupt without locking, so we go to
> > IDLE without other parts noticing.)  
> 
> I'm only going by what the (existing and my temporary) tea leaves in
> s390dbf are telling us. :)

/me makes a note to try tea leaves :)

> 
> >   
> >>     INTERRUPT (set state=IDLE)  
> 
> Part of the problem is that this is actually comprised of these elements:
> 
> if (irb_is_final && state == CP_PENDING)
> 	cp_free()
> 
> lock io_mutex
> copy irb to io_region
> unlock io_mutex
> 
> if (irb_is_final)
> 	state = IDLE
> 
> The CP_PENDING check will protect us if a SSCH is still being built at
> the time we execute this code. But if we got to CP_PENDING first
> (between fsm_irq() stacking to the workqueue and us unstacking
> vfio_ccw_sch_io_todo()), we would free an unrelated operation. (This was
> the scenario in the first version of my fix back in January.)
> 
> We can't add a CP_PENDING check after the io_mutex barrier, because if a
> second SSCH is being processed, we will hang on the lock acquisition and
> will DEFINITELY be in CP_PENDING state when we come back. But by that
> point, we will have skipped freeing the (now active) CP but are back in
> an IDLE state.

That's all very ugly :(

> 
> 
> >>                                     INTERRUPT (set state=IDLE)  
> > 
> > But taking a step back (and ignoring your series and the discussion,
> > sorry about that):  
> 
> No apologies necessary.
> 
> > 
> > We need to do something (creating a local translation of the guest's
> > channel program) that does not have any relation to the process in the
> > architecture at all, but is only something that needs to be done
> > because of what vfio-ccw is trying to do (issuing a channel program on
> > behalf of another entity.) Trying to sort that out by poking at actl
> > and fctl bits does not seem like the best way; especially as keeping
> > the bits up-to-date via STSCH is an exercise in futility.  
> 
> I am coming to strongly agree with this sentiment.

Thank you for making me feel like I'm not completely out in the weeds :)

> 
> > 
> > What about the following (and yes, I had suggested something vaguely in
> > that direction before):
> > 
> > - Detach the cp from the subchannel (or better, remove the 1:1
> >   relationship). By that I mean building the cp as a separately
> >   allocated structure (maybe embedding a kref, but that might not be
> >   needed), and appending it to a list after SSCH with cc=0. Discard it
> >   if cc!=0.
> > - Remove the CP_PENDING state. The state is either IDLE after any
> >   successful SSCH/HSCH/CSCH, or a new state in that case. But no
> >   special state for SSCH.
> > - A successful CSCH removes the first queued request, if any.
> > - A final interrupt removes the first queued request, if any.
> > 
> > Thoughts?
> >   
> 
> I'm cautiously optimistic, for exactly the reason I mention above. If we
> always expect to be in IDLE state once an interrupt arrives, we can just
> rely on determining if the interrupt is in relation to an actual
> operation we're waiting on. I'll give this a try and report back.

Great, good luck!
Cornelia Huck May 19, 2020, 11:36 a.m. UTC | #15
On Tue, 19 May 2020 00:09:43 +0200
Halil Pasic <pasic@linux.ibm.com> wrote:

> On Mon, 18 May 2020 18:09:03 +0200
> Cornelia Huck <cohuck@redhat.com> wrote:

> > But taking a step back (and ignoring your series and the discussion,
> > sorry about that):
> > 
> > We need to do something (creating a local translation of the guest's
> > channel program) that does not have any relation to the process in the
> > architecture at all, but is only something that needs to be done
> > because of what vfio-ccw is trying to do (issuing a channel program on
> > behalf of another entity.)   
> 
> I violently disagree with this point. Looking at the whole vfio-ccw
> device the translation is part of the execution of the channel program,
> more specifically it fits in as prefetching. Thus it needs to happen
> with the FC start bit set. Before FC start is set the subchannel is
> not allowed to process (including look at) the channel program. At least
> that is what I remember.

I fear we really have to agree to disagree here. The PoP describes how
a SSCH etc. has to be done and what reaction to expect. It does not
cover the 'SSCH on behalf of someone else' pattern: only what we can
expect from that second SSCH, and what the guest can expect from us. In
particular, the PoP does not specify anything about how a hypervisor is
supposed to handle I/O from its guests (and why should it?)

> 
> > Trying to sort that out by poking at actl
> > and fctl bits does not seem like the best way; especially as keeping
> > the bits up-to-date via STSCH is an exercise in futility.  
> 
> I disagree. A single subchannel is processing at most one channel
> program at any given point in time. Or am I reading the PoP wrong?

The hypervisor cannot know the exact status of the subchannel. It only
knows the state of the subchannel at the time it issued its last STSCH.
Anything else it needs to track is the hypervisor's business, and
ideally, it should track that in its own control structures. (I know we
muck around with the control bits today; but maybe that's not the best
idea.)

> 
> > 
> > What about the following (and yes, I had suggested something vaguely in
> > that direction before):
> > 
> > - Detach the cp from the subchannel (or better, remove the 1:1
> >   relationship). By that I mean building the cp as a separately
> >   allocated structure (maybe embedding a kref, but that might not be
> >   needed), and appending it to a list after SSCH with cc=0. Discard it
> >   if cc!=0.
> > - Remove the CP_PENDING state. The state is either IDLE after any
> >   successful SSCH/HSCH/CSCH, or a new state in that case. But no
> >   special state for SSCH.
> > - A successful CSCH removes the first queued request, if any.
> > - A final interrupt removes the first queued request, if any.
> > 
> > Thoughts?
> >   
> 
> See above. IMHO the second SSCH is to be rejected by QEMU. I've
> explained this in more detail in my previous mail.

I don't think we should rely on whatever QEMU is or isn't doing. We
should not get all tangled up if userspace is doing weird stuff.
Halil Pasic May 19, 2020, 12:10 p.m. UTC | #16
On Tue, 19 May 2020 13:36:06 +0200
Cornelia Huck <cohuck@redhat.com> wrote:

> On Tue, 19 May 2020 00:09:43 +0200
> Halil Pasic <pasic@linux.ibm.com> wrote:
> 
> > On Mon, 18 May 2020 18:09:03 +0200
> > Cornelia Huck <cohuck@redhat.com> wrote:
> 
> > > But taking a step back (and ignoring your series and the discussion,
> > > sorry about that):
> > > 
> > > We need to do something (creating a local translation of the guest's
> > > channel program) that does not have any relation to the process in the
> > > architecture at all, but is only something that needs to be done
> > > because of what vfio-ccw is trying to do (issuing a channel program on
> > > behalf of another entity.)   
> > 
> > I violently disagree with this point. Looking at the whole vfio-ccw
> > device the translation is part of the execution of the channel program,
> > more specifically it fits in as prefetching. Thus it needs to happen
> > with the FC start bit set. Before FC start is set the subchannel is
> > not allowed to process (including look at) the channel program. At least
> > that is what I remember.
> 
> I fear we really have to agree to disagree here. 

So you say the subchannel is allowed to poke around before FC start is
set? I'm not sure what do you disagree with.

> The PoP describes how
> a SSCH etc. has to be done and what reaction to expect. It does not
> cover the 'SSCH on behalf of someone else' pattern: only what we can
> expect from that second SSCH, and what the guest can expect from us.

Yes. The PoP says that the guest can expect from us to not process the
submitted channel program if the subchannel is already busy. For this
whether the ccw device is fully emulated, or if it is a pass through
with 'SSCH on behalf' it does not matter. From the guest perspective
the 'SSCH on behalf' is just an implementation detail, right?

> In
> particular, the PoP does not specify anything about how a hypervisor is
> supposed to handle I/O from its guests (and why should it?)
>

Except that the exposed interface (towards the guest OS) needs to conform
to the pop. Including the point I cited.
 
> > 
> > > Trying to sort that out by poking at actl
> > > and fctl bits does not seem like the best way; especially as keeping
> > > the bits up-to-date via STSCH is an exercise in futility.  
> > 
> > I disagree. A single subchannel is processing at most one channel
> > program at any given point in time. Or am I reading the PoP wrong?
> 
> The hypervisor cannot know the exact status of the subchannel. It only
> knows the state of the subchannel at the time it issued its last STSCH.

You didn't respond to 'one cp at any given time is in PoP'. Do we agree
about that at least?

What can the hypervisor know about the host subchannel and its status
is a different issue. I would argue that it actually knows enough but
we need to get the very basics straight first.

But this point the hypervisor does not need to know about the exact
state of the host subchannel, it needs to know about the state of the
interface it is exposing (to the guest, or to the userspace).

> Anything else it needs to track is the hypervisor's business, and
> ideally, it should track that in its own control structures.

We agree on this. And this is what I'm asking for, and also what Eric
did. Although I don't agree with the details. He misused the actl bits.
According to him these were unused and thus the hypervisor's own control
structure. He himself stated that this may not be the best way to do it.

> (I know we
> muck around with the control bits today; but maybe that's not the best
> idea.)
> 

What control bits do you mean?

> > 
> > > 
> > > What about the following (and yes, I had suggested something vaguely in
> > > that direction before):
> > > 
> > > - Detach the cp from the subchannel (or better, remove the 1:1
> > >   relationship). By that I mean building the cp as a separately
> > >   allocated structure (maybe embedding a kref, but that might not be
> > >   needed), and appending it to a list after SSCH with cc=0. Discard it
> > >   if cc!=0.
> > > - Remove the CP_PENDING state. The state is either IDLE after any
> > >   successful SSCH/HSCH/CSCH, or a new state in that case. But no
> > >   special state for SSCH.
> > > - A successful CSCH removes the first queued request, if any.
> > > - A final interrupt removes the first queued request, if any.
> > > 
> > > Thoughts?
> > >   
> > 
> > See above. IMHO the second SSCH is to be rejected by QEMU. I've
> > explained this in more detail in my previous mail.
> 
> I don't think we should rely on whatever QEMU is or isn't doing. We
> should not get all tangled up if userspace is doing weird stuff.

I agree 100%. And I was a vocal advocate of this all the time. We
need to avoid undefined behavior in the kernel regardless of what
userspace is doing.

But we can (and should) have expectations towards the userspace,
and refuse to do any further work, if we detect that userspace is
misbehaving.

Regards,
Halil


>
Cornelia Huck May 26, 2020, 9:55 a.m. UTC | #17
On Wed, 13 May 2020 16:29:30 +0200
Eric Farman <farman@linux.ibm.com> wrote:

> There was some suggestion earlier about locking the FSM, but I'm not
> seeing any problems with that. Rather, what I'm noticing is that the
> flow between a synchronous START and asynchronous HALT/CLEAR have
> different impacts on the FSM state. Consider:
> 
>     CPU 1                           CPU 2
> 
>     SSCH (set state=CP_PENDING)
>     INTERRUPT (set state=IDLE)
>     CSCH (no change in state)
>                                     SSCH (set state=CP_PENDING)
>     INTERRUPT (set state=IDLE)
>                                     INTERRUPT (set state=IDLE)

A different question (not related to how we want to fix this): How
easily can you trigger this bug? Is this during normal testing with a
bit of I/O stress, or do you have a special test case?
Eric Farman May 26, 2020, 11:08 a.m. UTC | #18
On 5/26/20 5:55 AM, Cornelia Huck wrote:
> On Wed, 13 May 2020 16:29:30 +0200
> Eric Farman <farman@linux.ibm.com> wrote:
> 
>> There was some suggestion earlier about locking the FSM, but I'm not
>> seeing any problems with that. Rather, what I'm noticing is that the
>> flow between a synchronous START and asynchronous HALT/CLEAR have
>> different impacts on the FSM state. Consider:
>>
>>     CPU 1                           CPU 2
>>
>>     SSCH (set state=CP_PENDING)
>>     INTERRUPT (set state=IDLE)
>>     CSCH (no change in state)
>>                                     SSCH (set state=CP_PENDING)
>>     INTERRUPT (set state=IDLE)
>>                                     INTERRUPT (set state=IDLE)
> 
> A different question (not related to how we want to fix this): How
> easily can you trigger this bug? Is this during normal testing with a
> bit of I/O stress, or do you have a special test case?
> 

I have hit this with "normal testing with a bit of I/O stress" but it's
been maddeningly slow to repro (invariably when I'm not running with any
detailed traces enabled). So I expedite the process with the channel
path handling code, and this script running on the host:

while True:
    tempChpid = random.choice(chpids)
    tempFunction = random.choice(["-c", "-v"])

    doChzdev(tempFunction, "0", tempChpid)
    doSleep()

    doChzdev(tempFunction, "1", tempChpid)
    doSleep()