Message ID | 20210511195631.3995081-1-farman@linux.ibm.com (mailing list archive) |
---|---|
Headers | show |
Series | vfio-ccw: Fix interrupt handling for HALT/CLEAR | expand |
On Tue, 11 May 2021 21:56:28 +0200 Eric Farman <farman@linux.ibm.com> wrote: > Hi Conny, Matt, Halil, > > Here's one (last?) update to my proposal for handling the collision > between interrupts for START SUBCHANNEL and HALT/CLEAR SUBCHANNEL. > > Only change here is to include Conny's suggestions on patch 3. > > Thanks, I believe these changes are beneficial, although I don't understand everything about them. In that sense I'm happy with the these getting merged. Let me also spend some words answering the unasked question, what I'm not understanding about these. Not understanding how the problem stated in the cover letter of v4 is actually resolved is certainly the most important one. Let me cite the relevant part of it (your cover letter already contains a link to the full version). """ CPU 1 CPU 2 1 CLEAR SUBCHANNEL 2 fsm_irq() 3 START SUBCHANNEL 4 vfio_ccw_sch_io_todo() 5 fsm_irq() 6 vfio_ccw_sch_io_todo() From the channel subsystem's point of view the CLEAR SUBCHANNEL (step 1) is complete once step 2 is called, as the Interrupt Response Block (IRB) has been presented and the TEST SUBCHANNEL was driven by the cio layer. Thus, the START SUBCHANNEL (step 3) is submitted [1] and gets a cc=0 to indicate the I/O was accepted. However, step 2 stacks the bulk of the actual work onto a workqueue for when the subchannel lock is NOT held, and is unqueued at step 4. That code misidentifies the data in the IRB as being associated with the newly active I/O, and may release memory that is actively in use by the channel subsystem and/or device. Eww. """ The last sentence clearly states "may release memory that is actively used by ... the device", and I understood it refers to the invocation of cp_free() from vfio_ccw_sch_io_todo(). Patch 3 of this series does not change the conditions under which cp_free() is called. Looking at the cited diagram, since patch 3 changes things in vfio_ccw_sch_io_todo() it probably ain't affecting steps 1-3 and I understood the description so that bad free happens in step 4. My guess is that your change from patch 3 somehow via the fsm prevents the SSCH on CPU 2 (using the diagram) from being executed if it actually happens to be after vfio_ccw_sch_io_todo(). And patch 1 is supposed to prevent the SSCH on CPU2 from being executed in the depicted case because if there is a cp to free, then we would bail out form if we see it while processing the new IO request. In any case, I don't want to hold this up any further. Regards, Halil
On Thu, 2021-05-13 at 03:05 +0200, Halil Pasic wrote: > On Tue, 11 May 2021 21:56:28 +0200 > Eric Farman <farman@linux.ibm.com> wrote: > > > Hi Conny, Matt, Halil, > > > > Here's one (last?) update to my proposal for handling the collision > > between interrupts for START SUBCHANNEL and HALT/CLEAR SUBCHANNEL. > > > > Only change here is to include Conny's suggestions on patch 3. > > > > Thanks, > > I believe these changes are beneficial, although I don't understand > everything about them. In that sense I'm happy with the these getting > merged. > > Let me also spend some words answering the unasked question, what I'm > not understanding about these. > > Not understanding how the problem stated in the cover letter of v4 is > actually resolved is certainly the most important one. Per our phone call last week, one of Conny's suggestions from that particular version was related to vfio_ccw_sch_io_todo() and was giving me some difficulties. We all agreed that I should send what I had, and leave the other corner case(s) to be addressed later along with the broader serialization topic throughout the driver. That is still my intention, but I suspect that's where you are going here... (I realize I said "last?" at the top here. Poor decision on my part.) > Let me cite > the relevant part of it (your cover letter already contains a link to > the full version). > > """ > > CPU 1 CPU 2 > 1 CLEAR SUBCHANNEL > 2 fsm_irq() > 3 START SUBCHANNEL > 4 vfio_ccw_sch_io_todo() > 5 fsm_irq() > 6 vfio_ccw_sch_io_todo() > > From the channel subsystem's point of view the CLEAR SUBCHANNEL (step > 1) > is complete once step 2 is called, as the Interrupt Response Block > (IRB) > has been presented and the TEST SUBCHANNEL was driven by the cio > layer. > Thus, the START SUBCHANNEL (step 3) is submitted [1] and gets a cc=0 > to > indicate the I/O was accepted. However, step 2 stacks the bulk of the > actual work onto a workqueue for when the subchannel lock is NOT > held, > and is unqueued at step 4. That code misidentifies the data in the > IRB > as being associated with the newly active I/O, and may release memory > that is actively in use by the channel subsystem and/or device. Eww. > """ > > The last sentence clearly states "may release memory that is actively > used by ... the device", and I understood it refers to the invocation > of cp_free() from vfio_ccw_sch_io_todo(). Patch 3 of this series does > not change the conditions under which cp_free() is called. Correct. > > Looking at the cited diagram, since patch 3 changes things in > vfio_ccw_sch_io_todo() it probably ain't affecting steps 1-3 and > I understood the description so that bad free happens in step 4. You are correct that patch 3 touches vfio_ccw_sch_io_todo(), but it is not addressing the possibility of a bad free described in the old cover letter. The commit message for patch 3 describes pretty clearly the scenario in question. > > My guess is that your change from patch 3 somehow via the fsm > prevents > the SSCH on CPU 2 (using the diagram) from being executed if it > actually > happens to be after vfio_ccw_sch_io_todo(). That's an incorrect guess. The code in vfio_ccw_sch_io_todo() today says "If another CPU is building an I/O (FSM is CP_PROCESSING), or there is no CPU building an I/O (FSM is IDLE), then skip the cp_free() call." The change in patch 3 says that in that situation, it should also not adjust the FSM state because the interrupt being handled on CPU1 was unrelated (maybe it was for a HALT/CLEAR, maybe it was an unsolicited interrupt). The SSCH on CPU2 will still go on as expected. > And patch 1 is supposed to > prevent the SSCH on CPU2 from being executed in the depicted case > because > if there is a cp to free, then we would bail out form if we see it > while processing the new IO request. Not really. It's the FSM's job to prevent a second SSCH, and route to fsm_io_retry() or fsm_io_busy() as appropriate. But the scenario described by patch 3 in this series would leave the cp initialized, while also resetting the FSM back to IDLE. As such, the FSM was free to allow another SSCH in, which would then re-initialize the cp and orphan the existing (active) cp resources. With the application of patch 3, that concern isn't present, so the change in patch 1 is really a NOP. But it allows for consistency in how the cp_*() functions are working, and a safety valve should this situation show up another way. (We'll get trace data that says cp_init() bailed out, rather than going on as if nothing were wrong.) > > In any case, I don't want to hold this up any further. > Thanks for that. You are correct that there's still a potential issue here, in the handoff between fsm_irq() and vfio_ccw_sch_io_todo(), and another fsm_io_request() that would arrive between those points. But it's not anything that we haven't already discussed, and will hopefully begin discussing in the next couple of weeks. Thanks, Eric > Regards, > Halil
On Thu, 13 May 2021 14:33:20 -0400 Eric Farman <farman@linux.ibm.com> wrote: > > > > In any case, I don't want to hold this up any further. > > > > Thanks for that. You are correct that there's still a potential issue > here, in the handoff between fsm_irq() and vfio_ccw_sch_io_todo(), and > another fsm_io_request() that would arrive between those points. But > it's not anything that we haven't already discussed, and will hopefully > begin discussing in the next couple of weeks. Thanks for all the explanations and your patience. I know, I can be difficult when I'm at discomfort due to dissonances in my mental model of a certain problem or a certain solution. Will try to carve out some time to at least have a look at those as well. Have a nice weekend! Halil
On Tue, 11 May 2021 21:56:28 +0200 Eric Farman <farman@linux.ibm.com> wrote: > Hi Conny, Matt, Halil, > > Here's one (last?) update to my proposal for handling the collision > between interrupts for START SUBCHANNEL and HALT/CLEAR SUBCHANNEL. > > Only change here is to include Conny's suggestions on patch 3. > > Thanks, > Eric > > Changelog: > v5->v6: > - Add a block comment and rename variable in patch 3 [CH] > - Drop RFC tag [EF] > > v4->v5: > - Applied Conny's r-b to patches 1 and 3 > - Dropped patch 2 and 4 > - Use a "finished" flag in the interrupt completion path > > Previous versions: > v5: https://lore.kernel.org/kvm/20210510205646.1845844-1-farman@linux.ibm.com/ > v4: https://lore.kernel.org/kvm/20210413182410.1396170-1-farman@linux.ibm.com/ > v3: https://lore.kernel.org/kvm/20200616195053.99253-1-farman@linux.ibm.com/ > v2: https://lore.kernel.org/kvm/20200513142934.28788-1-farman@linux.ibm.com/ > v1: https://lore.kernel.org/kvm/20200124145455.51181-1-farman@linux.ibm.com/ > > Eric Farman (3): > vfio-ccw: Check initialized flag in cp_init() > vfio-ccw: Reset FSM state to IDLE inside FSM > vfio-ccw: Serialize FSM IDLE state with I/O completion > > drivers/s390/cio/vfio_ccw_cp.c | 4 ++++ > drivers/s390/cio/vfio_ccw_drv.c | 12 ++++++++++-- > drivers/s390/cio/vfio_ccw_fsm.c | 1 + > drivers/s390/cio/vfio_ccw_ops.c | 2 -- > 4 files changed, 15 insertions(+), 4 deletions(-) > Thanks, applied.