[2/4] xenbus: limit when state is forced to closed
diff mbox series

Message ID 20191205140123.3817-3-pdurrant@amazon.com
State Superseded
Headers show
Series
  • xen-blkback: support live update
Related show

Commit Message

Paul Durrant Dec. 5, 2019, 2:01 p.m. UTC
Only force state to closed in the case when the toolstack may need to
clean up. This can be detected by checking whether the state in xenstore
has been set to closing prior to device removal.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
---
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
---
 drivers/xen/xenbus/xenbus_probe.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

Comments

Roger Pau Monne Dec. 9, 2019, 11:39 a.m. UTC | #1
On Thu, Dec 05, 2019 at 02:01:21PM +0000, Paul Durrant wrote:
> Only force state to closed in the case when the toolstack may need to
> clean up. This can be detected by checking whether the state in xenstore
> has been set to closing prior to device removal.

I'm not sure I see the point of this, I would expect that a failure to
probe or the removal of the device would leave the xenbus state as
closed, which is consistent with the actual driver state.

Can you explain what's the benefit of leaving a device without a
driver in such unknown state?

Thanks, Roger.
Juergen Gross Dec. 9, 2019, 11:55 a.m. UTC | #2
On 09.12.19 12:39, Roger Pau Monné wrote:
> On Thu, Dec 05, 2019 at 02:01:21PM +0000, Paul Durrant wrote:
>> Only force state to closed in the case when the toolstack may need to
>> clean up. This can be detected by checking whether the state in xenstore
>> has been set to closing prior to device removal.
> 
> I'm not sure I see the point of this, I would expect that a failure to
> probe or the removal of the device would leave the xenbus state as
> closed, which is consistent with the actual driver state.
> 
> Can you explain what's the benefit of leaving a device without a
> driver in such unknown state?

And more concerning: did you check that no frontend/backend is
relying on the closed state to be visible without closing having been
set before?


Juergen
Paul Durrant Dec. 9, 2019, 12:01 p.m. UTC | #3
> -----Original Message-----
> From: Roger Pau Monné <roger.pau@citrix.com>
> Sent: 09 December 2019 11:39
> To: Durrant, Paul <pdurrant@amazon.com>
> Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org; Juergen
> Gross <jgross@suse.com>; Stefano Stabellini <sstabellini@kernel.org>;
> Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is forced to
> closed
> 
> On Thu, Dec 05, 2019 at 02:01:21PM +0000, Paul Durrant wrote:
> > Only force state to closed in the case when the toolstack may need to
> > clean up. This can be detected by checking whether the state in xenstore
> > has been set to closing prior to device removal.
> 
> I'm not sure I see the point of this, I would expect that a failure to
> probe or the removal of the device would leave the xenbus state as
> closed, which is consistent with the actual driver state.
> 
> Can you explain what's the benefit of leaving a device without a
> driver in such unknown state?
> 

If probe fails then I think it should leave the state alone. If the state is moved to closed then basically you just killed that connection to the guest (as the frontend will normally close down when it sees this change) so, if the probe failure was due to a bug in blkback or, e.g., a transient resource issue then it's game over as far as that guest goes.
The ultimate goal here is PV backend re-load that is completely transparent to the guest. Modifying anything in xenstore compromises that so we need to be careful.

  Paul
Paul Durrant Dec. 9, 2019, 12:03 p.m. UTC | #4
> -----Original Message-----
> From: Jürgen Groß <jgross@suse.com>
> Sent: 09 December 2019 11:55
> To: Roger Pau Monné <roger.pau@citrix.com>; Durrant, Paul
> <pdurrant@amazon.com>
> Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org; Stefano
> Stabellini <sstabellini@kernel.org>; Boris Ostrovsky
> <boris.ostrovsky@oracle.com>
> Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is forced to
> closed
> 
> On 09.12.19 12:39, Roger Pau Monné wrote:
> > On Thu, Dec 05, 2019 at 02:01:21PM +0000, Paul Durrant wrote:
> >> Only force state to closed in the case when the toolstack may need to
> >> clean up. This can be detected by checking whether the state in
> xenstore
> >> has been set to closing prior to device removal.
> >
> > I'm not sure I see the point of this, I would expect that a failure to
> > probe or the removal of the device would leave the xenbus state as
> > closed, which is consistent with the actual driver state.
> >
> > Can you explain what's the benefit of leaving a device without a
> > driver in such unknown state?
> 
> And more concerning: did you check that no frontend/backend is
> relying on the closed state to be visible without closing having been
> set before?

Blkfront doesn't seem to mind and I believe the Windows PV drivers cope, but I don't really understand the comment since this patch is actually removing a case where the backend transitions directly to closed.

  Paul

> 
> 
> Juergen
Juergen Gross Dec. 9, 2019, 12:08 p.m. UTC | #5
On 09.12.19 13:03, Durrant, Paul wrote:
>> -----Original Message-----
>> From: Jürgen Groß <jgross@suse.com>
>> Sent: 09 December 2019 11:55
>> To: Roger Pau Monné <roger.pau@citrix.com>; Durrant, Paul
>> <pdurrant@amazon.com>
>> Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org; Stefano
>> Stabellini <sstabellini@kernel.org>; Boris Ostrovsky
>> <boris.ostrovsky@oracle.com>
>> Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is forced to
>> closed
>>
>> On 09.12.19 12:39, Roger Pau Monné wrote:
>>> On Thu, Dec 05, 2019 at 02:01:21PM +0000, Paul Durrant wrote:
>>>> Only force state to closed in the case when the toolstack may need to
>>>> clean up. This can be detected by checking whether the state in
>> xenstore
>>>> has been set to closing prior to device removal.
>>>
>>> I'm not sure I see the point of this, I would expect that a failure to
>>> probe or the removal of the device would leave the xenbus state as
>>> closed, which is consistent with the actual driver state.
>>>
>>> Can you explain what's the benefit of leaving a device without a
>>> driver in such unknown state?
>>
>> And more concerning: did you check that no frontend/backend is
>> relying on the closed state to be visible without closing having been
>> set before?
> 
> Blkfront doesn't seem to mind and I believe the Windows PV drivers cope, but I don't really understand the comment since this patch is actually removing a case where the backend transitions directly to closed.

I'm not speaking of blkfront/blkback only, but of net, tpm, scsi, pvcall
etc. frontends/backends. After all you are modifying a function common
to all PV driver pairs.

You are removing a state switc to "closed" in case the state was _not_
"closing" before. So any PV driver reacting to "closed" of the other end
in case the previous state might not have been "closing" before is at
risk to misbehave with your patch.


Juergen
Paul Durrant Dec. 9, 2019, 12:19 p.m. UTC | #6
> -----Original Message-----
> From: Jürgen Groß <jgross@suse.com>
> Sent: 09 December 2019 12:09
> To: Durrant, Paul <pdurrant@amazon.com>; Roger Pau Monné
> <roger.pau@citrix.com>
> Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org; Stefano
> Stabellini <sstabellini@kernel.org>; Boris Ostrovsky
> <boris.ostrovsky@oracle.com>
> Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is forced to
> closed
> 
> On 09.12.19 13:03, Durrant, Paul wrote:
> >> -----Original Message-----
> >> From: Jürgen Groß <jgross@suse.com>
> >> Sent: 09 December 2019 11:55
> >> To: Roger Pau Monné <roger.pau@citrix.com>; Durrant, Paul
> >> <pdurrant@amazon.com>
> >> Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org;
> Stefano
> >> Stabellini <sstabellini@kernel.org>; Boris Ostrovsky
> >> <boris.ostrovsky@oracle.com>
> >> Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is forced
> to
> >> closed
> >>
> >> On 09.12.19 12:39, Roger Pau Monné wrote:
> >>> On Thu, Dec 05, 2019 at 02:01:21PM +0000, Paul Durrant wrote:
> >>>> Only force state to closed in the case when the toolstack may need to
> >>>> clean up. This can be detected by checking whether the state in
> >> xenstore
> >>>> has been set to closing prior to device removal.
> >>>
> >>> I'm not sure I see the point of this, I would expect that a failure to
> >>> probe or the removal of the device would leave the xenbus state as
> >>> closed, which is consistent with the actual driver state.
> >>>
> >>> Can you explain what's the benefit of leaving a device without a
> >>> driver in such unknown state?
> >>
> >> And more concerning: did you check that no frontend/backend is
> >> relying on the closed state to be visible without closing having been
> >> set before?
> >
> > Blkfront doesn't seem to mind and I believe the Windows PV drivers cope,
> but I don't really understand the comment since this patch is actually
> removing a case where the backend transitions directly to closed.
> 
> I'm not speaking of blkfront/blkback only, but of net, tpm, scsi, pvcall
> etc. frontends/backends. After all you are modifying a function common
> to all PV driver pairs.
> 
> You are removing a state switc to "closed" in case the state was _not_
> "closing" before.

Yes, which AFAIK is against the intention of the generic PV protocol such that it ever existed anyway.

> So any PV driver reacting to "closed" of the other end
> in case the previous state might not have been "closing" before is at
> risk to misbehave with your patch.

Well, they will see nothing now. If the state was not closing, it gets left alone, so the frontend shouldn't do anything. The only risk that I can see is that some frontend/backend pair needed a direct 4 -> 6 transition to support 'unbind' before but AFAIK nothing has ever supported that, and blk and net crash'n'burn if you try that on upstream as it stands. A clean unplug would always set state to 5 first, since that's part of the unplug protocol.

  Paul

> 
> Juergen
Roger Pau Monne Dec. 9, 2019, 12:25 p.m. UTC | #7
On Mon, Dec 09, 2019 at 12:01:38PM +0000, Durrant, Paul wrote:
> > -----Original Message-----
> > From: Roger Pau Monné <roger.pau@citrix.com>
> > Sent: 09 December 2019 11:39
> > To: Durrant, Paul <pdurrant@amazon.com>
> > Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org; Juergen
> > Gross <jgross@suse.com>; Stefano Stabellini <sstabellini@kernel.org>;
> > Boris Ostrovsky <boris.ostrovsky@oracle.com>
> > Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is forced to
> > closed
> > 
> > On Thu, Dec 05, 2019 at 02:01:21PM +0000, Paul Durrant wrote:
> > > Only force state to closed in the case when the toolstack may need to
> > > clean up. This can be detected by checking whether the state in xenstore
> > > has been set to closing prior to device removal.
> > 
> > I'm not sure I see the point of this, I would expect that a failure to
> > probe or the removal of the device would leave the xenbus state as
> > closed, which is consistent with the actual driver state.
> > 
> > Can you explain what's the benefit of leaving a device without a
> > driver in such unknown state?
> > 
> 
> If probe fails then I think it should leave the state alone. If the
> state is moved to closed then basically you just killed that
> connection to the guest (as the frontend will normally close down
> when it sees this change) so, if the probe failure was due to a bug
> in blkback or, e.g., a transient resource issue then it's game over
> as far as that guest goes.

But the connection can be restarted by switching the backend to the
init state again.

> The ultimate goal here is PV backend re-load that is completely transparent to the guest. Modifying anything in xenstore compromises that so we need to be careful.

That's a fine goal, but not switching to closed state in
xenbus_dev_remove seems wrong, as you have actually left the frontend
without a matching backend and with the state not set to closed.

Ie: that would be fine if you explicitly state this is some kind of
internal blkback reload, but not for the general case where blkback
has been unbound. I think we need someway to difference a blkback
reload vs a unbound.

Thanks, Roger.
Paul Durrant Dec. 9, 2019, 12:40 p.m. UTC | #8
> -----Original Message-----
> From: Roger Pau Monné <roger.pau@citrix.com>
> Sent: 09 December 2019 12:26
> To: Durrant, Paul <pdurrant@amazon.com>
> Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org; Juergen
> Gross <jgross@suse.com>; Stefano Stabellini <sstabellini@kernel.org>;
> Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is forced to
> closed
> 
> On Mon, Dec 09, 2019 at 12:01:38PM +0000, Durrant, Paul wrote:
> > > -----Original Message-----
> > > From: Roger Pau Monné <roger.pau@citrix.com>
> > > Sent: 09 December 2019 11:39
> > > To: Durrant, Paul <pdurrant@amazon.com>
> > > Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org;
> Juergen
> > > Gross <jgross@suse.com>; Stefano Stabellini <sstabellini@kernel.org>;
> > > Boris Ostrovsky <boris.ostrovsky@oracle.com>
> > > Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is
> forced to
> > > closed
> > >
> > > On Thu, Dec 05, 2019 at 02:01:21PM +0000, Paul Durrant wrote:
> > > > Only force state to closed in the case when the toolstack may need
> to
> > > > clean up. This can be detected by checking whether the state in
> xenstore
> > > > has been set to closing prior to device removal.
> > >
> > > I'm not sure I see the point of this, I would expect that a failure to
> > > probe or the removal of the device would leave the xenbus state as
> > > closed, which is consistent with the actual driver state.
> > >
> > > Can you explain what's the benefit of leaving a device without a
> > > driver in such unknown state?
> > >
> >
> > If probe fails then I think it should leave the state alone. If the
> > state is moved to closed then basically you just killed that
> > connection to the guest (as the frontend will normally close down
> > when it sees this change) so, if the probe failure was due to a bug
> > in blkback or, e.g., a transient resource issue then it's game over
> > as far as that guest goes.
> 
> But the connection can be restarted by switching the backend to the
> init state again.

Too late. The frontend saw closed and you already lost.

> 
> > The ultimate goal here is PV backend re-load that is completely
> transparent to the guest. Modifying anything in xenstore compromises that
> so we need to be careful.
> 
> That's a fine goal, but not switching to closed state in
> xenbus_dev_remove seems wrong, as you have actually left the frontend
> without a matching backend and with the state not set to closed.
> 

Why is this a problem? With this series fully applied a (block) backend can come and go without needing to change the state. Relying on guests to DTRT is not a sustainable option for a cloud deployment.

> Ie: that would be fine if you explicitly state this is some kind of
> internal blkback reload, but not for the general case where blkback
> has been unbound. I think we need someway to difference a blkback
> reload vs a unbound.
> 

Why do we need that though? Why is it advantageous for a backend to go to closed. No PV backends cope with an unbind as-is, and a toolstack initiated unplug will always set state to 5 anyway. So TBH any state transition done directly in the xenbus code looks wrong to me anyway (but appears to be a necessary evil to keep the toolstack working in the event it spawns a backend where there is actually to driver present, or it doesn't come online).

  Paul
Juergen Gross Dec. 9, 2019, 1:38 p.m. UTC | #9
On 09.12.19 13:19, Durrant, Paul wrote:
>> -----Original Message-----
>> From: Jürgen Groß <jgross@suse.com>
>> Sent: 09 December 2019 12:09
>> To: Durrant, Paul <pdurrant@amazon.com>; Roger Pau Monné
>> <roger.pau@citrix.com>
>> Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org; Stefano
>> Stabellini <sstabellini@kernel.org>; Boris Ostrovsky
>> <boris.ostrovsky@oracle.com>
>> Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is forced to
>> closed
>>
>> On 09.12.19 13:03, Durrant, Paul wrote:
>>>> -----Original Message-----
>>>> From: Jürgen Groß <jgross@suse.com>
>>>> Sent: 09 December 2019 11:55
>>>> To: Roger Pau Monné <roger.pau@citrix.com>; Durrant, Paul
>>>> <pdurrant@amazon.com>
>>>> Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org;
>> Stefano
>>>> Stabellini <sstabellini@kernel.org>; Boris Ostrovsky
>>>> <boris.ostrovsky@oracle.com>
>>>> Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is forced
>> to
>>>> closed
>>>>
>>>> On 09.12.19 12:39, Roger Pau Monné wrote:
>>>>> On Thu, Dec 05, 2019 at 02:01:21PM +0000, Paul Durrant wrote:
>>>>>> Only force state to closed in the case when the toolstack may need to
>>>>>> clean up. This can be detected by checking whether the state in
>>>> xenstore
>>>>>> has been set to closing prior to device removal.
>>>>>
>>>>> I'm not sure I see the point of this, I would expect that a failure to
>>>>> probe or the removal of the device would leave the xenbus state as
>>>>> closed, which is consistent with the actual driver state.
>>>>>
>>>>> Can you explain what's the benefit of leaving a device without a
>>>>> driver in such unknown state?
>>>>
>>>> And more concerning: did you check that no frontend/backend is
>>>> relying on the closed state to be visible without closing having been
>>>> set before?
>>>
>>> Blkfront doesn't seem to mind and I believe the Windows PV drivers cope,
>> but I don't really understand the comment since this patch is actually
>> removing a case where the backend transitions directly to closed.
>>
>> I'm not speaking of blkfront/blkback only, but of net, tpm, scsi, pvcall
>> etc. frontends/backends. After all you are modifying a function common
>> to all PV driver pairs.
>>
>> You are removing a state switc to "closed" in case the state was _not_
>> "closing" before.
> 
> Yes, which AFAIK is against the intention of the generic PV protocol such that it ever existed anyway.

While this might be the case we should _not_ break any guests
running now. So this kind of reasoning is dangerous.

> 
>> So any PV driver reacting to "closed" of the other end
>> in case the previous state might not have been "closing" before is at
>> risk to misbehave with your patch.
> 
> Well, they will see nothing now. If the state was not closing, it gets left alone, so the frontend shouldn't do anything. The only risk that I can see is that some frontend/backend pair needed a direct 4 -> 6 transition to support 'unbind' before but AFAIK nothing has ever supported that, and blk and net crash'n'burn if you try that on upstream as it stands. A clean unplug would always set state to 5 first, since that's part of the unplug protocol.

That was my question: are you sure all current and previous
guest frontends and backends are handling unplug this way?

Not "should handle", but "do handle".


Juergen
Paul Durrant Dec. 9, 2019, 2:06 p.m. UTC | #10
> -----Original Message-----
> From: Jürgen Groß <jgross@suse.com>
> Sent: 09 December 2019 13:39
> To: Durrant, Paul <pdurrant@amazon.com>; Roger Pau Monné
> <roger.pau@citrix.com>
> Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org; Stefano
> Stabellini <sstabellini@kernel.org>; Boris Ostrovsky
> <boris.ostrovsky@oracle.com>
> Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is forced to
> closed
> 
> On 09.12.19 13:19, Durrant, Paul wrote:
> >> -----Original Message-----
> >> From: Jürgen Groß <jgross@suse.com>
> >> Sent: 09 December 2019 12:09
> >> To: Durrant, Paul <pdurrant@amazon.com>; Roger Pau Monné
> >> <roger.pau@citrix.com>
> >> Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org;
> Stefano
> >> Stabellini <sstabellini@kernel.org>; Boris Ostrovsky
> >> <boris.ostrovsky@oracle.com>
> >> Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is forced
> to
> >> closed
> >>
> >> On 09.12.19 13:03, Durrant, Paul wrote:
> >>>> -----Original Message-----
> >>>> From: Jürgen Groß <jgross@suse.com>
> >>>> Sent: 09 December 2019 11:55
> >>>> To: Roger Pau Monné <roger.pau@citrix.com>; Durrant, Paul
> >>>> <pdurrant@amazon.com>
> >>>> Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org;
> >> Stefano
> >>>> Stabellini <sstabellini@kernel.org>; Boris Ostrovsky
> >>>> <boris.ostrovsky@oracle.com>
> >>>> Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is
> forced
> >> to
> >>>> closed
> >>>>
> >>>> On 09.12.19 12:39, Roger Pau Monné wrote:
> >>>>> On Thu, Dec 05, 2019 at 02:01:21PM +0000, Paul Durrant wrote:
> >>>>>> Only force state to closed in the case when the toolstack may need
> to
> >>>>>> clean up. This can be detected by checking whether the state in
> >>>> xenstore
> >>>>>> has been set to closing prior to device removal.
> >>>>>
> >>>>> I'm not sure I see the point of this, I would expect that a failure
> to
> >>>>> probe or the removal of the device would leave the xenbus state as
> >>>>> closed, which is consistent with the actual driver state.
> >>>>>
> >>>>> Can you explain what's the benefit of leaving a device without a
> >>>>> driver in such unknown state?
> >>>>
> >>>> And more concerning: did you check that no frontend/backend is
> >>>> relying on the closed state to be visible without closing having been
> >>>> set before?
> >>>
> >>> Blkfront doesn't seem to mind and I believe the Windows PV drivers
> cope,
> >> but I don't really understand the comment since this patch is actually
> >> removing a case where the backend transitions directly to closed.
> >>
> >> I'm not speaking of blkfront/blkback only, but of net, tpm, scsi,
> pvcall
> >> etc. frontends/backends. After all you are modifying a function common
> >> to all PV driver pairs.
> >>
> >> You are removing a state switc to "closed" in case the state was _not_
> >> "closing" before.
> >
> > Yes, which AFAIK is against the intention of the generic PV protocol
> such that it ever existed anyway.
> 
> While this might be the case we should _not_ break any guests
> running now. So this kind of reasoning is dangerous.
> 
> >
> >> So any PV driver reacting to "closed" of the other end
> >> in case the previous state might not have been "closing" before is at
> >> risk to misbehave with your patch.
> >
> > Well, they will see nothing now. If the state was not closing, it gets
> left alone, so the frontend shouldn't do anything. The only risk that I
> can see is that some frontend/backend pair needed a direct 4 -> 6
> transition to support 'unbind' before but AFAIK nothing has ever supported
> that, and blk and net crash'n'burn if you try that on upstream as it
> stands. A clean unplug would always set state to 5 first, since that's
> part of the unplug protocol.
> 
> That was my question: are you sure all current and previous
> guest frontends and backends are handling unplug this way?
> 
> Not "should handle", but "do handle".

That depends on the toolstack. IIUC the only 'supported' toolstack is xl/libxl, which will set 'state' to 5 and 'online' to 0 to initiate an unplug.

  Paul

> 
> 
> Juergen
Juergen Gross Dec. 9, 2019, 2:09 p.m. UTC | #11
On 09.12.19 15:06, Durrant, Paul wrote:
>> -----Original Message-----
>> From: Jürgen Groß <jgross@suse.com>
>> Sent: 09 December 2019 13:39
>> To: Durrant, Paul <pdurrant@amazon.com>; Roger Pau Monné
>> <roger.pau@citrix.com>
>> Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org; Stefano
>> Stabellini <sstabellini@kernel.org>; Boris Ostrovsky
>> <boris.ostrovsky@oracle.com>
>> Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is forced to
>> closed
>>
>> On 09.12.19 13:19, Durrant, Paul wrote:
>>>> -----Original Message-----
>>>> From: Jürgen Groß <jgross@suse.com>
>>>> Sent: 09 December 2019 12:09
>>>> To: Durrant, Paul <pdurrant@amazon.com>; Roger Pau Monné
>>>> <roger.pau@citrix.com>
>>>> Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org;
>> Stefano
>>>> Stabellini <sstabellini@kernel.org>; Boris Ostrovsky
>>>> <boris.ostrovsky@oracle.com>
>>>> Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is forced
>> to
>>>> closed
>>>>
>>>> On 09.12.19 13:03, Durrant, Paul wrote:
>>>>>> -----Original Message-----
>>>>>> From: Jürgen Groß <jgross@suse.com>
>>>>>> Sent: 09 December 2019 11:55
>>>>>> To: Roger Pau Monné <roger.pau@citrix.com>; Durrant, Paul
>>>>>> <pdurrant@amazon.com>
>>>>>> Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org;
>>>> Stefano
>>>>>> Stabellini <sstabellini@kernel.org>; Boris Ostrovsky
>>>>>> <boris.ostrovsky@oracle.com>
>>>>>> Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is
>> forced
>>>> to
>>>>>> closed
>>>>>>
>>>>>> On 09.12.19 12:39, Roger Pau Monné wrote:
>>>>>>> On Thu, Dec 05, 2019 at 02:01:21PM +0000, Paul Durrant wrote:
>>>>>>>> Only force state to closed in the case when the toolstack may need
>> to
>>>>>>>> clean up. This can be detected by checking whether the state in
>>>>>> xenstore
>>>>>>>> has been set to closing prior to device removal.
>>>>>>>
>>>>>>> I'm not sure I see the point of this, I would expect that a failure
>> to
>>>>>>> probe or the removal of the device would leave the xenbus state as
>>>>>>> closed, which is consistent with the actual driver state.
>>>>>>>
>>>>>>> Can you explain what's the benefit of leaving a device without a
>>>>>>> driver in such unknown state?
>>>>>>
>>>>>> And more concerning: did you check that no frontend/backend is
>>>>>> relying on the closed state to be visible without closing having been
>>>>>> set before?
>>>>>
>>>>> Blkfront doesn't seem to mind and I believe the Windows PV drivers
>> cope,
>>>> but I don't really understand the comment since this patch is actually
>>>> removing a case where the backend transitions directly to closed.
>>>>
>>>> I'm not speaking of blkfront/blkback only, but of net, tpm, scsi,
>> pvcall
>>>> etc. frontends/backends. After all you are modifying a function common
>>>> to all PV driver pairs.
>>>>
>>>> You are removing a state switc to "closed" in case the state was _not_
>>>> "closing" before.
>>>
>>> Yes, which AFAIK is against the intention of the generic PV protocol
>> such that it ever existed anyway.
>>
>> While this might be the case we should _not_ break any guests
>> running now. So this kind of reasoning is dangerous.
>>
>>>
>>>> So any PV driver reacting to "closed" of the other end
>>>> in case the previous state might not have been "closing" before is at
>>>> risk to misbehave with your patch.
>>>
>>> Well, they will see nothing now. If the state was not closing, it gets
>> left alone, so the frontend shouldn't do anything. The only risk that I
>> can see is that some frontend/backend pair needed a direct 4 -> 6
>> transition to support 'unbind' before but AFAIK nothing has ever supported
>> that, and blk and net crash'n'burn if you try that on upstream as it
>> stands. A clean unplug would always set state to 5 first, since that's
>> part of the unplug protocol.
>>
>> That was my question: are you sure all current and previous
>> guest frontends and backends are handling unplug this way?
>>
>> Not "should handle", but "do handle".
> 
> That depends on the toolstack. IIUC the only 'supported' toolstack is xl/libxl, which will set 'state' to 5 and 'online' to 0 to initiate an unplug.

I guess libvirt/libxl is doing the same?

At least at SUSE we still have some customers running xend based
Xen installations with recent Linux or Windows guests.


Juergen
Paul Durrant Dec. 9, 2019, 2:23 p.m. UTC | #12
> -----Original Message-----
> From: Jürgen Groß <jgross@suse.com>
> Sent: 09 December 2019 14:10
> To: Durrant, Paul <pdurrant@amazon.com>; Roger Pau Monné
> <roger.pau@citrix.com>
> Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org; Stefano
> Stabellini <sstabellini@kernel.org>; Boris Ostrovsky
> <boris.ostrovsky@oracle.com>
> Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is forced to
> closed
> 
> On 09.12.19 15:06, Durrant, Paul wrote:
> >> -----Original Message-----
> >> From: Jürgen Groß <jgross@suse.com>
> >> Sent: 09 December 2019 13:39
> >> To: Durrant, Paul <pdurrant@amazon.com>; Roger Pau Monné
> >> <roger.pau@citrix.com>
> >> Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org;
> Stefano
> >> Stabellini <sstabellini@kernel.org>; Boris Ostrovsky
> >> <boris.ostrovsky@oracle.com>
> >> Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is forced
> to
> >> closed
> >>
> >> On 09.12.19 13:19, Durrant, Paul wrote:
> >>>> -----Original Message-----
> >>>> From: Jürgen Groß <jgross@suse.com>
> >>>> Sent: 09 December 2019 12:09
> >>>> To: Durrant, Paul <pdurrant@amazon.com>; Roger Pau Monné
> >>>> <roger.pau@citrix.com>
> >>>> Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org;
> >> Stefano
> >>>> Stabellini <sstabellini@kernel.org>; Boris Ostrovsky
> >>>> <boris.ostrovsky@oracle.com>
> >>>> Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is
> forced
> >> to
> >>>> closed
> >>>>
> >>>> On 09.12.19 13:03, Durrant, Paul wrote:
> >>>>>> -----Original Message-----
> >>>>>> From: Jürgen Groß <jgross@suse.com>
> >>>>>> Sent: 09 December 2019 11:55
> >>>>>> To: Roger Pau Monné <roger.pau@citrix.com>; Durrant, Paul
> >>>>>> <pdurrant@amazon.com>
> >>>>>> Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org;
> >>>> Stefano
> >>>>>> Stabellini <sstabellini@kernel.org>; Boris Ostrovsky
> >>>>>> <boris.ostrovsky@oracle.com>
> >>>>>> Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is
> >> forced
> >>>> to
> >>>>>> closed
> >>>>>>
> >>>>>> On 09.12.19 12:39, Roger Pau Monné wrote:
> >>>>>>> On Thu, Dec 05, 2019 at 02:01:21PM +0000, Paul Durrant wrote:
> >>>>>>>> Only force state to closed in the case when the toolstack may
> need
> >> to
> >>>>>>>> clean up. This can be detected by checking whether the state in
> >>>>>> xenstore
> >>>>>>>> has been set to closing prior to device removal.
> >>>>>>>
> >>>>>>> I'm not sure I see the point of this, I would expect that a
> failure
> >> to
> >>>>>>> probe or the removal of the device would leave the xenbus state as
> >>>>>>> closed, which is consistent with the actual driver state.
> >>>>>>>
> >>>>>>> Can you explain what's the benefit of leaving a device without a
> >>>>>>> driver in such unknown state?
> >>>>>>
> >>>>>> And more concerning: did you check that no frontend/backend is
> >>>>>> relying on the closed state to be visible without closing having
> been
> >>>>>> set before?
> >>>>>
> >>>>> Blkfront doesn't seem to mind and I believe the Windows PV drivers
> >> cope,
> >>>> but I don't really understand the comment since this patch is
> actually
> >>>> removing a case where the backend transitions directly to closed.
> >>>>
> >>>> I'm not speaking of blkfront/blkback only, but of net, tpm, scsi,
> >> pvcall
> >>>> etc. frontends/backends. After all you are modifying a function
> common
> >>>> to all PV driver pairs.
> >>>>
> >>>> You are removing a state switc to "closed" in case the state was
> _not_
> >>>> "closing" before.
> >>>
> >>> Yes, which AFAIK is against the intention of the generic PV protocol
> >> such that it ever existed anyway.
> >>
> >> While this might be the case we should _not_ break any guests
> >> running now. So this kind of reasoning is dangerous.
> >>
> >>>
> >>>> So any PV driver reacting to "closed" of the other end
> >>>> in case the previous state might not have been "closing" before is at
> >>>> risk to misbehave with your patch.
> >>>
> >>> Well, they will see nothing now. If the state was not closing, it gets
> >> left alone, so the frontend shouldn't do anything. The only risk that I
> >> can see is that some frontend/backend pair needed a direct 4 -> 6
> >> transition to support 'unbind' before but AFAIK nothing has ever
> supported
> >> that, and blk and net crash'n'burn if you try that on upstream as it
> >> stands. A clean unplug would always set state to 5 first, since that's
> >> part of the unplug protocol.
> >>
> >> That was my question: are you sure all current and previous
> >> guest frontends and backends are handling unplug this way?
> >>
> >> Not "should handle", but "do handle".
> >
> > That depends on the toolstack. IIUC the only 'supported' toolstack is
> xl/libxl, which will set 'state' to 5 and 'online' to 0 to initiate an
> unplug.
> 
> I guess libvirt/libxl is doing the same?
> 

The unplug mechansism is all in libxl AFAICT, so it should be identical.

> At least at SUSE we still have some customers running xend based
> Xen installations with recent Linux or Windows guests.
> 

Is that something the upstream code can/should support though? I'd be surprised if xend is actually doing anything different to libxl since I've been coding the Windows PV drivers to trigger off the combined closing/online transition for as long as I can remember.

  Paul
Roger Pau Monne Dec. 9, 2019, 2:28 p.m. UTC | #13
On Mon, Dec 09, 2019 at 12:40:47PM +0000, Durrant, Paul wrote:
> > -----Original Message-----
> > From: Roger Pau Monné <roger.pau@citrix.com>
> > Sent: 09 December 2019 12:26
> > To: Durrant, Paul <pdurrant@amazon.com>
> > Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org; Juergen
> > Gross <jgross@suse.com>; Stefano Stabellini <sstabellini@kernel.org>;
> > Boris Ostrovsky <boris.ostrovsky@oracle.com>
> > Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is forced to
> > closed
> > 
> > On Mon, Dec 09, 2019 at 12:01:38PM +0000, Durrant, Paul wrote:
> > > > -----Original Message-----
> > > > From: Roger Pau Monné <roger.pau@citrix.com>
> > > > Sent: 09 December 2019 11:39
> > > > To: Durrant, Paul <pdurrant@amazon.com>
> > > > Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org;
> > Juergen
> > > > Gross <jgross@suse.com>; Stefano Stabellini <sstabellini@kernel.org>;
> > > > Boris Ostrovsky <boris.ostrovsky@oracle.com>
> > > > Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is
> > forced to
> > > > closed
> > > >
> > > > On Thu, Dec 05, 2019 at 02:01:21PM +0000, Paul Durrant wrote:
> > > > > Only force state to closed in the case when the toolstack may need
> > to
> > > > > clean up. This can be detected by checking whether the state in
> > xenstore
> > > > > has been set to closing prior to device removal.
> > > >
> > > > I'm not sure I see the point of this, I would expect that a failure to
> > > > probe or the removal of the device would leave the xenbus state as
> > > > closed, which is consistent with the actual driver state.
> > > >
> > > > Can you explain what's the benefit of leaving a device without a
> > > > driver in such unknown state?
> > > >
> > >
> > > If probe fails then I think it should leave the state alone. If the
> > > state is moved to closed then basically you just killed that
> > > connection to the guest (as the frontend will normally close down
> > > when it sees this change) so, if the probe failure was due to a bug
> > > in blkback or, e.g., a transient resource issue then it's game over
> > > as far as that guest goes.
> > 
> > But the connection can be restarted by switching the backend to the
> > init state again.
> 
> Too late. The frontend saw closed and you already lost.
> 
> > 
> > > The ultimate goal here is PV backend re-load that is completely
> > transparent to the guest. Modifying anything in xenstore compromises that
> > so we need to be careful.
> > 
> > That's a fine goal, but not switching to closed state in
> > xenbus_dev_remove seems wrong, as you have actually left the frontend
> > without a matching backend and with the state not set to closed.
> > 
> 
> Why is this a problem? With this series fully applied a (block) backend can come and go without needing to change the state. Relying on guests to DTRT is not a sustainable option for a cloud deployment.
> 
> > Ie: that would be fine if you explicitly state this is some kind of
> > internal blkback reload, but not for the general case where blkback
> > has been unbound. I think we need someway to difference a blkback
> > reload vs a unbound.
> > 
> 
> Why do we need that though? Why is it advantageous for a backend to go to closed. No PV backends cope with an unbind as-is, and a toolstack initiated unplug will always set state to 5 anyway. So TBH any state transition done directly in the xenbus code looks wrong to me anyway (but appears to be a necessary evil to keep the toolstack working in the event it spawns a backend where there is actually to driver present, or it doesn't come online).

IMO the normal flow for unbind would be to attempt to close open
connections and then remove the driver: leaving frontends connected
without any attached backends is not correct, and will just block the
guest frontend until requests start timing out.

I can see the reasoning for doing that for the purpose of updating a
blkback module without guests noticing, but I would prefer that
leaving connections open was an option that could be given when
unbinding (or maybe a driver option in sysfs?), so that the default
behaviour would be to try to close everything when unbinding if
possible.

Thanks, Roger.
Juergen Gross Dec. 9, 2019, 2:41 p.m. UTC | #14
On 09.12.19 15:23, Durrant, Paul wrote:
>> -----Original Message-----
>> From: Jürgen Groß <jgross@suse.com>
>> Sent: 09 December 2019 14:10
>> To: Durrant, Paul <pdurrant@amazon.com>; Roger Pau Monné
>> <roger.pau@citrix.com>
>> Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org; Stefano
>> Stabellini <sstabellini@kernel.org>; Boris Ostrovsky
>> <boris.ostrovsky@oracle.com>
>> Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is forced to
>> closed
>>
>> On 09.12.19 15:06, Durrant, Paul wrote:
>>>> -----Original Message-----
>>>> From: Jürgen Groß <jgross@suse.com>
>>>> Sent: 09 December 2019 13:39
>>>> To: Durrant, Paul <pdurrant@amazon.com>; Roger Pau Monné
>>>> <roger.pau@citrix.com>
>>>> Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org;
>> Stefano
>>>> Stabellini <sstabellini@kernel.org>; Boris Ostrovsky
>>>> <boris.ostrovsky@oracle.com>
>>>> Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is forced
>> to
>>>> closed
>>>>
>>>> On 09.12.19 13:19, Durrant, Paul wrote:
>>>>>> -----Original Message-----
>>>>>> From: Jürgen Groß <jgross@suse.com>
>>>>>> Sent: 09 December 2019 12:09
>>>>>> To: Durrant, Paul <pdurrant@amazon.com>; Roger Pau Monné
>>>>>> <roger.pau@citrix.com>
>>>>>> Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org;
>>>> Stefano
>>>>>> Stabellini <sstabellini@kernel.org>; Boris Ostrovsky
>>>>>> <boris.ostrovsky@oracle.com>
>>>>>> Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is
>> forced
>>>> to
>>>>>> closed
>>>>>>
>>>>>> On 09.12.19 13:03, Durrant, Paul wrote:
>>>>>>>> -----Original Message-----
>>>>>>>> From: Jürgen Groß <jgross@suse.com>
>>>>>>>> Sent: 09 December 2019 11:55
>>>>>>>> To: Roger Pau Monné <roger.pau@citrix.com>; Durrant, Paul
>>>>>>>> <pdurrant@amazon.com>
>>>>>>>> Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org;
>>>>>> Stefano
>>>>>>>> Stabellini <sstabellini@kernel.org>; Boris Ostrovsky
>>>>>>>> <boris.ostrovsky@oracle.com>
>>>>>>>> Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is
>>>> forced
>>>>>> to
>>>>>>>> closed
>>>>>>>>
>>>>>>>> On 09.12.19 12:39, Roger Pau Monné wrote:
>>>>>>>>> On Thu, Dec 05, 2019 at 02:01:21PM +0000, Paul Durrant wrote:
>>>>>>>>>> Only force state to closed in the case when the toolstack may
>> need
>>>> to
>>>>>>>>>> clean up. This can be detected by checking whether the state in
>>>>>>>> xenstore
>>>>>>>>>> has been set to closing prior to device removal.
>>>>>>>>>
>>>>>>>>> I'm not sure I see the point of this, I would expect that a
>> failure
>>>> to
>>>>>>>>> probe or the removal of the device would leave the xenbus state as
>>>>>>>>> closed, which is consistent with the actual driver state.
>>>>>>>>>
>>>>>>>>> Can you explain what's the benefit of leaving a device without a
>>>>>>>>> driver in such unknown state?
>>>>>>>>
>>>>>>>> And more concerning: did you check that no frontend/backend is
>>>>>>>> relying on the closed state to be visible without closing having
>> been
>>>>>>>> set before?
>>>>>>>
>>>>>>> Blkfront doesn't seem to mind and I believe the Windows PV drivers
>>>> cope,
>>>>>> but I don't really understand the comment since this patch is
>> actually
>>>>>> removing a case where the backend transitions directly to closed.
>>>>>>
>>>>>> I'm not speaking of blkfront/blkback only, but of net, tpm, scsi,
>>>> pvcall
>>>>>> etc. frontends/backends. After all you are modifying a function
>> common
>>>>>> to all PV driver pairs.
>>>>>>
>>>>>> You are removing a state switc to "closed" in case the state was
>> _not_
>>>>>> "closing" before.
>>>>>
>>>>> Yes, which AFAIK is against the intention of the generic PV protocol
>>>> such that it ever existed anyway.
>>>>
>>>> While this might be the case we should _not_ break any guests
>>>> running now. So this kind of reasoning is dangerous.
>>>>
>>>>>
>>>>>> So any PV driver reacting to "closed" of the other end
>>>>>> in case the previous state might not have been "closing" before is at
>>>>>> risk to misbehave with your patch.
>>>>>
>>>>> Well, they will see nothing now. If the state was not closing, it gets
>>>> left alone, so the frontend shouldn't do anything. The only risk that I
>>>> can see is that some frontend/backend pair needed a direct 4 -> 6
>>>> transition to support 'unbind' before but AFAIK nothing has ever
>> supported
>>>> that, and blk and net crash'n'burn if you try that on upstream as it
>>>> stands. A clean unplug would always set state to 5 first, since that's
>>>> part of the unplug protocol.
>>>>
>>>> That was my question: are you sure all current and previous
>>>> guest frontends and backends are handling unplug this way?
>>>>
>>>> Not "should handle", but "do handle".
>>>
>>> That depends on the toolstack. IIUC the only 'supported' toolstack is
>> xl/libxl, which will set 'state' to 5 and 'online' to 0 to initiate an
>> unplug.
>>
>> I guess libvirt/libxl is doing the same?
>>
> 
> The unplug mechansism is all in libxl AFAICT, so it should be identical.
> 
>> At least at SUSE we still have some customers running xend based
>> Xen installations with recent Linux or Windows guests.
>>
> 
> Is that something the upstream code can/should support though? I'd be surprised if xend is actually doing anything different to libxl since I've been coding the Windows PV drivers to trigger off the combined closing/online transition for as long as I can remember.

I'd rather not have to carry a private patch for new Linux kernel to be
able to run on those hosts.

AFAIK you at Amazon have some quite old Xen installations, too. How are
you handling that (assuming the customer is updating the kernel to a
recent version in his guest)?


Juergen
Paul Durrant Dec. 9, 2019, 2:41 p.m. UTC | #15
> -----Original Message-----
> From: Roger Pau Monné <roger.pau@citrix.com>
> Sent: 09 December 2019 14:29
> To: Durrant, Paul <pdurrant@amazon.com>
> Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org; Juergen
> Gross <jgross@suse.com>; Stefano Stabellini <sstabellini@kernel.org>;
> Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is forced to
> closed
> 
> On Mon, Dec 09, 2019 at 12:40:47PM +0000, Durrant, Paul wrote:
> > > -----Original Message-----
> > > From: Roger Pau Monné <roger.pau@citrix.com>
> > > Sent: 09 December 2019 12:26
> > > To: Durrant, Paul <pdurrant@amazon.com>
> > > Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org;
> Juergen
> > > Gross <jgross@suse.com>; Stefano Stabellini <sstabellini@kernel.org>;
> > > Boris Ostrovsky <boris.ostrovsky@oracle.com>
> > > Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is
> forced to
> > > closed
> > >
> > > On Mon, Dec 09, 2019 at 12:01:38PM +0000, Durrant, Paul wrote:
> > > > > -----Original Message-----
> > > > > From: Roger Pau Monné <roger.pau@citrix.com>
> > > > > Sent: 09 December 2019 11:39
> > > > > To: Durrant, Paul <pdurrant@amazon.com>
> > > > > Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org;
> > > Juergen
> > > > > Gross <jgross@suse.com>; Stefano Stabellini
> <sstabellini@kernel.org>;
> > > > > Boris Ostrovsky <boris.ostrovsky@oracle.com>
> > > > > Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is
> > > forced to
> > > > > closed
> > > > >
> > > > > On Thu, Dec 05, 2019 at 02:01:21PM +0000, Paul Durrant wrote:
> > > > > > Only force state to closed in the case when the toolstack may
> need
> > > to
> > > > > > clean up. This can be detected by checking whether the state in
> > > xenstore
> > > > > > has been set to closing prior to device removal.
> > > > >
> > > > > I'm not sure I see the point of this, I would expect that a
> failure to
> > > > > probe or the removal of the device would leave the xenbus state as
> > > > > closed, which is consistent with the actual driver state.
> > > > >
> > > > > Can you explain what's the benefit of leaving a device without a
> > > > > driver in such unknown state?
> > > > >
> > > >
> > > > If probe fails then I think it should leave the state alone. If the
> > > > state is moved to closed then basically you just killed that
> > > > connection to the guest (as the frontend will normally close down
> > > > when it sees this change) so, if the probe failure was due to a bug
> > > > in blkback or, e.g., a transient resource issue then it's game over
> > > > as far as that guest goes.
> > >
> > > But the connection can be restarted by switching the backend to the
> > > init state again.
> >
> > Too late. The frontend saw closed and you already lost.
> >
> > >
> > > > The ultimate goal here is PV backend re-load that is completely
> > > transparent to the guest. Modifying anything in xenstore compromises
> that
> > > so we need to be careful.
> > >
> > > That's a fine goal, but not switching to closed state in
> > > xenbus_dev_remove seems wrong, as you have actually left the frontend
> > > without a matching backend and with the state not set to closed.
> > >
> >
> > Why is this a problem? With this series fully applied a (block) backend
> can come and go without needing to change the state. Relying on guests to
> DTRT is not a sustainable option for a cloud deployment.
> >
> > > Ie: that would be fine if you explicitly state this is some kind of
> > > internal blkback reload, but not for the general case where blkback
> > > has been unbound. I think we need someway to difference a blkback
> > > reload vs a unbound.
> > >
> >
> > Why do we need that though? Why is it advantageous for a backend to go
> to closed. No PV backends cope with an unbind as-is, and a toolstack
> initiated unplug will always set state to 5 anyway. So TBH any state
> transition done directly in the xenbus code looks wrong to me anyway (but
> appears to be a necessary evil to keep the toolstack working in the event
> it spawns a backend where there is actually to driver present, or it
> doesn't come online).
> 
> IMO the normal flow for unbind would be to attempt to close open
> connections and then remove the driver: leaving frontends connected
> without any attached backends is not correct, and will just block the
> guest frontend until requests start timing out.
> 
> I can see the reasoning for doing that for the purpose of updating a
> blkback module without guests noticing, but I would prefer that
> leaving connections open was an option that could be given when
> unbinding (or maybe a driver option in sysfs?), so that the default
> behaviour would be to try to close everything when unbinding if
> possible.

Well unbind is pretty useless now IMO since bind doesn't work, and a transition straight to closed is just plain wrong anyway. But, we could have a flag that the backend driver sets to say that it supports transparent re-bind that gates this code. Would that make you feel more comfortable?

If you want unbind to actually do a proper unplug then that's extra work and not really something I want to tackle (and re-bind would still need to be toolstack initiated as something would have to re-create the xenstore area).

  Paul

> 
> Thanks, Roger.
Paul Durrant Dec. 9, 2019, 2:43 p.m. UTC | #16
> -----Original Message-----
> From: Jürgen Groß <jgross@suse.com>
> Sent: 09 December 2019 14:41
> To: Durrant, Paul <pdurrant@amazon.com>; Roger Pau Monné
> <roger.pau@citrix.com>
> Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org; Stefano
> Stabellini <sstabellini@kernel.org>; Boris Ostrovsky
> <boris.ostrovsky@oracle.com>
> Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is forced to
> closed
> 
> On 09.12.19 15:23, Durrant, Paul wrote:
> >> -----Original Message-----
> >> From: Jürgen Groß <jgross@suse.com>
> >> Sent: 09 December 2019 14:10
> >> To: Durrant, Paul <pdurrant@amazon.com>; Roger Pau Monné
> >> <roger.pau@citrix.com>
> >> Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org;
> Stefano
> >> Stabellini <sstabellini@kernel.org>; Boris Ostrovsky
> >> <boris.ostrovsky@oracle.com>
> >> Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is forced
> to
> >> closed
> >>
> >> On 09.12.19 15:06, Durrant, Paul wrote:
> >>>> -----Original Message-----
> >>>> From: Jürgen Groß <jgross@suse.com>
> >>>> Sent: 09 December 2019 13:39
> >>>> To: Durrant, Paul <pdurrant@amazon.com>; Roger Pau Monné
> >>>> <roger.pau@citrix.com>
> >>>> Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org;
> >> Stefano
> >>>> Stabellini <sstabellini@kernel.org>; Boris Ostrovsky
> >>>> <boris.ostrovsky@oracle.com>
> >>>> Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is
> forced
> >> to
> >>>> closed
> >>>>
> >>>> On 09.12.19 13:19, Durrant, Paul wrote:
> >>>>>> -----Original Message-----
> >>>>>> From: Jürgen Groß <jgross@suse.com>
> >>>>>> Sent: 09 December 2019 12:09
> >>>>>> To: Durrant, Paul <pdurrant@amazon.com>; Roger Pau Monné
> >>>>>> <roger.pau@citrix.com>
> >>>>>> Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org;
> >>>> Stefano
> >>>>>> Stabellini <sstabellini@kernel.org>; Boris Ostrovsky
> >>>>>> <boris.ostrovsky@oracle.com>
> >>>>>> Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is
> >> forced
> >>>> to
> >>>>>> closed
> >>>>>>
> >>>>>> On 09.12.19 13:03, Durrant, Paul wrote:
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: Jürgen Groß <jgross@suse.com>
> >>>>>>>> Sent: 09 December 2019 11:55
> >>>>>>>> To: Roger Pau Monné <roger.pau@citrix.com>; Durrant, Paul
> >>>>>>>> <pdurrant@amazon.com>
> >>>>>>>> Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org;
> >>>>>> Stefano
> >>>>>>>> Stabellini <sstabellini@kernel.org>; Boris Ostrovsky
> >>>>>>>> <boris.ostrovsky@oracle.com>
> >>>>>>>> Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is
> >>>> forced
> >>>>>> to
> >>>>>>>> closed
> >>>>>>>>
> >>>>>>>> On 09.12.19 12:39, Roger Pau Monné wrote:
> >>>>>>>>> On Thu, Dec 05, 2019 at 02:01:21PM +0000, Paul Durrant wrote:
> >>>>>>>>>> Only force state to closed in the case when the toolstack may
> >> need
> >>>> to
> >>>>>>>>>> clean up. This can be detected by checking whether the state in
> >>>>>>>> xenstore
> >>>>>>>>>> has been set to closing prior to device removal.
> >>>>>>>>>
> >>>>>>>>> I'm not sure I see the point of this, I would expect that a
> >> failure
> >>>> to
> >>>>>>>>> probe or the removal of the device would leave the xenbus state
> as
> >>>>>>>>> closed, which is consistent with the actual driver state.
> >>>>>>>>>
> >>>>>>>>> Can you explain what's the benefit of leaving a device without a
> >>>>>>>>> driver in such unknown state?
> >>>>>>>>
> >>>>>>>> And more concerning: did you check that no frontend/backend is
> >>>>>>>> relying on the closed state to be visible without closing having
> >> been
> >>>>>>>> set before?
> >>>>>>>
> >>>>>>> Blkfront doesn't seem to mind and I believe the Windows PV drivers
> >>>> cope,
> >>>>>> but I don't really understand the comment since this patch is
> >> actually
> >>>>>> removing a case where the backend transitions directly to closed.
> >>>>>>
> >>>>>> I'm not speaking of blkfront/blkback only, but of net, tpm, scsi,
> >>>> pvcall
> >>>>>> etc. frontends/backends. After all you are modifying a function
> >> common
> >>>>>> to all PV driver pairs.
> >>>>>>
> >>>>>> You are removing a state switc to "closed" in case the state was
> >> _not_
> >>>>>> "closing" before.
> >>>>>
> >>>>> Yes, which AFAIK is against the intention of the generic PV protocol
> >>>> such that it ever existed anyway.
> >>>>
> >>>> While this might be the case we should _not_ break any guests
> >>>> running now. So this kind of reasoning is dangerous.
> >>>>
> >>>>>
> >>>>>> So any PV driver reacting to "closed" of the other end
> >>>>>> in case the previous state might not have been "closing" before is
> at
> >>>>>> risk to misbehave with your patch.
> >>>>>
> >>>>> Well, they will see nothing now. If the state was not closing, it
> gets
> >>>> left alone, so the frontend shouldn't do anything. The only risk that
> I
> >>>> can see is that some frontend/backend pair needed a direct 4 -> 6
> >>>> transition to support 'unbind' before but AFAIK nothing has ever
> >> supported
> >>>> that, and blk and net crash'n'burn if you try that on upstream as it
> >>>> stands. A clean unplug would always set state to 5 first, since
> that's
> >>>> part of the unplug protocol.
> >>>>
> >>>> That was my question: are you sure all current and previous
> >>>> guest frontends and backends are handling unplug this way?
> >>>>
> >>>> Not "should handle", but "do handle".
> >>>
> >>> That depends on the toolstack. IIUC the only 'supported' toolstack is
> >> xl/libxl, which will set 'state' to 5 and 'online' to 0 to initiate an
> >> unplug.
> >>
> >> I guess libvirt/libxl is doing the same?
> >>
> >
> > The unplug mechansism is all in libxl AFAICT, so it should be identical.
> >
> >> At least at SUSE we still have some customers running xend based
> >> Xen installations with recent Linux or Windows guests.
> >>
> >
> > Is that something the upstream code can/should support though? I'd be
> surprised if xend is actually doing anything different to libxl since I've
> been coding the Windows PV drivers to trigger off the combined
> closing/online transition for as long as I can remember.
> 
> I'd rather not have to carry a private patch for new Linux kernel to be
> able to run on those hosts.
> 
> AFAIK you at Amazon have some quite old Xen installations, too. How are
> you handling that (assuming the customer is updating the kernel to a
> recent version in his guest)?

I'm not aware of any problems running with xend (but I'm not I the loop on everything). I think it is still a reasonably safe assumption that xend initiated unplug cleanly and doesn't rely on a 4 -> 6 transition.

  Paul

> 
> 
> Juergen
Roger Pau Monne Dec. 9, 2019, 3:13 p.m. UTC | #17
On Mon, Dec 09, 2019 at 02:41:40PM +0000, Durrant, Paul wrote:
> > -----Original Message-----
> > From: Roger Pau Monné <roger.pau@citrix.com>
> > Sent: 09 December 2019 14:29
> > To: Durrant, Paul <pdurrant@amazon.com>
> > Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org; Juergen
> > Gross <jgross@suse.com>; Stefano Stabellini <sstabellini@kernel.org>;
> > Boris Ostrovsky <boris.ostrovsky@oracle.com>
> > Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is forced to
> > closed
> > 
> > On Mon, Dec 09, 2019 at 12:40:47PM +0000, Durrant, Paul wrote:
> > > > -----Original Message-----
> > > > From: Roger Pau Monné <roger.pau@citrix.com>
> > > > Sent: 09 December 2019 12:26
> > > > To: Durrant, Paul <pdurrant@amazon.com>
> > > > Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org;
> > Juergen
> > > > Gross <jgross@suse.com>; Stefano Stabellini <sstabellini@kernel.org>;
> > > > Boris Ostrovsky <boris.ostrovsky@oracle.com>
> > > > Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is
> > forced to
> > > > closed
> > > >
> > > > On Mon, Dec 09, 2019 at 12:01:38PM +0000, Durrant, Paul wrote:
> > > > > > -----Original Message-----
> > > > > > From: Roger Pau Monné <roger.pau@citrix.com>
> > > > > > Sent: 09 December 2019 11:39
> > > > > > To: Durrant, Paul <pdurrant@amazon.com>
> > > > > > Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org;
> > > > Juergen
> > > > > > Gross <jgross@suse.com>; Stefano Stabellini
> > <sstabellini@kernel.org>;
> > > > > > Boris Ostrovsky <boris.ostrovsky@oracle.com>
> > > > > > Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is
> > > > forced to
> > > > > > closed
> > > > > >
> > > > > > On Thu, Dec 05, 2019 at 02:01:21PM +0000, Paul Durrant wrote:
> > > > > > > Only force state to closed in the case when the toolstack may
> > need
> > > > to
> > > > > > > clean up. This can be detected by checking whether the state in
> > > > xenstore
> > > > > > > has been set to closing prior to device removal.
> > > > > >
> > > > > > I'm not sure I see the point of this, I would expect that a
> > failure to
> > > > > > probe or the removal of the device would leave the xenbus state as
> > > > > > closed, which is consistent with the actual driver state.
> > > > > >
> > > > > > Can you explain what's the benefit of leaving a device without a
> > > > > > driver in such unknown state?
> > > > > >
> > > > >
> > > > > If probe fails then I think it should leave the state alone. If the
> > > > > state is moved to closed then basically you just killed that
> > > > > connection to the guest (as the frontend will normally close down
> > > > > when it sees this change) so, if the probe failure was due to a bug
> > > > > in blkback or, e.g., a transient resource issue then it's game over
> > > > > as far as that guest goes.
> > > >
> > > > But the connection can be restarted by switching the backend to the
> > > > init state again.
> > >
> > > Too late. The frontend saw closed and you already lost.
> > >
> > > >
> > > > > The ultimate goal here is PV backend re-load that is completely
> > > > transparent to the guest. Modifying anything in xenstore compromises
> > that
> > > > so we need to be careful.
> > > >
> > > > That's a fine goal, but not switching to closed state in
> > > > xenbus_dev_remove seems wrong, as you have actually left the frontend
> > > > without a matching backend and with the state not set to closed.
> > > >
> > >
> > > Why is this a problem? With this series fully applied a (block) backend
> > can come and go without needing to change the state. Relying on guests to
> > DTRT is not a sustainable option for a cloud deployment.
> > >
> > > > Ie: that would be fine if you explicitly state this is some kind of
> > > > internal blkback reload, but not for the general case where blkback
> > > > has been unbound. I think we need someway to difference a blkback
> > > > reload vs a unbound.
> > > >
> > >
> > > Why do we need that though? Why is it advantageous for a backend to go
> > to closed. No PV backends cope with an unbind as-is, and a toolstack
> > initiated unplug will always set state to 5 anyway. So TBH any state
> > transition done directly in the xenbus code looks wrong to me anyway (but
> > appears to be a necessary evil to keep the toolstack working in the event
> > it spawns a backend where there is actually to driver present, or it
> > doesn't come online).
> > 
> > IMO the normal flow for unbind would be to attempt to close open
> > connections and then remove the driver: leaving frontends connected
> > without any attached backends is not correct, and will just block the
> > guest frontend until requests start timing out.
> > 
> > I can see the reasoning for doing that for the purpose of updating a
> > blkback module without guests noticing, but I would prefer that
> > leaving connections open was an option that could be given when
> > unbinding (or maybe a driver option in sysfs?), so that the default
> > behaviour would be to try to close everything when unbinding if
> > possible.
> 
> Well unbind is pretty useless now IMO since bind doesn't work, and a transition straight to closed is just plain wrong anyway.

Why do you claim that a straight transition into the closed state is
wrong?

I don't see any such mention in blkif.h, which also doesn't contain
any guidelines regarding closing state transitions, so unless
otherwise stated somewhere else transitions into closed can happen
from any state IMO.

> But, we could have a flag that the backend driver sets to say that it supports transparent re-bind that gates this code. Would that make you feel more comfortable?

Having an option to leave state untouched when unbinding would be fine
for me, otherwise state should be set to closed when unbinding. I
don't think there's anything else that needs to be done in this
regard, the cleanup should be exactly the same the only difference
being the setting of all the active backends to closed state.

> If you want unbind to actually do a proper unplug then that's extra work and not really something I want to tackle (and re-bind would still need to be toolstack initiated as something would have to re-create the xenstore area).

Why do you say the xenstore area would need to be recreated?

Setting state to closed shouldn't cause any cleanup of the xenstore
area, as that should already happen for example when using pvgrub
since in that case grub itself disconnects and already causes a
transition to closed and a re-attachment afterwards by the guest
kernel.

Thanks, Roger.
Paul Durrant Dec. 9, 2019, 4:26 p.m. UTC | #18
> -----Original Message-----
[snip]
> >
> > Well unbind is pretty useless now IMO since bind doesn't work, and a
> transition straight to closed is just plain wrong anyway.
> 
> Why do you claim that a straight transition into the closed state is
> wrong?

It's badly documented, I agree, but have a look at https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/xen-netback/xenbus.c#n480. Connected -> Closed is not a valid transition, and I don't think it was ever intended to be.

> 
> I don't see any such mention in blkif.h, which also doesn't contain
> any guidelines regarding closing state transitions, so unless
> otherwise stated somewhere else transitions into closed can happen
> from any state IMO.
> 

They can, but it is even more poorly documented what should be done in this case.

> > But, we could have a flag that the backend driver sets to say that it
> supports transparent re-bind that gates this code. Would that make you
> feel more comfortable?
> 
> Having an option to leave state untouched when unbinding would be fine
> for me, otherwise state should be set to closed when unbinding. I
> don't think there's anything else that needs to be done in this
> regard, the cleanup should be exactly the same the only difference
> being the setting of all the active backends to closed state.
> 

Ok, I'll add such a flag and define it for blkback only, in patch #4 i.e. when it actually gains the ability to rebind.

> > If you want unbind to actually do a proper unplug then that's extra work
> and not really something I want to tackle (and re-bind would still need to
> be toolstack initiated as something would have to re-create the xenstore
> area).
> 
> Why do you say the xenstore area would need to be recreated?
> 
> Setting state to closed shouldn't cause any cleanup of the xenstore
> area, as that should already happen for example when using pvgrub
> since in that case grub itself disconnects and already causes a
> transition to closed and a re-attachment afterwards by the guest
> kernel.
> 

For some reason, when I originally tested, the xenstore area disappeared. I checked again and it did not this time. I just ended up with a frontend stuck in state 5 (because it is the system disk and won't go offline) trying to talk to a non-existent backend. Upon re-bind the backend goes into state 5 (because it sees the 5 in the frontend) and leaves the guest wedged.

  Paul
Roger Pau Monne Dec. 9, 2019, 5:17 p.m. UTC | #19
On Mon, Dec 09, 2019 at 04:26:15PM +0000, Durrant, Paul wrote:
> > > If you want unbind to actually do a proper unplug then that's extra work
> > and not really something I want to tackle (and re-bind would still need to
> > be toolstack initiated as something would have to re-create the xenstore
> > area).
> > 
> > Why do you say the xenstore area would need to be recreated?
> > 
> > Setting state to closed shouldn't cause any cleanup of the xenstore
> > area, as that should already happen for example when using pvgrub
> > since in that case grub itself disconnects and already causes a
> > transition to closed and a re-attachment afterwards by the guest
> > kernel.
> > 
> 
> For some reason, when I originally tested, the xenstore area disappeared. I checked again and it did not this time. I just ended up with a frontend stuck in state 5 (because it is the system disk and won't go offline) trying to talk to a non-existent backend. Upon re-bind the backend goes into state 5 (because it sees the 5 in the frontend) and leaves the guest wedged.

Likely blkfront should go back to init state, but anyway, that's not
something that needs fixing as part of this series.

Thanks, Roger.
Paul Durrant Dec. 9, 2019, 5:23 p.m. UTC | #20
> -----Original Message-----
> From: Roger Pau Monné <roger.pau@citrix.com>
> Sent: 09 December 2019 17:18
> To: Durrant, Paul <pdurrant@amazon.com>
> Cc: linux-kernel@vger.kernel.org; xen-devel@lists.xenproject.org; Juergen
> Gross <jgross@suse.com>; Stefano Stabellini <sstabellini@kernel.org>;
> Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Subject: Re: [Xen-devel] [PATCH 2/4] xenbus: limit when state is forced to
> closed
> 
> On Mon, Dec 09, 2019 at 04:26:15PM +0000, Durrant, Paul wrote:
> > > > If you want unbind to actually do a proper unplug then that's extra
> work
> > > and not really something I want to tackle (and re-bind would still
> need to
> > > be toolstack initiated as something would have to re-create the
> xenstore
> > > area).
> > >
> > > Why do you say the xenstore area would need to be recreated?
> > >
> > > Setting state to closed shouldn't cause any cleanup of the xenstore
> > > area, as that should already happen for example when using pvgrub
> > > since in that case grub itself disconnects and already causes a
> > > transition to closed and a re-attachment afterwards by the guest
> > > kernel.
> > >
> >
> > For some reason, when I originally tested, the xenstore area
> disappeared. I checked again and it did not this time. I just ended up
> with a frontend stuck in state 5 (because it is the system disk and won't
> go offline) trying to talk to a non-existent backend. Upon re-bind the
> backend goes into state 5 (because it sees the 5 in the frontend) and
> leaves the guest wedged.
> 
> Likely blkfront should go back to init state, but anyway, that's not
> something that needs fixing as part of this series.
> 

Ok, cool.

I am wondering though whether we ought to suppress bind/unbind for drivers that don't whitelist themselves (through the new xenbus_driver flag that I'll add). It's somewhat misleading that the nodes are there but don't necessarily work.

  Paul

Patch
diff mbox series

diff --git a/drivers/xen/xenbus/xenbus_probe.c b/drivers/xen/xenbus/xenbus_probe.c
index a10311c348b9..c54a53da0106 100644
--- a/drivers/xen/xenbus/xenbus_probe.c
+++ b/drivers/xen/xenbus/xenbus_probe.c
@@ -255,7 +255,6 @@  int xenbus_dev_probe(struct device *_dev)
 	module_put(drv->driver.owner);
 fail:
 	xenbus_dev_error(dev, err, "xenbus_dev_probe on %s", dev->nodename);
-	xenbus_switch_state(dev, XenbusStateClosed);
 	return err;
 }
 EXPORT_SYMBOL_GPL(xenbus_dev_probe);
@@ -264,6 +263,7 @@  int xenbus_dev_remove(struct device *_dev)
 {
 	struct xenbus_device *dev = to_xenbus_device(_dev);
 	struct xenbus_driver *drv = to_xenbus_driver(_dev->driver);
+	enum xenbus_state state;
 
 	DPRINTK("%s", dev->nodename);
 
@@ -276,7 +276,14 @@  int xenbus_dev_remove(struct device *_dev)
 
 	free_otherend_details(dev);
 
-	xenbus_switch_state(dev, XenbusStateClosed);
+	/*
+	 * If the toolstack had force the device state to closing then set
+	 * the state to closed now to allow it to be cleaned up.
+	 */
+	state = xenbus_read_driver_state(dev->nodename);
+	if (state == XenbusStateClosing)
+		xenbus_switch_state(dev, XenbusStateClosed);
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(xenbus_dev_remove);