[v4,for-4.0,0/7] vhost-user-blk: Add support for backend reconnecting

Message ID	20190109112728.9214-1-xieyongji@baidu.com (mailing list archive)
Headers	show Return-Path: <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org> From: elohimes@gmail.com To: mst@redhat.com, marcandre.lureau@redhat.com, berrange@redhat.com, jasowang@redhat.com, maxime.coquelin@redhat.com, yury-kotov@yandex-team.ru, wrfsh@yandex-team.ru Date: Wed, 9 Jan 2019 19:27:21 +0800 Message-Id: <20190109112728.9214-1-xieyongji@baidu.com> Subject: [Qemu-devel] [PATCH v4 for-4.0 0/7] vhost-user-blk: Add support for backend reconnecting Precedence: list Cc: nixun@baidu.com, qemu-devel@nongnu.org, lilin24@baidu.com, zhangyu31@baidu.com, chaiwen@baidu.com, Xie Yongji <xieyongji@baidu.com> Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>
Series	vhost-user-blk: Add support for backend reconnecting \| expand [v4,for-4.0,0/7] vhost-user-blk: Add support for backend reconnecting [v4,for-4.0,1/7] char-socket: Enable "nowait" option on client sockets [v4,for-4.0,2/7] vhost-user: Support transferring inflight buffer between qemu and backend [v4,for-4.0,3/7] libvhost-user: Introduce vu_queue_map_desc() [v4,for-4.0,4/7] libvhost-user: Support tracking inflight I/O in shared memory [v4,for-4.0,5/7] vhost-user-blk: Add support to get/set inflight buffer [v4,for-4.0,6/7] vhost-user-blk: Add support to reconnect backend [v4,for-4.0,7/7] contrib/vhost-user-blk: enable inflight I/O tracking

Yongji Xie Jan. 9, 2019, 11:27 a.m. UTC

From: Xie Yongji <xieyongji@baidu.com>

This patchset is aimed at supporting qemu to reconnect
vhost-user-blk backend after vhost-user-blk backend crash or
restart.

The patch 1 uses exisiting wait/nowait options to make QEMU not
do a connect on client sockets during initialization of the chardev.

The patch 2 introduces two new messages VHOST_USER_GET_INFLIGHT_FD
and VHOST_USER_SET_INFLIGHT_FD to support providing shared
memory to backend.

The patch 3,4 are the corresponding libvhost-user patches of
patch 2. Make libvhost-user support VHOST_USER_GET_INFLIGHT_FD
and VHOST_USER_SET_INFLIGHT_FD.

The patch 5 allows vhost-user-blk to use the two new messages
to get/set inflight buffer from/to backend.

The patch 6 supports vhost-user-blk to reconnect backend when
connection closed.

The patch 7 introduces VHOST_USER_PROTOCOL_F_SLAVE_SHMFD
to vhost-user-blk backend which is used to tell qemu that
we support reconnecting now.

To use it, we could start qemu with:

qemu-system-x86_64 \
        -chardev socket,id=char0,path=/path/vhost.socket,nowait,reconnect=1, \
        -device vhost-user-blk-pci,chardev=char0 \

and start vhost-user-blk backend with:

vhost-user-blk -b /path/file -s /path/vhost.socket

Then we can restart vhost-user-blk at any time during VM running.

V3 to V4:
- Drop messages VHOST_USER_GET_SHM_SIZE and VHOST_USER_SET_SHM_FD
- Introduce two new messages VHOST_USER_GET_INFLIGHT_FD
  and VHOST_USER_SET_INFLIGHT_FD
- Allocate inflight buffer in backend rather than in qemu
- Document a recommended format for inflight buffer

V2 to V3:
- Using exisiting wait/nowait options to control connection on
  client sockets instead of introducing "disconnected" option.
- Support the case that vhost-user backend restart during initialzation
  of vhost-user-blk device.

V1 to V2:
- Introduce "disconnected" option for chardev instead of reuse "wait"
  option
- Support the case that QEMU starts before vhost-user backend
- Drop message VHOST_USER_SET_VRING_INFLIGHT
- Introduce two new messages VHOST_USER_GET_SHM_SIZE
  and VHOST_USER_SET_SHM_FD

Xie Yongji (7):
  char-socket: Enable "nowait" option on client sockets
  vhost-user: Support transferring inflight buffer between qemu and
    backend
  libvhost-user: Introduce vu_queue_map_desc()
  libvhost-user: Support tracking inflight I/O in shared memory
  vhost-user-blk: Add support to get/set inflight buffer
  vhost-user-blk: Add support to reconnect backend
  contrib/vhost-user-blk: enable inflight I/O tracking

 Makefile                                |   2 +-
 chardev/char-socket.c                   |  56 ++--
 contrib/libvhost-user/libvhost-user.c   | 346 ++++++++++++++++++++----
 contrib/libvhost-user/libvhost-user.h   |  29 ++
 contrib/vhost-user-blk/vhost-user-blk.c |   3 +-
 docs/interop/vhost-user.txt             |  60 ++++
 hw/block/vhost-user-blk.c               | 223 ++++++++++++---
 hw/virtio/vhost-user.c                  | 108 ++++++++
 hw/virtio/vhost.c                       | 108 ++++++++
 include/hw/virtio/vhost-backend.h       |   9 +
 include/hw/virtio/vhost-user-blk.h      |   5 +
 include/hw/virtio/vhost.h               |  19 ++
 qapi/char.json                          |   3 +-
 qemu-options.hx                         |   9 +-
 14 files changed, 851 insertions(+), 129 deletions(-)

Stefan Hajnoczi Jan. 10, 2019, 10:25 a.m. UTC | #1

On Wed, Jan 09, 2019 at 07:27:21PM +0800, elohimes@gmail.com wrote:
> From: Xie Yongji <xieyongji@baidu.com>
> 
> This patchset is aimed at supporting qemu to reconnect
> vhost-user-blk backend after vhost-user-blk backend crash or
> restart.
> 
> The patch 1 uses exisiting wait/nowait options to make QEMU not
> do a connect on client sockets during initialization of the chardev.
> 
> The patch 2 introduces two new messages VHOST_USER_GET_INFLIGHT_FD
> and VHOST_USER_SET_INFLIGHT_FD to support providing shared
> memory to backend.

Can you describe the problem that the inflight I/O shared memory region
solves?

Thanks,
Stefan

Marc-André Lureau Jan. 10, 2019, 10:39 a.m. UTC | #2

Hi

On Wed, Jan 9, 2019 at 3:28 PM <elohimes@gmail.com> wrote:
>
> From: Xie Yongji <xieyongji@baidu.com>
>
> This patchset is aimed at supporting qemu to reconnect
> vhost-user-blk backend after vhost-user-blk backend crash or
> restart.
>
> The patch 1 uses exisiting wait/nowait options to make QEMU not
> do a connect on client sockets during initialization of the chardev.
>
> The patch 2 introduces two new messages VHOST_USER_GET_INFLIGHT_FD
> and VHOST_USER_SET_INFLIGHT_FD to support providing shared
> memory to backend.
>
> The patch 3,4 are the corresponding libvhost-user patches of
> patch 2. Make libvhost-user support VHOST_USER_GET_INFLIGHT_FD
> and VHOST_USER_SET_INFLIGHT_FD.
>
> The patch 5 allows vhost-user-blk to use the two new messages
> to get/set inflight buffer from/to backend.
>
> The patch 6 supports vhost-user-blk to reconnect backend when
> connection closed.
>
> The patch 7 introduces VHOST_USER_PROTOCOL_F_SLAVE_SHMFD
> to vhost-user-blk backend which is used to tell qemu that
> we support reconnecting now.
>
> To use it, we could start qemu with:
>
> qemu-system-x86_64 \
>         -chardev socket,id=char0,path=/path/vhost.socket,nowait,reconnect=1, \
>         -device vhost-user-blk-pci,chardev=char0 \
>
> and start vhost-user-blk backend with:
>
> vhost-user-blk -b /path/file -s /path/vhost.socket
>
> Then we can restart vhost-user-blk at any time during VM running.
>
> V3 to V4:
> - Drop messages VHOST_USER_GET_SHM_SIZE and VHOST_USER_SET_SHM_FD
> - Introduce two new messages VHOST_USER_GET_INFLIGHT_FD
>   and VHOST_USER_SET_INFLIGHT_FD
> - Allocate inflight buffer in backend rather than in qemu
> - Document a recommended format for inflight buffer
>
> V2 to V3:
> - Using exisiting wait/nowait options to control connection on
>   client sockets instead of introducing "disconnected" option.
> - Support the case that vhost-user backend restart during initialzation
>   of vhost-user-blk device.
>
> V1 to V2:
> - Introduce "disconnected" option for chardev instead of reuse "wait"
>   option
> - Support the case that QEMU starts before vhost-user backend
> - Drop message VHOST_USER_SET_VRING_INFLIGHT
> - Introduce two new messages VHOST_USER_GET_SHM_SIZE
>   and VHOST_USER_SET_SHM_FD
>
> Xie Yongji (7):
>   char-socket: Enable "nowait" option on client sockets

This patch breaks make check.

It would be nice to add a test for the new nowait behaviour.

>   vhost-user: Support transferring inflight buffer between qemu and
>     backend
>   libvhost-user: Introduce vu_queue_map_desc()
>   libvhost-user: Support tracking inflight I/O in shared memory
>   vhost-user-blk: Add support to get/set inflight buffer
>   vhost-user-blk: Add support to reconnect backend
>   contrib/vhost-user-blk: enable inflight I/O tracking
>
>  Makefile                                |   2 +-
>  chardev/char-socket.c                   |  56 ++--
>  contrib/libvhost-user/libvhost-user.c   | 346 ++++++++++++++++++++----
>  contrib/libvhost-user/libvhost-user.h   |  29 ++
>  contrib/vhost-user-blk/vhost-user-blk.c |   3 +-
>  docs/interop/vhost-user.txt             |  60 ++++
>  hw/block/vhost-user-blk.c               | 223 ++++++++++++---
>  hw/virtio/vhost-user.c                  | 108 ++++++++
>  hw/virtio/vhost.c                       | 108 ++++++++
>  include/hw/virtio/vhost-backend.h       |   9 +
>  include/hw/virtio/vhost-user-blk.h      |   5 +
>  include/hw/virtio/vhost.h               |  19 ++
>  qapi/char.json                          |   3 +-
>  qemu-options.hx                         |   9 +-
>  14 files changed, 851 insertions(+), 129 deletions(-)
>
> --
> 2.17.1
>
>

Yongji Xie Jan. 10, 2019, 10:59 a.m. UTC | #3

On Thu, 10 Jan 2019 at 18:25, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Wed, Jan 09, 2019 at 07:27:21PM +0800, elohimes@gmail.com wrote:
> > From: Xie Yongji <xieyongji@baidu.com>
> >
> > This patchset is aimed at supporting qemu to reconnect
> > vhost-user-blk backend after vhost-user-blk backend crash or
> > restart.
> >
> > The patch 1 uses exisiting wait/nowait options to make QEMU not
> > do a connect on client sockets during initialization of the chardev.
> >
> > The patch 2 introduces two new messages VHOST_USER_GET_INFLIGHT_FD
> > and VHOST_USER_SET_INFLIGHT_FD to support providing shared
> > memory to backend.
>
> Can you describe the problem that the inflight I/O shared memory region
> solves?
>

The backend need to get inflight I/O and do I/O replay after restart.
Now we can only get used_idx in used_ring. It's not enough. Because we
might not process descriptors in the same order which they have been
made available sometimes. A simple example:
https://patchwork.kernel.org/cover/10715305/#22375607. So we need a
shared memory to track inflight I/O.

Thanks,
Yongji

Yongji Xie Jan. 10, 2019, 11:09 a.m. UTC | #4

On Thu, 10 Jan 2019 at 18:39, Marc-André Lureau
<marcandre.lureau@gmail.com> wrote:
>
> Hi
>
> On Wed, Jan 9, 2019 at 3:28 PM <elohimes@gmail.com> wrote:
> >
> > From: Xie Yongji <xieyongji@baidu.com>
> >
> > This patchset is aimed at supporting qemu to reconnect
> > vhost-user-blk backend after vhost-user-blk backend crash or
> > restart.
> >
> > The patch 1 uses exisiting wait/nowait options to make QEMU not
> > do a connect on client sockets during initialization of the chardev.
> >
> > The patch 2 introduces two new messages VHOST_USER_GET_INFLIGHT_FD
> > and VHOST_USER_SET_INFLIGHT_FD to support providing shared
> > memory to backend.
> >
> > The patch 3,4 are the corresponding libvhost-user patches of
> > patch 2. Make libvhost-user support VHOST_USER_GET_INFLIGHT_FD
> > and VHOST_USER_SET_INFLIGHT_FD.
> >
> > The patch 5 allows vhost-user-blk to use the two new messages
> > to get/set inflight buffer from/to backend.
> >
> > The patch 6 supports vhost-user-blk to reconnect backend when
> > connection closed.
> >
> > The patch 7 introduces VHOST_USER_PROTOCOL_F_SLAVE_SHMFD
> > to vhost-user-blk backend which is used to tell qemu that
> > we support reconnecting now.
> >
> > To use it, we could start qemu with:
> >
> > qemu-system-x86_64 \
> >         -chardev socket,id=char0,path=/path/vhost.socket,nowait,reconnect=1, \
> >         -device vhost-user-blk-pci,chardev=char0 \
> >
> > and start vhost-user-blk backend with:
> >
> > vhost-user-blk -b /path/file -s /path/vhost.socket
> >
> > Then we can restart vhost-user-blk at any time during VM running.
> >
> > V3 to V4:
> > - Drop messages VHOST_USER_GET_SHM_SIZE and VHOST_USER_SET_SHM_FD
> > - Introduce two new messages VHOST_USER_GET_INFLIGHT_FD
> >   and VHOST_USER_SET_INFLIGHT_FD
> > - Allocate inflight buffer in backend rather than in qemu
> > - Document a recommended format for inflight buffer
> >
> > V2 to V3:
> > - Using exisiting wait/nowait options to control connection on
> >   client sockets instead of introducing "disconnected" option.
> > - Support the case that vhost-user backend restart during initialzation
> >   of vhost-user-blk device.
> >
> > V1 to V2:
> > - Introduce "disconnected" option for chardev instead of reuse "wait"
> >   option
> > - Support the case that QEMU starts before vhost-user backend
> > - Drop message VHOST_USER_SET_VRING_INFLIGHT
> > - Introduce two new messages VHOST_USER_GET_SHM_SIZE
> >   and VHOST_USER_SET_SHM_FD
> >
> > Xie Yongji (7):
> >   char-socket: Enable "nowait" option on client sockets
>
> This patch breaks make check.
>
> It would be nice to add a test for the new nowait behaviour.
>

Oh, sorry. I'll fix this issue and add one test for "nowait" option. Thank you!

Thanks,
Yongji

Stefan Hajnoczi Jan. 11, 2019, 3:53 p.m. UTC | #5

On Thu, Jan 10, 2019 at 06:59:27PM +0800, Yongji Xie wrote:
> On Thu, 10 Jan 2019 at 18:25, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >
> > On Wed, Jan 09, 2019 at 07:27:21PM +0800, elohimes@gmail.com wrote:
> > > From: Xie Yongji <xieyongji@baidu.com>
> > >
> > > This patchset is aimed at supporting qemu to reconnect
> > > vhost-user-blk backend after vhost-user-blk backend crash or
> > > restart.
> > >
> > > The patch 1 uses exisiting wait/nowait options to make QEMU not
> > > do a connect on client sockets during initialization of the chardev.
> > >
> > > The patch 2 introduces two new messages VHOST_USER_GET_INFLIGHT_FD
> > > and VHOST_USER_SET_INFLIGHT_FD to support providing shared
> > > memory to backend.
> >
> > Can you describe the problem that the inflight I/O shared memory region
> > solves?
> >
> 
> The backend need to get inflight I/O and do I/O replay after restart.
> Now we can only get used_idx in used_ring. It's not enough. Because we
> might not process descriptors in the same order which they have been
> made available sometimes. A simple example:
> https://patchwork.kernel.org/cover/10715305/#22375607. So we need a
> shared memory to track inflight I/O.

The inflight shared memory region is something that needs to be
discussed in detail to make sure it is correct not just for vhost-blk
but also for other vhost-user devices in the future.

Please expand the protocol documentation so that someone can implement
this feature without looking at the code.  Explain the reason for the
inflight shared memory region and how exactly it used.

After a quick look at the shared memory region documentation I have a
few questions:

1. The avail ring only contains "head" indices, each of which may chain
   non-head descriptors.  Does the inflight memory region track only the
   heads?

2. Does the inflight shared memory region preserve avail ring order?
   For example, if the guest submitted 5 2 4, will the new vhost-user
   backend keep that order or does it see 2 4 5?

3. What are the exact memory region size requirements?  Imagine a device
   with a large virtqueue.  Or a device with multiple virtqueues of
   different sizes.  Can your structure hand those cases?

I'm concerned that this approach to device recovery is invasive and hard
to test.  Instead I would use VIRTIO's Device Status Field
DEVICE_NEEDS_RESET bit to tell the guest driver that a reset is
necessary.  This is more disruptive - drivers either have to resubmit or
fail I/O with EIO - but it's also simple and more likely to work
correctly (it only needs to be implemented correctly in the guest
driver, not in the many available vhost-user backend implementations).

Stefan

Michael S. Tsirkin Jan. 11, 2019, 5:24 p.m. UTC | #6

On Fri, Jan 11, 2019 at 03:53:42PM +0000, Stefan Hajnoczi wrote:
> I'm concerned that this approach to device recovery is invasive and hard
> to test.  Instead I would use VIRTIO's Device Status Field
> DEVICE_NEEDS_RESET bit to tell the guest driver that a reset is
> necessary.  This is more disruptive - drivers either have to resubmit or
> fail I/O with EIO - but it's also simple and more likely to work
> correctly (it only needs to be implemented correctly in the guest
> driver, not in the many available vhost-user backend implementations).
> 
> Stefan

Unfortunately drivers don't support DEVICE_NEEDS_RESET yet.
I'll be happy to accept patches, but this means we
can't depend on DEVICE_NEEDS_RESET for basic functionality
like reconnect. And given virtio 1.0 has been there
for a while now, I suspect the only way to start using
it is by adding a new feature flag.

Unfortunate but true.

Yongji Xie Jan. 12, 2019, 4:50 a.m. UTC | #7

On Fri, 11 Jan 2019 at 23:53, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Thu, Jan 10, 2019 at 06:59:27PM +0800, Yongji Xie wrote:
> > On Thu, 10 Jan 2019 at 18:25, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > >
> > > On Wed, Jan 09, 2019 at 07:27:21PM +0800, elohimes@gmail.com wrote:
> > > > From: Xie Yongji <xieyongji@baidu.com>
> > > >
> > > > This patchset is aimed at supporting qemu to reconnect
> > > > vhost-user-blk backend after vhost-user-blk backend crash or
> > > > restart.
> > > >
> > > > The patch 1 uses exisiting wait/nowait options to make QEMU not
> > > > do a connect on client sockets during initialization of the chardev.
> > > >
> > > > The patch 2 introduces two new messages VHOST_USER_GET_INFLIGHT_FD
> > > > and VHOST_USER_SET_INFLIGHT_FD to support providing shared
> > > > memory to backend.
> > >
> > > Can you describe the problem that the inflight I/O shared memory region
> > > solves?
> > >
> >
> > The backend need to get inflight I/O and do I/O replay after restart.
> > Now we can only get used_idx in used_ring. It's not enough. Because we
> > might not process descriptors in the same order which they have been
> > made available sometimes. A simple example:
> > https://patchwork.kernel.org/cover/10715305/#22375607. So we need a
> > shared memory to track inflight I/O.
>
> The inflight shared memory region is something that needs to be
> discussed in detail to make sure it is correct not just for vhost-blk
> but also for other vhost-user devices in the future.
>
> Please expand the protocol documentation so that someone can implement
> this feature without looking at the code.  Explain the reason for the
> inflight shared memory region and how exactly it used.
>

OK, will do it in v5.

> After a quick look at the shared memory region documentation I have a
> few questions:
>
> 1. The avail ring only contains "head" indices, each of which may chain
>    non-head descriptors.  Does the inflight memory region track only the
>    heads?
>

Yes, we only track the head in inflight region.

> 2. Does the inflight shared memory region preserve avail ring order?
>    For example, if the guest submitted 5 2 4, will the new vhost-user
>    backend keep that order or does it see 2 4 5?
>

Now we don't support to resubmit I/O in order. But I think we can add
a timestamp
for each inflight descriptor in shared memory to achieve that.

> 3. What are the exact memory region size requirements?  Imagine a device
>    with a large virtqueue.  Or a device with multiple virtqueues of
>    different sizes.  Can your structure hand those cases?
>

Each available virtqueue should have a region in inflight memory. The
size of region should be fixed and support handling max size elements
in one virtqueue. There may be a little waste, but I think it make the
structure more clear.

> I'm concerned that this approach to device recovery is invasive and hard
> to test.  Instead I would use VIRTIO's Device Status Field
> DEVICE_NEEDS_RESET bit to tell the guest driver that a reset is
> necessary.  This is more disruptive - drivers either have to resubmit or
> fail I/O with EIO - but it's also simple and more likely to work
> correctly (it only needs to be implemented correctly in the guest
> driver, not in the many available vhost-user backend implementations).
>

So you mean adding one way to notify guest to resubmit inflight I/O. I
think it's a good idea. But would it be more flexible to implement
this in backend. We can support old guest. And it will be easy to fix
bug or add feature.

Thanks,
Yongji

Stefan Hajnoczi Jan. 14, 2019, 10:22 a.m. UTC | #8

On Sat, Jan 12, 2019 at 12:50:12PM +0800, Yongji Xie wrote:
> On Fri, 11 Jan 2019 at 23:53, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >
> > On Thu, Jan 10, 2019 at 06:59:27PM +0800, Yongji Xie wrote:
> > > On Thu, 10 Jan 2019 at 18:25, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > >
> > > > On Wed, Jan 09, 2019 at 07:27:21PM +0800, elohimes@gmail.com wrote:
> > > > > From: Xie Yongji <xieyongji@baidu.com>
> > > > >
> > > > > This patchset is aimed at supporting qemu to reconnect
> > > > > vhost-user-blk backend after vhost-user-blk backend crash or
> > > > > restart.
> > > > >
> > > > > The patch 1 uses exisiting wait/nowait options to make QEMU not
> > > > > do a connect on client sockets during initialization of the chardev.
> > > > >
> > > > > The patch 2 introduces two new messages VHOST_USER_GET_INFLIGHT_FD
> > > > > and VHOST_USER_SET_INFLIGHT_FD to support providing shared
> > > > > memory to backend.
> > > >
> > > > Can you describe the problem that the inflight I/O shared memory region
> > > > solves?
> > > >
> > >
> > > The backend need to get inflight I/O and do I/O replay after restart.
> > > Now we can only get used_idx in used_ring. It's not enough. Because we
> > > might not process descriptors in the same order which they have been
> > > made available sometimes. A simple example:
> > > https://patchwork.kernel.org/cover/10715305/#22375607. So we need a
> > > shared memory to track inflight I/O.
> >
> > The inflight shared memory region is something that needs to be
> > discussed in detail to make sure it is correct not just for vhost-blk
> > but also for other vhost-user devices in the future.
> >
> > Please expand the protocol documentation so that someone can implement
> > this feature without looking at the code.  Explain the reason for the
> > inflight shared memory region and how exactly it used.
> >
> 
> OK, will do it in v5.
> 
> > After a quick look at the shared memory region documentation I have a
> > few questions:
> >
> > 1. The avail ring only contains "head" indices, each of which may chain
> >    non-head descriptors.  Does the inflight memory region track only the
> >    heads?
> >
> 
> Yes, we only track the head in inflight region.

Okay, thanks.  That is useful information to include in the
specification.

> > 2. Does the inflight shared memory region preserve avail ring order?
> >    For example, if the guest submitted 5 2 4, will the new vhost-user
> >    backend keep that order or does it see 2 4 5?
> >
> 
> Now we don't support to resubmit I/O in order. But I think we can add
> a timestamp
> for each inflight descriptor in shared memory to achieve that.

Great, the reason I think that feature is interesting is that other
device types may need to preserve order.  It depends on the exact
meaning of the device's requests...

> > I'm concerned that this approach to device recovery is invasive and hard
> > to test.  Instead I would use VIRTIO's Device Status Field
> > DEVICE_NEEDS_RESET bit to tell the guest driver that a reset is
> > necessary.  This is more disruptive - drivers either have to resubmit or
> > fail I/O with EIO - but it's also simple and more likely to work
> > correctly (it only needs to be implemented correctly in the guest
> > driver, not in the many available vhost-user backend implementations).
> >
> 
> So you mean adding one way to notify guest to resubmit inflight I/O. I
> think it's a good idea. But would it be more flexible to implement
> this in backend. We can support old guest. And it will be easy to fix
> bug or add feature.

There are trade-offs with either approach.  In the long run I think it's
beneficial minimize non-trivial logic in vhost-user backends.  There
will be more vhost-user backend implementations and therefore more bugs
if we put the logic into the backend.  This is why I think a simple
mechanism for marking the device as needing a reset will be more
reliable and less trouble.

Stefan

Yongji Xie Jan. 14, 2019, 10:55 a.m. UTC | #9

On Mon, 14 Jan 2019 at 18:22, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Sat, Jan 12, 2019 at 12:50:12PM +0800, Yongji Xie wrote:
> > On Fri, 11 Jan 2019 at 23:53, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > >
> > > On Thu, Jan 10, 2019 at 06:59:27PM +0800, Yongji Xie wrote:
> > > > On Thu, 10 Jan 2019 at 18:25, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > >
> > > > > On Wed, Jan 09, 2019 at 07:27:21PM +0800, elohimes@gmail.com wrote:
> > > > > > From: Xie Yongji <xieyongji@baidu.com>
> > > > > >
> > > > > > This patchset is aimed at supporting qemu to reconnect
> > > > > > vhost-user-blk backend after vhost-user-blk backend crash or
> > > > > > restart.
> > > > > >
> > > > > > The patch 1 uses exisiting wait/nowait options to make QEMU not
> > > > > > do a connect on client sockets during initialization of the chardev.
> > > > > >
> > > > > > The patch 2 introduces two new messages VHOST_USER_GET_INFLIGHT_FD
> > > > > > and VHOST_USER_SET_INFLIGHT_FD to support providing shared
> > > > > > memory to backend.
> > > > >
> > > > > Can you describe the problem that the inflight I/O shared memory region
> > > > > solves?
> > > > >
> > > >
> > > > The backend need to get inflight I/O and do I/O replay after restart.
> > > > Now we can only get used_idx in used_ring. It's not enough. Because we
> > > > might not process descriptors in the same order which they have been
> > > > made available sometimes. A simple example:
> > > > https://patchwork.kernel.org/cover/10715305/#22375607. So we need a
> > > > shared memory to track inflight I/O.
> > >
> > > The inflight shared memory region is something that needs to be
> > > discussed in detail to make sure it is correct not just for vhost-blk
> > > but also for other vhost-user devices in the future.
> > >
> > > Please expand the protocol documentation so that someone can implement
> > > this feature without looking at the code.  Explain the reason for the
> > > inflight shared memory region and how exactly it used.
> > >
> >
> > OK, will do it in v5.
> >
> > > After a quick look at the shared memory region documentation I have a
> > > few questions:
> > >
> > > 1. The avail ring only contains "head" indices, each of which may chain
> > >    non-head descriptors.  Does the inflight memory region track only the
> > >    heads?
> > >
> >
> > Yes, we only track the head in inflight region.
>
> Okay, thanks.  That is useful information to include in the
> specification.
>

OK, will do it.

> > > 2. Does the inflight shared memory region preserve avail ring order?
> > >    For example, if the guest submitted 5 2 4, will the new vhost-user
> > >    backend keep that order or does it see 2 4 5?
> > >
> >
> > Now we don't support to resubmit I/O in order. But I think we can add
> > a timestamp
> > for each inflight descriptor in shared memory to achieve that.
>
> Great, the reason I think that feature is interesting is that other
> device types may need to preserve order.  It depends on the exact
> meaning of the device's requests...
>

Yes, this would be a useful feature.

> > > I'm concerned that this approach to device recovery is invasive and hard
> > > to test.  Instead I would use VIRTIO's Device Status Field
> > > DEVICE_NEEDS_RESET bit to tell the guest driver that a reset is
> > > necessary.  This is more disruptive - drivers either have to resubmit or
> > > fail I/O with EIO - but it's also simple and more likely to work
> > > correctly (it only needs to be implemented correctly in the guest
> > > driver, not in the many available vhost-user backend implementations).
> > >
> >
> > So you mean adding one way to notify guest to resubmit inflight I/O. I
> > think it's a good idea. But would it be more flexible to implement
> > this in backend. We can support old guest. And it will be easy to fix
> > bug or add feature.
>
> There are trade-offs with either approach.  In the long run I think it's
> beneficial minimize non-trivial logic in vhost-user backends.  There
> will be more vhost-user backend implementations and therefore more bugs
> if we put the logic into the backend.  This is why I think a simple
> mechanism for marking the device as needing a reset will be more
> reliable and less trouble.
>

I agree. So is it possible to implement both? In the long run,
updating guest driver to support this is better. And at that time, the
logic in backend can be only used to support legacy guest driver.

Thanks,
Yongji

Stefan Hajnoczi Jan. 16, 2019, 2:28 p.m. UTC | #10

On Mon, Jan 14, 2019 at 06:55:40PM +0800, Yongji Xie wrote:
> On Mon, 14 Jan 2019 at 18:22, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >
> > On Sat, Jan 12, 2019 at 12:50:12PM +0800, Yongji Xie wrote:
> > > On Fri, 11 Jan 2019 at 23:53, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > >
> > > > On Thu, Jan 10, 2019 at 06:59:27PM +0800, Yongji Xie wrote:
> > > > > On Thu, 10 Jan 2019 at 18:25, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > > >
> > > > > > On Wed, Jan 09, 2019 at 07:27:21PM +0800, elohimes@gmail.com wrote:
> > > > I'm concerned that this approach to device recovery is invasive and hard
> > > > to test.  Instead I would use VIRTIO's Device Status Field
> > > > DEVICE_NEEDS_RESET bit to tell the guest driver that a reset is
> > > > necessary.  This is more disruptive - drivers either have to resubmit or
> > > > fail I/O with EIO - but it's also simple and more likely to work
> > > > correctly (it only needs to be implemented correctly in the guest
> > > > driver, not in the many available vhost-user backend implementations).
> > > >
> > >
> > > So you mean adding one way to notify guest to resubmit inflight I/O. I
> > > think it's a good idea. But would it be more flexible to implement
> > > this in backend. We can support old guest. And it will be easy to fix
> > > bug or add feature.
> >
> > There are trade-offs with either approach.  In the long run I think it's
> > beneficial minimize non-trivial logic in vhost-user backends.  There
> > will be more vhost-user backend implementations and therefore more bugs
> > if we put the logic into the backend.  This is why I think a simple
> > mechanism for marking the device as needing a reset will be more
> > reliable and less trouble.
> >
> 
> I agree. So is it possible to implement both? In the long run,
> updating guest driver to support this is better. And at that time, the
> logic in backend can be only used to support legacy guest driver.

Yes, in theory both approaches could be available.

Stefan

[v4,for-4.0,0/7] vhost-user-blk: Add support for backend reconnecting

Message

Comments