mbox series

[v2,0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k

Message ID cover.1633376313.git.qemu_oss@crudebyte.com (mailing list archive)
Headers show
Series virtio: increase VIRTQUEUE_MAX_SIZE to 32k | expand

Message

Christian Schoenebeck Oct. 4, 2021, 7:38 p.m. UTC
At the moment the maximum transfer size with virtio is limited to 4M
(1024 * PAGE_SIZE). This series raises this limit to its maximum
theoretical possible transfer size of 128M (32k pages) according to the
virtio specs:

https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-240006

Maintainers: if you don't care about allowing users to go beyond 4M then no
action is required on your side for now. This series preserves the old value
of 1k for now by using VIRTQUEUE_LEGACY_MAX_SIZE on your end.

If you do want to support 128M however, then replace
VIRTQUEUE_LEGACY_MAX_SIZE by VIRTQUEUE_MAX_SIZE on your end (see patch 3 as
example for 9pfs being the first virtio user supporting it) and make sure
that this new transfer size limit is actually supported by you.

Changes v1 -> v2:

  * Instead of simply raising VIRTQUEUE_MAX_SIZE to 32k for all virtio
    users, preserve the old value of 1k for all virtio users unless they
    explicitly opted in:
    https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00056.html

Christian Schoenebeck (3):
  virtio: turn VIRTQUEUE_MAX_SIZE into a variable
  virtio: increase VIRTQUEUE_MAX_SIZE to 32k
  virtio-9p-device: switch to 32k max. transfer size

 hw/9pfs/virtio-9p-device.c     |  3 ++-
 hw/block/vhost-user-blk.c      |  6 +++---
 hw/block/virtio-blk.c          |  7 ++++---
 hw/char/virtio-serial-bus.c    |  2 +-
 hw/display/virtio-gpu-base.c   |  2 +-
 hw/input/virtio-input.c        |  2 +-
 hw/net/virtio-net.c            | 25 ++++++++++++------------
 hw/scsi/virtio-scsi.c          |  2 +-
 hw/virtio/vhost-user-fs.c      |  6 +++---
 hw/virtio/vhost-user-i2c.c     |  3 ++-
 hw/virtio/vhost-vsock-common.c |  2 +-
 hw/virtio/virtio-balloon.c     |  4 ++--
 hw/virtio/virtio-crypto.c      |  3 ++-
 hw/virtio/virtio-iommu.c       |  2 +-
 hw/virtio/virtio-mem.c         |  2 +-
 hw/virtio/virtio-mmio.c        |  4 ++--
 hw/virtio/virtio-pmem.c        |  2 +-
 hw/virtio/virtio-rng.c         |  3 ++-
 hw/virtio/virtio.c             | 35 +++++++++++++++++++++++-----------
 include/hw/virtio/virtio.h     | 25 ++++++++++++++++++++++--
 20 files changed, 90 insertions(+), 50 deletions(-)

Comments

David Hildenbrand Oct. 5, 2021, 7:38 a.m. UTC | #1
On 04.10.21 21:38, Christian Schoenebeck wrote:
> At the moment the maximum transfer size with virtio is limited to 4M
> (1024 * PAGE_SIZE). This series raises this limit to its maximum
> theoretical possible transfer size of 128M (32k pages) according to the
> virtio specs:
> 
> https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-240006
> 

I'm missing the "why do we care". Can you comment on that?
Christian Schoenebeck Oct. 5, 2021, 11:10 a.m. UTC | #2
On Dienstag, 5. Oktober 2021 09:38:53 CEST David Hildenbrand wrote:
> On 04.10.21 21:38, Christian Schoenebeck wrote:
> > At the moment the maximum transfer size with virtio is limited to 4M
> > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > theoretical possible transfer size of 128M (32k pages) according to the
> > virtio specs:
> > 
> > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#
> > x1-240006
> I'm missing the "why do we care". Can you comment on that?

Primary motivation is the possibility of improved performance, e.g. in case of 
9pfs, people can raise the maximum transfer size with the Linux 9p client's 
'msize' option on guest side (and only on guest side actually). If guest 
performs large chunk I/O, e.g. consider something "useful" like this one on 
guest side:

  time cat large_file_on_9pfs.dat > /dev/null

Then there is a noticable performance increase with higher transfer size 
values. That performance gain is continuous with rising transfer size values, 
but the performance increase obviously shrinks with rising transfer sizes as 
well, as with similar concepts in general like cache sizes, etc.

Then a secondary motivation is described in reason (2) of patch 2: if the 
transfer size is configurable on guest side (like it is the case with the 9pfs 
'msize' option), then there is the unpleasant side effect that the current 
virtio limit of 4M is invisible to guest; as this value of 4M is simply an 
arbitrarily limit set on QEMU side in the past (probably just implementation 
motivated on QEMU side at that point), i.e. it is not a limit specified by the 
virtio protocol, nor is this limit be made aware to guest via virtio protocol 
at all. The consequence with 9pfs would be if user tries to go higher than 4M, 
then the system would simply hang with this QEMU error:

  virtio: too many write descriptors in indirect table

Now whether this is an issue or not for individual virtio users, depends on 
whether the individual virtio user already had its own limitation <= 4M 
enforced on its side.

Best regards,
Christian Schoenebeck
Michael S. Tsirkin Oct. 5, 2021, 11:19 a.m. UTC | #3
On Tue, Oct 05, 2021 at 01:10:56PM +0200, Christian Schoenebeck wrote:
> On Dienstag, 5. Oktober 2021 09:38:53 CEST David Hildenbrand wrote:
> > On 04.10.21 21:38, Christian Schoenebeck wrote:
> > > At the moment the maximum transfer size with virtio is limited to 4M
> > > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > > theoretical possible transfer size of 128M (32k pages) according to the
> > > virtio specs:
> > > 
> > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#
> > > x1-240006
> > I'm missing the "why do we care". Can you comment on that?
> 
> Primary motivation is the possibility of improved performance, e.g. in case of 
> 9pfs, people can raise the maximum transfer size with the Linux 9p client's 
> 'msize' option on guest side (and only on guest side actually). If guest 
> performs large chunk I/O, e.g. consider something "useful" like this one on 
> guest side:
> 
>   time cat large_file_on_9pfs.dat > /dev/null
> 
> Then there is a noticable performance increase with higher transfer size 
> values. That performance gain is continuous with rising transfer size values, 
> but the performance increase obviously shrinks with rising transfer sizes as 
> well, as with similar concepts in general like cache sizes, etc.
> 
> Then a secondary motivation is described in reason (2) of patch 2: if the 
> transfer size is configurable on guest side (like it is the case with the 9pfs 
> 'msize' option), then there is the unpleasant side effect that the current 
> virtio limit of 4M is invisible to guest; as this value of 4M is simply an 
> arbitrarily limit set on QEMU side in the past (probably just implementation 
> motivated on QEMU side at that point), i.e. it is not a limit specified by the 
> virtio protocol,

According to the spec it's specified, sure enough: vq size limits the
size of indirect descriptors too.
However, ever since commit 44ed8089e991a60d614abe0ee4b9057a28b364e4 we
do not enforce it in the driver ...

> nor is this limit be made aware to guest via virtio protocol 
> at all. The consequence with 9pfs would be if user tries to go higher than 4M, 
> then the system would simply hang with this QEMU error:
> 
>   virtio: too many write descriptors in indirect table
> 
> Now whether this is an issue or not for individual virtio users, depends on 
> whether the individual virtio user already had its own limitation <= 4M 
> enforced on its side.
> 
> Best regards,
> Christian Schoenebeck
>
Christian Schoenebeck Oct. 5, 2021, 11:43 a.m. UTC | #4
On Dienstag, 5. Oktober 2021 13:19:43 CEST Michael S. Tsirkin wrote:
> On Tue, Oct 05, 2021 at 01:10:56PM +0200, Christian Schoenebeck wrote:
> > On Dienstag, 5. Oktober 2021 09:38:53 CEST David Hildenbrand wrote:
> > > On 04.10.21 21:38, Christian Schoenebeck wrote:
> > > > At the moment the maximum transfer size with virtio is limited to 4M
> > > > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > > > theoretical possible transfer size of 128M (32k pages) according to
> > > > the
> > > > virtio specs:
> > > > 
> > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.h
> > > > tml#
> > > > x1-240006
> > > 
> > > I'm missing the "why do we care". Can you comment on that?
> > 
> > Primary motivation is the possibility of improved performance, e.g. in
> > case of 9pfs, people can raise the maximum transfer size with the Linux
> > 9p client's 'msize' option on guest side (and only on guest side
> > actually). If guest performs large chunk I/O, e.g. consider something
> > "useful" like this one on> 
> > guest side:
> >   time cat large_file_on_9pfs.dat > /dev/null
> > 
> > Then there is a noticable performance increase with higher transfer size
> > values. That performance gain is continuous with rising transfer size
> > values, but the performance increase obviously shrinks with rising
> > transfer sizes as well, as with similar concepts in general like cache
> > sizes, etc.
> > 
> > Then a secondary motivation is described in reason (2) of patch 2: if the
> > transfer size is configurable on guest side (like it is the case with the
> > 9pfs 'msize' option), then there is the unpleasant side effect that the
> > current virtio limit of 4M is invisible to guest; as this value of 4M is
> > simply an arbitrarily limit set on QEMU side in the past (probably just
> > implementation motivated on QEMU side at that point), i.e. it is not a
> > limit specified by the virtio protocol,
> 
> According to the spec it's specified, sure enough: vq size limits the
> size of indirect descriptors too.

In the virtio specs the only hard limit that I see is the aforementioned 32k:

"Queue Size corresponds to the maximum number of buffers in the virtqueue. 
Queue Size value is always a power of 2. The maximum Queue Size value is 
32768. This value is specified in a bus-specific way."

> However, ever since commit 44ed8089e991a60d614abe0ee4b9057a28b364e4 we
> do not enforce it in the driver ...

Then there is the current queue size (that you probably mean) which is 
transmitted to guest with whatever virtio was initialized with.

In case of 9p client however the virtio queue size is first initialized with 
some initial hard coded value when the 9p driver is loaded on Linux kernel 
guest side, then when some 9pfs is mounted later on by guest, it may include 
the 'msize' mount option to raise the transfer size, and that's the problem. I 
don't see any way for guest to see that it cannot go above that 4M transfer 
size now.

> > nor is this limit be made aware to guest via virtio protocol
> > at all. The consequence with 9pfs would be if user tries to go higher than
> > 4M,> 
> > then the system would simply hang with this QEMU error:
> >   virtio: too many write descriptors in indirect table
> > 
> > Now whether this is an issue or not for individual virtio users, depends
> > on
> > whether the individual virtio user already had its own limitation <= 4M
> > enforced on its side.
> > 
> > Best regards,
> > Christian Schoenebeck
Stefan Hajnoczi Oct. 7, 2021, 5:23 a.m. UTC | #5
On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian Schoenebeck wrote:
> At the moment the maximum transfer size with virtio is limited to 4M
> (1024 * PAGE_SIZE). This series raises this limit to its maximum
> theoretical possible transfer size of 128M (32k pages) according to the
> virtio specs:
> 
> https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-240006

Hi Christian,
I took a quick look at the code:

- The Linux 9p driver restricts descriptor chains to 128 elements
  (net/9p/trans_virtio.c:VIRTQUEUE_NUM)

- The QEMU 9pfs code passes iovecs directly to preadv(2) and will fail
  with EINVAL when called with more than IOV_MAX iovecs
  (hw/9pfs/9p.c:v9fs_read())

Unless I misunderstood the code, neither side can take advantage of the
new 32k descriptor chain limit?

Thanks,
Stefan
Christian Schoenebeck Oct. 7, 2021, 12:51 p.m. UTC | #6
On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian Schoenebeck wrote:
> > At the moment the maximum transfer size with virtio is limited to 4M
> > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > theoretical possible transfer size of 128M (32k pages) according to the
> > virtio specs:
> > 
> > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#
> > x1-240006
> Hi Christian,
> I took a quick look at the code:
> 
> - The Linux 9p driver restricts descriptor chains to 128 elements
>   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)

Yes, that's the limitation that I am about to remove (WIP); current kernel 
patches:
https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyte.com/

> - The QEMU 9pfs code passes iovecs directly to preadv(2) and will fail
>   with EINVAL when called with more than IOV_MAX iovecs
>   (hw/9pfs/9p.c:v9fs_read())

Hmm, which makes me wonder why I never encountered this error during testing.

Most people will use the 9p qemu 'local' fs driver backend in practice, so 
that v9fs_read() call would translate for most people to this implementation 
on QEMU side (hw/9p/9p-local.c):

static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState *fs,
                            const struct iovec *iov,
                            int iovcnt, off_t offset)
{
#ifdef CONFIG_PREADV
    return preadv(fs->fd, iov, iovcnt, offset);
#else
    int err = lseek(fs->fd, offset, SEEK_SET);
    if (err == -1) {
        return err;
    } else {
        return readv(fs->fd, iov, iovcnt);
    }
#endif
}

> Unless I misunderstood the code, neither side can take advantage of the
> new 32k descriptor chain limit?
> 
> Thanks,
> Stefan

I need to check that when I have some more time. One possible explanation 
might be that preadv() already has this wrapped into a loop in its 
implementation to circumvent a limit like IOV_MAX. It might be another "it 
works, but not portable" issue, but not sure.

There are still a bunch of other issues I have to resolve. If you look at
net/9p/client.c on kernel side, you'll notice that it basically does this ATM

    kmalloc(msize);

for every 9p request. So not only does it allocate much more memory for every 
request than actually required (i.e. say 9pfs was mounted with msize=8M, then 
a 9p request that actually would just need 1k would nevertheless allocate 8M), 
but also it allocates > PAGE_SIZE, which obviously may fail at any time.

With those kernel patches above and QEMU being patched with these series as 
well, I can go above 4M msize now, and the test system runs stable if 9pfs was 
mounted with an msize not being "too high". If I try to mount 9pfs with msize 
being very high, the upper described kmalloc() issue would kick in and cause 
an immediate kernel oops when mounting. So that's a high priority issue that I 
still need to resolve.

Best regards,
Christian Schoenebeck
Stefan Hajnoczi Oct. 7, 2021, 3:42 p.m. UTC | #7
On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian Schoenebeck wrote:
> > > At the moment the maximum transfer size with virtio is limited to 4M
> > > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > > theoretical possible transfer size of 128M (32k pages) according to the
> > > virtio specs:
> > > 
> > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#
> > > x1-240006
> > Hi Christian,
> > I took a quick look at the code:
> > 
> > - The Linux 9p driver restricts descriptor chains to 128 elements
> >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> 
> Yes, that's the limitation that I am about to remove (WIP); current kernel 
> patches:
> https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyte.com/

I haven't read the patches yet but I'm concerned that today the driver
is pretty well-behaved and this new patch series introduces a spec
violation. Not fixing existing spec violations is okay, but adding new
ones is a red flag. I think we need to figure out a clean solution.

> > - The QEMU 9pfs code passes iovecs directly to preadv(2) and will fail
> >   with EINVAL when called with more than IOV_MAX iovecs
> >   (hw/9pfs/9p.c:v9fs_read())
> 
> Hmm, which makes me wonder why I never encountered this error during testing.
> 
> Most people will use the 9p qemu 'local' fs driver backend in practice, so 
> that v9fs_read() call would translate for most people to this implementation 
> on QEMU side (hw/9p/9p-local.c):
> 
> static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState *fs,
>                             const struct iovec *iov,
>                             int iovcnt, off_t offset)
> {
> #ifdef CONFIG_PREADV
>     return preadv(fs->fd, iov, iovcnt, offset);
> #else
>     int err = lseek(fs->fd, offset, SEEK_SET);
>     if (err == -1) {
>         return err;
>     } else {
>         return readv(fs->fd, iov, iovcnt);
>     }
> #endif
> }
> 
> > Unless I misunderstood the code, neither side can take advantage of the
> > new 32k descriptor chain limit?
> > 
> > Thanks,
> > Stefan
> 
> I need to check that when I have some more time. One possible explanation 
> might be that preadv() already has this wrapped into a loop in its 
> implementation to circumvent a limit like IOV_MAX. It might be another "it 
> works, but not portable" issue, but not sure.
>
> There are still a bunch of other issues I have to resolve. If you look at
> net/9p/client.c on kernel side, you'll notice that it basically does this ATM
> 
>     kmalloc(msize);
> 
> for every 9p request. So not only does it allocate much more memory for every 
> request than actually required (i.e. say 9pfs was mounted with msize=8M, then 
> a 9p request that actually would just need 1k would nevertheless allocate 8M), 
> but also it allocates > PAGE_SIZE, which obviously may fail at any time.

The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc() situation.

I saw zerocopy code in the 9p guest driver but didn't investigate when
it's used. Maybe that should be used for large requests (file
reads/writes)? virtio-blk/scsi don't memcpy data into a new buffer, they
directly access page cache or O_DIRECT pinned pages.

Stefan
Greg Kurz Oct. 8, 2021, 7:25 a.m. UTC | #8
On Thu, 7 Oct 2021 16:42:49 +0100
Stefan Hajnoczi <stefanha@redhat.com> wrote:

> On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian Schoenebeck wrote:
> > > > At the moment the maximum transfer size with virtio is limited to 4M
> > > > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > > > theoretical possible transfer size of 128M (32k pages) according to the
> > > > virtio specs:
> > > > 
> > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#
> > > > x1-240006
> > > Hi Christian,
> > > I took a quick look at the code:
> > > 


Hi,

Thanks Stefan for sharing virtio expertise and helping Christian !

> > > - The Linux 9p driver restricts descriptor chains to 128 elements
> > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > 
> > Yes, that's the limitation that I am about to remove (WIP); current kernel 
> > patches:
> > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyte.com/
> 
> I haven't read the patches yet but I'm concerned that today the driver
> is pretty well-behaved and this new patch series introduces a spec
> violation. Not fixing existing spec violations is okay, but adding new
> ones is a red flag. I think we need to figure out a clean solution.
> 
> > > - The QEMU 9pfs code passes iovecs directly to preadv(2) and will fail
> > >   with EINVAL when called with more than IOV_MAX iovecs
> > >   (hw/9pfs/9p.c:v9fs_read())
> > 
> > Hmm, which makes me wonder why I never encountered this error during testing.
> > 
> > Most people will use the 9p qemu 'local' fs driver backend in practice, so 
> > that v9fs_read() call would translate for most people to this implementation 
> > on QEMU side (hw/9p/9p-local.c):
> > 
> > static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState *fs,
> >                             const struct iovec *iov,
> >                             int iovcnt, off_t offset)
> > {
> > #ifdef CONFIG_PREADV
> >     return preadv(fs->fd, iov, iovcnt, offset);
> > #else
> >     int err = lseek(fs->fd, offset, SEEK_SET);
> >     if (err == -1) {
> >         return err;
> >     } else {
> >         return readv(fs->fd, iov, iovcnt);
> >     }
> > #endif
> > }
> > 
> > > Unless I misunderstood the code, neither side can take advantage of the
> > > new 32k descriptor chain limit?
> > > 
> > > Thanks,
> > > Stefan
> > 
> > I need to check that when I have some more time. One possible explanation 
> > might be that preadv() already has this wrapped into a loop in its 
> > implementation to circumvent a limit like IOV_MAX. It might be another "it 
> > works, but not portable" issue, but not sure.
> >
> > There are still a bunch of other issues I have to resolve. If you look at
> > net/9p/client.c on kernel side, you'll notice that it basically does this ATM
> > 
> >     kmalloc(msize);
> > 

Note that this is done twice : once for the T message (client request) and once
for the R message (server answer). The 9p driver could adjust the size of the T
message to what's really needed instead of allocating the full msize. R message
size is not known though.

> > for every 9p request. So not only does it allocate much more memory for every 
> > request than actually required (i.e. say 9pfs was mounted with msize=8M, then 
> > a 9p request that actually would just need 1k would nevertheless allocate 8M), 
> > but also it allocates > PAGE_SIZE, which obviously may fail at any time.
> 
> The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc() situation.
> 
> I saw zerocopy code in the 9p guest driver but didn't investigate when
> it's used. Maybe that should be used for large requests (file
> reads/writes)?

This is the case already : zero-copy is only used for reads/writes/readdir
if the requested size is 1k or more.

Also you'll note that in this case, the 9p driver doesn't allocate msize
for the T/R messages but only 4k, which is largely enough to hold the
header.

	/*
	 * We allocate a inline protocol data of only 4k bytes.
	 * The actual content is passed in zero-copy fashion.
	 */
	req = p9_client_prepare_req(c, type, P9_ZC_HDR_SZ, fmt, ap);

and

/* size of header for zero copy read/write */
#define P9_ZC_HDR_SZ 4096

A huge msize only makes sense for Twrite, Rread and Rreaddir because
of the amount of data they convey. All other messages certainly fit
in a couple of kilobytes only (sorry, don't remember the numbers).

A first change should be to allocate MIN(XXX, msize) for the
regular non-zc case, where XXX could be a reasonable fixed
value (8k?). In the case of T messages, it is even possible
to adjust the size to what's exactly needed, ala snprintf(NULL).

> virtio-blk/scsi don't memcpy data into a new buffer, they
> directly access page cache or O_DIRECT pinned pages.
> 
> Stefan

Cheers,

--
Greg
Christian Schoenebeck Oct. 8, 2021, 2:24 p.m. UTC | #9
On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> On Thu, 7 Oct 2021 16:42:49 +0100
> 
> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian Schoenebeck wrote:
> > > > > At the moment the maximum transfer size with virtio is limited to 4M
> > > > > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > > > > theoretical possible transfer size of 128M (32k pages) according to
> > > > > the
> > > > > virtio specs:
> > > > > 
> > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01
> > > > > .html#
> > > > > x1-240006
> > > > 
> > > > Hi Christian,
> 
> > > > I took a quick look at the code:
> Hi,
> 
> Thanks Stefan for sharing virtio expertise and helping Christian !
> 
> > > > - The Linux 9p driver restricts descriptor chains to 128 elements
> > > > 
> > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > 
> > > Yes, that's the limitation that I am about to remove (WIP); current
> > > kernel
> > > patches:
> > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyte.
> > > com/> 
> > I haven't read the patches yet but I'm concerned that today the driver
> > is pretty well-behaved and this new patch series introduces a spec
> > violation. Not fixing existing spec violations is okay, but adding new
> > ones is a red flag. I think we need to figure out a clean solution.

Nobody has reviewed the kernel patches yet. My main concern therefore actually 
is that the kernel patches are already too complex, because the current 
situation is that only Dominique is handling 9p patches on kernel side, and he 
barely has time for 9p anymore.

Another reason for me to catch up on reading current kernel code and stepping 
in as reviewer of 9p on kernel side ASAP, independent of this issue.

As for current kernel patches' complexity: I can certainly drop patch 7 
entirely as it is probably just overkill. Patch 4 is then the biggest chunk, I 
have to see if I can simplify it, and whether it would make sense to squash 
with patch 3.

> > 
> > > > - The QEMU 9pfs code passes iovecs directly to preadv(2) and will fail
> > > > 
> > > >   with EINVAL when called with more than IOV_MAX iovecs
> > > >   (hw/9pfs/9p.c:v9fs_read())
> > > 
> > > Hmm, which makes me wonder why I never encountered this error during
> > > testing.
> > > 
> > > Most people will use the 9p qemu 'local' fs driver backend in practice,
> > > so
> > > that v9fs_read() call would translate for most people to this
> > > implementation on QEMU side (hw/9p/9p-local.c):
> > > 
> > > static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState *fs,
> > > 
> > >                             const struct iovec *iov,
> > >                             int iovcnt, off_t offset)
> > > 
> > > {
> > > #ifdef CONFIG_PREADV
> > > 
> > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > 
> > > #else
> > > 
> > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > >     if (err == -1) {
> > >     
> > >         return err;
> > >     
> > >     } else {
> > >     
> > >         return readv(fs->fd, iov, iovcnt);
> > >     
> > >     }
> > > 
> > > #endif
> > > }
> > > 
> > > > Unless I misunderstood the code, neither side can take advantage of
> > > > the
> > > > new 32k descriptor chain limit?
> > > > 
> > > > Thanks,
> > > > Stefan
> > > 
> > > I need to check that when I have some more time. One possible
> > > explanation
> > > might be that preadv() already has this wrapped into a loop in its
> > > implementation to circumvent a limit like IOV_MAX. It might be another
> > > "it
> > > works, but not portable" issue, but not sure.
> > > 
> > > There are still a bunch of other issues I have to resolve. If you look
> > > at
> > > net/9p/client.c on kernel side, you'll notice that it basically does
> > > this ATM> > 
> > >     kmalloc(msize);
> 
> Note that this is done twice : once for the T message (client request) and
> once for the R message (server answer). The 9p driver could adjust the size
> of the T message to what's really needed instead of allocating the full
> msize. R message size is not known though.

Would it make sense adding a second virtio ring, dedicated to server responses 
to solve this? IIRC 9p server already calculates appropriate exact sizes for 
each response type. So server could just push space that's really needed for 
its responses.

> > > for every 9p request. So not only does it allocate much more memory for
> > > every request than actually required (i.e. say 9pfs was mounted with
> > > msize=8M, then a 9p request that actually would just need 1k would
> > > nevertheless allocate 8M), but also it allocates > PAGE_SIZE, which
> > > obviously may fail at any time.> 
> > The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc() situation.

Hu, I didn't even consider vmalloc(). I just tried the kvmalloc() wrapper as a 
quick & dirty test, but it crashed in the same way as kmalloc() with large 
msize values immediately on mounting:

diff --git a/net/9p/client.c b/net/9p/client.c
index a75034fa249b..cfe300a4b6ca 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -227,15 +227,18 @@ static int parse_opts(char *opts, struct p9_client 
*clnt)
 static int p9_fcall_init(struct p9_client *c, struct p9_fcall *fc,
                         int alloc_msize)
 {
-       if (likely(c->fcall_cache) && alloc_msize == c->msize) {
+       //if (likely(c->fcall_cache) && alloc_msize == c->msize) {
+       if (false) {
                fc->sdata = kmem_cache_alloc(c->fcall_cache, GFP_NOFS);
                fc->cache = c->fcall_cache;
        } else {
-               fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
+               fc->sdata = kvmalloc(alloc_msize, GFP_NOFS);
                fc->cache = NULL;
        }
-       if (!fc->sdata)
+       if (!fc->sdata) {
+               pr_info("%s !fc->sdata", __func__);
                return -ENOMEM;
+       }
        fc->capacity = alloc_msize;
        return 0;
 }

I try to look at this at the weekend, I would have expected this hack to 
bypass this issue.

> > I saw zerocopy code in the 9p guest driver but didn't investigate when
> > it's used. Maybe that should be used for large requests (file
> > reads/writes)?
> 
> This is the case already : zero-copy is only used for reads/writes/readdir
> if the requested size is 1k or more.
> 
> Also you'll note that in this case, the 9p driver doesn't allocate msize
> for the T/R messages but only 4k, which is largely enough to hold the
> header.
> 
> 	/*
> 	 * We allocate a inline protocol data of only 4k bytes.
> 	 * The actual content is passed in zero-copy fashion.
> 	 */
> 	req = p9_client_prepare_req(c, type, P9_ZC_HDR_SZ, fmt, ap);
> 
> and
> 
> /* size of header for zero copy read/write */
> #define P9_ZC_HDR_SZ 4096
> 
> A huge msize only makes sense for Twrite, Rread and Rreaddir because
> of the amount of data they convey. All other messages certainly fit
> in a couple of kilobytes only (sorry, don't remember the numbers).
> 
> A first change should be to allocate MIN(XXX, msize) for the
> regular non-zc case, where XXX could be a reasonable fixed
> value (8k?). In the case of T messages, it is even possible
> to adjust the size to what's exactly needed, ala snprintf(NULL).

Good idea actually! That would limit this problem to reviewing the 9p specs 
and picking one reasonable max value. Because you are right, those message 
types are tiny. Probably not worth to pile up new code to calculate exact 
message sizes for each one of them.

Adding some safety net would make sense though, to force e.g. if a new message 
type is added in future, that this value would be reviewed as well, something 
like:

static int max_msg_size(int msg_type) {
    switch (msg_type) {
        /* large zero copy messages */
        case Twrite:
        case Tread:
        case Treaddir:
            BUG_ON(true);

        /* small messages */
        case Tversion:
        ....
            return 8k; /* to be replaced with appropriate max value */
    }
}

That way the compiler would bark on future additions. But on doubt, a simple 
comment on msg type enum might do as well though.

> > virtio-blk/scsi don't memcpy data into a new buffer, they
> > directly access page cache or O_DIRECT pinned pages.
> > 
> > Stefan
> 
> Cheers,
> 
> --
> Greg
Christian Schoenebeck Oct. 8, 2021, 4:08 p.m. UTC | #10
On Freitag, 8. Oktober 2021 16:24:42 CEST Christian Schoenebeck wrote:
> On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > On Thu, 7 Oct 2021 16:42:49 +0100
> > 
> > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian Schoenebeck 
wrote:
> > > > > > At the moment the maximum transfer size with virtio is limited to
> > > > > > 4M
> > > > > > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > > > > > theoretical possible transfer size of 128M (32k pages) according
> > > > > > to
> > > > > > the
> > > > > > virtio specs:
> > > > > > 
> > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs
> > > > > > 01
> > > > > > .html#
> > > > > > x1-240006
> > > > > 
> > > > > Hi Christian,
> > 
> > > > > I took a quick look at the code:
> > Hi,
> > 
> > Thanks Stefan for sharing virtio expertise and helping Christian !
> > 
> > > > > - The Linux 9p driver restricts descriptor chains to 128 elements
> > > > > 
> > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > 
> > > > Yes, that's the limitation that I am about to remove (WIP); current
> > > > kernel
> > > > patches:
> > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyt
> > > > e.
> > > > com/>
> > > 
> > > I haven't read the patches yet but I'm concerned that today the driver
> > > is pretty well-behaved and this new patch series introduces a spec
> > > violation. Not fixing existing spec violations is okay, but adding new
> > > ones is a red flag. I think we need to figure out a clean solution.
> 
> Nobody has reviewed the kernel patches yet. My main concern therefore
> actually is that the kernel patches are already too complex, because the
> current situation is that only Dominique is handling 9p patches on kernel
> side, and he barely has time for 9p anymore.
> 
> Another reason for me to catch up on reading current kernel code and
> stepping in as reviewer of 9p on kernel side ASAP, independent of this
> issue.
> 
> As for current kernel patches' complexity: I can certainly drop patch 7
> entirely as it is probably just overkill. Patch 4 is then the biggest chunk,
> I have to see if I can simplify it, and whether it would make sense to
> squash with patch 3.
> 
> > > > > - The QEMU 9pfs code passes iovecs directly to preadv(2) and will
> > > > > fail
> > > > > 
> > > > >   with EINVAL when called with more than IOV_MAX iovecs
> > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > 
> > > > Hmm, which makes me wonder why I never encountered this error during
> > > > testing.
> > > > 
> > > > Most people will use the 9p qemu 'local' fs driver backend in
> > > > practice,
> > > > so
> > > > that v9fs_read() call would translate for most people to this
> > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > 
> > > > static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState *fs,
> > > > 
> > > >                             const struct iovec *iov,
> > > >                             int iovcnt, off_t offset)
> > > > 
> > > > {
> > > > #ifdef CONFIG_PREADV
> > > > 
> > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > 
> > > > #else
> > > > 
> > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > >     if (err == -1) {
> > > >     
> > > >         return err;
> > > >     
> > > >     } else {
> > > >     
> > > >         return readv(fs->fd, iov, iovcnt);
> > > >     
> > > >     }
> > > > 
> > > > #endif
> > > > }
> > > > 
> > > > > Unless I misunderstood the code, neither side can take advantage of
> > > > > the
> > > > > new 32k descriptor chain limit?
> > > > > 
> > > > > Thanks,
> > > > > Stefan
> > > > 
> > > > I need to check that when I have some more time. One possible
> > > > explanation
> > > > might be that preadv() already has this wrapped into a loop in its
> > > > implementation to circumvent a limit like IOV_MAX. It might be another
> > > > "it
> > > > works, but not portable" issue, but not sure.
> > > > 
> > > > There are still a bunch of other issues I have to resolve. If you look
> > > > at
> > > > net/9p/client.c on kernel side, you'll notice that it basically does
> > > > this ATM> >
> > > > 
> > > >     kmalloc(msize);
> > 
> > Note that this is done twice : once for the T message (client request) and
> > once for the R message (server answer). The 9p driver could adjust the
> > size
> > of the T message to what's really needed instead of allocating the full
> > msize. R message size is not known though.
> 
> Would it make sense adding a second virtio ring, dedicated to server
> responses to solve this? IIRC 9p server already calculates appropriate
> exact sizes for each response type. So server could just push space that's
> really needed for its responses.
> 
> > > > for every 9p request. So not only does it allocate much more memory
> > > > for
> > > > every request than actually required (i.e. say 9pfs was mounted with
> > > > msize=8M, then a 9p request that actually would just need 1k would
> > > > nevertheless allocate 8M), but also it allocates > PAGE_SIZE, which
> > > > obviously may fail at any time.>
> > > 
> > > The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc() situation.
> 
> Hu, I didn't even consider vmalloc(). I just tried the kvmalloc() wrapper as
> a quick & dirty test, but it crashed in the same way as kmalloc() with
> large msize values immediately on mounting:
> 
> diff --git a/net/9p/client.c b/net/9p/client.c
> index a75034fa249b..cfe300a4b6ca 100644
> --- a/net/9p/client.c
> +++ b/net/9p/client.c
> @@ -227,15 +227,18 @@ static int parse_opts(char *opts, struct p9_client
> *clnt)
>  static int p9_fcall_init(struct p9_client *c, struct p9_fcall *fc,
>                          int alloc_msize)
>  {
> -       if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> +       //if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> +       if (false) {
>                 fc->sdata = kmem_cache_alloc(c->fcall_cache, GFP_NOFS);
>                 fc->cache = c->fcall_cache;
>         } else {
> -               fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> +               fc->sdata = kvmalloc(alloc_msize, GFP_NOFS);

Ok, GFP_NOFS -> GFP_KERNEL did the trick.

Now I get:

   virtio: bogus descriptor or out of resources

So, still some work ahead on both ends.

>                 fc->cache = NULL;
>         }
> -       if (!fc->sdata)
> +       if (!fc->sdata) {
> +               pr_info("%s !fc->sdata", __func__);
>                 return -ENOMEM;
> +       }
>         fc->capacity = alloc_msize;
>         return 0;
>  }
> 
> I try to look at this at the weekend, I would have expected this hack to
> bypass this issue.
> 
> > > I saw zerocopy code in the 9p guest driver but didn't investigate when
> > > it's used. Maybe that should be used for large requests (file
> > > reads/writes)?
> > 
> > This is the case already : zero-copy is only used for reads/writes/readdir
> > if the requested size is 1k or more.
> > 
> > Also you'll note that in this case, the 9p driver doesn't allocate msize
> > for the T/R messages but only 4k, which is largely enough to hold the
> > header.
> > 
> > 	/*
> > 	
> > 	 * We allocate a inline protocol data of only 4k bytes.
> > 	 * The actual content is passed in zero-copy fashion.
> > 	 */
> > 	
> > 	req = p9_client_prepare_req(c, type, P9_ZC_HDR_SZ, fmt, ap);
> > 
> > and
> > 
> > /* size of header for zero copy read/write */
> > #define P9_ZC_HDR_SZ 4096
> > 
> > A huge msize only makes sense for Twrite, Rread and Rreaddir because
> > of the amount of data they convey. All other messages certainly fit
> > in a couple of kilobytes only (sorry, don't remember the numbers).
> > 
> > A first change should be to allocate MIN(XXX, msize) for the
> > regular non-zc case, where XXX could be a reasonable fixed
> > value (8k?). In the case of T messages, it is even possible
> > to adjust the size to what's exactly needed, ala snprintf(NULL).
> 
> Good idea actually! That would limit this problem to reviewing the 9p specs
> and picking one reasonable max value. Because you are right, those message
> types are tiny. Probably not worth to pile up new code to calculate exact
> message sizes for each one of them.
> 
> Adding some safety net would make sense though, to force e.g. if a new
> message type is added in future, that this value would be reviewed as well,
> something like:
> 
> static int max_msg_size(int msg_type) {
>     switch (msg_type) {
>         /* large zero copy messages */
>         case Twrite:
>         case Tread:
>         case Treaddir:
>             BUG_ON(true);
> 
>         /* small messages */
>         case Tversion:
>         ....
>             return 8k; /* to be replaced with appropriate max value */
>     }
> }
> 
> That way the compiler would bark on future additions. But on doubt, a simple
> comment on msg type enum might do as well though.
> 
> > > virtio-blk/scsi don't memcpy data into a new buffer, they
> > > directly access page cache or O_DIRECT pinned pages.
> > > 
> > > Stefan
> > 
> > Cheers,
> > 
> > --
> > Greg
Christian Schoenebeck Oct. 21, 2021, 3:39 p.m. UTC | #11
On Freitag, 8. Oktober 2021 18:08:48 CEST Christian Schoenebeck wrote:
> On Freitag, 8. Oktober 2021 16:24:42 CEST Christian Schoenebeck wrote:
> > On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > > On Thu, 7 Oct 2021 16:42:49 +0100
> > > 
> > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian Schoenebeck
> 
> wrote:
> > > > > > > At the moment the maximum transfer size with virtio is limited
> > > > > > > to
> > > > > > > 4M
> > > > > > > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > > > > > > theoretical possible transfer size of 128M (32k pages) according
> > > > > > > to
> > > > > > > the
> > > > > > > virtio specs:
> > > > > > > 
> > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-
> > > > > > > cs
> > > > > > > 01
> > > > > > > .html#
> > > > > > > x1-240006
> > > > > > 
> > > > > > Hi Christian,
> > > 
> > > > > > I took a quick look at the code:
> > > Hi,
> > > 
> > > Thanks Stefan for sharing virtio expertise and helping Christian !
> > > 
> > > > > > - The Linux 9p driver restricts descriptor chains to 128 elements
> > > > > > 
> > > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > > 
> > > > > Yes, that's the limitation that I am about to remove (WIP); current
> > > > > kernel
> > > > > patches:
> > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudeb
> > > > > yt
> > > > > e.
> > > > > com/>
> > > > 
> > > > I haven't read the patches yet but I'm concerned that today the driver
> > > > is pretty well-behaved and this new patch series introduces a spec
> > > > violation. Not fixing existing spec violations is okay, but adding new
> > > > ones is a red flag. I think we need to figure out a clean solution.
> > 
> > Nobody has reviewed the kernel patches yet. My main concern therefore
> > actually is that the kernel patches are already too complex, because the
> > current situation is that only Dominique is handling 9p patches on kernel
> > side, and he barely has time for 9p anymore.
> > 
> > Another reason for me to catch up on reading current kernel code and
> > stepping in as reviewer of 9p on kernel side ASAP, independent of this
> > issue.
> > 
> > As for current kernel patches' complexity: I can certainly drop patch 7
> > entirely as it is probably just overkill. Patch 4 is then the biggest
> > chunk, I have to see if I can simplify it, and whether it would make
> > sense to squash with patch 3.
> > 
> > > > > > - The QEMU 9pfs code passes iovecs directly to preadv(2) and will
> > > > > > fail
> > > > > > 
> > > > > >   with EINVAL when called with more than IOV_MAX iovecs
> > > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > > 
> > > > > Hmm, which makes me wonder why I never encountered this error during
> > > > > testing.
> > > > > 
> > > > > Most people will use the 9p qemu 'local' fs driver backend in
> > > > > practice,
> > > > > so
> > > > > that v9fs_read() call would translate for most people to this
> > > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > > 
> > > > > static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState *fs,
> > > > > 
> > > > >                             const struct iovec *iov,
> > > > >                             int iovcnt, off_t offset)
> > > > > 
> > > > > {
> > > > > #ifdef CONFIG_PREADV
> > > > > 
> > > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > > 
> > > > > #else
> > > > > 
> > > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > > >     if (err == -1) {
> > > > >     
> > > > >         return err;
> > > > >     
> > > > >     } else {
> > > > >     
> > > > >         return readv(fs->fd, iov, iovcnt);
> > > > >     
> > > > >     }
> > > > > 
> > > > > #endif
> > > > > }
> > > > > 
> > > > > > Unless I misunderstood the code, neither side can take advantage
> > > > > > of
> > > > > > the
> > > > > > new 32k descriptor chain limit?
> > > > > > 
> > > > > > Thanks,
> > > > > > Stefan
> > > > > 
> > > > > I need to check that when I have some more time. One possible
> > > > > explanation
> > > > > might be that preadv() already has this wrapped into a loop in its
> > > > > implementation to circumvent a limit like IOV_MAX. It might be
> > > > > another
> > > > > "it
> > > > > works, but not portable" issue, but not sure.
> > > > > 
> > > > > There are still a bunch of other issues I have to resolve. If you
> > > > > look
> > > > > at
> > > > > net/9p/client.c on kernel side, you'll notice that it basically does
> > > > > this ATM> >
> > > > > 
> > > > >     kmalloc(msize);
> > > 
> > > Note that this is done twice : once for the T message (client request)
> > > and
> > > once for the R message (server answer). The 9p driver could adjust the
> > > size
> > > of the T message to what's really needed instead of allocating the full
> > > msize. R message size is not known though.
> > 
> > Would it make sense adding a second virtio ring, dedicated to server
> > responses to solve this? IIRC 9p server already calculates appropriate
> > exact sizes for each response type. So server could just push space that's
> > really needed for its responses.
> > 
> > > > > for every 9p request. So not only does it allocate much more memory
> > > > > for
> > > > > every request than actually required (i.e. say 9pfs was mounted with
> > > > > msize=8M, then a 9p request that actually would just need 1k would
> > > > > nevertheless allocate 8M), but also it allocates > PAGE_SIZE, which
> > > > > obviously may fail at any time.>
> > > > 
> > > > The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc()
> > > > situation.
> > 
> > Hu, I didn't even consider vmalloc(). I just tried the kvmalloc() wrapper
> > as a quick & dirty test, but it crashed in the same way as kmalloc() with
> > large msize values immediately on mounting:
> > 
> > diff --git a/net/9p/client.c b/net/9p/client.c
> > index a75034fa249b..cfe300a4b6ca 100644
> > --- a/net/9p/client.c
> > +++ b/net/9p/client.c
> > @@ -227,15 +227,18 @@ static int parse_opts(char *opts, struct p9_client
> > *clnt)
> > 
> >  static int p9_fcall_init(struct p9_client *c, struct p9_fcall *fc,
> >  
> >                          int alloc_msize)
> >  
> >  {
> > 
> > -       if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > +       //if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > +       if (false) {
> > 
> >                 fc->sdata = kmem_cache_alloc(c->fcall_cache, GFP_NOFS);
> >                 fc->cache = c->fcall_cache;
> >         
> >         } else {
> > 
> > -               fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> > +               fc->sdata = kvmalloc(alloc_msize, GFP_NOFS);
> 
> Ok, GFP_NOFS -> GFP_KERNEL did the trick.
> 
> Now I get:
> 
>    virtio: bogus descriptor or out of resources
> 
> So, still some work ahead on both ends.

Few hacks later (only changes on 9p client side) I got this running stable
now. The reason for the virtio error above was that kvmalloc() returns a
non-logical kernel address for any kvmalloc(>4M), i.e. an address that is
inaccessible from host side, hence that "bogus descriptor" message by QEMU.
So I had to split those linear 9p client buffers into sparse ones (set of
individual pages).

I tested this for some days with various virtio transmission sizes and it
works as expected up to 128 MB (more precisely: 128 MB read space + 128 MB
write space per virtio round trip message).

I did not encounter a show stopper for large virtio transmission sizes
(4 MB ... 128 MB) on virtio level, neither as a result of testing, nor after
reviewing the existing code.

About IOV_MAX: that's apparently not an issue on virtio level. Most of the
iovec code, both on Linux kernel side and on QEMU side do not have this
limitation. It is apparently however indeed a limitation for userland apps
calling the Linux kernel's syscalls yet.

Stefan, as it stands now, I am even more convinced that the upper virtio
transmission size limit should not be squeezed into the queue size argument of
virtio_add_queue(). Not because of the previous argument that it would waste
space (~1MB), but rather because they are two different things. To outline
this, just a quick recap of what happens exactly when a bulk message is pushed
over the virtio wire (assuming virtio "split" layout here):

---------- [recap-start] ----------

For each bulk message sent guest <-> host, exactly *one* of the pre-allocated
descriptors is taken and placed (subsequently) into exactly *one* position of
the two available/used ring buffers. The actual descriptor table though,
containing all the DMA addresses of the message bulk data, is allocated just
in time for each round trip message. Say, it is the first message sent, it
yields in the following structure:

Ring Buffer   Descriptor Table      Bulk Data Pages

   +-+              +-+           +-----------------+
   |D|------------->|d|---------->| Bulk data block |
   +-+              |d|--------+  +-----------------+
   | |              |d|------+ |
   +-+               .       | |  +-----------------+
   | |               .       | +->| Bulk data block |
    .                .       |    +-----------------+
    .               |d|-+    |
    .               +-+ |    |    +-----------------+
   | |                  |    +--->| Bulk data block |
   +-+                  |         +-----------------+
   | |                  |                 .
   +-+                  |                 .
                        |                 .
                        |         +-----------------+
                        +-------->| Bulk data block |
                                  +-----------------+
Legend:
D: pre-allocated descriptor
d: just in time allocated descriptor
-->: memory pointer (DMA)

The bulk data blocks are allocated by the respective device driver above
virtio subsystem level (guest side).

There are exactly as many descriptors pre-allocated (D) as the size of a ring
buffer.

A "descriptor" is more or less just a chainable DMA memory pointer; defined
as:

/* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
struct vring_desc {
	/* Address (guest-physical). */
	__virtio64 addr;
	/* Length. */
	__virtio32 len;
	/* The flags as indicated above. */
	__virtio16 flags;
	/* We chain unused descriptors via this, too */
	__virtio16 next;
};

There are 2 ring buffers; the "available" ring buffer is for sending a message
guest->host (which will transmit DMA addresses of guest allocated bulk data
blocks that are used for data sent to device, and separate guest allocated
bulk data blocks that will be used by host side to place its response bulk
data), and the "used" ring buffer is for sending host->guest to let guest know
about host's response and that it could now safely consume and then deallocate
the bulk data blocks subsequently.

---------- [recap-end] ----------

So the "queue size" actually defines the ringbuffer size. It does not define
the maximum amount of descriptors. The "queue size" rather defines how many
pending messages can be pushed into either one ringbuffer before the other
side would need to wait until the counter side would step up (i.e. ring buffer
full).

The maximum amount of descriptors (what VIRTQUEUE_MAX_SIZE actually is) OTOH
defines the max. bulk data size that could be transmitted with each virtio
round trip message.

And in fact, 9p currently handles the virtio "queue size" as directly
associative with its maximum amount of active 9p requests the server could
handle simultaniously:

  hw/9pfs/9p.h:#define MAX_REQ         128
  hw/9pfs/9p.h:    V9fsPDU pdus[MAX_REQ];
  hw/9pfs/virtio-9p-device.c:    v->vq = virtio_add_queue(vdev, MAX_REQ,
                                 handle_9p_output);

So if I would change it like this, just for the purpose to increase the max.
virtio transmission size:

--- a/hw/9pfs/virtio-9p-device.c
+++ b/hw/9pfs/virtio-9p-device.c
@@ -218,7 +218,7 @@ static void virtio_9p_device_realize(DeviceState *dev, Error **errp)
     v->config_size = sizeof(struct virtio_9p_config) + strlen(s->fsconf.tag);
     virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
                 VIRTQUEUE_MAX_SIZE);
-    v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
+    v->vq = virtio_add_queue(vdev, 32*1024, handle_9p_output);
 }
 
Then it would require additional synchronization code on both ends and
therefore unnecessary complexity, because it would now be possible that more
requests are pushed into the ringbuffer than server could handle.

There is one potential issue though that probably did justify the "don't
exceed the queue size" rule:

ATM the descriptor table is allocated (just in time) as *one* continuous
buffer via kmalloc_array():
https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086f7c7d33a4/drivers/virtio/virtio_ring.c#L440

So assuming transmission size of 2 * 128 MB that kmalloc_array() call would
yield in kmalloc(1M) and the latter might fail if guest had highly fragmented
physical memory. For such kind of error case there is currently a fallback
path in virtqueue_add_split() that would then use the required amount of
pre-allocated descriptors instead:
https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086f7c7d33a4/drivers/virtio/virtio_ring.c#L525

That fallback recovery path would no longer be viable if the queue size was
exceeded. There would be alternatives though, e.g. by allowing to chain
indirect descriptor tables (currently prohibited by the virtio specs).

Best regards,
Christian Schoenebeck

> 
> >                 fc->cache = NULL;
> >         
> >         }
> > 
> > -       if (!fc->sdata)
> > +       if (!fc->sdata) {
> > +               pr_info("%s !fc->sdata", __func__);
> > 
> >                 return -ENOMEM;
> > 
> > +       }
> > 
> >         fc->capacity = alloc_msize;
> >         return 0;
> >  
> >  }
> > 
> > I try to look at this at the weekend, I would have expected this hack to
> > bypass this issue.
> > 
> > > > I saw zerocopy code in the 9p guest driver but didn't investigate when
> > > > it's used. Maybe that should be used for large requests (file
> > > > reads/writes)?
> > > 
> > > This is the case already : zero-copy is only used for
> > > reads/writes/readdir
> > > if the requested size is 1k or more.
> > > 
> > > Also you'll note that in this case, the 9p driver doesn't allocate msize
> > > for the T/R messages but only 4k, which is largely enough to hold the
> > > header.
> > > 
> > > 	/*
> > > 	
> > > 	 * We allocate a inline protocol data of only 4k bytes.
> > > 	 * The actual content is passed in zero-copy fashion.
> > > 	 */
> > > 	
> > > 	req = p9_client_prepare_req(c, type, P9_ZC_HDR_SZ, fmt, ap);
> > > 
> > > and
> > > 
> > > /* size of header for zero copy read/write */
> > > #define P9_ZC_HDR_SZ 4096
> > > 
> > > A huge msize only makes sense for Twrite, Rread and Rreaddir because
> > > of the amount of data they convey. All other messages certainly fit
> > > in a couple of kilobytes only (sorry, don't remember the numbers).
> > > 
> > > A first change should be to allocate MIN(XXX, msize) for the
> > > regular non-zc case, where XXX could be a reasonable fixed
> > > value (8k?). In the case of T messages, it is even possible
> > > to adjust the size to what's exactly needed, ala snprintf(NULL).
> > 
> > Good idea actually! That would limit this problem to reviewing the 9p
> > specs
> > and picking one reasonable max value. Because you are right, those message
> > types are tiny. Probably not worth to pile up new code to calculate exact
> > message sizes for each one of them.
> > 
> > Adding some safety net would make sense though, to force e.g. if a new
> > message type is added in future, that this value would be reviewed as
> > well,
> > something like:
> > 
> > static int max_msg_size(int msg_type) {
> > 
> >     switch (msg_type) {
> >     
> >         /* large zero copy messages */
> >         case Twrite:
> >         case Tread:
> >         
> >         case Treaddir:
> >             BUG_ON(true);
> >         
> >         /* small messages */
> >         case Tversion:
> >         ....
> >         
> >             return 8k; /* to be replaced with appropriate max value */
> >     
> >     }
> > 
> > }
> > 
> > That way the compiler would bark on future additions. But on doubt, a
> > simple comment on msg type enum might do as well though.
> > 
> > > > virtio-blk/scsi don't memcpy data into a new buffer, they
> > > > directly access page cache or O_DIRECT pinned pages.
> > > > 
> > > > Stefan
> > > 
> > > Cheers,
> > > 
> > > --
> > > Greg
Stefan Hajnoczi Oct. 25, 2021, 10:30 a.m. UTC | #12
On Thu, Oct 21, 2021 at 05:39:28PM +0200, Christian Schoenebeck wrote:
> On Freitag, 8. Oktober 2021 18:08:48 CEST Christian Schoenebeck wrote:
> > On Freitag, 8. Oktober 2021 16:24:42 CEST Christian Schoenebeck wrote:
> > > On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > > > On Thu, 7 Oct 2021 16:42:49 +0100
> > > > 
> > > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > > > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian Schoenebeck
> > 
> > wrote:
> > > > > > > > At the moment the maximum transfer size with virtio is limited
> > > > > > > > to
> > > > > > > > 4M
> > > > > > > > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > > > > > > > theoretical possible transfer size of 128M (32k pages) according
> > > > > > > > to
> > > > > > > > the
> > > > > > > > virtio specs:
> > > > > > > > 
> > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-
> > > > > > > > cs
> > > > > > > > 01
> > > > > > > > .html#
> > > > > > > > x1-240006
> > > > > > > 
> > > > > > > Hi Christian,
> > > > 
> > > > > > > I took a quick look at the code:
> > > > Hi,
> > > > 
> > > > Thanks Stefan for sharing virtio expertise and helping Christian !
> > > > 
> > > > > > > - The Linux 9p driver restricts descriptor chains to 128 elements
> > > > > > > 
> > > > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > > > 
> > > > > > Yes, that's the limitation that I am about to remove (WIP); current
> > > > > > kernel
> > > > > > patches:
> > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudeb
> > > > > > yt
> > > > > > e.
> > > > > > com/>
> > > > > 
> > > > > I haven't read the patches yet but I'm concerned that today the driver
> > > > > is pretty well-behaved and this new patch series introduces a spec
> > > > > violation. Not fixing existing spec violations is okay, but adding new
> > > > > ones is a red flag. I think we need to figure out a clean solution.
> > > 
> > > Nobody has reviewed the kernel patches yet. My main concern therefore
> > > actually is that the kernel patches are already too complex, because the
> > > current situation is that only Dominique is handling 9p patches on kernel
> > > side, and he barely has time for 9p anymore.
> > > 
> > > Another reason for me to catch up on reading current kernel code and
> > > stepping in as reviewer of 9p on kernel side ASAP, independent of this
> > > issue.
> > > 
> > > As for current kernel patches' complexity: I can certainly drop patch 7
> > > entirely as it is probably just overkill. Patch 4 is then the biggest
> > > chunk, I have to see if I can simplify it, and whether it would make
> > > sense to squash with patch 3.
> > > 
> > > > > > > - The QEMU 9pfs code passes iovecs directly to preadv(2) and will
> > > > > > > fail
> > > > > > > 
> > > > > > >   with EINVAL when called with more than IOV_MAX iovecs
> > > > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > > > 
> > > > > > Hmm, which makes me wonder why I never encountered this error during
> > > > > > testing.
> > > > > > 
> > > > > > Most people will use the 9p qemu 'local' fs driver backend in
> > > > > > practice,
> > > > > > so
> > > > > > that v9fs_read() call would translate for most people to this
> > > > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > > > 
> > > > > > static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState *fs,
> > > > > > 
> > > > > >                             const struct iovec *iov,
> > > > > >                             int iovcnt, off_t offset)
> > > > > > 
> > > > > > {
> > > > > > #ifdef CONFIG_PREADV
> > > > > > 
> > > > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > > > 
> > > > > > #else
> > > > > > 
> > > > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > > > >     if (err == -1) {
> > > > > >     
> > > > > >         return err;
> > > > > >     
> > > > > >     } else {
> > > > > >     
> > > > > >         return readv(fs->fd, iov, iovcnt);
> > > > > >     
> > > > > >     }
> > > > > > 
> > > > > > #endif
> > > > > > }
> > > > > > 
> > > > > > > Unless I misunderstood the code, neither side can take advantage
> > > > > > > of
> > > > > > > the
> > > > > > > new 32k descriptor chain limit?
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > Stefan
> > > > > > 
> > > > > > I need to check that when I have some more time. One possible
> > > > > > explanation
> > > > > > might be that preadv() already has this wrapped into a loop in its
> > > > > > implementation to circumvent a limit like IOV_MAX. It might be
> > > > > > another
> > > > > > "it
> > > > > > works, but not portable" issue, but not sure.
> > > > > > 
> > > > > > There are still a bunch of other issues I have to resolve. If you
> > > > > > look
> > > > > > at
> > > > > > net/9p/client.c on kernel side, you'll notice that it basically does
> > > > > > this ATM> >
> > > > > > 
> > > > > >     kmalloc(msize);
> > > > 
> > > > Note that this is done twice : once for the T message (client request)
> > > > and
> > > > once for the R message (server answer). The 9p driver could adjust the
> > > > size
> > > > of the T message to what's really needed instead of allocating the full
> > > > msize. R message size is not known though.
> > > 
> > > Would it make sense adding a second virtio ring, dedicated to server
> > > responses to solve this? IIRC 9p server already calculates appropriate
> > > exact sizes for each response type. So server could just push space that's
> > > really needed for its responses.
> > > 
> > > > > > for every 9p request. So not only does it allocate much more memory
> > > > > > for
> > > > > > every request than actually required (i.e. say 9pfs was mounted with
> > > > > > msize=8M, then a 9p request that actually would just need 1k would
> > > > > > nevertheless allocate 8M), but also it allocates > PAGE_SIZE, which
> > > > > > obviously may fail at any time.>
> > > > > 
> > > > > The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc()
> > > > > situation.
> > > 
> > > Hu, I didn't even consider vmalloc(). I just tried the kvmalloc() wrapper
> > > as a quick & dirty test, but it crashed in the same way as kmalloc() with
> > > large msize values immediately on mounting:
> > > 
> > > diff --git a/net/9p/client.c b/net/9p/client.c
> > > index a75034fa249b..cfe300a4b6ca 100644
> > > --- a/net/9p/client.c
> > > +++ b/net/9p/client.c
> > > @@ -227,15 +227,18 @@ static int parse_opts(char *opts, struct p9_client
> > > *clnt)
> > > 
> > >  static int p9_fcall_init(struct p9_client *c, struct p9_fcall *fc,
> > >  
> > >                          int alloc_msize)
> > >  
> > >  {
> > > 
> > > -       if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > > +       //if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > > +       if (false) {
> > > 
> > >                 fc->sdata = kmem_cache_alloc(c->fcall_cache, GFP_NOFS);
> > >                 fc->cache = c->fcall_cache;
> > >         
> > >         } else {
> > > 
> > > -               fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> > > +               fc->sdata = kvmalloc(alloc_msize, GFP_NOFS);
> > 
> > Ok, GFP_NOFS -> GFP_KERNEL did the trick.
> > 
> > Now I get:
> > 
> >    virtio: bogus descriptor or out of resources
> > 
> > So, still some work ahead on both ends.
> 
> Few hacks later (only changes on 9p client side) I got this running stable
> now. The reason for the virtio error above was that kvmalloc() returns a
> non-logical kernel address for any kvmalloc(>4M), i.e. an address that is
> inaccessible from host side, hence that "bogus descriptor" message by QEMU.
> So I had to split those linear 9p client buffers into sparse ones (set of
> individual pages).
> 
> I tested this for some days with various virtio transmission sizes and it
> works as expected up to 128 MB (more precisely: 128 MB read space + 128 MB
> write space per virtio round trip message).
> 
> I did not encounter a show stopper for large virtio transmission sizes
> (4 MB ... 128 MB) on virtio level, neither as a result of testing, nor after
> reviewing the existing code.
> 
> About IOV_MAX: that's apparently not an issue on virtio level. Most of the
> iovec code, both on Linux kernel side and on QEMU side do not have this
> limitation. It is apparently however indeed a limitation for userland apps
> calling the Linux kernel's syscalls yet.
> 
> Stefan, as it stands now, I am even more convinced that the upper virtio
> transmission size limit should not be squeezed into the queue size argument of
> virtio_add_queue(). Not because of the previous argument that it would waste
> space (~1MB), but rather because they are two different things. To outline
> this, just a quick recap of what happens exactly when a bulk message is pushed
> over the virtio wire (assuming virtio "split" layout here):
> 
> ---------- [recap-start] ----------
> 
> For each bulk message sent guest <-> host, exactly *one* of the pre-allocated
> descriptors is taken and placed (subsequently) into exactly *one* position of
> the two available/used ring buffers. The actual descriptor table though,
> containing all the DMA addresses of the message bulk data, is allocated just
> in time for each round trip message. Say, it is the first message sent, it
> yields in the following structure:
> 
> Ring Buffer   Descriptor Table      Bulk Data Pages
> 
>    +-+              +-+           +-----------------+
>    |D|------------->|d|---------->| Bulk data block |
>    +-+              |d|--------+  +-----------------+
>    | |              |d|------+ |
>    +-+               .       | |  +-----------------+
>    | |               .       | +->| Bulk data block |
>     .                .       |    +-----------------+
>     .               |d|-+    |
>     .               +-+ |    |    +-----------------+
>    | |                  |    +--->| Bulk data block |
>    +-+                  |         +-----------------+
>    | |                  |                 .
>    +-+                  |                 .
>                         |                 .
>                         |         +-----------------+
>                         +-------->| Bulk data block |
>                                   +-----------------+
> Legend:
> D: pre-allocated descriptor
> d: just in time allocated descriptor
> -->: memory pointer (DMA)
> 
> The bulk data blocks are allocated by the respective device driver above
> virtio subsystem level (guest side).
> 
> There are exactly as many descriptors pre-allocated (D) as the size of a ring
> buffer.
> 
> A "descriptor" is more or less just a chainable DMA memory pointer; defined
> as:
> 
> /* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
> struct vring_desc {
> 	/* Address (guest-physical). */
> 	__virtio64 addr;
> 	/* Length. */
> 	__virtio32 len;
> 	/* The flags as indicated above. */
> 	__virtio16 flags;
> 	/* We chain unused descriptors via this, too */
> 	__virtio16 next;
> };
> 
> There are 2 ring buffers; the "available" ring buffer is for sending a message
> guest->host (which will transmit DMA addresses of guest allocated bulk data
> blocks that are used for data sent to device, and separate guest allocated
> bulk data blocks that will be used by host side to place its response bulk
> data), and the "used" ring buffer is for sending host->guest to let guest know
> about host's response and that it could now safely consume and then deallocate
> the bulk data blocks subsequently.
> 
> ---------- [recap-end] ----------
> 
> So the "queue size" actually defines the ringbuffer size. It does not define
> the maximum amount of descriptors. The "queue size" rather defines how many
> pending messages can be pushed into either one ringbuffer before the other
> side would need to wait until the counter side would step up (i.e. ring buffer
> full).
> 
> The maximum amount of descriptors (what VIRTQUEUE_MAX_SIZE actually is) OTOH
> defines the max. bulk data size that could be transmitted with each virtio
> round trip message.
> 
> And in fact, 9p currently handles the virtio "queue size" as directly
> associative with its maximum amount of active 9p requests the server could
> handle simultaniously:
> 
>   hw/9pfs/9p.h:#define MAX_REQ         128
>   hw/9pfs/9p.h:    V9fsPDU pdus[MAX_REQ];
>   hw/9pfs/virtio-9p-device.c:    v->vq = virtio_add_queue(vdev, MAX_REQ,
>                                  handle_9p_output);
> 
> So if I would change it like this, just for the purpose to increase the max.
> virtio transmission size:
> 
> --- a/hw/9pfs/virtio-9p-device.c
> +++ b/hw/9pfs/virtio-9p-device.c
> @@ -218,7 +218,7 @@ static void virtio_9p_device_realize(DeviceState *dev, Error **errp)
>      v->config_size = sizeof(struct virtio_9p_config) + strlen(s->fsconf.tag);
>      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
>                  VIRTQUEUE_MAX_SIZE);
> -    v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> +    v->vq = virtio_add_queue(vdev, 32*1024, handle_9p_output);
>  }
>  
> Then it would require additional synchronization code on both ends and
> therefore unnecessary complexity, because it would now be possible that more
> requests are pushed into the ringbuffer than server could handle.
> 
> There is one potential issue though that probably did justify the "don't
> exceed the queue size" rule:
> 
> ATM the descriptor table is allocated (just in time) as *one* continuous
> buffer via kmalloc_array():
> https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086f7c7d33a4/drivers/virtio/virtio_ring.c#L440
> 
> So assuming transmission size of 2 * 128 MB that kmalloc_array() call would
> yield in kmalloc(1M) and the latter might fail if guest had highly fragmented
> physical memory. For such kind of error case there is currently a fallback
> path in virtqueue_add_split() that would then use the required amount of
> pre-allocated descriptors instead:
> https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086f7c7d33a4/drivers/virtio/virtio_ring.c#L525
> 
> That fallback recovery path would no longer be viable if the queue size was
> exceeded. There would be alternatives though, e.g. by allowing to chain
> indirect descriptor tables (currently prohibited by the virtio specs).

Making the maximum number of descriptors independent of the queue size
requires a change to the VIRTIO spec since the two values are currently
explicitly tied together by the spec.

Before doing that, are there benchmark results showing that 1 MB vs 128
MB produces a performance improvement? I'm asking because if performance
with 1 MB is good then you can probably do that without having to change
VIRTIO and also because it's counter-intuitive that 9p needs 128 MB for
good performance when it's ultimately implemented on top of disk and
network I/O that have lower size limits.

Stefan
Christian Schoenebeck Oct. 25, 2021, 3:03 p.m. UTC | #13
On Montag, 25. Oktober 2021 12:30:41 CEST Stefan Hajnoczi wrote:
> On Thu, Oct 21, 2021 at 05:39:28PM +0200, Christian Schoenebeck wrote:
> > On Freitag, 8. Oktober 2021 18:08:48 CEST Christian Schoenebeck wrote:
> > > On Freitag, 8. Oktober 2021 16:24:42 CEST Christian Schoenebeck wrote:
> > > > On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > > > > On Thu, 7 Oct 2021 16:42:49 +0100
> > > > > 
> > > > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > > > > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > > > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian
> > > > > > > > Schoenebeck
> > > 
> > > wrote:
> > > > > > > > > At the moment the maximum transfer size with virtio is
> > > > > > > > > limited
> > > > > > > > > to
> > > > > > > > > 4M
> > > > > > > > > (1024 * PAGE_SIZE). This series raises this limit to its
> > > > > > > > > maximum
> > > > > > > > > theoretical possible transfer size of 128M (32k pages)
> > > > > > > > > according
> > > > > > > > > to
> > > > > > > > > the
> > > > > > > > > virtio specs:
> > > > > > > > > 
> > > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v
> > > > > > > > > 1.1-
> > > > > > > > > cs
> > > > > > > > > 01
> > > > > > > > > .html#
> > > > > > > > > x1-240006
> > > > > > > > 
> > > > > > > > Hi Christian,
> > > > > 
> > > > > > > > I took a quick look at the code:
> > > > > Hi,
> > > > > 
> > > > > Thanks Stefan for sharing virtio expertise and helping Christian !
> > > > > 
> > > > > > > > - The Linux 9p driver restricts descriptor chains to 128
> > > > > > > > elements
> > > > > > > > 
> > > > > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > > > > 
> > > > > > > Yes, that's the limitation that I am about to remove (WIP);
> > > > > > > current
> > > > > > > kernel
> > > > > > > patches:
> > > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@cr
> > > > > > > udeb
> > > > > > > yt
> > > > > > > e.
> > > > > > > com/>
> > > > > > 
> > > > > > I haven't read the patches yet but I'm concerned that today the
> > > > > > driver
> > > > > > is pretty well-behaved and this new patch series introduces a spec
> > > > > > violation. Not fixing existing spec violations is okay, but adding
> > > > > > new
> > > > > > ones is a red flag. I think we need to figure out a clean
> > > > > > solution.
> > > > 
> > > > Nobody has reviewed the kernel patches yet. My main concern therefore
> > > > actually is that the kernel patches are already too complex, because
> > > > the
> > > > current situation is that only Dominique is handling 9p patches on
> > > > kernel
> > > > side, and he barely has time for 9p anymore.
> > > > 
> > > > Another reason for me to catch up on reading current kernel code and
> > > > stepping in as reviewer of 9p on kernel side ASAP, independent of this
> > > > issue.
> > > > 
> > > > As for current kernel patches' complexity: I can certainly drop patch
> > > > 7
> > > > entirely as it is probably just overkill. Patch 4 is then the biggest
> > > > chunk, I have to see if I can simplify it, and whether it would make
> > > > sense to squash with patch 3.
> > > > 
> > > > > > > > - The QEMU 9pfs code passes iovecs directly to preadv(2) and
> > > > > > > > will
> > > > > > > > fail
> > > > > > > > 
> > > > > > > >   with EINVAL when called with more than IOV_MAX iovecs
> > > > > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > > > > 
> > > > > > > Hmm, which makes me wonder why I never encountered this error
> > > > > > > during
> > > > > > > testing.
> > > > > > > 
> > > > > > > Most people will use the 9p qemu 'local' fs driver backend in
> > > > > > > practice,
> > > > > > > so
> > > > > > > that v9fs_read() call would translate for most people to this
> > > > > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > > > > 
> > > > > > > static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState
> > > > > > > *fs,
> > > > > > > 
> > > > > > >                             const struct iovec *iov,
> > > > > > >                             int iovcnt, off_t offset)
> > > > > > > 
> > > > > > > {
> > > > > > > #ifdef CONFIG_PREADV
> > > > > > > 
> > > > > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > > > > 
> > > > > > > #else
> > > > > > > 
> > > > > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > > > > >     if (err == -1) {
> > > > > > >     
> > > > > > >         return err;
> > > > > > >     
> > > > > > >     } else {
> > > > > > >     
> > > > > > >         return readv(fs->fd, iov, iovcnt);
> > > > > > >     
> > > > > > >     }
> > > > > > > 
> > > > > > > #endif
> > > > > > > }
> > > > > > > 
> > > > > > > > Unless I misunderstood the code, neither side can take
> > > > > > > > advantage
> > > > > > > > of
> > > > > > > > the
> > > > > > > > new 32k descriptor chain limit?
> > > > > > > > 
> > > > > > > > Thanks,
> > > > > > > > Stefan
> > > > > > > 
> > > > > > > I need to check that when I have some more time. One possible
> > > > > > > explanation
> > > > > > > might be that preadv() already has this wrapped into a loop in
> > > > > > > its
> > > > > > > implementation to circumvent a limit like IOV_MAX. It might be
> > > > > > > another
> > > > > > > "it
> > > > > > > works, but not portable" issue, but not sure.
> > > > > > > 
> > > > > > > There are still a bunch of other issues I have to resolve. If
> > > > > > > you
> > > > > > > look
> > > > > > > at
> > > > > > > net/9p/client.c on kernel side, you'll notice that it basically
> > > > > > > does
> > > > > > > this ATM> >
> > > > > > > 
> > > > > > >     kmalloc(msize);
> > > > > 
> > > > > Note that this is done twice : once for the T message (client
> > > > > request)
> > > > > and
> > > > > once for the R message (server answer). The 9p driver could adjust
> > > > > the
> > > > > size
> > > > > of the T message to what's really needed instead of allocating the
> > > > > full
> > > > > msize. R message size is not known though.
> > > > 
> > > > Would it make sense adding a second virtio ring, dedicated to server
> > > > responses to solve this? IIRC 9p server already calculates appropriate
> > > > exact sizes for each response type. So server could just push space
> > > > that's
> > > > really needed for its responses.
> > > > 
> > > > > > > for every 9p request. So not only does it allocate much more
> > > > > > > memory
> > > > > > > for
> > > > > > > every request than actually required (i.e. say 9pfs was mounted
> > > > > > > with
> > > > > > > msize=8M, then a 9p request that actually would just need 1k
> > > > > > > would
> > > > > > > nevertheless allocate 8M), but also it allocates > PAGE_SIZE,
> > > > > > > which
> > > > > > > obviously may fail at any time.>
> > > > > > 
> > > > > > The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc()
> > > > > > situation.
> > > > 
> > > > Hu, I didn't even consider vmalloc(). I just tried the kvmalloc()
> > > > wrapper
> > > > as a quick & dirty test, but it crashed in the same way as kmalloc()
> > > > with
> > > > large msize values immediately on mounting:
> > > > 
> > > > diff --git a/net/9p/client.c b/net/9p/client.c
> > > > index a75034fa249b..cfe300a4b6ca 100644
> > > > --- a/net/9p/client.c
> > > > +++ b/net/9p/client.c
> > > > @@ -227,15 +227,18 @@ static int parse_opts(char *opts, struct
> > > > p9_client
> > > > *clnt)
> > > > 
> > > >  static int p9_fcall_init(struct p9_client *c, struct p9_fcall *fc,
> > > >  
> > > >                          int alloc_msize)
> > > >  
> > > >  {
> > > > 
> > > > -       if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > > > +       //if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > > > +       if (false) {
> > > > 
> > > >                 fc->sdata = kmem_cache_alloc(c->fcall_cache,
> > > >                 GFP_NOFS);
> > > >                 fc->cache = c->fcall_cache;
> > > >         
> > > >         } else {
> > > > 
> > > > -               fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> > > > +               fc->sdata = kvmalloc(alloc_msize, GFP_NOFS);
> > > 
> > > Ok, GFP_NOFS -> GFP_KERNEL did the trick.
> > > 
> > > Now I get:
> > >    virtio: bogus descriptor or out of resources
> > > 
> > > So, still some work ahead on both ends.
> > 
> > Few hacks later (only changes on 9p client side) I got this running stable
> > now. The reason for the virtio error above was that kvmalloc() returns a
> > non-logical kernel address for any kvmalloc(>4M), i.e. an address that is
> > inaccessible from host side, hence that "bogus descriptor" message by
> > QEMU.
> > So I had to split those linear 9p client buffers into sparse ones (set of
> > individual pages).
> > 
> > I tested this for some days with various virtio transmission sizes and it
> > works as expected up to 128 MB (more precisely: 128 MB read space + 128 MB
> > write space per virtio round trip message).
> > 
> > I did not encounter a show stopper for large virtio transmission sizes
> > (4 MB ... 128 MB) on virtio level, neither as a result of testing, nor
> > after reviewing the existing code.
> > 
> > About IOV_MAX: that's apparently not an issue on virtio level. Most of the
> > iovec code, both on Linux kernel side and on QEMU side do not have this
> > limitation. It is apparently however indeed a limitation for userland apps
> > calling the Linux kernel's syscalls yet.
> > 
> > Stefan, as it stands now, I am even more convinced that the upper virtio
> > transmission size limit should not be squeezed into the queue size
> > argument of virtio_add_queue(). Not because of the previous argument that
> > it would waste space (~1MB), but rather because they are two different
> > things. To outline this, just a quick recap of what happens exactly when
> > a bulk message is pushed over the virtio wire (assuming virtio "split"
> > layout here):
> > 
> > ---------- [recap-start] ----------
> > 
> > For each bulk message sent guest <-> host, exactly *one* of the
> > pre-allocated descriptors is taken and placed (subsequently) into exactly
> > *one* position of the two available/used ring buffers. The actual
> > descriptor table though, containing all the DMA addresses of the message
> > bulk data, is allocated just in time for each round trip message. Say, it
> > is the first message sent, it yields in the following structure:
> > 
> > Ring Buffer   Descriptor Table      Bulk Data Pages
> > 
> >    +-+              +-+           +-----------------+
> >    
> >    |D|------------->|d|---------->| Bulk data block |
> >    
> >    +-+              |d|--------+  +-----------------+
> >    
> >    | |              |d|------+ |
> >    
> >    +-+               .       | |  +-----------------+
> >    
> >    | |               .       | +->| Bulk data block |
> >     
> >     .                .       |    +-----------------+
> >     .               |d|-+    |
> >     .               +-+ |    |    +-----------------+
> >     
> >    | |                  |    +--->| Bulk data block |
> >    
> >    +-+                  |         +-----------------+
> >    
> >    | |                  |                 .
> >    
> >    +-+                  |                 .
> >    
> >                         |                 .
> >                         |         
> >                         |         +-----------------+
> >                         
> >                         +-------->| Bulk data block |
> >                         
> >                                   +-----------------+
> > 
> > Legend:
> > D: pre-allocated descriptor
> > d: just in time allocated descriptor
> > -->: memory pointer (DMA)
> > 
> > The bulk data blocks are allocated by the respective device driver above
> > virtio subsystem level (guest side).
> > 
> > There are exactly as many descriptors pre-allocated (D) as the size of a
> > ring buffer.
> > 
> > A "descriptor" is more or less just a chainable DMA memory pointer;
> > defined
> > as:
> > 
> > /* Virtio ring descriptors: 16 bytes.  These can chain together via
> > "next". */ struct vring_desc {
> > 
> > 	/* Address (guest-physical). */
> > 	__virtio64 addr;
> > 	/* Length. */
> > 	__virtio32 len;
> > 	/* The flags as indicated above. */
> > 	__virtio16 flags;
> > 	/* We chain unused descriptors via this, too */
> > 	__virtio16 next;
> > 
> > };
> > 
> > There are 2 ring buffers; the "available" ring buffer is for sending a
> > message guest->host (which will transmit DMA addresses of guest allocated
> > bulk data blocks that are used for data sent to device, and separate
> > guest allocated bulk data blocks that will be used by host side to place
> > its response bulk data), and the "used" ring buffer is for sending
> > host->guest to let guest know about host's response and that it could now
> > safely consume and then deallocate the bulk data blocks subsequently.
> > 
> > ---------- [recap-end] ----------
> > 
> > So the "queue size" actually defines the ringbuffer size. It does not
> > define the maximum amount of descriptors. The "queue size" rather defines
> > how many pending messages can be pushed into either one ringbuffer before
> > the other side would need to wait until the counter side would step up
> > (i.e. ring buffer full).
> > 
> > The maximum amount of descriptors (what VIRTQUEUE_MAX_SIZE actually is)
> > OTOH defines the max. bulk data size that could be transmitted with each
> > virtio round trip message.
> > 
> > And in fact, 9p currently handles the virtio "queue size" as directly
> > associative with its maximum amount of active 9p requests the server could
> > 
> > handle simultaniously:
> >   hw/9pfs/9p.h:#define MAX_REQ         128
> >   hw/9pfs/9p.h:    V9fsPDU pdus[MAX_REQ];
> >   hw/9pfs/virtio-9p-device.c:    v->vq = virtio_add_queue(vdev, MAX_REQ,
> >   
> >                                  handle_9p_output);
> > 
> > So if I would change it like this, just for the purpose to increase the
> > max. virtio transmission size:
> > 
> > --- a/hw/9pfs/virtio-9p-device.c
> > +++ b/hw/9pfs/virtio-9p-device.c
> > @@ -218,7 +218,7 @@ static void virtio_9p_device_realize(DeviceState *dev,
> > Error **errp)> 
> >      v->config_size = sizeof(struct virtio_9p_config) +
> >      strlen(s->fsconf.tag);
> >      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> >      
> >                  VIRTQUEUE_MAX_SIZE);
> > 
> > -    v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> > +    v->vq = virtio_add_queue(vdev, 32*1024, handle_9p_output);
> > 
> >  }
> > 
> > Then it would require additional synchronization code on both ends and
> > therefore unnecessary complexity, because it would now be possible that
> > more requests are pushed into the ringbuffer than server could handle.
> > 
> > There is one potential issue though that probably did justify the "don't
> > exceed the queue size" rule:
> > 
> > ATM the descriptor table is allocated (just in time) as *one* continuous
> > buffer via kmalloc_array():
> > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086f7c7
> > d33a4/drivers/virtio/virtio_ring.c#L440
> > 
> > So assuming transmission size of 2 * 128 MB that kmalloc_array() call
> > would
> > yield in kmalloc(1M) and the latter might fail if guest had highly
> > fragmented physical memory. For such kind of error case there is
> > currently a fallback path in virtqueue_add_split() that would then use
> > the required amount of pre-allocated descriptors instead:
> > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086f7c7
> > d33a4/drivers/virtio/virtio_ring.c#L525
> > 
> > That fallback recovery path would no longer be viable if the queue size
> > was
> > exceeded. There would be alternatives though, e.g. by allowing to chain
> > indirect descriptor tables (currently prohibited by the virtio specs).
> 
> Making the maximum number of descriptors independent of the queue size
> requires a change to the VIRTIO spec since the two values are currently
> explicitly tied together by the spec.

Yes, that's what the virtio specs say. But they don't say why, nor did I hear
a reason in this dicussion.

That's why I invested time reviewing current virtio implementation and specs,
as well as actually testing exceeding that limit. And as I outlined in detail
in my previous email, I only found one theoretical issue that could be
addressed though.

> Before doing that, are there benchmark results showing that 1 MB vs 128
> MB produces a performance improvement? I'm asking because if performance
> with 1 MB is good then you can probably do that without having to change
> VIRTIO and also because it's counter-intuitive that 9p needs 128 MB for
> good performance when it's ultimately implemented on top of disk and
> network I/O that have lower size limits.

First some numbers, linear reading a 12 GB file:

msize    average      notes

8 kB     52.0 MB/s    default msize of Linux kernel <v5.15
128 kB   624.8 MB/s   default msize of Linux kernel >=v5.15
512 kB   1961 MB/s    current max. msize with any Linux kernel <=v5.15
1 MB     2551 MB/s    this msize would already violate virtio specs
2 MB     2521 MB/s    this msize would already violate virtio specs
4 MB     2628 MB/s    planned max. msize of my current kernel patches [1]

Note that current 9p Linux client implementation used here, has a bunch of
known code simplifications that cost a lot the more you increase msize, which
also explains the bumb at 1 MB vs 2 MB here. I will address these issues with
my kernel patches soon. Current numbers already suggest though that you will
see growing performance above 4 MB msize as well.

I have not even bothered benchmarking my current, heavily hacked kernel for
the case 4 MB .. 128 MB, because I'm using ridiculous expensive hacks that
copy huge buffers between 9p client level (linear buffers, non-logical address
space) and virtio level (sparse buffers, logical address space for DMA)
several times back and forth.

The point of my current hacks were just to find out whether it was feasible
and sane to exceed current virtio limit, and I think it is.

But again, this is not just about performance. My conclusion as described in
my previous email is that virtio currently squeezes

	"max. simultanious amount of bulk messages"

vs.

	"max. bulk data transmission size per bulk messaage"

into the same configuration parameter, which is IMO inappropriate and hence
splitting them into 2 separate parameters when creating a queue makes sense,
independent of the performance benchmarks.

[1] https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyte.com/

Best regards,
Christian Schoenebeck
Stefan Hajnoczi Oct. 28, 2021, 9 a.m. UTC | #14
On Mon, Oct 25, 2021 at 05:03:25PM +0200, Christian Schoenebeck wrote:
> On Montag, 25. Oktober 2021 12:30:41 CEST Stefan Hajnoczi wrote:
> > On Thu, Oct 21, 2021 at 05:39:28PM +0200, Christian Schoenebeck wrote:
> > > On Freitag, 8. Oktober 2021 18:08:48 CEST Christian Schoenebeck wrote:
> > > > On Freitag, 8. Oktober 2021 16:24:42 CEST Christian Schoenebeck wrote:
> > > > > On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > > > > > On Thu, 7 Oct 2021 16:42:49 +0100
> > > > > > 
> > > > > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > > > > > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > > > > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian
> > > > > > > > > Schoenebeck
> > > > 
> > > > wrote:
> > > > > > > > > > At the moment the maximum transfer size with virtio is
> > > > > > > > > > limited
> > > > > > > > > > to
> > > > > > > > > > 4M
> > > > > > > > > > (1024 * PAGE_SIZE). This series raises this limit to its
> > > > > > > > > > maximum
> > > > > > > > > > theoretical possible transfer size of 128M (32k pages)
> > > > > > > > > > according
> > > > > > > > > > to
> > > > > > > > > > the
> > > > > > > > > > virtio specs:
> > > > > > > > > > 
> > > > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v
> > > > > > > > > > 1.1-
> > > > > > > > > > cs
> > > > > > > > > > 01
> > > > > > > > > > .html#
> > > > > > > > > > x1-240006
> > > > > > > > > 
> > > > > > > > > Hi Christian,
> > > > > > 
> > > > > > > > > I took a quick look at the code:
> > > > > > Hi,
> > > > > > 
> > > > > > Thanks Stefan for sharing virtio expertise and helping Christian !
> > > > > > 
> > > > > > > > > - The Linux 9p driver restricts descriptor chains to 128
> > > > > > > > > elements
> > > > > > > > > 
> > > > > > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > > > > > 
> > > > > > > > Yes, that's the limitation that I am about to remove (WIP);
> > > > > > > > current
> > > > > > > > kernel
> > > > > > > > patches:
> > > > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@cr
> > > > > > > > udeb
> > > > > > > > yt
> > > > > > > > e.
> > > > > > > > com/>
> > > > > > > 
> > > > > > > I haven't read the patches yet but I'm concerned that today the
> > > > > > > driver
> > > > > > > is pretty well-behaved and this new patch series introduces a spec
> > > > > > > violation. Not fixing existing spec violations is okay, but adding
> > > > > > > new
> > > > > > > ones is a red flag. I think we need to figure out a clean
> > > > > > > solution.
> > > > > 
> > > > > Nobody has reviewed the kernel patches yet. My main concern therefore
> > > > > actually is that the kernel patches are already too complex, because
> > > > > the
> > > > > current situation is that only Dominique is handling 9p patches on
> > > > > kernel
> > > > > side, and he barely has time for 9p anymore.
> > > > > 
> > > > > Another reason for me to catch up on reading current kernel code and
> > > > > stepping in as reviewer of 9p on kernel side ASAP, independent of this
> > > > > issue.
> > > > > 
> > > > > As for current kernel patches' complexity: I can certainly drop patch
> > > > > 7
> > > > > entirely as it is probably just overkill. Patch 4 is then the biggest
> > > > > chunk, I have to see if I can simplify it, and whether it would make
> > > > > sense to squash with patch 3.
> > > > > 
> > > > > > > > > - The QEMU 9pfs code passes iovecs directly to preadv(2) and
> > > > > > > > > will
> > > > > > > > > fail
> > > > > > > > > 
> > > > > > > > >   with EINVAL when called with more than IOV_MAX iovecs
> > > > > > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > > > > > 
> > > > > > > > Hmm, which makes me wonder why I never encountered this error
> > > > > > > > during
> > > > > > > > testing.
> > > > > > > > 
> > > > > > > > Most people will use the 9p qemu 'local' fs driver backend in
> > > > > > > > practice,
> > > > > > > > so
> > > > > > > > that v9fs_read() call would translate for most people to this
> > > > > > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > > > > > 
> > > > > > > > static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState
> > > > > > > > *fs,
> > > > > > > > 
> > > > > > > >                             const struct iovec *iov,
> > > > > > > >                             int iovcnt, off_t offset)
> > > > > > > > 
> > > > > > > > {
> > > > > > > > #ifdef CONFIG_PREADV
> > > > > > > > 
> > > > > > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > > > > > 
> > > > > > > > #else
> > > > > > > > 
> > > > > > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > > > > > >     if (err == -1) {
> > > > > > > >     
> > > > > > > >         return err;
> > > > > > > >     
> > > > > > > >     } else {
> > > > > > > >     
> > > > > > > >         return readv(fs->fd, iov, iovcnt);
> > > > > > > >     
> > > > > > > >     }
> > > > > > > > 
> > > > > > > > #endif
> > > > > > > > }
> > > > > > > > 
> > > > > > > > > Unless I misunderstood the code, neither side can take
> > > > > > > > > advantage
> > > > > > > > > of
> > > > > > > > > the
> > > > > > > > > new 32k descriptor chain limit?
> > > > > > > > > 
> > > > > > > > > Thanks,
> > > > > > > > > Stefan
> > > > > > > > 
> > > > > > > > I need to check that when I have some more time. One possible
> > > > > > > > explanation
> > > > > > > > might be that preadv() already has this wrapped into a loop in
> > > > > > > > its
> > > > > > > > implementation to circumvent a limit like IOV_MAX. It might be
> > > > > > > > another
> > > > > > > > "it
> > > > > > > > works, but not portable" issue, but not sure.
> > > > > > > > 
> > > > > > > > There are still a bunch of other issues I have to resolve. If
> > > > > > > > you
> > > > > > > > look
> > > > > > > > at
> > > > > > > > net/9p/client.c on kernel side, you'll notice that it basically
> > > > > > > > does
> > > > > > > > this ATM> >
> > > > > > > > 
> > > > > > > >     kmalloc(msize);
> > > > > > 
> > > > > > Note that this is done twice : once for the T message (client
> > > > > > request)
> > > > > > and
> > > > > > once for the R message (server answer). The 9p driver could adjust
> > > > > > the
> > > > > > size
> > > > > > of the T message to what's really needed instead of allocating the
> > > > > > full
> > > > > > msize. R message size is not known though.
> > > > > 
> > > > > Would it make sense adding a second virtio ring, dedicated to server
> > > > > responses to solve this? IIRC 9p server already calculates appropriate
> > > > > exact sizes for each response type. So server could just push space
> > > > > that's
> > > > > really needed for its responses.
> > > > > 
> > > > > > > > for every 9p request. So not only does it allocate much more
> > > > > > > > memory
> > > > > > > > for
> > > > > > > > every request than actually required (i.e. say 9pfs was mounted
> > > > > > > > with
> > > > > > > > msize=8M, then a 9p request that actually would just need 1k
> > > > > > > > would
> > > > > > > > nevertheless allocate 8M), but also it allocates > PAGE_SIZE,
> > > > > > > > which
> > > > > > > > obviously may fail at any time.>
> > > > > > > 
> > > > > > > The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc()
> > > > > > > situation.
> > > > > 
> > > > > Hu, I didn't even consider vmalloc(). I just tried the kvmalloc()
> > > > > wrapper
> > > > > as a quick & dirty test, but it crashed in the same way as kmalloc()
> > > > > with
> > > > > large msize values immediately on mounting:
> > > > > 
> > > > > diff --git a/net/9p/client.c b/net/9p/client.c
> > > > > index a75034fa249b..cfe300a4b6ca 100644
> > > > > --- a/net/9p/client.c
> > > > > +++ b/net/9p/client.c
> > > > > @@ -227,15 +227,18 @@ static int parse_opts(char *opts, struct
> > > > > p9_client
> > > > > *clnt)
> > > > > 
> > > > >  static int p9_fcall_init(struct p9_client *c, struct p9_fcall *fc,
> > > > >  
> > > > >                          int alloc_msize)
> > > > >  
> > > > >  {
> > > > > 
> > > > > -       if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > > > > +       //if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > > > > +       if (false) {
> > > > > 
> > > > >                 fc->sdata = kmem_cache_alloc(c->fcall_cache,
> > > > >                 GFP_NOFS);
> > > > >                 fc->cache = c->fcall_cache;
> > > > >         
> > > > >         } else {
> > > > > 
> > > > > -               fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> > > > > +               fc->sdata = kvmalloc(alloc_msize, GFP_NOFS);
> > > > 
> > > > Ok, GFP_NOFS -> GFP_KERNEL did the trick.
> > > > 
> > > > Now I get:
> > > >    virtio: bogus descriptor or out of resources
> > > > 
> > > > So, still some work ahead on both ends.
> > > 
> > > Few hacks later (only changes on 9p client side) I got this running stable
> > > now. The reason for the virtio error above was that kvmalloc() returns a
> > > non-logical kernel address for any kvmalloc(>4M), i.e. an address that is
> > > inaccessible from host side, hence that "bogus descriptor" message by
> > > QEMU.
> > > So I had to split those linear 9p client buffers into sparse ones (set of
> > > individual pages).
> > > 
> > > I tested this for some days with various virtio transmission sizes and it
> > > works as expected up to 128 MB (more precisely: 128 MB read space + 128 MB
> > > write space per virtio round trip message).
> > > 
> > > I did not encounter a show stopper for large virtio transmission sizes
> > > (4 MB ... 128 MB) on virtio level, neither as a result of testing, nor
> > > after reviewing the existing code.
> > > 
> > > About IOV_MAX: that's apparently not an issue on virtio level. Most of the
> > > iovec code, both on Linux kernel side and on QEMU side do not have this
> > > limitation. It is apparently however indeed a limitation for userland apps
> > > calling the Linux kernel's syscalls yet.
> > > 
> > > Stefan, as it stands now, I am even more convinced that the upper virtio
> > > transmission size limit should not be squeezed into the queue size
> > > argument of virtio_add_queue(). Not because of the previous argument that
> > > it would waste space (~1MB), but rather because they are two different
> > > things. To outline this, just a quick recap of what happens exactly when
> > > a bulk message is pushed over the virtio wire (assuming virtio "split"
> > > layout here):
> > > 
> > > ---------- [recap-start] ----------
> > > 
> > > For each bulk message sent guest <-> host, exactly *one* of the
> > > pre-allocated descriptors is taken and placed (subsequently) into exactly
> > > *one* position of the two available/used ring buffers. The actual
> > > descriptor table though, containing all the DMA addresses of the message
> > > bulk data, is allocated just in time for each round trip message. Say, it
> > > is the first message sent, it yields in the following structure:
> > > 
> > > Ring Buffer   Descriptor Table      Bulk Data Pages
> > > 
> > >    +-+              +-+           +-----------------+
> > >    
> > >    |D|------------->|d|---------->| Bulk data block |
> > >    
> > >    +-+              |d|--------+  +-----------------+
> > >    
> > >    | |              |d|------+ |
> > >    
> > >    +-+               .       | |  +-----------------+
> > >    
> > >    | |               .       | +->| Bulk data block |
> > >     
> > >     .                .       |    +-----------------+
> > >     .               |d|-+    |
> > >     .               +-+ |    |    +-----------------+
> > >     
> > >    | |                  |    +--->| Bulk data block |
> > >    
> > >    +-+                  |         +-----------------+
> > >    
> > >    | |                  |                 .
> > >    
> > >    +-+                  |                 .
> > >    
> > >                         |                 .
> > >                         |         
> > >                         |         +-----------------+
> > >                         
> > >                         +-------->| Bulk data block |
> > >                         
> > >                                   +-----------------+
> > > 
> > > Legend:
> > > D: pre-allocated descriptor
> > > d: just in time allocated descriptor
> > > -->: memory pointer (DMA)
> > > 
> > > The bulk data blocks are allocated by the respective device driver above
> > > virtio subsystem level (guest side).
> > > 
> > > There are exactly as many descriptors pre-allocated (D) as the size of a
> > > ring buffer.
> > > 
> > > A "descriptor" is more or less just a chainable DMA memory pointer;
> > > defined
> > > as:
> > > 
> > > /* Virtio ring descriptors: 16 bytes.  These can chain together via
> > > "next". */ struct vring_desc {
> > > 
> > > 	/* Address (guest-physical). */
> > > 	__virtio64 addr;
> > > 	/* Length. */
> > > 	__virtio32 len;
> > > 	/* The flags as indicated above. */
> > > 	__virtio16 flags;
> > > 	/* We chain unused descriptors via this, too */
> > > 	__virtio16 next;
> > > 
> > > };
> > > 
> > > There are 2 ring buffers; the "available" ring buffer is for sending a
> > > message guest->host (which will transmit DMA addresses of guest allocated
> > > bulk data blocks that are used for data sent to device, and separate
> > > guest allocated bulk data blocks that will be used by host side to place
> > > its response bulk data), and the "used" ring buffer is for sending
> > > host->guest to let guest know about host's response and that it could now
> > > safely consume and then deallocate the bulk data blocks subsequently.
> > > 
> > > ---------- [recap-end] ----------
> > > 
> > > So the "queue size" actually defines the ringbuffer size. It does not
> > > define the maximum amount of descriptors. The "queue size" rather defines
> > > how many pending messages can be pushed into either one ringbuffer before
> > > the other side would need to wait until the counter side would step up
> > > (i.e. ring buffer full).
> > > 
> > > The maximum amount of descriptors (what VIRTQUEUE_MAX_SIZE actually is)
> > > OTOH defines the max. bulk data size that could be transmitted with each
> > > virtio round trip message.
> > > 
> > > And in fact, 9p currently handles the virtio "queue size" as directly
> > > associative with its maximum amount of active 9p requests the server could
> > > 
> > > handle simultaniously:
> > >   hw/9pfs/9p.h:#define MAX_REQ         128
> > >   hw/9pfs/9p.h:    V9fsPDU pdus[MAX_REQ];
> > >   hw/9pfs/virtio-9p-device.c:    v->vq = virtio_add_queue(vdev, MAX_REQ,
> > >   
> > >                                  handle_9p_output);
> > > 
> > > So if I would change it like this, just for the purpose to increase the
> > > max. virtio transmission size:
> > > 
> > > --- a/hw/9pfs/virtio-9p-device.c
> > > +++ b/hw/9pfs/virtio-9p-device.c
> > > @@ -218,7 +218,7 @@ static void virtio_9p_device_realize(DeviceState *dev,
> > > Error **errp)> 
> > >      v->config_size = sizeof(struct virtio_9p_config) +
> > >      strlen(s->fsconf.tag);
> > >      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> > >      
> > >                  VIRTQUEUE_MAX_SIZE);
> > > 
> > > -    v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> > > +    v->vq = virtio_add_queue(vdev, 32*1024, handle_9p_output);
> > > 
> > >  }
> > > 
> > > Then it would require additional synchronization code on both ends and
> > > therefore unnecessary complexity, because it would now be possible that
> > > more requests are pushed into the ringbuffer than server could handle.
> > > 
> > > There is one potential issue though that probably did justify the "don't
> > > exceed the queue size" rule:
> > > 
> > > ATM the descriptor table is allocated (just in time) as *one* continuous
> > > buffer via kmalloc_array():
> > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086f7c7
> > > d33a4/drivers/virtio/virtio_ring.c#L440
> > > 
> > > So assuming transmission size of 2 * 128 MB that kmalloc_array() call
> > > would
> > > yield in kmalloc(1M) and the latter might fail if guest had highly
> > > fragmented physical memory. For such kind of error case there is
> > > currently a fallback path in virtqueue_add_split() that would then use
> > > the required amount of pre-allocated descriptors instead:
> > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086f7c7
> > > d33a4/drivers/virtio/virtio_ring.c#L525
> > > 
> > > That fallback recovery path would no longer be viable if the queue size
> > > was
> > > exceeded. There would be alternatives though, e.g. by allowing to chain
> > > indirect descriptor tables (currently prohibited by the virtio specs).
> > 
> > Making the maximum number of descriptors independent of the queue size
> > requires a change to the VIRTIO spec since the two values are currently
> > explicitly tied together by the spec.
> 
> Yes, that's what the virtio specs say. But they don't say why, nor did I hear
> a reason in this dicussion.
> 
> That's why I invested time reviewing current virtio implementation and specs,
> as well as actually testing exceeding that limit. And as I outlined in detail
> in my previous email, I only found one theoretical issue that could be
> addressed though.

I agree that there is a limitation in the VIRTIO spec, but violating the
spec isn't an acceptable solution:

1. QEMU and Linux aren't the only components that implement VIRTIO. You
   cannot make assumptions about their implementations because it may
   break spec-compliant implementations that you haven't looked at.

   Your patches weren't able to increase Queue Size because some device
   implementations break when descriptor chains are too long. This shows
   there is a practical issue even in QEMU.

2. The specific spec violation that we discussed creates the problem
   that drivers can no longer determine the maximum description chain
   length. This in turn will lead to more implementation-specific
   assumptions being baked into drivers and cause problems with
   interoperability and future changes.

The spec needs to be extended instead. I included an idea for how to do
that below.

> > Before doing that, are there benchmark results showing that 1 MB vs 128
> > MB produces a performance improvement? I'm asking because if performance
> > with 1 MB is good then you can probably do that without having to change
> > VIRTIO and also because it's counter-intuitive that 9p needs 128 MB for
> > good performance when it's ultimately implemented on top of disk and
> > network I/O that have lower size limits.
> 
> First some numbers, linear reading a 12 GB file:
> 
> msize    average      notes
> 
> 8 kB     52.0 MB/s    default msize of Linux kernel <v5.15
> 128 kB   624.8 MB/s   default msize of Linux kernel >=v5.15
> 512 kB   1961 MB/s    current max. msize with any Linux kernel <=v5.15
> 1 MB     2551 MB/s    this msize would already violate virtio specs
> 2 MB     2521 MB/s    this msize would already violate virtio specs
> 4 MB     2628 MB/s    planned max. msize of my current kernel patches [1]

How many descriptors are used? 4 MB can be covered by a single
descriptor if the data is physically contiguous in memory, so this data
doesn't demonstrate a need for more descriptors.

> But again, this is not just about performance. My conclusion as described in
> my previous email is that virtio currently squeezes
> 
> 	"max. simultanious amount of bulk messages"
> 
> vs.
> 
> 	"max. bulk data transmission size per bulk messaage"
> 
> into the same configuration parameter, which is IMO inappropriate and hence
> splitting them into 2 separate parameters when creating a queue makes sense,
> independent of the performance benchmarks.
> 
> [1] https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyte.com/

Some devices effectively already have this because the device advertises
a maximum number of descriptors via device-specific mechanisms like the
struct virtio_blk_config seg_max field. But today these fields can only
reduce the maximum descriptor chain length because the spec still limits
the length to Queue Size.

We can build on this approach to raise the length above Queue Size. This
approach has the advantage that the maximum number of segments isn't per
device or per virtqueue, it's fine-grained. If the device supports two
requests types then different max descriptor chain limits could be given
for them by introducing two separate configuration space fields.

Here are the corresponding spec changes:

1. A new feature bit called VIRTIO_RING_F_LARGE_INDIRECT_DESC is added
   to indicate that indirect descriptor table size and maximum
   descriptor chain length are not limited by Queue Size value. (Maybe
   there still needs to be a limit like 2^15?)

2. "2.6.5.3.1 Driver Requirements: Indirect Descriptors" is updated to
   say that VIRTIO_RING_F_LARGE_INDIRECT_DESC overrides the maximum
   descriptor chain length.

2. A new configuration space field is added for 9p indicating the
   maximum descriptor chain length.

One thing that's messy is that we've been discussing the maximum
descriptor chain length but 9p has the "msize" concept, which isn't
aware of contiguous memory. It may be necessary to extend the 9p driver
code to size requests not just according to their length in bytes but
also according to the descriptor chain length. That's how the Linux
block layer deals with queue limits (struct queue_limits max_segments vs
max_hw_sectors).

Stefan
Christian Schoenebeck Nov. 1, 2021, 8:29 p.m. UTC | #15
On Donnerstag, 28. Oktober 2021 11:00:48 CET Stefan Hajnoczi wrote:
> On Mon, Oct 25, 2021 at 05:03:25PM +0200, Christian Schoenebeck wrote:
> > On Montag, 25. Oktober 2021 12:30:41 CEST Stefan Hajnoczi wrote:
> > > On Thu, Oct 21, 2021 at 05:39:28PM +0200, Christian Schoenebeck wrote:
> > > > On Freitag, 8. Oktober 2021 18:08:48 CEST Christian Schoenebeck wrote:
> > > > > On Freitag, 8. Oktober 2021 16:24:42 CEST Christian Schoenebeck wrote:
> > > > > > On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > > > > > > On Thu, 7 Oct 2021 16:42:49 +0100
> > > > > > > 
> > > > > > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > > > > > > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > > > > > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian
> > > > > > > > > > Schoenebeck
> > > > > 
> > > > > wrote:
> > > > > > > > > > > At the moment the maximum transfer size with virtio is
> > > > > > > > > > > limited
> > > > > > > > > > > to
> > > > > > > > > > > 4M
> > > > > > > > > > > (1024 * PAGE_SIZE). This series raises this limit to its
> > > > > > > > > > > maximum
> > > > > > > > > > > theoretical possible transfer size of 128M (32k pages)
> > > > > > > > > > > according
> > > > > > > > > > > to
> > > > > > > > > > > the
> > > > > > > > > > > virtio specs:
> > > > > > > > > > > 
> > > > > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virt
> > > > > > > > > > > io-v
> > > > > > > > > > > 1.1-
> > > > > > > > > > > cs
> > > > > > > > > > > 01
> > > > > > > > > > > .html#
> > > > > > > > > > > x1-240006
> > > > > > > > > > 
> > > > > > > > > > Hi Christian,
> > > > > > > 
> > > > > > > > > > I took a quick look at the code:
> > > > > > > Hi,
> > > > > > > 
> > > > > > > Thanks Stefan for sharing virtio expertise and helping Christian
> > > > > > > !
> > > > > > > 
> > > > > > > > > > - The Linux 9p driver restricts descriptor chains to 128
> > > > > > > > > > elements
> > > > > > > > > > 
> > > > > > > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > > > > > > 
> > > > > > > > > Yes, that's the limitation that I am about to remove (WIP);
> > > > > > > > > current
> > > > > > > > > kernel
> > > > > > > > > patches:
> > > > > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_os
> > > > > > > > > s@cr
> > > > > > > > > udeb
> > > > > > > > > yt
> > > > > > > > > e.
> > > > > > > > > com/>
> > > > > > > > 
> > > > > > > > I haven't read the patches yet but I'm concerned that today
> > > > > > > > the
> > > > > > > > driver
> > > > > > > > is pretty well-behaved and this new patch series introduces a
> > > > > > > > spec
> > > > > > > > violation. Not fixing existing spec violations is okay, but
> > > > > > > > adding
> > > > > > > > new
> > > > > > > > ones is a red flag. I think we need to figure out a clean
> > > > > > > > solution.
> > > > > > 
> > > > > > Nobody has reviewed the kernel patches yet. My main concern
> > > > > > therefore
> > > > > > actually is that the kernel patches are already too complex,
> > > > > > because
> > > > > > the
> > > > > > current situation is that only Dominique is handling 9p patches on
> > > > > > kernel
> > > > > > side, and he barely has time for 9p anymore.
> > > > > > 
> > > > > > Another reason for me to catch up on reading current kernel code
> > > > > > and
> > > > > > stepping in as reviewer of 9p on kernel side ASAP, independent of
> > > > > > this
> > > > > > issue.
> > > > > > 
> > > > > > As for current kernel patches' complexity: I can certainly drop
> > > > > > patch
> > > > > > 7
> > > > > > entirely as it is probably just overkill. Patch 4 is then the
> > > > > > biggest
> > > > > > chunk, I have to see if I can simplify it, and whether it would
> > > > > > make
> > > > > > sense to squash with patch 3.
> > > > > > 
> > > > > > > > > > - The QEMU 9pfs code passes iovecs directly to preadv(2)
> > > > > > > > > > and
> > > > > > > > > > will
> > > > > > > > > > fail
> > > > > > > > > > 
> > > > > > > > > >   with EINVAL when called with more than IOV_MAX iovecs
> > > > > > > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > > > > > > 
> > > > > > > > > Hmm, which makes me wonder why I never encountered this
> > > > > > > > > error
> > > > > > > > > during
> > > > > > > > > testing.
> > > > > > > > > 
> > > > > > > > > Most people will use the 9p qemu 'local' fs driver backend
> > > > > > > > > in
> > > > > > > > > practice,
> > > > > > > > > so
> > > > > > > > > that v9fs_read() call would translate for most people to
> > > > > > > > > this
> > > > > > > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > > > > > > 
> > > > > > > > > static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState
> > > > > > > > > *fs,
> > > > > > > > > 
> > > > > > > > >                             const struct iovec *iov,
> > > > > > > > >                             int iovcnt, off_t offset)
> > > > > > > > > 
> > > > > > > > > {
> > > > > > > > > #ifdef CONFIG_PREADV
> > > > > > > > > 
> > > > > > > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > > > > > > 
> > > > > > > > > #else
> > > > > > > > > 
> > > > > > > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > > > > > > >     if (err == -1) {
> > > > > > > > >     
> > > > > > > > >         return err;
> > > > > > > > >     
> > > > > > > > >     } else {
> > > > > > > > >     
> > > > > > > > >         return readv(fs->fd, iov, iovcnt);
> > > > > > > > >     
> > > > > > > > >     }
> > > > > > > > > 
> > > > > > > > > #endif
> > > > > > > > > }
> > > > > > > > > 
> > > > > > > > > > Unless I misunderstood the code, neither side can take
> > > > > > > > > > advantage
> > > > > > > > > > of
> > > > > > > > > > the
> > > > > > > > > > new 32k descriptor chain limit?
> > > > > > > > > > 
> > > > > > > > > > Thanks,
> > > > > > > > > > Stefan
> > > > > > > > > 
> > > > > > > > > I need to check that when I have some more time. One
> > > > > > > > > possible
> > > > > > > > > explanation
> > > > > > > > > might be that preadv() already has this wrapped into a loop
> > > > > > > > > in
> > > > > > > > > its
> > > > > > > > > implementation to circumvent a limit like IOV_MAX. It might
> > > > > > > > > be
> > > > > > > > > another
> > > > > > > > > "it
> > > > > > > > > works, but not portable" issue, but not sure.
> > > > > > > > > 
> > > > > > > > > There are still a bunch of other issues I have to resolve.
> > > > > > > > > If
> > > > > > > > > you
> > > > > > > > > look
> > > > > > > > > at
> > > > > > > > > net/9p/client.c on kernel side, you'll notice that it
> > > > > > > > > basically
> > > > > > > > > does
> > > > > > > > > this ATM> >
> > > > > > > > > 
> > > > > > > > >     kmalloc(msize);
> > > > > > > 
> > > > > > > Note that this is done twice : once for the T message (client
> > > > > > > request)
> > > > > > > and
> > > > > > > once for the R message (server answer). The 9p driver could
> > > > > > > adjust
> > > > > > > the
> > > > > > > size
> > > > > > > of the T message to what's really needed instead of allocating
> > > > > > > the
> > > > > > > full
> > > > > > > msize. R message size is not known though.
> > > > > > 
> > > > > > Would it make sense adding a second virtio ring, dedicated to
> > > > > > server
> > > > > > responses to solve this? IIRC 9p server already calculates
> > > > > > appropriate
> > > > > > exact sizes for each response type. So server could just push
> > > > > > space
> > > > > > that's
> > > > > > really needed for its responses.
> > > > > > 
> > > > > > > > > for every 9p request. So not only does it allocate much more
> > > > > > > > > memory
> > > > > > > > > for
> > > > > > > > > every request than actually required (i.e. say 9pfs was
> > > > > > > > > mounted
> > > > > > > > > with
> > > > > > > > > msize=8M, then a 9p request that actually would just need 1k
> > > > > > > > > would
> > > > > > > > > nevertheless allocate 8M), but also it allocates >
> > > > > > > > > PAGE_SIZE,
> > > > > > > > > which
> > > > > > > > > obviously may fail at any time.>
> > > > > > > > 
> > > > > > > > The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc()
> > > > > > > > situation.
> > > > > > 
> > > > > > Hu, I didn't even consider vmalloc(). I just tried the kvmalloc()
> > > > > > wrapper
> > > > > > as a quick & dirty test, but it crashed in the same way as
> > > > > > kmalloc()
> > > > > > with
> > > > > > large msize values immediately on mounting:
> > > > > > 
> > > > > > diff --git a/net/9p/client.c b/net/9p/client.c
> > > > > > index a75034fa249b..cfe300a4b6ca 100644
> > > > > > --- a/net/9p/client.c
> > > > > > +++ b/net/9p/client.c
> > > > > > @@ -227,15 +227,18 @@ static int parse_opts(char *opts, struct
> > > > > > p9_client
> > > > > > *clnt)
> > > > > > 
> > > > > >  static int p9_fcall_init(struct p9_client *c, struct p9_fcall
> > > > > >  *fc,
> > > > > >  
> > > > > >                          int alloc_msize)
> > > > > >  
> > > > > >  {
> > > > > > 
> > > > > > -       if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > > > > > +       //if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > > > > > +       if (false) {
> > > > > > 
> > > > > >                 fc->sdata = kmem_cache_alloc(c->fcall_cache,
> > > > > >                 GFP_NOFS);
> > > > > >                 fc->cache = c->fcall_cache;
> > > > > >         
> > > > > >         } else {
> > > > > > 
> > > > > > -               fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> > > > > > +               fc->sdata = kvmalloc(alloc_msize, GFP_NOFS);
> > > > > 
> > > > > Ok, GFP_NOFS -> GFP_KERNEL did the trick.
> > > > > 
> > > > > Now I get:
> > > > >    virtio: bogus descriptor or out of resources
> > > > > 
> > > > > So, still some work ahead on both ends.
> > > > 
> > > > Few hacks later (only changes on 9p client side) I got this running
> > > > stable
> > > > now. The reason for the virtio error above was that kvmalloc() returns
> > > > a
> > > > non-logical kernel address for any kvmalloc(>4M), i.e. an address that
> > > > is
> > > > inaccessible from host side, hence that "bogus descriptor" message by
> > > > QEMU.
> > > > So I had to split those linear 9p client buffers into sparse ones (set
> > > > of
> > > > individual pages).
> > > > 
> > > > I tested this for some days with various virtio transmission sizes and
> > > > it
> > > > works as expected up to 128 MB (more precisely: 128 MB read space +
> > > > 128 MB
> > > > write space per virtio round trip message).
> > > > 
> > > > I did not encounter a show stopper for large virtio transmission sizes
> > > > (4 MB ... 128 MB) on virtio level, neither as a result of testing, nor
> > > > after reviewing the existing code.
> > > > 
> > > > About IOV_MAX: that's apparently not an issue on virtio level. Most of
> > > > the
> > > > iovec code, both on Linux kernel side and on QEMU side do not have
> > > > this
> > > > limitation. It is apparently however indeed a limitation for userland
> > > > apps
> > > > calling the Linux kernel's syscalls yet.
> > > > 
> > > > Stefan, as it stands now, I am even more convinced that the upper
> > > > virtio
> > > > transmission size limit should not be squeezed into the queue size
> > > > argument of virtio_add_queue(). Not because of the previous argument
> > > > that
> > > > it would waste space (~1MB), but rather because they are two different
> > > > things. To outline this, just a quick recap of what happens exactly
> > > > when
> > > > a bulk message is pushed over the virtio wire (assuming virtio "split"
> > > > layout here):
> > > > 
> > > > ---------- [recap-start] ----------
> > > > 
> > > > For each bulk message sent guest <-> host, exactly *one* of the
> > > > pre-allocated descriptors is taken and placed (subsequently) into
> > > > exactly
> > > > *one* position of the two available/used ring buffers. The actual
> > > > descriptor table though, containing all the DMA addresses of the
> > > > message
> > > > bulk data, is allocated just in time for each round trip message. Say,
> > > > it
> > > > is the first message sent, it yields in the following structure:
> > > > 
> > > > Ring Buffer   Descriptor Table      Bulk Data Pages
> > > > 
> > > >    +-+              +-+           +-----------------+
> > > >    
> > > >    |D|------------->|d|---------->| Bulk data block |
> > > >    
> > > >    +-+              |d|--------+  +-----------------+
> > > >    
> > > >    | |              |d|------+ |
> > > >    
> > > >    +-+               .       | |  +-----------------+
> > > >    
> > > >    | |               .       | +->| Bulk data block |
> > > >     
> > > >     .                .       |    +-----------------+
> > > >     .               |d|-+    |
> > > >     .               +-+ |    |    +-----------------+
> > > >     
> > > >    | |                  |    +--->| Bulk data block |
> > > >    
> > > >    +-+                  |         +-----------------+
> > > >    
> > > >    | |                  |                 .
> > > >    
> > > >    +-+                  |                 .
> > > >    
> > > >                         |                 .
> > > >                         |         
> > > >                         |         +-----------------+
> > > >                         
> > > >                         +-------->| Bulk data block |
> > > >                         
> > > >                                   +-----------------+
> > > > 
> > > > Legend:
> > > > D: pre-allocated descriptor
> > > > d: just in time allocated descriptor
> > > > -->: memory pointer (DMA)
> > > > 
> > > > The bulk data blocks are allocated by the respective device driver
> > > > above
> > > > virtio subsystem level (guest side).
> > > > 
> > > > There are exactly as many descriptors pre-allocated (D) as the size of
> > > > a
> > > > ring buffer.
> > > > 
> > > > A "descriptor" is more or less just a chainable DMA memory pointer;
> > > > defined
> > > > as:
> > > > 
> > > > /* Virtio ring descriptors: 16 bytes.  These can chain together via
> > > > "next". */ struct vring_desc {
> > > > 
> > > > 	/* Address (guest-physical). */
> > > > 	__virtio64 addr;
> > > > 	/* Length. */
> > > > 	__virtio32 len;
> > > > 	/* The flags as indicated above. */
> > > > 	__virtio16 flags;
> > > > 	/* We chain unused descriptors via this, too */
> > > > 	__virtio16 next;
> > > > 
> > > > };
> > > > 
> > > > There are 2 ring buffers; the "available" ring buffer is for sending a
> > > > message guest->host (which will transmit DMA addresses of guest
> > > > allocated
> > > > bulk data blocks that are used for data sent to device, and separate
> > > > guest allocated bulk data blocks that will be used by host side to
> > > > place
> > > > its response bulk data), and the "used" ring buffer is for sending
> > > > host->guest to let guest know about host's response and that it could
> > > > now
> > > > safely consume and then deallocate the bulk data blocks subsequently.
> > > > 
> > > > ---------- [recap-end] ----------
> > > > 
> > > > So the "queue size" actually defines the ringbuffer size. It does not
> > > > define the maximum amount of descriptors. The "queue size" rather
> > > > defines
> > > > how many pending messages can be pushed into either one ringbuffer
> > > > before
> > > > the other side would need to wait until the counter side would step up
> > > > (i.e. ring buffer full).
> > > > 
> > > > The maximum amount of descriptors (what VIRTQUEUE_MAX_SIZE actually
> > > > is)
> > > > OTOH defines the max. bulk data size that could be transmitted with
> > > > each
> > > > virtio round trip message.
> > > > 
> > > > And in fact, 9p currently handles the virtio "queue size" as directly
> > > > associative with its maximum amount of active 9p requests the server
> > > > could
> > > > 
> > > > handle simultaniously:
> > > >   hw/9pfs/9p.h:#define MAX_REQ         128
> > > >   hw/9pfs/9p.h:    V9fsPDU pdus[MAX_REQ];
> > > >   hw/9pfs/virtio-9p-device.c:    v->vq = virtio_add_queue(vdev,
> > > >   MAX_REQ,
> > > >   
> > > >                                  handle_9p_output);
> > > > 
> > > > So if I would change it like this, just for the purpose to increase
> > > > the
> > > > max. virtio transmission size:
> > > > 
> > > > --- a/hw/9pfs/virtio-9p-device.c
> > > > +++ b/hw/9pfs/virtio-9p-device.c
> > > > @@ -218,7 +218,7 @@ static void virtio_9p_device_realize(DeviceState
> > > > *dev,
> > > > Error **errp)>
> > > > 
> > > >      v->config_size = sizeof(struct virtio_9p_config) +
> > > >      strlen(s->fsconf.tag);
> > > >      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> > > >      
> > > >                  VIRTQUEUE_MAX_SIZE);
> > > > 
> > > > -    v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> > > > +    v->vq = virtio_add_queue(vdev, 32*1024, handle_9p_output);
> > > > 
> > > >  }
> > > > 
> > > > Then it would require additional synchronization code on both ends and
> > > > therefore unnecessary complexity, because it would now be possible
> > > > that
> > > > more requests are pushed into the ringbuffer than server could handle.
> > > > 
> > > > There is one potential issue though that probably did justify the
> > > > "don't
> > > > exceed the queue size" rule:
> > > > 
> > > > ATM the descriptor table is allocated (just in time) as *one*
> > > > continuous
> > > > buffer via kmalloc_array():
> > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086
> > > > f7c7
> > > > d33a4/drivers/virtio/virtio_ring.c#L440
> > > > 
> > > > So assuming transmission size of 2 * 128 MB that kmalloc_array() call
> > > > would
> > > > yield in kmalloc(1M) and the latter might fail if guest had highly
> > > > fragmented physical memory. For such kind of error case there is
> > > > currently a fallback path in virtqueue_add_split() that would then use
> > > > the required amount of pre-allocated descriptors instead:
> > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086
> > > > f7c7
> > > > d33a4/drivers/virtio/virtio_ring.c#L525
> > > > 
> > > > That fallback recovery path would no longer be viable if the queue
> > > > size
> > > > was
> > > > exceeded. There would be alternatives though, e.g. by allowing to
> > > > chain
> > > > indirect descriptor tables (currently prohibited by the virtio specs).
> > > 
> > > Making the maximum number of descriptors independent of the queue size
> > > requires a change to the VIRTIO spec since the two values are currently
> > > explicitly tied together by the spec.
> > 
> > Yes, that's what the virtio specs say. But they don't say why, nor did I
> > hear a reason in this dicussion.
> > 
> > That's why I invested time reviewing current virtio implementation and
> > specs, as well as actually testing exceeding that limit. And as I
> > outlined in detail in my previous email, I only found one theoretical
> > issue that could be addressed though.
> 
> I agree that there is a limitation in the VIRTIO spec, but violating the
> spec isn't an acceptable solution:
> 
> 1. QEMU and Linux aren't the only components that implement VIRTIO. You
>    cannot make assumptions about their implementations because it may
>    break spec-compliant implementations that you haven't looked at.
> 
>    Your patches weren't able to increase Queue Size because some device
>    implementations break when descriptor chains are too long. This shows
>    there is a practical issue even in QEMU.
> 
> 2. The specific spec violation that we discussed creates the problem
>    that drivers can no longer determine the maximum description chain
>    length. This in turn will lead to more implementation-specific
>    assumptions being baked into drivers and cause problems with
>    interoperability and future changes.
> 
> The spec needs to be extended instead. I included an idea for how to do
> that below.

Sure, I just wanted to see if there was a non-neglectable "hard" show stopper
per se that I probably haven't seen yet. I have not questioned aiming a clean
solution.

Thanks for the clarification!

> > > Before doing that, are there benchmark results showing that 1 MB vs 128
> > > MB produces a performance improvement? I'm asking because if performance
> > > with 1 MB is good then you can probably do that without having to change
> > > VIRTIO and also because it's counter-intuitive that 9p needs 128 MB for
> > > good performance when it's ultimately implemented on top of disk and
> > > network I/O that have lower size limits.
> > 
> > First some numbers, linear reading a 12 GB file:
> > 
> > msize    average      notes
> > 
> > 8 kB     52.0 MB/s    default msize of Linux kernel <v5.15
> > 128 kB   624.8 MB/s   default msize of Linux kernel >=v5.15
> > 512 kB   1961 MB/s    current max. msize with any Linux kernel <=v5.15
> > 1 MB     2551 MB/s    this msize would already violate virtio specs
> > 2 MB     2521 MB/s    this msize would already violate virtio specs
> > 4 MB     2628 MB/s    planned max. msize of my current kernel patches [1]
> 
> How many descriptors are used? 4 MB can be covered by a single
> descriptor if the data is physically contiguous in memory, so this data
> doesn't demonstrate a need for more descriptors.

No, in the last couple years there was apparently no kernel version that used
just one descriptor, nor did my benchmarked version. Even though the Linux 9p
client uses (yet) simple linear buffers (contiguous physical memory) on 9p
client level, these are however split into PAGE_SIZE chunks by function
pack_sg_list() [1] before being fed to virtio level:

static unsigned int rest_of_page(void *data)
{
	return PAGE_SIZE - offset_in_page(data);
}
...
static int pack_sg_list(struct scatterlist *sg, int start,
			int limit, char *data, int count)
{
	int s;
	int index = start;

	while (count) {
		s = rest_of_page(data);
		...
		sg_set_buf(&sg[index++], data, s);
		count -= s;
		data += s;
	}
	...
}

[1] https://github.com/torvalds/linux/blob/19901165d90fdca1e57c9baa0d5b4c63d15c476a/net/9p/trans_virtio.c#L171

So when sending 4MB over virtio wire, it would yield in 1k descriptors ATM.

I have wondered about this before, but did not question it, because due to the
cross-platform nature I couldn't say for certain whether that's probably
needed somewhere. I mean for the case virtio-PCI I know for sure that one
descriptor (i.e. >PAGE_SIZE) would be fine, but I don't know if that applies
to all buses and architectures.

> > But again, this is not just about performance. My conclusion as described
> > in my previous email is that virtio currently squeezes
> > 
> > 	"max. simultanious amount of bulk messages"
> > 
> > vs.
> > 
> > 	"max. bulk data transmission size per bulk messaage"
> > 
> > into the same configuration parameter, which is IMO inappropriate and
> > hence
> > splitting them into 2 separate parameters when creating a queue makes
> > sense, independent of the performance benchmarks.
> > 
> > [1]
> > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyte.c
> > om/
> Some devices effectively already have this because the device advertises
> a maximum number of descriptors via device-specific mechanisms like the
> struct virtio_blk_config seg_max field. But today these fields can only
> reduce the maximum descriptor chain length because the spec still limits
> the length to Queue Size.
> 
> We can build on this approach to raise the length above Queue Size. This
> approach has the advantage that the maximum number of segments isn't per
> device or per virtqueue, it's fine-grained. If the device supports two
> requests types then different max descriptor chain limits could be given
> for them by introducing two separate configuration space fields.
> 
> Here are the corresponding spec changes:
> 
> 1. A new feature bit called VIRTIO_RING_F_LARGE_INDIRECT_DESC is added
>    to indicate that indirect descriptor table size and maximum
>    descriptor chain length are not limited by Queue Size value. (Maybe
>    there still needs to be a limit like 2^15?)

Sounds good to me!

AFAIK it is effectively limited to 2^16 because of vring_desc->next:

/* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
struct vring_desc {
        /* Address (guest-physical). */
        __virtio64 addr;
        /* Length. */
        __virtio32 len;
        /* The flags as indicated above. */
        __virtio16 flags;
        /* We chain unused descriptors via this, too */
        __virtio16 next;
};

At least unless either chained indirect descriptor tables or nested indirect
descriptor tables would be allowed as well, which are both prohibited by the
specs ATM. I'm not saying that this would be needed anytime soon. :)

> 2. "2.6.5.3.1 Driver Requirements: Indirect Descriptors" is updated to
>    say that VIRTIO_RING_F_LARGE_INDIRECT_DESC overrides the maximum
>    descriptor chain length.

OK

> 2. A new configuration space field is added for 9p indicating the
>    maximum descriptor chain length.

So additionally to the VIRTIO_RING_F_LARGE_INDIRECT_DESC capability bit field
also a numeric field that would exactly specify the max. chain length. Sure,
why not.

> One thing that's messy is that we've been discussing the maximum
> descriptor chain length but 9p has the "msize" concept, which isn't
> aware of contiguous memory. It may be necessary to extend the 9p driver
> code to size requests not just according to their length in bytes but
> also according to the descriptor chain length. That's how the Linux
> block layer deals with queue limits (struct queue_limits max_segments vs
> max_hw_sectors).

Hmm, can't follow on that one. For what should that be needed in case of 9p?
My plan was to limit msize by 9p client simply at session start to whatever is
the max. amount virtio descriptors supported by host and using PAGE_SIZE as
size per descriptor, because that's what 9p client actually does ATM (see
above). So you think that should be changed to e.g. just one descriptor for
4MB, right?

Best regards,
Christian Schoenebeck
Stefan Hajnoczi Nov. 3, 2021, 11:33 a.m. UTC | #16
On Mon, Nov 01, 2021 at 09:29:26PM +0100, Christian Schoenebeck wrote:
> On Donnerstag, 28. Oktober 2021 11:00:48 CET Stefan Hajnoczi wrote:
> > On Mon, Oct 25, 2021 at 05:03:25PM +0200, Christian Schoenebeck wrote:
> > > On Montag, 25. Oktober 2021 12:30:41 CEST Stefan Hajnoczi wrote:
> > > > On Thu, Oct 21, 2021 at 05:39:28PM +0200, Christian Schoenebeck wrote:
> > > > > On Freitag, 8. Oktober 2021 18:08:48 CEST Christian Schoenebeck wrote:
> > > > > > On Freitag, 8. Oktober 2021 16:24:42 CEST Christian Schoenebeck wrote:
> > > > > > > On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > > > > > > > On Thu, 7 Oct 2021 16:42:49 +0100
> > > > > > > > 
> > > > > > > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > > > > > > > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > > > > > > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian
> > > > > > > > > > > Schoenebeck
> > > > > > 
> > > > > > wrote:
> > > > > > > > > > > > At the moment the maximum transfer size with virtio is
> > > > > > > > > > > > limited
> > > > > > > > > > > > to
> > > > > > > > > > > > 4M
> > > > > > > > > > > > (1024 * PAGE_SIZE). This series raises this limit to its
> > > > > > > > > > > > maximum
> > > > > > > > > > > > theoretical possible transfer size of 128M (32k pages)
> > > > > > > > > > > > according
> > > > > > > > > > > > to
> > > > > > > > > > > > the
> > > > > > > > > > > > virtio specs:
> > > > > > > > > > > > 
> > > > > > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virt
> > > > > > > > > > > > io-v
> > > > > > > > > > > > 1.1-
> > > > > > > > > > > > cs
> > > > > > > > > > > > 01
> > > > > > > > > > > > .html#
> > > > > > > > > > > > x1-240006
> > > > > > > > > > > 
> > > > > > > > > > > Hi Christian,
> > > > > > > > 
> > > > > > > > > > > I took a quick look at the code:
> > > > > > > > Hi,
> > > > > > > > 
> > > > > > > > Thanks Stefan for sharing virtio expertise and helping Christian
> > > > > > > > !
> > > > > > > > 
> > > > > > > > > > > - The Linux 9p driver restricts descriptor chains to 128
> > > > > > > > > > > elements
> > > > > > > > > > > 
> > > > > > > > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > > > > > > > 
> > > > > > > > > > Yes, that's the limitation that I am about to remove (WIP);
> > > > > > > > > > current
> > > > > > > > > > kernel
> > > > > > > > > > patches:
> > > > > > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_os
> > > > > > > > > > s@cr
> > > > > > > > > > udeb
> > > > > > > > > > yt
> > > > > > > > > > e.
> > > > > > > > > > com/>
> > > > > > > > > 
> > > > > > > > > I haven't read the patches yet but I'm concerned that today
> > > > > > > > > the
> > > > > > > > > driver
> > > > > > > > > is pretty well-behaved and this new patch series introduces a
> > > > > > > > > spec
> > > > > > > > > violation. Not fixing existing spec violations is okay, but
> > > > > > > > > adding
> > > > > > > > > new
> > > > > > > > > ones is a red flag. I think we need to figure out a clean
> > > > > > > > > solution.
> > > > > > > 
> > > > > > > Nobody has reviewed the kernel patches yet. My main concern
> > > > > > > therefore
> > > > > > > actually is that the kernel patches are already too complex,
> > > > > > > because
> > > > > > > the
> > > > > > > current situation is that only Dominique is handling 9p patches on
> > > > > > > kernel
> > > > > > > side, and he barely has time for 9p anymore.
> > > > > > > 
> > > > > > > Another reason for me to catch up on reading current kernel code
> > > > > > > and
> > > > > > > stepping in as reviewer of 9p on kernel side ASAP, independent of
> > > > > > > this
> > > > > > > issue.
> > > > > > > 
> > > > > > > As for current kernel patches' complexity: I can certainly drop
> > > > > > > patch
> > > > > > > 7
> > > > > > > entirely as it is probably just overkill. Patch 4 is then the
> > > > > > > biggest
> > > > > > > chunk, I have to see if I can simplify it, and whether it would
> > > > > > > make
> > > > > > > sense to squash with patch 3.
> > > > > > > 
> > > > > > > > > > > - The QEMU 9pfs code passes iovecs directly to preadv(2)
> > > > > > > > > > > and
> > > > > > > > > > > will
> > > > > > > > > > > fail
> > > > > > > > > > > 
> > > > > > > > > > >   with EINVAL when called with more than IOV_MAX iovecs
> > > > > > > > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > > > > > > > 
> > > > > > > > > > Hmm, which makes me wonder why I never encountered this
> > > > > > > > > > error
> > > > > > > > > > during
> > > > > > > > > > testing.
> > > > > > > > > > 
> > > > > > > > > > Most people will use the 9p qemu 'local' fs driver backend
> > > > > > > > > > in
> > > > > > > > > > practice,
> > > > > > > > > > so
> > > > > > > > > > that v9fs_read() call would translate for most people to
> > > > > > > > > > this
> > > > > > > > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > > > > > > > 
> > > > > > > > > > static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState
> > > > > > > > > > *fs,
> > > > > > > > > > 
> > > > > > > > > >                             const struct iovec *iov,
> > > > > > > > > >                             int iovcnt, off_t offset)
> > > > > > > > > > 
> > > > > > > > > > {
> > > > > > > > > > #ifdef CONFIG_PREADV
> > > > > > > > > > 
> > > > > > > > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > > > > > > > 
> > > > > > > > > > #else
> > > > > > > > > > 
> > > > > > > > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > > > > > > > >     if (err == -1) {
> > > > > > > > > >     
> > > > > > > > > >         return err;
> > > > > > > > > >     
> > > > > > > > > >     } else {
> > > > > > > > > >     
> > > > > > > > > >         return readv(fs->fd, iov, iovcnt);
> > > > > > > > > >     
> > > > > > > > > >     }
> > > > > > > > > > 
> > > > > > > > > > #endif
> > > > > > > > > > }
> > > > > > > > > > 
> > > > > > > > > > > Unless I misunderstood the code, neither side can take
> > > > > > > > > > > advantage
> > > > > > > > > > > of
> > > > > > > > > > > the
> > > > > > > > > > > new 32k descriptor chain limit?
> > > > > > > > > > > 
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Stefan
> > > > > > > > > > 
> > > > > > > > > > I need to check that when I have some more time. One
> > > > > > > > > > possible
> > > > > > > > > > explanation
> > > > > > > > > > might be that preadv() already has this wrapped into a loop
> > > > > > > > > > in
> > > > > > > > > > its
> > > > > > > > > > implementation to circumvent a limit like IOV_MAX. It might
> > > > > > > > > > be
> > > > > > > > > > another
> > > > > > > > > > "it
> > > > > > > > > > works, but not portable" issue, but not sure.
> > > > > > > > > > 
> > > > > > > > > > There are still a bunch of other issues I have to resolve.
> > > > > > > > > > If
> > > > > > > > > > you
> > > > > > > > > > look
> > > > > > > > > > at
> > > > > > > > > > net/9p/client.c on kernel side, you'll notice that it
> > > > > > > > > > basically
> > > > > > > > > > does
> > > > > > > > > > this ATM> >
> > > > > > > > > > 
> > > > > > > > > >     kmalloc(msize);
> > > > > > > > 
> > > > > > > > Note that this is done twice : once for the T message (client
> > > > > > > > request)
> > > > > > > > and
> > > > > > > > once for the R message (server answer). The 9p driver could
> > > > > > > > adjust
> > > > > > > > the
> > > > > > > > size
> > > > > > > > of the T message to what's really needed instead of allocating
> > > > > > > > the
> > > > > > > > full
> > > > > > > > msize. R message size is not known though.
> > > > > > > 
> > > > > > > Would it make sense adding a second virtio ring, dedicated to
> > > > > > > server
> > > > > > > responses to solve this? IIRC 9p server already calculates
> > > > > > > appropriate
> > > > > > > exact sizes for each response type. So server could just push
> > > > > > > space
> > > > > > > that's
> > > > > > > really needed for its responses.
> > > > > > > 
> > > > > > > > > > for every 9p request. So not only does it allocate much more
> > > > > > > > > > memory
> > > > > > > > > > for
> > > > > > > > > > every request than actually required (i.e. say 9pfs was
> > > > > > > > > > mounted
> > > > > > > > > > with
> > > > > > > > > > msize=8M, then a 9p request that actually would just need 1k
> > > > > > > > > > would
> > > > > > > > > > nevertheless allocate 8M), but also it allocates >
> > > > > > > > > > PAGE_SIZE,
> > > > > > > > > > which
> > > > > > > > > > obviously may fail at any time.>
> > > > > > > > > 
> > > > > > > > > The PAGE_SIZE limitation sounds like a kmalloc() vs vmalloc()
> > > > > > > > > situation.
> > > > > > > 
> > > > > > > Hu, I didn't even consider vmalloc(). I just tried the kvmalloc()
> > > > > > > wrapper
> > > > > > > as a quick & dirty test, but it crashed in the same way as
> > > > > > > kmalloc()
> > > > > > > with
> > > > > > > large msize values immediately on mounting:
> > > > > > > 
> > > > > > > diff --git a/net/9p/client.c b/net/9p/client.c
> > > > > > > index a75034fa249b..cfe300a4b6ca 100644
> > > > > > > --- a/net/9p/client.c
> > > > > > > +++ b/net/9p/client.c
> > > > > > > @@ -227,15 +227,18 @@ static int parse_opts(char *opts, struct
> > > > > > > p9_client
> > > > > > > *clnt)
> > > > > > > 
> > > > > > >  static int p9_fcall_init(struct p9_client *c, struct p9_fcall
> > > > > > >  *fc,
> > > > > > >  
> > > > > > >                          int alloc_msize)
> > > > > > >  
> > > > > > >  {
> > > > > > > 
> > > > > > > -       if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > > > > > > +       //if (likely(c->fcall_cache) && alloc_msize == c->msize) {
> > > > > > > +       if (false) {
> > > > > > > 
> > > > > > >                 fc->sdata = kmem_cache_alloc(c->fcall_cache,
> > > > > > >                 GFP_NOFS);
> > > > > > >                 fc->cache = c->fcall_cache;
> > > > > > >         
> > > > > > >         } else {
> > > > > > > 
> > > > > > > -               fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> > > > > > > +               fc->sdata = kvmalloc(alloc_msize, GFP_NOFS);
> > > > > > 
> > > > > > Ok, GFP_NOFS -> GFP_KERNEL did the trick.
> > > > > > 
> > > > > > Now I get:
> > > > > >    virtio: bogus descriptor or out of resources
> > > > > > 
> > > > > > So, still some work ahead on both ends.
> > > > > 
> > > > > Few hacks later (only changes on 9p client side) I got this running
> > > > > stable
> > > > > now. The reason for the virtio error above was that kvmalloc() returns
> > > > > a
> > > > > non-logical kernel address for any kvmalloc(>4M), i.e. an address that
> > > > > is
> > > > > inaccessible from host side, hence that "bogus descriptor" message by
> > > > > QEMU.
> > > > > So I had to split those linear 9p client buffers into sparse ones (set
> > > > > of
> > > > > individual pages).
> > > > > 
> > > > > I tested this for some days with various virtio transmission sizes and
> > > > > it
> > > > > works as expected up to 128 MB (more precisely: 128 MB read space +
> > > > > 128 MB
> > > > > write space per virtio round trip message).
> > > > > 
> > > > > I did not encounter a show stopper for large virtio transmission sizes
> > > > > (4 MB ... 128 MB) on virtio level, neither as a result of testing, nor
> > > > > after reviewing the existing code.
> > > > > 
> > > > > About IOV_MAX: that's apparently not an issue on virtio level. Most of
> > > > > the
> > > > > iovec code, both on Linux kernel side and on QEMU side do not have
> > > > > this
> > > > > limitation. It is apparently however indeed a limitation for userland
> > > > > apps
> > > > > calling the Linux kernel's syscalls yet.
> > > > > 
> > > > > Stefan, as it stands now, I am even more convinced that the upper
> > > > > virtio
> > > > > transmission size limit should not be squeezed into the queue size
> > > > > argument of virtio_add_queue(). Not because of the previous argument
> > > > > that
> > > > > it would waste space (~1MB), but rather because they are two different
> > > > > things. To outline this, just a quick recap of what happens exactly
> > > > > when
> > > > > a bulk message is pushed over the virtio wire (assuming virtio "split"
> > > > > layout here):
> > > > > 
> > > > > ---------- [recap-start] ----------
> > > > > 
> > > > > For each bulk message sent guest <-> host, exactly *one* of the
> > > > > pre-allocated descriptors is taken and placed (subsequently) into
> > > > > exactly
> > > > > *one* position of the two available/used ring buffers. The actual
> > > > > descriptor table though, containing all the DMA addresses of the
> > > > > message
> > > > > bulk data, is allocated just in time for each round trip message. Say,
> > > > > it
> > > > > is the first message sent, it yields in the following structure:
> > > > > 
> > > > > Ring Buffer   Descriptor Table      Bulk Data Pages
> > > > > 
> > > > >    +-+              +-+           +-----------------+
> > > > >    
> > > > >    |D|------------->|d|---------->| Bulk data block |
> > > > >    
> > > > >    +-+              |d|--------+  +-----------------+
> > > > >    
> > > > >    | |              |d|------+ |
> > > > >    
> > > > >    +-+               .       | |  +-----------------+
> > > > >    
> > > > >    | |               .       | +->| Bulk data block |
> > > > >     
> > > > >     .                .       |    +-----------------+
> > > > >     .               |d|-+    |
> > > > >     .               +-+ |    |    +-----------------+
> > > > >     
> > > > >    | |                  |    +--->| Bulk data block |
> > > > >    
> > > > >    +-+                  |         +-----------------+
> > > > >    
> > > > >    | |                  |                 .
> > > > >    
> > > > >    +-+                  |                 .
> > > > >    
> > > > >                         |                 .
> > > > >                         |         
> > > > >                         |         +-----------------+
> > > > >                         
> > > > >                         +-------->| Bulk data block |
> > > > >                         
> > > > >                                   +-----------------+
> > > > > 
> > > > > Legend:
> > > > > D: pre-allocated descriptor
> > > > > d: just in time allocated descriptor
> > > > > -->: memory pointer (DMA)
> > > > > 
> > > > > The bulk data blocks are allocated by the respective device driver
> > > > > above
> > > > > virtio subsystem level (guest side).
> > > > > 
> > > > > There are exactly as many descriptors pre-allocated (D) as the size of
> > > > > a
> > > > > ring buffer.
> > > > > 
> > > > > A "descriptor" is more or less just a chainable DMA memory pointer;
> > > > > defined
> > > > > as:
> > > > > 
> > > > > /* Virtio ring descriptors: 16 bytes.  These can chain together via
> > > > > "next". */ struct vring_desc {
> > > > > 
> > > > > 	/* Address (guest-physical). */
> > > > > 	__virtio64 addr;
> > > > > 	/* Length. */
> > > > > 	__virtio32 len;
> > > > > 	/* The flags as indicated above. */
> > > > > 	__virtio16 flags;
> > > > > 	/* We chain unused descriptors via this, too */
> > > > > 	__virtio16 next;
> > > > > 
> > > > > };
> > > > > 
> > > > > There are 2 ring buffers; the "available" ring buffer is for sending a
> > > > > message guest->host (which will transmit DMA addresses of guest
> > > > > allocated
> > > > > bulk data blocks that are used for data sent to device, and separate
> > > > > guest allocated bulk data blocks that will be used by host side to
> > > > > place
> > > > > its response bulk data), and the "used" ring buffer is for sending
> > > > > host->guest to let guest know about host's response and that it could
> > > > > now
> > > > > safely consume and then deallocate the bulk data blocks subsequently.
> > > > > 
> > > > > ---------- [recap-end] ----------
> > > > > 
> > > > > So the "queue size" actually defines the ringbuffer size. It does not
> > > > > define the maximum amount of descriptors. The "queue size" rather
> > > > > defines
> > > > > how many pending messages can be pushed into either one ringbuffer
> > > > > before
> > > > > the other side would need to wait until the counter side would step up
> > > > > (i.e. ring buffer full).
> > > > > 
> > > > > The maximum amount of descriptors (what VIRTQUEUE_MAX_SIZE actually
> > > > > is)
> > > > > OTOH defines the max. bulk data size that could be transmitted with
> > > > > each
> > > > > virtio round trip message.
> > > > > 
> > > > > And in fact, 9p currently handles the virtio "queue size" as directly
> > > > > associative with its maximum amount of active 9p requests the server
> > > > > could
> > > > > 
> > > > > handle simultaniously:
> > > > >   hw/9pfs/9p.h:#define MAX_REQ         128
> > > > >   hw/9pfs/9p.h:    V9fsPDU pdus[MAX_REQ];
> > > > >   hw/9pfs/virtio-9p-device.c:    v->vq = virtio_add_queue(vdev,
> > > > >   MAX_REQ,
> > > > >   
> > > > >                                  handle_9p_output);
> > > > > 
> > > > > So if I would change it like this, just for the purpose to increase
> > > > > the
> > > > > max. virtio transmission size:
> > > > > 
> > > > > --- a/hw/9pfs/virtio-9p-device.c
> > > > > +++ b/hw/9pfs/virtio-9p-device.c
> > > > > @@ -218,7 +218,7 @@ static void virtio_9p_device_realize(DeviceState
> > > > > *dev,
> > > > > Error **errp)>
> > > > > 
> > > > >      v->config_size = sizeof(struct virtio_9p_config) +
> > > > >      strlen(s->fsconf.tag);
> > > > >      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> > > > >      
> > > > >                  VIRTQUEUE_MAX_SIZE);
> > > > > 
> > > > > -    v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> > > > > +    v->vq = virtio_add_queue(vdev, 32*1024, handle_9p_output);
> > > > > 
> > > > >  }
> > > > > 
> > > > > Then it would require additional synchronization code on both ends and
> > > > > therefore unnecessary complexity, because it would now be possible
> > > > > that
> > > > > more requests are pushed into the ringbuffer than server could handle.
> > > > > 
> > > > > There is one potential issue though that probably did justify the
> > > > > "don't
> > > > > exceed the queue size" rule:
> > > > > 
> > > > > ATM the descriptor table is allocated (just in time) as *one*
> > > > > continuous
> > > > > buffer via kmalloc_array():
> > > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086
> > > > > f7c7
> > > > > d33a4/drivers/virtio/virtio_ring.c#L440
> > > > > 
> > > > > So assuming transmission size of 2 * 128 MB that kmalloc_array() call
> > > > > would
> > > > > yield in kmalloc(1M) and the latter might fail if guest had highly
> > > > > fragmented physical memory. For such kind of error case there is
> > > > > currently a fallback path in virtqueue_add_split() that would then use
> > > > > the required amount of pre-allocated descriptors instead:
> > > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086
> > > > > f7c7
> > > > > d33a4/drivers/virtio/virtio_ring.c#L525
> > > > > 
> > > > > That fallback recovery path would no longer be viable if the queue
> > > > > size
> > > > > was
> > > > > exceeded. There would be alternatives though, e.g. by allowing to
> > > > > chain
> > > > > indirect descriptor tables (currently prohibited by the virtio specs).
> > > > 
> > > > Making the maximum number of descriptors independent of the queue size
> > > > requires a change to the VIRTIO spec since the two values are currently
> > > > explicitly tied together by the spec.
> > > 
> > > Yes, that's what the virtio specs say. But they don't say why, nor did I
> > > hear a reason in this dicussion.
> > > 
> > > That's why I invested time reviewing current virtio implementation and
> > > specs, as well as actually testing exceeding that limit. And as I
> > > outlined in detail in my previous email, I only found one theoretical
> > > issue that could be addressed though.
> > 
> > I agree that there is a limitation in the VIRTIO spec, but violating the
> > spec isn't an acceptable solution:
> > 
> > 1. QEMU and Linux aren't the only components that implement VIRTIO. You
> >    cannot make assumptions about their implementations because it may
> >    break spec-compliant implementations that you haven't looked at.
> > 
> >    Your patches weren't able to increase Queue Size because some device
> >    implementations break when descriptor chains are too long. This shows
> >    there is a practical issue even in QEMU.
> > 
> > 2. The specific spec violation that we discussed creates the problem
> >    that drivers can no longer determine the maximum description chain
> >    length. This in turn will lead to more implementation-specific
> >    assumptions being baked into drivers and cause problems with
> >    interoperability and future changes.
> > 
> > The spec needs to be extended instead. I included an idea for how to do
> > that below.
> 
> Sure, I just wanted to see if there was a non-neglectable "hard" show stopper
> per se that I probably haven't seen yet. I have not questioned aiming a clean
> solution.
> 
> Thanks for the clarification!
> 
> > > > Before doing that, are there benchmark results showing that 1 MB vs 128
> > > > MB produces a performance improvement? I'm asking because if performance
> > > > with 1 MB is good then you can probably do that without having to change
> > > > VIRTIO and also because it's counter-intuitive that 9p needs 128 MB for
> > > > good performance when it's ultimately implemented on top of disk and
> > > > network I/O that have lower size limits.
> > > 
> > > First some numbers, linear reading a 12 GB file:
> > > 
> > > msize    average      notes
> > > 
> > > 8 kB     52.0 MB/s    default msize of Linux kernel <v5.15
> > > 128 kB   624.8 MB/s   default msize of Linux kernel >=v5.15
> > > 512 kB   1961 MB/s    current max. msize with any Linux kernel <=v5.15
> > > 1 MB     2551 MB/s    this msize would already violate virtio specs
> > > 2 MB     2521 MB/s    this msize would already violate virtio specs
> > > 4 MB     2628 MB/s    planned max. msize of my current kernel patches [1]
> > 
> > How many descriptors are used? 4 MB can be covered by a single
> > descriptor if the data is physically contiguous in memory, so this data
> > doesn't demonstrate a need for more descriptors.
> 
> No, in the last couple years there was apparently no kernel version that used
> just one descriptor, nor did my benchmarked version. Even though the Linux 9p
> client uses (yet) simple linear buffers (contiguous physical memory) on 9p
> client level, these are however split into PAGE_SIZE chunks by function
> pack_sg_list() [1] before being fed to virtio level:
> 
> static unsigned int rest_of_page(void *data)
> {
> 	return PAGE_SIZE - offset_in_page(data);
> }
> ...
> static int pack_sg_list(struct scatterlist *sg, int start,
> 			int limit, char *data, int count)
> {
> 	int s;
> 	int index = start;
> 
> 	while (count) {
> 		s = rest_of_page(data);
> 		...
> 		sg_set_buf(&sg[index++], data, s);
> 		count -= s;
> 		data += s;
> 	}
> 	...
> }
> 
> [1] https://github.com/torvalds/linux/blob/19901165d90fdca1e57c9baa0d5b4c63d15c476a/net/9p/trans_virtio.c#L171
> 
> So when sending 4MB over virtio wire, it would yield in 1k descriptors ATM.
> 
> I have wondered about this before, but did not question it, because due to the
> cross-platform nature I couldn't say for certain whether that's probably
> needed somewhere. I mean for the case virtio-PCI I know for sure that one
> descriptor (i.e. >PAGE_SIZE) would be fine, but I don't know if that applies
> to all buses and architectures.

VIRTIO does not limit descriptor the descriptor len field to PAGE_SIZE,
so I don't think there is a limit at the VIRTIO level.

If this function coalesces adjacent pages then the descriptor chain
length issues could be reduced.

> > > But again, this is not just about performance. My conclusion as described
> > > in my previous email is that virtio currently squeezes
> > > 
> > > 	"max. simultanious amount of bulk messages"
> > > 
> > > vs.
> > > 
> > > 	"max. bulk data transmission size per bulk messaage"
> > > 
> > > into the same configuration parameter, which is IMO inappropriate and
> > > hence
> > > splitting them into 2 separate parameters when creating a queue makes
> > > sense, independent of the performance benchmarks.
> > > 
> > > [1]
> > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyte.c
> > > om/
> > Some devices effectively already have this because the device advertises
> > a maximum number of descriptors via device-specific mechanisms like the
> > struct virtio_blk_config seg_max field. But today these fields can only
> > reduce the maximum descriptor chain length because the spec still limits
> > the length to Queue Size.
> > 
> > We can build on this approach to raise the length above Queue Size. This
> > approach has the advantage that the maximum number of segments isn't per
> > device or per virtqueue, it's fine-grained. If the device supports two
> > requests types then different max descriptor chain limits could be given
> > for them by introducing two separate configuration space fields.
> > 
> > Here are the corresponding spec changes:
> > 
> > 1. A new feature bit called VIRTIO_RING_F_LARGE_INDIRECT_DESC is added
> >    to indicate that indirect descriptor table size and maximum
> >    descriptor chain length are not limited by Queue Size value. (Maybe
> >    there still needs to be a limit like 2^15?)
> 
> Sounds good to me!
> 
> AFAIK it is effectively limited to 2^16 because of vring_desc->next:
> 
> /* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
> struct vring_desc {
>         /* Address (guest-physical). */
>         __virtio64 addr;
>         /* Length. */
>         __virtio32 len;
>         /* The flags as indicated above. */
>         __virtio16 flags;
>         /* We chain unused descriptors via this, too */
>         __virtio16 next;
> };

Yes, Split Virtqueues have a fundamental limit on indirect table size
due to the "next" field. Packed Virtqueue descriptors don't have a
"next" field so descriptor chains could be longer in theory (currently
forbidden by the spec).

> > One thing that's messy is that we've been discussing the maximum
> > descriptor chain length but 9p has the "msize" concept, which isn't
> > aware of contiguous memory. It may be necessary to extend the 9p driver
> > code to size requests not just according to their length in bytes but
> > also according to the descriptor chain length. That's how the Linux
> > block layer deals with queue limits (struct queue_limits max_segments vs
> > max_hw_sectors).
> 
> Hmm, can't follow on that one. For what should that be needed in case of 9p?
> My plan was to limit msize by 9p client simply at session start to whatever is
> the max. amount virtio descriptors supported by host and using PAGE_SIZE as
> size per descriptor, because that's what 9p client actually does ATM (see
> above). So you think that should be changed to e.g. just one descriptor for
> 4MB, right?

Limiting msize to the 9p transport device's maximum number of
descriptors is conservative (i.e. 128 descriptors = 512 KB msize)
because it doesn't take advantage of contiguous memory. I suggest
leaving msize alone, adding a separate limit at which requests are split
according to the maximum descriptor chain length, and tweaking
pack_sg_list() to coalesce adjacent pages.

That way msize can be large without necessarily using lots of
descriptors (depending on the memory layout).

Stefan
Christian Schoenebeck Nov. 4, 2021, 2:41 p.m. UTC | #17
On Mittwoch, 3. November 2021 12:33:33 CET Stefan Hajnoczi wrote:
> On Mon, Nov 01, 2021 at 09:29:26PM +0100, Christian Schoenebeck wrote:
> > On Donnerstag, 28. Oktober 2021 11:00:48 CET Stefan Hajnoczi wrote:
> > > On Mon, Oct 25, 2021 at 05:03:25PM +0200, Christian Schoenebeck wrote:
> > > > On Montag, 25. Oktober 2021 12:30:41 CEST Stefan Hajnoczi wrote:
> > > > > On Thu, Oct 21, 2021 at 05:39:28PM +0200, Christian Schoenebeck wrote:
> > > > > > On Freitag, 8. Oktober 2021 18:08:48 CEST Christian Schoenebeck wrote:
> > > > > > > On Freitag, 8. Oktober 2021 16:24:42 CEST Christian Schoenebeck wrote:
> > > > > > > > On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > > > > > > > > On Thu, 7 Oct 2021 16:42:49 +0100
> > > > > > > > > 
> > > > > > > > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > > > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > > > > > > > > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > > > > > > > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian
> > > > > > > > > > > > Schoenebeck
> > > > > > > 
> > > > > > > wrote:
> > > > > > > > > > > > > At the moment the maximum transfer size with virtio
> > > > > > > > > > > > > is
> > > > > > > > > > > > > limited
> > > > > > > > > > > > > to
> > > > > > > > > > > > > 4M
> > > > > > > > > > > > > (1024 * PAGE_SIZE). This series raises this limit to
> > > > > > > > > > > > > its
> > > > > > > > > > > > > maximum
> > > > > > > > > > > > > theoretical possible transfer size of 128M (32k
> > > > > > > > > > > > > pages)
> > > > > > > > > > > > > according
> > > > > > > > > > > > > to
> > > > > > > > > > > > > the
> > > > > > > > > > > > > virtio specs:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/
> > > > > > > > > > > > > virt
> > > > > > > > > > > > > io-v
> > > > > > > > > > > > > 1.1-
> > > > > > > > > > > > > cs
> > > > > > > > > > > > > 01
> > > > > > > > > > > > > .html#
> > > > > > > > > > > > > x1-240006
> > > > > > > > > > > > 
> > > > > > > > > > > > Hi Christian,
> > > > > > > > > 
> > > > > > > > > > > > I took a quick look at the code:
> > > > > > > > > Hi,
> > > > > > > > > 
> > > > > > > > > Thanks Stefan for sharing virtio expertise and helping
> > > > > > > > > Christian
> > > > > > > > > !
> > > > > > > > > 
> > > > > > > > > > > > - The Linux 9p driver restricts descriptor chains to
> > > > > > > > > > > > 128
> > > > > > > > > > > > elements
> > > > > > > > > > > > 
> > > > > > > > > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > > > > > > > > 
> > > > > > > > > > > Yes, that's the limitation that I am about to remove
> > > > > > > > > > > (WIP);
> > > > > > > > > > > current
> > > > > > > > > > > kernel
> > > > > > > > > > > patches:
> > > > > > > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linu
> > > > > > > > > > > x_os
> > > > > > > > > > > s@cr
> > > > > > > > > > > udeb
> > > > > > > > > > > yt
> > > > > > > > > > > e.
> > > > > > > > > > > com/>
> > > > > > > > > > 
> > > > > > > > > > I haven't read the patches yet but I'm concerned that
> > > > > > > > > > today
> > > > > > > > > > the
> > > > > > > > > > driver
> > > > > > > > > > is pretty well-behaved and this new patch series
> > > > > > > > > > introduces a
> > > > > > > > > > spec
> > > > > > > > > > violation. Not fixing existing spec violations is okay,
> > > > > > > > > > but
> > > > > > > > > > adding
> > > > > > > > > > new
> > > > > > > > > > ones is a red flag. I think we need to figure out a clean
> > > > > > > > > > solution.
> > > > > > > > 
> > > > > > > > Nobody has reviewed the kernel patches yet. My main concern
> > > > > > > > therefore
> > > > > > > > actually is that the kernel patches are already too complex,
> > > > > > > > because
> > > > > > > > the
> > > > > > > > current situation is that only Dominique is handling 9p
> > > > > > > > patches on
> > > > > > > > kernel
> > > > > > > > side, and he barely has time for 9p anymore.
> > > > > > > > 
> > > > > > > > Another reason for me to catch up on reading current kernel
> > > > > > > > code
> > > > > > > > and
> > > > > > > > stepping in as reviewer of 9p on kernel side ASAP, independent
> > > > > > > > of
> > > > > > > > this
> > > > > > > > issue.
> > > > > > > > 
> > > > > > > > As for current kernel patches' complexity: I can certainly
> > > > > > > > drop
> > > > > > > > patch
> > > > > > > > 7
> > > > > > > > entirely as it is probably just overkill. Patch 4 is then the
> > > > > > > > biggest
> > > > > > > > chunk, I have to see if I can simplify it, and whether it
> > > > > > > > would
> > > > > > > > make
> > > > > > > > sense to squash with patch 3.
> > > > > > > > 
> > > > > > > > > > > > - The QEMU 9pfs code passes iovecs directly to
> > > > > > > > > > > > preadv(2)
> > > > > > > > > > > > and
> > > > > > > > > > > > will
> > > > > > > > > > > > fail
> > > > > > > > > > > > 
> > > > > > > > > > > >   with EINVAL when called with more than IOV_MAX
> > > > > > > > > > > >   iovecs
> > > > > > > > > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > > > > > > > > 
> > > > > > > > > > > Hmm, which makes me wonder why I never encountered this
> > > > > > > > > > > error
> > > > > > > > > > > during
> > > > > > > > > > > testing.
> > > > > > > > > > > 
> > > > > > > > > > > Most people will use the 9p qemu 'local' fs driver
> > > > > > > > > > > backend
> > > > > > > > > > > in
> > > > > > > > > > > practice,
> > > > > > > > > > > so
> > > > > > > > > > > that v9fs_read() call would translate for most people to
> > > > > > > > > > > this
> > > > > > > > > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > > > > > > > > 
> > > > > > > > > > > static ssize_t local_preadv(FsContext *ctx,
> > > > > > > > > > > V9fsFidOpenState
> > > > > > > > > > > *fs,
> > > > > > > > > > > 
> > > > > > > > > > >                             const struct iovec *iov,
> > > > > > > > > > >                             int iovcnt, off_t offset)
> > > > > > > > > > > 
> > > > > > > > > > > {
> > > > > > > > > > > #ifdef CONFIG_PREADV
> > > > > > > > > > > 
> > > > > > > > > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > > > > > > > > 
> > > > > > > > > > > #else
> > > > > > > > > > > 
> > > > > > > > > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > > > > > > > > >     if (err == -1) {
> > > > > > > > > > >     
> > > > > > > > > > >         return err;
> > > > > > > > > > >     
> > > > > > > > > > >     } else {
> > > > > > > > > > >     
> > > > > > > > > > >         return readv(fs->fd, iov, iovcnt);
> > > > > > > > > > >     
> > > > > > > > > > >     }
> > > > > > > > > > > 
> > > > > > > > > > > #endif
> > > > > > > > > > > }
> > > > > > > > > > > 
> > > > > > > > > > > > Unless I misunderstood the code, neither side can take
> > > > > > > > > > > > advantage
> > > > > > > > > > > > of
> > > > > > > > > > > > the
> > > > > > > > > > > > new 32k descriptor chain limit?
> > > > > > > > > > > > 
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Stefan
> > > > > > > > > > > 
> > > > > > > > > > > I need to check that when I have some more time. One
> > > > > > > > > > > possible
> > > > > > > > > > > explanation
> > > > > > > > > > > might be that preadv() already has this wrapped into a
> > > > > > > > > > > loop
> > > > > > > > > > > in
> > > > > > > > > > > its
> > > > > > > > > > > implementation to circumvent a limit like IOV_MAX. It
> > > > > > > > > > > might
> > > > > > > > > > > be
> > > > > > > > > > > another
> > > > > > > > > > > "it
> > > > > > > > > > > works, but not portable" issue, but not sure.
> > > > > > > > > > > 
> > > > > > > > > > > There are still a bunch of other issues I have to
> > > > > > > > > > > resolve.
> > > > > > > > > > > If
> > > > > > > > > > > you
> > > > > > > > > > > look
> > > > > > > > > > > at
> > > > > > > > > > > net/9p/client.c on kernel side, you'll notice that it
> > > > > > > > > > > basically
> > > > > > > > > > > does
> > > > > > > > > > > this ATM> >
> > > > > > > > > > > 
> > > > > > > > > > >     kmalloc(msize);
> > > > > > > > > 
> > > > > > > > > Note that this is done twice : once for the T message
> > > > > > > > > (client
> > > > > > > > > request)
> > > > > > > > > and
> > > > > > > > > once for the R message (server answer). The 9p driver could
> > > > > > > > > adjust
> > > > > > > > > the
> > > > > > > > > size
> > > > > > > > > of the T message to what's really needed instead of
> > > > > > > > > allocating
> > > > > > > > > the
> > > > > > > > > full
> > > > > > > > > msize. R message size is not known though.
> > > > > > > > 
> > > > > > > > Would it make sense adding a second virtio ring, dedicated to
> > > > > > > > server
> > > > > > > > responses to solve this? IIRC 9p server already calculates
> > > > > > > > appropriate
> > > > > > > > exact sizes for each response type. So server could just push
> > > > > > > > space
> > > > > > > > that's
> > > > > > > > really needed for its responses.
> > > > > > > > 
> > > > > > > > > > > for every 9p request. So not only does it allocate much
> > > > > > > > > > > more
> > > > > > > > > > > memory
> > > > > > > > > > > for
> > > > > > > > > > > every request than actually required (i.e. say 9pfs was
> > > > > > > > > > > mounted
> > > > > > > > > > > with
> > > > > > > > > > > msize=8M, then a 9p request that actually would just
> > > > > > > > > > > need 1k
> > > > > > > > > > > would
> > > > > > > > > > > nevertheless allocate 8M), but also it allocates >
> > > > > > > > > > > PAGE_SIZE,
> > > > > > > > > > > which
> > > > > > > > > > > obviously may fail at any time.>
> > > > > > > > > > 
> > > > > > > > > > The PAGE_SIZE limitation sounds like a kmalloc() vs
> > > > > > > > > > vmalloc()
> > > > > > > > > > situation.
> > > > > > > > 
> > > > > > > > Hu, I didn't even consider vmalloc(). I just tried the
> > > > > > > > kvmalloc()
> > > > > > > > wrapper
> > > > > > > > as a quick & dirty test, but it crashed in the same way as
> > > > > > > > kmalloc()
> > > > > > > > with
> > > > > > > > large msize values immediately on mounting:
> > > > > > > > 
> > > > > > > > diff --git a/net/9p/client.c b/net/9p/client.c
> > > > > > > > index a75034fa249b..cfe300a4b6ca 100644
> > > > > > > > --- a/net/9p/client.c
> > > > > > > > +++ b/net/9p/client.c
> > > > > > > > @@ -227,15 +227,18 @@ static int parse_opts(char *opts, struct
> > > > > > > > p9_client
> > > > > > > > *clnt)
> > > > > > > > 
> > > > > > > >  static int p9_fcall_init(struct p9_client *c, struct p9_fcall
> > > > > > > >  *fc,
> > > > > > > >  
> > > > > > > >                          int alloc_msize)
> > > > > > > >  
> > > > > > > >  {
> > > > > > > > 
> > > > > > > > -       if (likely(c->fcall_cache) && alloc_msize == c->msize)
> > > > > > > > {
> > > > > > > > +       //if (likely(c->fcall_cache) && alloc_msize ==
> > > > > > > > c->msize) {
> > > > > > > > +       if (false) {
> > > > > > > > 
> > > > > > > >                 fc->sdata = kmem_cache_alloc(c->fcall_cache,
> > > > > > > >                 GFP_NOFS);
> > > > > > > >                 fc->cache = c->fcall_cache;
> > > > > > > >         
> > > > > > > >         } else {
> > > > > > > > 
> > > > > > > > -               fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> > > > > > > > +               fc->sdata = kvmalloc(alloc_msize, GFP_NOFS);
> > > > > > > 
> > > > > > > Ok, GFP_NOFS -> GFP_KERNEL did the trick.
> > > > > > > 
> > > > > > > Now I get:
> > > > > > >    virtio: bogus descriptor or out of resources
> > > > > > > 
> > > > > > > So, still some work ahead on both ends.
> > > > > > 
> > > > > > Few hacks later (only changes on 9p client side) I got this
> > > > > > running
> > > > > > stable
> > > > > > now. The reason for the virtio error above was that kvmalloc()
> > > > > > returns
> > > > > > a
> > > > > > non-logical kernel address for any kvmalloc(>4M), i.e. an address
> > > > > > that
> > > > > > is
> > > > > > inaccessible from host side, hence that "bogus descriptor" message
> > > > > > by
> > > > > > QEMU.
> > > > > > So I had to split those linear 9p client buffers into sparse ones
> > > > > > (set
> > > > > > of
> > > > > > individual pages).
> > > > > > 
> > > > > > I tested this for some days with various virtio transmission sizes
> > > > > > and
> > > > > > it
> > > > > > works as expected up to 128 MB (more precisely: 128 MB read space
> > > > > > +
> > > > > > 128 MB
> > > > > > write space per virtio round trip message).
> > > > > > 
> > > > > > I did not encounter a show stopper for large virtio transmission
> > > > > > sizes
> > > > > > (4 MB ... 128 MB) on virtio level, neither as a result of testing,
> > > > > > nor
> > > > > > after reviewing the existing code.
> > > > > > 
> > > > > > About IOV_MAX: that's apparently not an issue on virtio level.
> > > > > > Most of
> > > > > > the
> > > > > > iovec code, both on Linux kernel side and on QEMU side do not have
> > > > > > this
> > > > > > limitation. It is apparently however indeed a limitation for
> > > > > > userland
> > > > > > apps
> > > > > > calling the Linux kernel's syscalls yet.
> > > > > > 
> > > > > > Stefan, as it stands now, I am even more convinced that the upper
> > > > > > virtio
> > > > > > transmission size limit should not be squeezed into the queue size
> > > > > > argument of virtio_add_queue(). Not because of the previous
> > > > > > argument
> > > > > > that
> > > > > > it would waste space (~1MB), but rather because they are two
> > > > > > different
> > > > > > things. To outline this, just a quick recap of what happens
> > > > > > exactly
> > > > > > when
> > > > > > a bulk message is pushed over the virtio wire (assuming virtio
> > > > > > "split"
> > > > > > layout here):
> > > > > > 
> > > > > > ---------- [recap-start] ----------
> > > > > > 
> > > > > > For each bulk message sent guest <-> host, exactly *one* of the
> > > > > > pre-allocated descriptors is taken and placed (subsequently) into
> > > > > > exactly
> > > > > > *one* position of the two available/used ring buffers. The actual
> > > > > > descriptor table though, containing all the DMA addresses of the
> > > > > > message
> > > > > > bulk data, is allocated just in time for each round trip message.
> > > > > > Say,
> > > > > > it
> > > > > > is the first message sent, it yields in the following structure:
> > > > > > 
> > > > > > Ring Buffer   Descriptor Table      Bulk Data Pages
> > > > > > 
> > > > > >    +-+              +-+           +-----------------+
> > > > > >    
> > > > > >    |D|------------->|d|---------->| Bulk data block |
> > > > > >    
> > > > > >    +-+              |d|--------+  +-----------------+
> > > > > >    
> > > > > >    | |              |d|------+ |
> > > > > >    
> > > > > >    +-+               .       | |  +-----------------+
> > > > > >    
> > > > > >    | |               .       | +->| Bulk data block |
> > > > > >     
> > > > > >     .                .       |    +-----------------+
> > > > > >     .               |d|-+    |
> > > > > >     .               +-+ |    |    +-----------------+
> > > > > >     
> > > > > >    | |                  |    +--->| Bulk data block |
> > > > > >    
> > > > > >    +-+                  |         +-----------------+
> > > > > >    
> > > > > >    | |                  |                 .
> > > > > >    
> > > > > >    +-+                  |                 .
> > > > > >    
> > > > > >                         |                 .
> > > > > >                         |         
> > > > > >                         |         +-----------------+
> > > > > >                         
> > > > > >                         +-------->| Bulk data block |
> > > > > >                         
> > > > > >                                   +-----------------+
> > > > > > 
> > > > > > Legend:
> > > > > > D: pre-allocated descriptor
> > > > > > d: just in time allocated descriptor
> > > > > > -->: memory pointer (DMA)
> > > > > > 
> > > > > > The bulk data blocks are allocated by the respective device driver
> > > > > > above
> > > > > > virtio subsystem level (guest side).
> > > > > > 
> > > > > > There are exactly as many descriptors pre-allocated (D) as the
> > > > > > size of
> > > > > > a
> > > > > > ring buffer.
> > > > > > 
> > > > > > A "descriptor" is more or less just a chainable DMA memory
> > > > > > pointer;
> > > > > > defined
> > > > > > as:
> > > > > > 
> > > > > > /* Virtio ring descriptors: 16 bytes.  These can chain together
> > > > > > via
> > > > > > "next". */ struct vring_desc {
> > > > > > 
> > > > > > 	/* Address (guest-physical). */
> > > > > > 	__virtio64 addr;
> > > > > > 	/* Length. */
> > > > > > 	__virtio32 len;
> > > > > > 	/* The flags as indicated above. */
> > > > > > 	__virtio16 flags;
> > > > > > 	/* We chain unused descriptors via this, too */
> > > > > > 	__virtio16 next;
> > > > > > 
> > > > > > };
> > > > > > 
> > > > > > There are 2 ring buffers; the "available" ring buffer is for
> > > > > > sending a
> > > > > > message guest->host (which will transmit DMA addresses of guest
> > > > > > allocated
> > > > > > bulk data blocks that are used for data sent to device, and
> > > > > > separate
> > > > > > guest allocated bulk data blocks that will be used by host side to
> > > > > > place
> > > > > > its response bulk data), and the "used" ring buffer is for sending
> > > > > > host->guest to let guest know about host's response and that it
> > > > > > could
> > > > > > now
> > > > > > safely consume and then deallocate the bulk data blocks
> > > > > > subsequently.
> > > > > > 
> > > > > > ---------- [recap-end] ----------
> > > > > > 
> > > > > > So the "queue size" actually defines the ringbuffer size. It does
> > > > > > not
> > > > > > define the maximum amount of descriptors. The "queue size" rather
> > > > > > defines
> > > > > > how many pending messages can be pushed into either one ringbuffer
> > > > > > before
> > > > > > the other side would need to wait until the counter side would
> > > > > > step up
> > > > > > (i.e. ring buffer full).
> > > > > > 
> > > > > > The maximum amount of descriptors (what VIRTQUEUE_MAX_SIZE
> > > > > > actually
> > > > > > is)
> > > > > > OTOH defines the max. bulk data size that could be transmitted
> > > > > > with
> > > > > > each
> > > > > > virtio round trip message.
> > > > > > 
> > > > > > And in fact, 9p currently handles the virtio "queue size" as
> > > > > > directly
> > > > > > associative with its maximum amount of active 9p requests the
> > > > > > server
> > > > > > could
> > > > > > 
> > > > > > handle simultaniously:
> > > > > >   hw/9pfs/9p.h:#define MAX_REQ         128
> > > > > >   hw/9pfs/9p.h:    V9fsPDU pdus[MAX_REQ];
> > > > > >   hw/9pfs/virtio-9p-device.c:    v->vq = virtio_add_queue(vdev,
> > > > > >   MAX_REQ,
> > > > > >   
> > > > > >                                  handle_9p_output);
> > > > > > 
> > > > > > So if I would change it like this, just for the purpose to
> > > > > > increase
> > > > > > the
> > > > > > max. virtio transmission size:
> > > > > > 
> > > > > > --- a/hw/9pfs/virtio-9p-device.c
> > > > > > +++ b/hw/9pfs/virtio-9p-device.c
> > > > > > @@ -218,7 +218,7 @@ static void
> > > > > > virtio_9p_device_realize(DeviceState
> > > > > > *dev,
> > > > > > Error **errp)>
> > > > > > 
> > > > > >      v->config_size = sizeof(struct virtio_9p_config) +
> > > > > >      strlen(s->fsconf.tag);
> > > > > >      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> > > > > >      
> > > > > >                  VIRTQUEUE_MAX_SIZE);
> > > > > > 
> > > > > > -    v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> > > > > > +    v->vq = virtio_add_queue(vdev, 32*1024, handle_9p_output);
> > > > > > 
> > > > > >  }
> > > > > > 
> > > > > > Then it would require additional synchronization code on both ends
> > > > > > and
> > > > > > therefore unnecessary complexity, because it would now be possible
> > > > > > that
> > > > > > more requests are pushed into the ringbuffer than server could
> > > > > > handle.
> > > > > > 
> > > > > > There is one potential issue though that probably did justify the
> > > > > > "don't
> > > > > > exceed the queue size" rule:
> > > > > > 
> > > > > > ATM the descriptor table is allocated (just in time) as *one*
> > > > > > continuous
> > > > > > buffer via kmalloc_array():
> > > > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798c
> > > > > > a086
> > > > > > f7c7
> > > > > > d33a4/drivers/virtio/virtio_ring.c#L440
> > > > > > 
> > > > > > So assuming transmission size of 2 * 128 MB that kmalloc_array()
> > > > > > call
> > > > > > would
> > > > > > yield in kmalloc(1M) and the latter might fail if guest had highly
> > > > > > fragmented physical memory. For such kind of error case there is
> > > > > > currently a fallback path in virtqueue_add_split() that would then
> > > > > > use
> > > > > > the required amount of pre-allocated descriptors instead:
> > > > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798c
> > > > > > a086
> > > > > > f7c7
> > > > > > d33a4/drivers/virtio/virtio_ring.c#L525
> > > > > > 
> > > > > > That fallback recovery path would no longer be viable if the queue
> > > > > > size
> > > > > > was
> > > > > > exceeded. There would be alternatives though, e.g. by allowing to
> > > > > > chain
> > > > > > indirect descriptor tables (currently prohibited by the virtio
> > > > > > specs).
> > > > > 
> > > > > Making the maximum number of descriptors independent of the queue
> > > > > size
> > > > > requires a change to the VIRTIO spec since the two values are
> > > > > currently
> > > > > explicitly tied together by the spec.
> > > > 
> > > > Yes, that's what the virtio specs say. But they don't say why, nor did
> > > > I
> > > > hear a reason in this dicussion.
> > > > 
> > > > That's why I invested time reviewing current virtio implementation and
> > > > specs, as well as actually testing exceeding that limit. And as I
> > > > outlined in detail in my previous email, I only found one theoretical
> > > > issue that could be addressed though.
> > > 
> > > I agree that there is a limitation in the VIRTIO spec, but violating the
> > > spec isn't an acceptable solution:
> > > 
> > > 1. QEMU and Linux aren't the only components that implement VIRTIO. You
> > > 
> > >    cannot make assumptions about their implementations because it may
> > >    break spec-compliant implementations that you haven't looked at.
> > >    
> > >    Your patches weren't able to increase Queue Size because some device
> > >    implementations break when descriptor chains are too long. This shows
> > >    there is a practical issue even in QEMU.
> > > 
> > > 2. The specific spec violation that we discussed creates the problem
> > > 
> > >    that drivers can no longer determine the maximum description chain
> > >    length. This in turn will lead to more implementation-specific
> > >    assumptions being baked into drivers and cause problems with
> > >    interoperability and future changes.
> > > 
> > > The spec needs to be extended instead. I included an idea for how to do
> > > that below.
> > 
> > Sure, I just wanted to see if there was a non-neglectable "hard" show
> > stopper per se that I probably haven't seen yet. I have not questioned
> > aiming a clean solution.
> > 
> > Thanks for the clarification!
> > 
> > > > > Before doing that, are there benchmark results showing that 1 MB vs
> > > > > 128
> > > > > MB produces a performance improvement? I'm asking because if
> > > > > performance
> > > > > with 1 MB is good then you can probably do that without having to
> > > > > change
> > > > > VIRTIO and also because it's counter-intuitive that 9p needs 128 MB
> > > > > for
> > > > > good performance when it's ultimately implemented on top of disk and
> > > > > network I/O that have lower size limits.
> > > > 
> > > > First some numbers, linear reading a 12 GB file:
> > > > 
> > > > msize    average      notes
> > > > 
> > > > 8 kB     52.0 MB/s    default msize of Linux kernel <v5.15
> > > > 128 kB   624.8 MB/s   default msize of Linux kernel >=v5.15
> > > > 512 kB   1961 MB/s    current max. msize with any Linux kernel <=v5.15
> > > > 1 MB     2551 MB/s    this msize would already violate virtio specs
> > > > 2 MB     2521 MB/s    this msize would already violate virtio specs
> > > > 4 MB     2628 MB/s    planned max. msize of my current kernel patches
> > > > [1]
> > > 
> > > How many descriptors are used? 4 MB can be covered by a single
> > > descriptor if the data is physically contiguous in memory, so this data
> > > doesn't demonstrate a need for more descriptors.
> > 
> > No, in the last couple years there was apparently no kernel version that
> > used just one descriptor, nor did my benchmarked version. Even though the
> > Linux 9p client uses (yet) simple linear buffers (contiguous physical
> > memory) on 9p client level, these are however split into PAGE_SIZE chunks
> > by function pack_sg_list() [1] before being fed to virtio level:
> > 
> > static unsigned int rest_of_page(void *data)
> > {
> > 
> > 	return PAGE_SIZE - offset_in_page(data);
> > 
> > }
> > ...
> > static int pack_sg_list(struct scatterlist *sg, int start,
> > 
> > 			int limit, char *data, int count)
> > 
> > {
> > 
> > 	int s;
> > 	int index = start;
> > 	
> > 	while (count) {
> > 	
> > 		s = rest_of_page(data);
> > 		...
> > 		sg_set_buf(&sg[index++], data, s);
> > 		count -= s;
> > 		data += s;
> > 	
> > 	}
> > 	...
> > 
> > }
> > 
> > [1]
> > https://github.com/torvalds/linux/blob/19901165d90fdca1e57c9baa0d5b4c63d1
> > 5c476a/net/9p/trans_virtio.c#L171
> > 
> > So when sending 4MB over virtio wire, it would yield in 1k descriptors
> > ATM.
> > 
> > I have wondered about this before, but did not question it, because due to
> > the cross-platform nature I couldn't say for certain whether that's
> > probably needed somewhere. I mean for the case virtio-PCI I know for sure
> > that one descriptor (i.e. >PAGE_SIZE) would be fine, but I don't know if
> > that applies to all buses and architectures.
> 
> VIRTIO does not limit descriptor the descriptor len field to PAGE_SIZE,
> so I don't think there is a limit at the VIRTIO level.

So you are viewing this purely from virtio specs PoV: in the sense, if it is
not prohibited by the virtio specs, then it should work. Maybe.

> If this function coalesces adjacent pages then the descriptor chain
> length issues could be reduced.
> 
> > > > But again, this is not just about performance. My conclusion as
> > > > described
> > > > in my previous email is that virtio currently squeezes
> > > > 
> > > > 	"max. simultanious amount of bulk messages"
> > > > 
> > > > vs.
> > > > 
> > > > 	"max. bulk data transmission size per bulk messaage"
> > > > 
> > > > into the same configuration parameter, which is IMO inappropriate and
> > > > hence
> > > > splitting them into 2 separate parameters when creating a queue makes
> > > > sense, independent of the performance benchmarks.
> > > > 
> > > > [1]
> > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyt
> > > > e.c
> > > > om/
> > > 
> > > Some devices effectively already have this because the device advertises
> > > a maximum number of descriptors via device-specific mechanisms like the
> > > struct virtio_blk_config seg_max field. But today these fields can only
> > > reduce the maximum descriptor chain length because the spec still limits
> > > the length to Queue Size.
> > > 
> > > We can build on this approach to raise the length above Queue Size. This
> > > approach has the advantage that the maximum number of segments isn't per
> > > device or per virtqueue, it's fine-grained. If the device supports two
> > > requests types then different max descriptor chain limits could be given
> > > for them by introducing two separate configuration space fields.
> > > 
> > > Here are the corresponding spec changes:
> > > 
> > > 1. A new feature bit called VIRTIO_RING_F_LARGE_INDIRECT_DESC is added
> > > 
> > >    to indicate that indirect descriptor table size and maximum
> > >    descriptor chain length are not limited by Queue Size value. (Maybe
> > >    there still needs to be a limit like 2^15?)
> > 
> > Sounds good to me!
> > 
> > AFAIK it is effectively limited to 2^16 because of vring_desc->next:
> > 
> > /* Virtio ring descriptors: 16 bytes.  These can chain together via
> > "next". */ struct vring_desc {
> > 
> >         /* Address (guest-physical). */
> >         __virtio64 addr;
> >         /* Length. */
> >         __virtio32 len;
> >         /* The flags as indicated above. */
> >         __virtio16 flags;
> >         /* We chain unused descriptors via this, too */
> >         __virtio16 next;
> > 
> > };
> 
> Yes, Split Virtqueues have a fundamental limit on indirect table size
> due to the "next" field. Packed Virtqueue descriptors don't have a
> "next" field so descriptor chains could be longer in theory (currently
> forbidden by the spec).
> 
> > > One thing that's messy is that we've been discussing the maximum
> > > descriptor chain length but 9p has the "msize" concept, which isn't
> > > aware of contiguous memory. It may be necessary to extend the 9p driver
> > > code to size requests not just according to their length in bytes but
> > > also according to the descriptor chain length. That's how the Linux
> > > block layer deals with queue limits (struct queue_limits max_segments vs
> > > max_hw_sectors).
> > 
> > Hmm, can't follow on that one. For what should that be needed in case of
> > 9p? My plan was to limit msize by 9p client simply at session start to
> > whatever is the max. amount virtio descriptors supported by host and
> > using PAGE_SIZE as size per descriptor, because that's what 9p client
> > actually does ATM (see above). So you think that should be changed to
> > e.g. just one descriptor for 4MB, right?
> 
> Limiting msize to the 9p transport device's maximum number of
> descriptors is conservative (i.e. 128 descriptors = 512 KB msize)
> because it doesn't take advantage of contiguous memory. I suggest
> leaving msize alone, adding a separate limit at which requests are split
> according to the maximum descriptor chain length, and tweaking
> pack_sg_list() to coalesce adjacent pages.
> 
> That way msize can be large without necessarily using lots of
> descriptors (depending on the memory layout).

That was actually a tempting solution. Because it would neither require
changes to the virtio specs (at least for a while) and it would also work with
older QEMU versions. And for that pack_sg_list() portion of the code it would
work well and easy as the buffer passed to pack_sg_list() is contiguous
already.

However I just realized for the zero-copy version of the code that would be
more tricky. The ZC version already uses individual pages (struct page, hence
PAGE_SIZE each) which are pinned, i.e. it uses pack_sg_list_p() [1] in
combination with p9_get_mapped_pages() [2]

[1] https://github.com/torvalds/linux/blob/7ddb58cb0ecae8e8b6181d736a87667cc9ab8389/net/9p/trans_virtio.c#L218
[2] https://github.com/torvalds/linux/blob/7ddb58cb0ecae8e8b6181d736a87667cc9ab8389/net/9p/trans_virtio.c#L309

So that would require much more work and code trying to sort and coalesce
individual pages to contiguous physical memory for the sake of reducing virtio
descriptors. And there is no guarantee that this is even possible. The kernel
may simply return a non-contiguous set of pages which would eventually end up
exceeding the virtio descriptor limit again.

So looks like it was probably still easier and realistic to just add virtio
capabilities for now for allowing to exceed current descriptor limit.

Best regards,
Christian Schoenebeck
Stefan Hajnoczi Nov. 9, 2021, 10:56 a.m. UTC | #18
On Thu, Nov 04, 2021 at 03:41:23PM +0100, Christian Schoenebeck wrote:
> On Mittwoch, 3. November 2021 12:33:33 CET Stefan Hajnoczi wrote:
> > On Mon, Nov 01, 2021 at 09:29:26PM +0100, Christian Schoenebeck wrote:
> > > On Donnerstag, 28. Oktober 2021 11:00:48 CET Stefan Hajnoczi wrote:
> > > > On Mon, Oct 25, 2021 at 05:03:25PM +0200, Christian Schoenebeck wrote:
> > > > > On Montag, 25. Oktober 2021 12:30:41 CEST Stefan Hajnoczi wrote:
> > > > > > On Thu, Oct 21, 2021 at 05:39:28PM +0200, Christian Schoenebeck wrote:
> > > > > > > On Freitag, 8. Oktober 2021 18:08:48 CEST Christian Schoenebeck wrote:
> > > > > > > > On Freitag, 8. Oktober 2021 16:24:42 CEST Christian Schoenebeck wrote:
> > > > > > > > > On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > > > > > > > > > On Thu, 7 Oct 2021 16:42:49 +0100
> > > > > > > > > > 
> > > > > > > > > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > > > > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck wrote:
> > > > > > > > > > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi wrote:
> > > > > > > > > > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian
> > > > > > > > > > > > > Schoenebeck
> > > > > > > > 
> > > > > > > > wrote:
> > > > > > > > > > > > > > At the moment the maximum transfer size with virtio
> > > > > > > > > > > > > > is
> > > > > > > > > > > > > > limited
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > 4M
> > > > > > > > > > > > > > (1024 * PAGE_SIZE). This series raises this limit to
> > > > > > > > > > > > > > its
> > > > > > > > > > > > > > maximum
> > > > > > > > > > > > > > theoretical possible transfer size of 128M (32k
> > > > > > > > > > > > > > pages)
> > > > > > > > > > > > > > according
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > virtio specs:
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/
> > > > > > > > > > > > > > virt
> > > > > > > > > > > > > > io-v
> > > > > > > > > > > > > > 1.1-
> > > > > > > > > > > > > > cs
> > > > > > > > > > > > > > 01
> > > > > > > > > > > > > > .html#
> > > > > > > > > > > > > > x1-240006
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Hi Christian,
> > > > > > > > > > 
> > > > > > > > > > > > > I took a quick look at the code:
> > > > > > > > > > Hi,
> > > > > > > > > > 
> > > > > > > > > > Thanks Stefan for sharing virtio expertise and helping
> > > > > > > > > > Christian
> > > > > > > > > > !
> > > > > > > > > > 
> > > > > > > > > > > > > - The Linux 9p driver restricts descriptor chains to
> > > > > > > > > > > > > 128
> > > > > > > > > > > > > elements
> > > > > > > > > > > > > 
> > > > > > > > > > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > > > > > > > > > 
> > > > > > > > > > > > Yes, that's the limitation that I am about to remove
> > > > > > > > > > > > (WIP);
> > > > > > > > > > > > current
> > > > > > > > > > > > kernel
> > > > > > > > > > > > patches:
> > > > > > > > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linu
> > > > > > > > > > > > x_os
> > > > > > > > > > > > s@cr
> > > > > > > > > > > > udeb
> > > > > > > > > > > > yt
> > > > > > > > > > > > e.
> > > > > > > > > > > > com/>
> > > > > > > > > > > 
> > > > > > > > > > > I haven't read the patches yet but I'm concerned that
> > > > > > > > > > > today
> > > > > > > > > > > the
> > > > > > > > > > > driver
> > > > > > > > > > > is pretty well-behaved and this new patch series
> > > > > > > > > > > introduces a
> > > > > > > > > > > spec
> > > > > > > > > > > violation. Not fixing existing spec violations is okay,
> > > > > > > > > > > but
> > > > > > > > > > > adding
> > > > > > > > > > > new
> > > > > > > > > > > ones is a red flag. I think we need to figure out a clean
> > > > > > > > > > > solution.
> > > > > > > > > 
> > > > > > > > > Nobody has reviewed the kernel patches yet. My main concern
> > > > > > > > > therefore
> > > > > > > > > actually is that the kernel patches are already too complex,
> > > > > > > > > because
> > > > > > > > > the
> > > > > > > > > current situation is that only Dominique is handling 9p
> > > > > > > > > patches on
> > > > > > > > > kernel
> > > > > > > > > side, and he barely has time for 9p anymore.
> > > > > > > > > 
> > > > > > > > > Another reason for me to catch up on reading current kernel
> > > > > > > > > code
> > > > > > > > > and
> > > > > > > > > stepping in as reviewer of 9p on kernel side ASAP, independent
> > > > > > > > > of
> > > > > > > > > this
> > > > > > > > > issue.
> > > > > > > > > 
> > > > > > > > > As for current kernel patches' complexity: I can certainly
> > > > > > > > > drop
> > > > > > > > > patch
> > > > > > > > > 7
> > > > > > > > > entirely as it is probably just overkill. Patch 4 is then the
> > > > > > > > > biggest
> > > > > > > > > chunk, I have to see if I can simplify it, and whether it
> > > > > > > > > would
> > > > > > > > > make
> > > > > > > > > sense to squash with patch 3.
> > > > > > > > > 
> > > > > > > > > > > > > - The QEMU 9pfs code passes iovecs directly to
> > > > > > > > > > > > > preadv(2)
> > > > > > > > > > > > > and
> > > > > > > > > > > > > will
> > > > > > > > > > > > > fail
> > > > > > > > > > > > > 
> > > > > > > > > > > > >   with EINVAL when called with more than IOV_MAX
> > > > > > > > > > > > >   iovecs
> > > > > > > > > > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > > > > > > > > > 
> > > > > > > > > > > > Hmm, which makes me wonder why I never encountered this
> > > > > > > > > > > > error
> > > > > > > > > > > > during
> > > > > > > > > > > > testing.
> > > > > > > > > > > > 
> > > > > > > > > > > > Most people will use the 9p qemu 'local' fs driver
> > > > > > > > > > > > backend
> > > > > > > > > > > > in
> > > > > > > > > > > > practice,
> > > > > > > > > > > > so
> > > > > > > > > > > > that v9fs_read() call would translate for most people to
> > > > > > > > > > > > this
> > > > > > > > > > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > > > > > > > > > 
> > > > > > > > > > > > static ssize_t local_preadv(FsContext *ctx,
> > > > > > > > > > > > V9fsFidOpenState
> > > > > > > > > > > > *fs,
> > > > > > > > > > > > 
> > > > > > > > > > > >                             const struct iovec *iov,
> > > > > > > > > > > >                             int iovcnt, off_t offset)
> > > > > > > > > > > > 
> > > > > > > > > > > > {
> > > > > > > > > > > > #ifdef CONFIG_PREADV
> > > > > > > > > > > > 
> > > > > > > > > > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > > > > > > > > > 
> > > > > > > > > > > > #else
> > > > > > > > > > > > 
> > > > > > > > > > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > > > > > > > > > >     if (err == -1) {
> > > > > > > > > > > >     
> > > > > > > > > > > >         return err;
> > > > > > > > > > > >     
> > > > > > > > > > > >     } else {
> > > > > > > > > > > >     
> > > > > > > > > > > >         return readv(fs->fd, iov, iovcnt);
> > > > > > > > > > > >     
> > > > > > > > > > > >     }
> > > > > > > > > > > > 
> > > > > > > > > > > > #endif
> > > > > > > > > > > > }
> > > > > > > > > > > > 
> > > > > > > > > > > > > Unless I misunderstood the code, neither side can take
> > > > > > > > > > > > > advantage
> > > > > > > > > > > > > of
> > > > > > > > > > > > > the
> > > > > > > > > > > > > new 32k descriptor chain limit?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Stefan
> > > > > > > > > > > > 
> > > > > > > > > > > > I need to check that when I have some more time. One
> > > > > > > > > > > > possible
> > > > > > > > > > > > explanation
> > > > > > > > > > > > might be that preadv() already has this wrapped into a
> > > > > > > > > > > > loop
> > > > > > > > > > > > in
> > > > > > > > > > > > its
> > > > > > > > > > > > implementation to circumvent a limit like IOV_MAX. It
> > > > > > > > > > > > might
> > > > > > > > > > > > be
> > > > > > > > > > > > another
> > > > > > > > > > > > "it
> > > > > > > > > > > > works, but not portable" issue, but not sure.
> > > > > > > > > > > > 
> > > > > > > > > > > > There are still a bunch of other issues I have to
> > > > > > > > > > > > resolve.
> > > > > > > > > > > > If
> > > > > > > > > > > > you
> > > > > > > > > > > > look
> > > > > > > > > > > > at
> > > > > > > > > > > > net/9p/client.c on kernel side, you'll notice that it
> > > > > > > > > > > > basically
> > > > > > > > > > > > does
> > > > > > > > > > > > this ATM> >
> > > > > > > > > > > > 
> > > > > > > > > > > >     kmalloc(msize);
> > > > > > > > > > 
> > > > > > > > > > Note that this is done twice : once for the T message
> > > > > > > > > > (client
> > > > > > > > > > request)
> > > > > > > > > > and
> > > > > > > > > > once for the R message (server answer). The 9p driver could
> > > > > > > > > > adjust
> > > > > > > > > > the
> > > > > > > > > > size
> > > > > > > > > > of the T message to what's really needed instead of
> > > > > > > > > > allocating
> > > > > > > > > > the
> > > > > > > > > > full
> > > > > > > > > > msize. R message size is not known though.
> > > > > > > > > 
> > > > > > > > > Would it make sense adding a second virtio ring, dedicated to
> > > > > > > > > server
> > > > > > > > > responses to solve this? IIRC 9p server already calculates
> > > > > > > > > appropriate
> > > > > > > > > exact sizes for each response type. So server could just push
> > > > > > > > > space
> > > > > > > > > that's
> > > > > > > > > really needed for its responses.
> > > > > > > > > 
> > > > > > > > > > > > for every 9p request. So not only does it allocate much
> > > > > > > > > > > > more
> > > > > > > > > > > > memory
> > > > > > > > > > > > for
> > > > > > > > > > > > every request than actually required (i.e. say 9pfs was
> > > > > > > > > > > > mounted
> > > > > > > > > > > > with
> > > > > > > > > > > > msize=8M, then a 9p request that actually would just
> > > > > > > > > > > > need 1k
> > > > > > > > > > > > would
> > > > > > > > > > > > nevertheless allocate 8M), but also it allocates >
> > > > > > > > > > > > PAGE_SIZE,
> > > > > > > > > > > > which
> > > > > > > > > > > > obviously may fail at any time.>
> > > > > > > > > > > 
> > > > > > > > > > > The PAGE_SIZE limitation sounds like a kmalloc() vs
> > > > > > > > > > > vmalloc()
> > > > > > > > > > > situation.
> > > > > > > > > 
> > > > > > > > > Hu, I didn't even consider vmalloc(). I just tried the
> > > > > > > > > kvmalloc()
> > > > > > > > > wrapper
> > > > > > > > > as a quick & dirty test, but it crashed in the same way as
> > > > > > > > > kmalloc()
> > > > > > > > > with
> > > > > > > > > large msize values immediately on mounting:
> > > > > > > > > 
> > > > > > > > > diff --git a/net/9p/client.c b/net/9p/client.c
> > > > > > > > > index a75034fa249b..cfe300a4b6ca 100644
> > > > > > > > > --- a/net/9p/client.c
> > > > > > > > > +++ b/net/9p/client.c
> > > > > > > > > @@ -227,15 +227,18 @@ static int parse_opts(char *opts, struct
> > > > > > > > > p9_client
> > > > > > > > > *clnt)
> > > > > > > > > 
> > > > > > > > >  static int p9_fcall_init(struct p9_client *c, struct p9_fcall
> > > > > > > > >  *fc,
> > > > > > > > >  
> > > > > > > > >                          int alloc_msize)
> > > > > > > > >  
> > > > > > > > >  {
> > > > > > > > > 
> > > > > > > > > -       if (likely(c->fcall_cache) && alloc_msize == c->msize)
> > > > > > > > > {
> > > > > > > > > +       //if (likely(c->fcall_cache) && alloc_msize ==
> > > > > > > > > c->msize) {
> > > > > > > > > +       if (false) {
> > > > > > > > > 
> > > > > > > > >                 fc->sdata = kmem_cache_alloc(c->fcall_cache,
> > > > > > > > >                 GFP_NOFS);
> > > > > > > > >                 fc->cache = c->fcall_cache;
> > > > > > > > >         
> > > > > > > > >         } else {
> > > > > > > > > 
> > > > > > > > > -               fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> > > > > > > > > +               fc->sdata = kvmalloc(alloc_msize, GFP_NOFS);
> > > > > > > > 
> > > > > > > > Ok, GFP_NOFS -> GFP_KERNEL did the trick.
> > > > > > > > 
> > > > > > > > Now I get:
> > > > > > > >    virtio: bogus descriptor or out of resources
> > > > > > > > 
> > > > > > > > So, still some work ahead on both ends.
> > > > > > > 
> > > > > > > Few hacks later (only changes on 9p client side) I got this
> > > > > > > running
> > > > > > > stable
> > > > > > > now. The reason for the virtio error above was that kvmalloc()
> > > > > > > returns
> > > > > > > a
> > > > > > > non-logical kernel address for any kvmalloc(>4M), i.e. an address
> > > > > > > that
> > > > > > > is
> > > > > > > inaccessible from host side, hence that "bogus descriptor" message
> > > > > > > by
> > > > > > > QEMU.
> > > > > > > So I had to split those linear 9p client buffers into sparse ones
> > > > > > > (set
> > > > > > > of
> > > > > > > individual pages).
> > > > > > > 
> > > > > > > I tested this for some days with various virtio transmission sizes
> > > > > > > and
> > > > > > > it
> > > > > > > works as expected up to 128 MB (more precisely: 128 MB read space
> > > > > > > +
> > > > > > > 128 MB
> > > > > > > write space per virtio round trip message).
> > > > > > > 
> > > > > > > I did not encounter a show stopper for large virtio transmission
> > > > > > > sizes
> > > > > > > (4 MB ... 128 MB) on virtio level, neither as a result of testing,
> > > > > > > nor
> > > > > > > after reviewing the existing code.
> > > > > > > 
> > > > > > > About IOV_MAX: that's apparently not an issue on virtio level.
> > > > > > > Most of
> > > > > > > the
> > > > > > > iovec code, both on Linux kernel side and on QEMU side do not have
> > > > > > > this
> > > > > > > limitation. It is apparently however indeed a limitation for
> > > > > > > userland
> > > > > > > apps
> > > > > > > calling the Linux kernel's syscalls yet.
> > > > > > > 
> > > > > > > Stefan, as it stands now, I am even more convinced that the upper
> > > > > > > virtio
> > > > > > > transmission size limit should not be squeezed into the queue size
> > > > > > > argument of virtio_add_queue(). Not because of the previous
> > > > > > > argument
> > > > > > > that
> > > > > > > it would waste space (~1MB), but rather because they are two
> > > > > > > different
> > > > > > > things. To outline this, just a quick recap of what happens
> > > > > > > exactly
> > > > > > > when
> > > > > > > a bulk message is pushed over the virtio wire (assuming virtio
> > > > > > > "split"
> > > > > > > layout here):
> > > > > > > 
> > > > > > > ---------- [recap-start] ----------
> > > > > > > 
> > > > > > > For each bulk message sent guest <-> host, exactly *one* of the
> > > > > > > pre-allocated descriptors is taken and placed (subsequently) into
> > > > > > > exactly
> > > > > > > *one* position of the two available/used ring buffers. The actual
> > > > > > > descriptor table though, containing all the DMA addresses of the
> > > > > > > message
> > > > > > > bulk data, is allocated just in time for each round trip message.
> > > > > > > Say,
> > > > > > > it
> > > > > > > is the first message sent, it yields in the following structure:
> > > > > > > 
> > > > > > > Ring Buffer   Descriptor Table      Bulk Data Pages
> > > > > > > 
> > > > > > >    +-+              +-+           +-----------------+
> > > > > > >    
> > > > > > >    |D|------------->|d|---------->| Bulk data block |
> > > > > > >    
> > > > > > >    +-+              |d|--------+  +-----------------+
> > > > > > >    
> > > > > > >    | |              |d|------+ |
> > > > > > >    
> > > > > > >    +-+               .       | |  +-----------------+
> > > > > > >    
> > > > > > >    | |               .       | +->| Bulk data block |
> > > > > > >     
> > > > > > >     .                .       |    +-----------------+
> > > > > > >     .               |d|-+    |
> > > > > > >     .               +-+ |    |    +-----------------+
> > > > > > >     
> > > > > > >    | |                  |    +--->| Bulk data block |
> > > > > > >    
> > > > > > >    +-+                  |         +-----------------+
> > > > > > >    
> > > > > > >    | |                  |                 .
> > > > > > >    
> > > > > > >    +-+                  |                 .
> > > > > > >    
> > > > > > >                         |                 .
> > > > > > >                         |         
> > > > > > >                         |         +-----------------+
> > > > > > >                         
> > > > > > >                         +-------->| Bulk data block |
> > > > > > >                         
> > > > > > >                                   +-----------------+
> > > > > > > 
> > > > > > > Legend:
> > > > > > > D: pre-allocated descriptor
> > > > > > > d: just in time allocated descriptor
> > > > > > > -->: memory pointer (DMA)
> > > > > > > 
> > > > > > > The bulk data blocks are allocated by the respective device driver
> > > > > > > above
> > > > > > > virtio subsystem level (guest side).
> > > > > > > 
> > > > > > > There are exactly as many descriptors pre-allocated (D) as the
> > > > > > > size of
> > > > > > > a
> > > > > > > ring buffer.
> > > > > > > 
> > > > > > > A "descriptor" is more or less just a chainable DMA memory
> > > > > > > pointer;
> > > > > > > defined
> > > > > > > as:
> > > > > > > 
> > > > > > > /* Virtio ring descriptors: 16 bytes.  These can chain together
> > > > > > > via
> > > > > > > "next". */ struct vring_desc {
> > > > > > > 
> > > > > > > 	/* Address (guest-physical). */
> > > > > > > 	__virtio64 addr;
> > > > > > > 	/* Length. */
> > > > > > > 	__virtio32 len;
> > > > > > > 	/* The flags as indicated above. */
> > > > > > > 	__virtio16 flags;
> > > > > > > 	/* We chain unused descriptors via this, too */
> > > > > > > 	__virtio16 next;
> > > > > > > 
> > > > > > > };
> > > > > > > 
> > > > > > > There are 2 ring buffers; the "available" ring buffer is for
> > > > > > > sending a
> > > > > > > message guest->host (which will transmit DMA addresses of guest
> > > > > > > allocated
> > > > > > > bulk data blocks that are used for data sent to device, and
> > > > > > > separate
> > > > > > > guest allocated bulk data blocks that will be used by host side to
> > > > > > > place
> > > > > > > its response bulk data), and the "used" ring buffer is for sending
> > > > > > > host->guest to let guest know about host's response and that it
> > > > > > > could
> > > > > > > now
> > > > > > > safely consume and then deallocate the bulk data blocks
> > > > > > > subsequently.
> > > > > > > 
> > > > > > > ---------- [recap-end] ----------
> > > > > > > 
> > > > > > > So the "queue size" actually defines the ringbuffer size. It does
> > > > > > > not
> > > > > > > define the maximum amount of descriptors. The "queue size" rather
> > > > > > > defines
> > > > > > > how many pending messages can be pushed into either one ringbuffer
> > > > > > > before
> > > > > > > the other side would need to wait until the counter side would
> > > > > > > step up
> > > > > > > (i.e. ring buffer full).
> > > > > > > 
> > > > > > > The maximum amount of descriptors (what VIRTQUEUE_MAX_SIZE
> > > > > > > actually
> > > > > > > is)
> > > > > > > OTOH defines the max. bulk data size that could be transmitted
> > > > > > > with
> > > > > > > each
> > > > > > > virtio round trip message.
> > > > > > > 
> > > > > > > And in fact, 9p currently handles the virtio "queue size" as
> > > > > > > directly
> > > > > > > associative with its maximum amount of active 9p requests the
> > > > > > > server
> > > > > > > could
> > > > > > > 
> > > > > > > handle simultaniously:
> > > > > > >   hw/9pfs/9p.h:#define MAX_REQ         128
> > > > > > >   hw/9pfs/9p.h:    V9fsPDU pdus[MAX_REQ];
> > > > > > >   hw/9pfs/virtio-9p-device.c:    v->vq = virtio_add_queue(vdev,
> > > > > > >   MAX_REQ,
> > > > > > >   
> > > > > > >                                  handle_9p_output);
> > > > > > > 
> > > > > > > So if I would change it like this, just for the purpose to
> > > > > > > increase
> > > > > > > the
> > > > > > > max. virtio transmission size:
> > > > > > > 
> > > > > > > --- a/hw/9pfs/virtio-9p-device.c
> > > > > > > +++ b/hw/9pfs/virtio-9p-device.c
> > > > > > > @@ -218,7 +218,7 @@ static void
> > > > > > > virtio_9p_device_realize(DeviceState
> > > > > > > *dev,
> > > > > > > Error **errp)>
> > > > > > > 
> > > > > > >      v->config_size = sizeof(struct virtio_9p_config) +
> > > > > > >      strlen(s->fsconf.tag);
> > > > > > >      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> > > > > > >      
> > > > > > >                  VIRTQUEUE_MAX_SIZE);
> > > > > > > 
> > > > > > > -    v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> > > > > > > +    v->vq = virtio_add_queue(vdev, 32*1024, handle_9p_output);
> > > > > > > 
> > > > > > >  }
> > > > > > > 
> > > > > > > Then it would require additional synchronization code on both ends
> > > > > > > and
> > > > > > > therefore unnecessary complexity, because it would now be possible
> > > > > > > that
> > > > > > > more requests are pushed into the ringbuffer than server could
> > > > > > > handle.
> > > > > > > 
> > > > > > > There is one potential issue though that probably did justify the
> > > > > > > "don't
> > > > > > > exceed the queue size" rule:
> > > > > > > 
> > > > > > > ATM the descriptor table is allocated (just in time) as *one*
> > > > > > > continuous
> > > > > > > buffer via kmalloc_array():
> > > > > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798c
> > > > > > > a086
> > > > > > > f7c7
> > > > > > > d33a4/drivers/virtio/virtio_ring.c#L440
> > > > > > > 
> > > > > > > So assuming transmission size of 2 * 128 MB that kmalloc_array()
> > > > > > > call
> > > > > > > would
> > > > > > > yield in kmalloc(1M) and the latter might fail if guest had highly
> > > > > > > fragmented physical memory. For such kind of error case there is
> > > > > > > currently a fallback path in virtqueue_add_split() that would then
> > > > > > > use
> > > > > > > the required amount of pre-allocated descriptors instead:
> > > > > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798c
> > > > > > > a086
> > > > > > > f7c7
> > > > > > > d33a4/drivers/virtio/virtio_ring.c#L525
> > > > > > > 
> > > > > > > That fallback recovery path would no longer be viable if the queue
> > > > > > > size
> > > > > > > was
> > > > > > > exceeded. There would be alternatives though, e.g. by allowing to
> > > > > > > chain
> > > > > > > indirect descriptor tables (currently prohibited by the virtio
> > > > > > > specs).
> > > > > > 
> > > > > > Making the maximum number of descriptors independent of the queue
> > > > > > size
> > > > > > requires a change to the VIRTIO spec since the two values are
> > > > > > currently
> > > > > > explicitly tied together by the spec.
> > > > > 
> > > > > Yes, that's what the virtio specs say. But they don't say why, nor did
> > > > > I
> > > > > hear a reason in this dicussion.
> > > > > 
> > > > > That's why I invested time reviewing current virtio implementation and
> > > > > specs, as well as actually testing exceeding that limit. And as I
> > > > > outlined in detail in my previous email, I only found one theoretical
> > > > > issue that could be addressed though.
> > > > 
> > > > I agree that there is a limitation in the VIRTIO spec, but violating the
> > > > spec isn't an acceptable solution:
> > > > 
> > > > 1. QEMU and Linux aren't the only components that implement VIRTIO. You
> > > > 
> > > >    cannot make assumptions about their implementations because it may
> > > >    break spec-compliant implementations that you haven't looked at.
> > > >    
> > > >    Your patches weren't able to increase Queue Size because some device
> > > >    implementations break when descriptor chains are too long. This shows
> > > >    there is a practical issue even in QEMU.
> > > > 
> > > > 2. The specific spec violation that we discussed creates the problem
> > > > 
> > > >    that drivers can no longer determine the maximum description chain
> > > >    length. This in turn will lead to more implementation-specific
> > > >    assumptions being baked into drivers and cause problems with
> > > >    interoperability and future changes.
> > > > 
> > > > The spec needs to be extended instead. I included an idea for how to do
> > > > that below.
> > > 
> > > Sure, I just wanted to see if there was a non-neglectable "hard" show
> > > stopper per se that I probably haven't seen yet. I have not questioned
> > > aiming a clean solution.
> > > 
> > > Thanks for the clarification!
> > > 
> > > > > > Before doing that, are there benchmark results showing that 1 MB vs
> > > > > > 128
> > > > > > MB produces a performance improvement? I'm asking because if
> > > > > > performance
> > > > > > with 1 MB is good then you can probably do that without having to
> > > > > > change
> > > > > > VIRTIO and also because it's counter-intuitive that 9p needs 128 MB
> > > > > > for
> > > > > > good performance when it's ultimately implemented on top of disk and
> > > > > > network I/O that have lower size limits.
> > > > > 
> > > > > First some numbers, linear reading a 12 GB file:
> > > > > 
> > > > > msize    average      notes
> > > > > 
> > > > > 8 kB     52.0 MB/s    default msize of Linux kernel <v5.15
> > > > > 128 kB   624.8 MB/s   default msize of Linux kernel >=v5.15
> > > > > 512 kB   1961 MB/s    current max. msize with any Linux kernel <=v5.15
> > > > > 1 MB     2551 MB/s    this msize would already violate virtio specs
> > > > > 2 MB     2521 MB/s    this msize would already violate virtio specs
> > > > > 4 MB     2628 MB/s    planned max. msize of my current kernel patches
> > > > > [1]
> > > > 
> > > > How many descriptors are used? 4 MB can be covered by a single
> > > > descriptor if the data is physically contiguous in memory, so this data
> > > > doesn't demonstrate a need for more descriptors.
> > > 
> > > No, in the last couple years there was apparently no kernel version that
> > > used just one descriptor, nor did my benchmarked version. Even though the
> > > Linux 9p client uses (yet) simple linear buffers (contiguous physical
> > > memory) on 9p client level, these are however split into PAGE_SIZE chunks
> > > by function pack_sg_list() [1] before being fed to virtio level:
> > > 
> > > static unsigned int rest_of_page(void *data)
> > > {
> > > 
> > > 	return PAGE_SIZE - offset_in_page(data);
> > > 
> > > }
> > > ...
> > > static int pack_sg_list(struct scatterlist *sg, int start,
> > > 
> > > 			int limit, char *data, int count)
> > > 
> > > {
> > > 
> > > 	int s;
> > > 	int index = start;
> > > 	
> > > 	while (count) {
> > > 	
> > > 		s = rest_of_page(data);
> > > 		...
> > > 		sg_set_buf(&sg[index++], data, s);
> > > 		count -= s;
> > > 		data += s;
> > > 	
> > > 	}
> > > 	...
> > > 
> > > }
> > > 
> > > [1]
> > > https://github.com/torvalds/linux/blob/19901165d90fdca1e57c9baa0d5b4c63d1
> > > 5c476a/net/9p/trans_virtio.c#L171
> > > 
> > > So when sending 4MB over virtio wire, it would yield in 1k descriptors
> > > ATM.
> > > 
> > > I have wondered about this before, but did not question it, because due to
> > > the cross-platform nature I couldn't say for certain whether that's
> > > probably needed somewhere. I mean for the case virtio-PCI I know for sure
> > > that one descriptor (i.e. >PAGE_SIZE) would be fine, but I don't know if
> > > that applies to all buses and architectures.
> > 
> > VIRTIO does not limit descriptor the descriptor len field to PAGE_SIZE,
> > so I don't think there is a limit at the VIRTIO level.
> 
> So you are viewing this purely from virtio specs PoV: in the sense, if it is
> not prohibited by the virtio specs, then it should work. Maybe.

Limitations must be specified either in the 9P protocol or the VIRTIO
specification. Drivers and devices will not be able to operate correctly
if there are limitations that aren't covered by the specs.

Do you have something in mind that isn't covered by the specs?

> > If this function coalesces adjacent pages then the descriptor chain
> > length issues could be reduced.
> > 
> > > > > But again, this is not just about performance. My conclusion as
> > > > > described
> > > > > in my previous email is that virtio currently squeezes
> > > > > 
> > > > > 	"max. simultanious amount of bulk messages"
> > > > > 
> > > > > vs.
> > > > > 
> > > > > 	"max. bulk data transmission size per bulk messaage"
> > > > > 
> > > > > into the same configuration parameter, which is IMO inappropriate and
> > > > > hence
> > > > > splitting them into 2 separate parameters when creating a queue makes
> > > > > sense, independent of the performance benchmarks.
> > > > > 
> > > > > [1]
> > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyt
> > > > > e.c
> > > > > om/
> > > > 
> > > > Some devices effectively already have this because the device advertises
> > > > a maximum number of descriptors via device-specific mechanisms like the
> > > > struct virtio_blk_config seg_max field. But today these fields can only
> > > > reduce the maximum descriptor chain length because the spec still limits
> > > > the length to Queue Size.
> > > > 
> > > > We can build on this approach to raise the length above Queue Size. This
> > > > approach has the advantage that the maximum number of segments isn't per
> > > > device or per virtqueue, it's fine-grained. If the device supports two
> > > > requests types then different max descriptor chain limits could be given
> > > > for them by introducing two separate configuration space fields.
> > > > 
> > > > Here are the corresponding spec changes:
> > > > 
> > > > 1. A new feature bit called VIRTIO_RING_F_LARGE_INDIRECT_DESC is added
> > > > 
> > > >    to indicate that indirect descriptor table size and maximum
> > > >    descriptor chain length are not limited by Queue Size value. (Maybe
> > > >    there still needs to be a limit like 2^15?)
> > > 
> > > Sounds good to me!
> > > 
> > > AFAIK it is effectively limited to 2^16 because of vring_desc->next:
> > > 
> > > /* Virtio ring descriptors: 16 bytes.  These can chain together via
> > > "next". */ struct vring_desc {
> > > 
> > >         /* Address (guest-physical). */
> > >         __virtio64 addr;
> > >         /* Length. */
> > >         __virtio32 len;
> > >         /* The flags as indicated above. */
> > >         __virtio16 flags;
> > >         /* We chain unused descriptors via this, too */
> > >         __virtio16 next;
> > > 
> > > };
> > 
> > Yes, Split Virtqueues have a fundamental limit on indirect table size
> > due to the "next" field. Packed Virtqueue descriptors don't have a
> > "next" field so descriptor chains could be longer in theory (currently
> > forbidden by the spec).
> > 
> > > > One thing that's messy is that we've been discussing the maximum
> > > > descriptor chain length but 9p has the "msize" concept, which isn't
> > > > aware of contiguous memory. It may be necessary to extend the 9p driver
> > > > code to size requests not just according to their length in bytes but
> > > > also according to the descriptor chain length. That's how the Linux
> > > > block layer deals with queue limits (struct queue_limits max_segments vs
> > > > max_hw_sectors).
> > > 
> > > Hmm, can't follow on that one. For what should that be needed in case of
> > > 9p? My plan was to limit msize by 9p client simply at session start to
> > > whatever is the max. amount virtio descriptors supported by host and
> > > using PAGE_SIZE as size per descriptor, because that's what 9p client
> > > actually does ATM (see above). So you think that should be changed to
> > > e.g. just one descriptor for 4MB, right?
> > 
> > Limiting msize to the 9p transport device's maximum number of
> > descriptors is conservative (i.e. 128 descriptors = 512 KB msize)
> > because it doesn't take advantage of contiguous memory. I suggest
> > leaving msize alone, adding a separate limit at which requests are split
> > according to the maximum descriptor chain length, and tweaking
> > pack_sg_list() to coalesce adjacent pages.
> > 
> > That way msize can be large without necessarily using lots of
> > descriptors (depending on the memory layout).
> 
> That was actually a tempting solution. Because it would neither require
> changes to the virtio specs (at least for a while) and it would also work with
> older QEMU versions. And for that pack_sg_list() portion of the code it would
> work well and easy as the buffer passed to pack_sg_list() is contiguous
> already.
> 
> However I just realized for the zero-copy version of the code that would be
> more tricky. The ZC version already uses individual pages (struct page, hence
> PAGE_SIZE each) which are pinned, i.e. it uses pack_sg_list_p() [1] in
> combination with p9_get_mapped_pages() [2]
> 
> [1] https://github.com/torvalds/linux/blob/7ddb58cb0ecae8e8b6181d736a87667cc9ab8389/net/9p/trans_virtio.c#L218
> [2] https://github.com/torvalds/linux/blob/7ddb58cb0ecae8e8b6181d736a87667cc9ab8389/net/9p/trans_virtio.c#L309
> 
> So that would require much more work and code trying to sort and coalesce
> individual pages to contiguous physical memory for the sake of reducing virtio
> descriptors. And there is no guarantee that this is even possible. The kernel
> may simply return a non-contiguous set of pages which would eventually end up
> exceeding the virtio descriptor limit again.

Order must be preserved so pages cannot be sorted by physical address.
How about simply coalescing when pages are adjacent?

> So looks like it was probably still easier and realistic to just add virtio
> capabilities for now for allowing to exceed current descriptor limit.

I'm still not sure why virtio-net, virtio-blk, virtio-fs, etc perform
fine under today's limits while virtio-9p needs a much higher limit to
achieve good performance. Maybe there is an issue in a layer above the
vring that's causing the virtio-9p performance you've observed?

Stefan
Christian Schoenebeck Nov. 9, 2021, 1:09 p.m. UTC | #19
On Dienstag, 9. November 2021 11:56:35 CET Stefan Hajnoczi wrote:
> On Thu, Nov 04, 2021 at 03:41:23PM +0100, Christian Schoenebeck wrote:
> > On Mittwoch, 3. November 2021 12:33:33 CET Stefan Hajnoczi wrote:
> > > On Mon, Nov 01, 2021 at 09:29:26PM +0100, Christian Schoenebeck wrote:
> > > > On Donnerstag, 28. Oktober 2021 11:00:48 CET Stefan Hajnoczi wrote:
> > > > > On Mon, Oct 25, 2021 at 05:03:25PM +0200, Christian Schoenebeck 
wrote:
> > > > > > On Montag, 25. Oktober 2021 12:30:41 CEST Stefan Hajnoczi wrote:
> > > > > > > On Thu, Oct 21, 2021 at 05:39:28PM +0200, Christian Schoenebeck 
wrote:
> > > > > > > > On Freitag, 8. Oktober 2021 18:08:48 CEST Christian 
Schoenebeck wrote:
> > > > > > > > > On Freitag, 8. Oktober 2021 16:24:42 CEST Christian 
Schoenebeck wrote:
> > > > > > > > > > On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > > > > > > > > > > On Thu, 7 Oct 2021 16:42:49 +0100
> > > > > > > > > > > 
> > > > > > > > > > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > > > > > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian 
Schoenebeck wrote:
> > > > > > > > > > > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan 
Hajnoczi wrote:
> > > > > > > > > > > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200,
> > > > > > > > > > > > > > Christian
> > > > > > > > > > > > > > Schoenebeck
> > > > > > > > > 
> > > > > > > > > wrote:
> > > > > > > > > > > > > > > At the moment the maximum transfer size with
> > > > > > > > > > > > > > > virtio
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > limited
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > 4M
> > > > > > > > > > > > > > > (1024 * PAGE_SIZE). This series raises this
> > > > > > > > > > > > > > > limit to
> > > > > > > > > > > > > > > its
> > > > > > > > > > > > > > > maximum
> > > > > > > > > > > > > > > theoretical possible transfer size of 128M (32k
> > > > > > > > > > > > > > > pages)
> > > > > > > > > > > > > > > according
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > virtio specs:
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/c
> > > > > > > > > > > > > > > s01/
> > > > > > > > > > > > > > > virt
> > > > > > > > > > > > > > > io-v
> > > > > > > > > > > > > > > 1.1-
> > > > > > > > > > > > > > > cs
> > > > > > > > > > > > > > > 01
> > > > > > > > > > > > > > > .html#
> > > > > > > > > > > > > > > x1-240006
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Hi Christian,
> > > > > > > > > > > 
> > > > > > > > > > > > > > I took a quick look at the code:
> > > > > > > > > > > Hi,
> > > > > > > > > > > 
> > > > > > > > > > > Thanks Stefan for sharing virtio expertise and helping
> > > > > > > > > > > Christian
> > > > > > > > > > > !
> > > > > > > > > > > 
> > > > > > > > > > > > > > - The Linux 9p driver restricts descriptor chains
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > 128
> > > > > > > > > > > > > > elements
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Yes, that's the limitation that I am about to remove
> > > > > > > > > > > > > (WIP);
> > > > > > > > > > > > > current
> > > > > > > > > > > > > kernel
> > > > > > > > > > > > > patches:
> > > > > > > > > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.
> > > > > > > > > > > > > linu
> > > > > > > > > > > > > x_os
> > > > > > > > > > > > > s@cr
> > > > > > > > > > > > > udeb
> > > > > > > > > > > > > yt
> > > > > > > > > > > > > e.
> > > > > > > > > > > > > com/>
> > > > > > > > > > > > 
> > > > > > > > > > > > I haven't read the patches yet but I'm concerned that
> > > > > > > > > > > > today
> > > > > > > > > > > > the
> > > > > > > > > > > > driver
> > > > > > > > > > > > is pretty well-behaved and this new patch series
> > > > > > > > > > > > introduces a
> > > > > > > > > > > > spec
> > > > > > > > > > > > violation. Not fixing existing spec violations is
> > > > > > > > > > > > okay,
> > > > > > > > > > > > but
> > > > > > > > > > > > adding
> > > > > > > > > > > > new
> > > > > > > > > > > > ones is a red flag. I think we need to figure out a
> > > > > > > > > > > > clean
> > > > > > > > > > > > solution.
> > > > > > > > > > 
> > > > > > > > > > Nobody has reviewed the kernel patches yet. My main
> > > > > > > > > > concern
> > > > > > > > > > therefore
> > > > > > > > > > actually is that the kernel patches are already too
> > > > > > > > > > complex,
> > > > > > > > > > because
> > > > > > > > > > the
> > > > > > > > > > current situation is that only Dominique is handling 9p
> > > > > > > > > > patches on
> > > > > > > > > > kernel
> > > > > > > > > > side, and he barely has time for 9p anymore.
> > > > > > > > > > 
> > > > > > > > > > Another reason for me to catch up on reading current
> > > > > > > > > > kernel
> > > > > > > > > > code
> > > > > > > > > > and
> > > > > > > > > > stepping in as reviewer of 9p on kernel side ASAP,
> > > > > > > > > > independent
> > > > > > > > > > of
> > > > > > > > > > this
> > > > > > > > > > issue.
> > > > > > > > > > 
> > > > > > > > > > As for current kernel patches' complexity: I can certainly
> > > > > > > > > > drop
> > > > > > > > > > patch
> > > > > > > > > > 7
> > > > > > > > > > entirely as it is probably just overkill. Patch 4 is then
> > > > > > > > > > the
> > > > > > > > > > biggest
> > > > > > > > > > chunk, I have to see if I can simplify it, and whether it
> > > > > > > > > > would
> > > > > > > > > > make
> > > > > > > > > > sense to squash with patch 3.
> > > > > > > > > > 
> > > > > > > > > > > > > > - The QEMU 9pfs code passes iovecs directly to
> > > > > > > > > > > > > > preadv(2)
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > will
> > > > > > > > > > > > > > fail
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > >   with EINVAL when called with more than IOV_MAX
> > > > > > > > > > > > > >   iovecs
> > > > > > > > > > > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Hmm, which makes me wonder why I never encountered
> > > > > > > > > > > > > this
> > > > > > > > > > > > > error
> > > > > > > > > > > > > during
> > > > > > > > > > > > > testing.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Most people will use the 9p qemu 'local' fs driver
> > > > > > > > > > > > > backend
> > > > > > > > > > > > > in
> > > > > > > > > > > > > practice,
> > > > > > > > > > > > > so
> > > > > > > > > > > > > that v9fs_read() call would translate for most
> > > > > > > > > > > > > people to
> > > > > > > > > > > > > this
> > > > > > > > > > > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > > > > > > > > > > 
> > > > > > > > > > > > > static ssize_t local_preadv(FsContext *ctx,
> > > > > > > > > > > > > V9fsFidOpenState
> > > > > > > > > > > > > *fs,
> > > > > > > > > > > > > 
> > > > > > > > > > > > >                             const struct iovec *iov,
> > > > > > > > > > > > >                             int iovcnt, off_t
> > > > > > > > > > > > >                             offset)
> > > > > > > > > > > > > 
> > > > > > > > > > > > > {
> > > > > > > > > > > > > #ifdef CONFIG_PREADV
> > > > > > > > > > > > > 
> > > > > > > > > > > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > > > > > > > > > > 
> > > > > > > > > > > > > #else
> > > > > > > > > > > > > 
> > > > > > > > > > > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > > > > > > > > > > >     if (err == -1) {
> > > > > > > > > > > > >     
> > > > > > > > > > > > >         return err;
> > > > > > > > > > > > >     
> > > > > > > > > > > > >     } else {
> > > > > > > > > > > > >     
> > > > > > > > > > > > >         return readv(fs->fd, iov, iovcnt);
> > > > > > > > > > > > >     
> > > > > > > > > > > > >     }
> > > > > > > > > > > > > 
> > > > > > > > > > > > > #endif
> > > > > > > > > > > > > }
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > Unless I misunderstood the code, neither side can
> > > > > > > > > > > > > > take
> > > > > > > > > > > > > > advantage
> > > > > > > > > > > > > > of
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > new 32k descriptor chain limit?
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > Stefan
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I need to check that when I have some more time. One
> > > > > > > > > > > > > possible
> > > > > > > > > > > > > explanation
> > > > > > > > > > > > > might be that preadv() already has this wrapped into
> > > > > > > > > > > > > a
> > > > > > > > > > > > > loop
> > > > > > > > > > > > > in
> > > > > > > > > > > > > its
> > > > > > > > > > > > > implementation to circumvent a limit like IOV_MAX.
> > > > > > > > > > > > > It
> > > > > > > > > > > > > might
> > > > > > > > > > > > > be
> > > > > > > > > > > > > another
> > > > > > > > > > > > > "it
> > > > > > > > > > > > > works, but not portable" issue, but not sure.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > There are still a bunch of other issues I have to
> > > > > > > > > > > > > resolve.
> > > > > > > > > > > > > If
> > > > > > > > > > > > > you
> > > > > > > > > > > > > look
> > > > > > > > > > > > > at
> > > > > > > > > > > > > net/9p/client.c on kernel side, you'll notice that
> > > > > > > > > > > > > it
> > > > > > > > > > > > > basically
> > > > > > > > > > > > > does
> > > > > > > > > > > > > this ATM> >
> > > > > > > > > > > > > 
> > > > > > > > > > > > >     kmalloc(msize);
> > > > > > > > > > > 
> > > > > > > > > > > Note that this is done twice : once for the T message
> > > > > > > > > > > (client
> > > > > > > > > > > request)
> > > > > > > > > > > and
> > > > > > > > > > > once for the R message (server answer). The 9p driver
> > > > > > > > > > > could
> > > > > > > > > > > adjust
> > > > > > > > > > > the
> > > > > > > > > > > size
> > > > > > > > > > > of the T message to what's really needed instead of
> > > > > > > > > > > allocating
> > > > > > > > > > > the
> > > > > > > > > > > full
> > > > > > > > > > > msize. R message size is not known though.
> > > > > > > > > > 
> > > > > > > > > > Would it make sense adding a second virtio ring, dedicated
> > > > > > > > > > to
> > > > > > > > > > server
> > > > > > > > > > responses to solve this? IIRC 9p server already calculates
> > > > > > > > > > appropriate
> > > > > > > > > > exact sizes for each response type. So server could just
> > > > > > > > > > push
> > > > > > > > > > space
> > > > > > > > > > that's
> > > > > > > > > > really needed for its responses.
> > > > > > > > > > 
> > > > > > > > > > > > > for every 9p request. So not only does it allocate
> > > > > > > > > > > > > much
> > > > > > > > > > > > > more
> > > > > > > > > > > > > memory
> > > > > > > > > > > > > for
> > > > > > > > > > > > > every request than actually required (i.e. say 9pfs
> > > > > > > > > > > > > was
> > > > > > > > > > > > > mounted
> > > > > > > > > > > > > with
> > > > > > > > > > > > > msize=8M, then a 9p request that actually would just
> > > > > > > > > > > > > need 1k
> > > > > > > > > > > > > would
> > > > > > > > > > > > > nevertheless allocate 8M), but also it allocates >
> > > > > > > > > > > > > PAGE_SIZE,
> > > > > > > > > > > > > which
> > > > > > > > > > > > > obviously may fail at any time.>
> > > > > > > > > > > > 
> > > > > > > > > > > > The PAGE_SIZE limitation sounds like a kmalloc() vs
> > > > > > > > > > > > vmalloc()
> > > > > > > > > > > > situation.
> > > > > > > > > > 
> > > > > > > > > > Hu, I didn't even consider vmalloc(). I just tried the
> > > > > > > > > > kvmalloc()
> > > > > > > > > > wrapper
> > > > > > > > > > as a quick & dirty test, but it crashed in the same way as
> > > > > > > > > > kmalloc()
> > > > > > > > > > with
> > > > > > > > > > large msize values immediately on mounting:
> > > > > > > > > > 
> > > > > > > > > > diff --git a/net/9p/client.c b/net/9p/client.c
> > > > > > > > > > index a75034fa249b..cfe300a4b6ca 100644
> > > > > > > > > > --- a/net/9p/client.c
> > > > > > > > > > +++ b/net/9p/client.c
> > > > > > > > > > @@ -227,15 +227,18 @@ static int parse_opts(char *opts,
> > > > > > > > > > struct
> > > > > > > > > > p9_client
> > > > > > > > > > *clnt)
> > > > > > > > > > 
> > > > > > > > > >  static int p9_fcall_init(struct p9_client *c, struct
> > > > > > > > > >  p9_fcall
> > > > > > > > > >  *fc,
> > > > > > > > > >  
> > > > > > > > > >                          int alloc_msize)
> > > > > > > > > >  
> > > > > > > > > >  {
> > > > > > > > > > 
> > > > > > > > > > -       if (likely(c->fcall_cache) && alloc_msize ==
> > > > > > > > > > c->msize)
> > > > > > > > > > {
> > > > > > > > > > +       //if (likely(c->fcall_cache) && alloc_msize ==
> > > > > > > > > > c->msize) {
> > > > > > > > > > +       if (false) {
> > > > > > > > > > 
> > > > > > > > > >                 fc->sdata =
> > > > > > > > > >                 kmem_cache_alloc(c->fcall_cache,
> > > > > > > > > >                 GFP_NOFS);
> > > > > > > > > >                 fc->cache = c->fcall_cache;
> > > > > > > > > >         
> > > > > > > > > >         } else {
> > > > > > > > > > 
> > > > > > > > > > -               fc->sdata = kmalloc(alloc_msize,
> > > > > > > > > > GFP_NOFS);
> > > > > > > > > > +               fc->sdata = kvmalloc(alloc_msize,
> > > > > > > > > > GFP_NOFS);
> > > > > > > > > 
> > > > > > > > > Ok, GFP_NOFS -> GFP_KERNEL did the trick.
> > > > > > > > > 
> > > > > > > > > Now I get:
> > > > > > > > >    virtio: bogus descriptor or out of resources
> > > > > > > > > 
> > > > > > > > > So, still some work ahead on both ends.
> > > > > > > > 
> > > > > > > > Few hacks later (only changes on 9p client side) I got this
> > > > > > > > running
> > > > > > > > stable
> > > > > > > > now. The reason for the virtio error above was that kvmalloc()
> > > > > > > > returns
> > > > > > > > a
> > > > > > > > non-logical kernel address for any kvmalloc(>4M), i.e. an
> > > > > > > > address
> > > > > > > > that
> > > > > > > > is
> > > > > > > > inaccessible from host side, hence that "bogus descriptor"
> > > > > > > > message
> > > > > > > > by
> > > > > > > > QEMU.
> > > > > > > > So I had to split those linear 9p client buffers into sparse
> > > > > > > > ones
> > > > > > > > (set
> > > > > > > > of
> > > > > > > > individual pages).
> > > > > > > > 
> > > > > > > > I tested this for some days with various virtio transmission
> > > > > > > > sizes
> > > > > > > > and
> > > > > > > > it
> > > > > > > > works as expected up to 128 MB (more precisely: 128 MB read
> > > > > > > > space
> > > > > > > > +
> > > > > > > > 128 MB
> > > > > > > > write space per virtio round trip message).
> > > > > > > > 
> > > > > > > > I did not encounter a show stopper for large virtio
> > > > > > > > transmission
> > > > > > > > sizes
> > > > > > > > (4 MB ... 128 MB) on virtio level, neither as a result of
> > > > > > > > testing,
> > > > > > > > nor
> > > > > > > > after reviewing the existing code.
> > > > > > > > 
> > > > > > > > About IOV_MAX: that's apparently not an issue on virtio level.
> > > > > > > > Most of
> > > > > > > > the
> > > > > > > > iovec code, both on Linux kernel side and on QEMU side do not
> > > > > > > > have
> > > > > > > > this
> > > > > > > > limitation. It is apparently however indeed a limitation for
> > > > > > > > userland
> > > > > > > > apps
> > > > > > > > calling the Linux kernel's syscalls yet.
> > > > > > > > 
> > > > > > > > Stefan, as it stands now, I am even more convinced that the
> > > > > > > > upper
> > > > > > > > virtio
> > > > > > > > transmission size limit should not be squeezed into the queue
> > > > > > > > size
> > > > > > > > argument of virtio_add_queue(). Not because of the previous
> > > > > > > > argument
> > > > > > > > that
> > > > > > > > it would waste space (~1MB), but rather because they are two
> > > > > > > > different
> > > > > > > > things. To outline this, just a quick recap of what happens
> > > > > > > > exactly
> > > > > > > > when
> > > > > > > > a bulk message is pushed over the virtio wire (assuming virtio
> > > > > > > > "split"
> > > > > > > > layout here):
> > > > > > > > 
> > > > > > > > ---------- [recap-start] ----------
> > > > > > > > 
> > > > > > > > For each bulk message sent guest <-> host, exactly *one* of
> > > > > > > > the
> > > > > > > > pre-allocated descriptors is taken and placed (subsequently)
> > > > > > > > into
> > > > > > > > exactly
> > > > > > > > *one* position of the two available/used ring buffers. The
> > > > > > > > actual
> > > > > > > > descriptor table though, containing all the DMA addresses of
> > > > > > > > the
> > > > > > > > message
> > > > > > > > bulk data, is allocated just in time for each round trip
> > > > > > > > message.
> > > > > > > > Say,
> > > > > > > > it
> > > > > > > > is the first message sent, it yields in the following
> > > > > > > > structure:
> > > > > > > > 
> > > > > > > > Ring Buffer   Descriptor Table      Bulk Data Pages
> > > > > > > > 
> > > > > > > >    +-+              +-+           +-----------------+
> > > > > > > >    
> > > > > > > >    |D|------------->|d|---------->| Bulk data block |
> > > > > > > >    
> > > > > > > >    +-+              |d|--------+  +-----------------+
> > > > > > > >    
> > > > > > > >    | |              |d|------+ |
> > > > > > > >    
> > > > > > > >    +-+               .       | |  +-----------------+
> > > > > > > >    
> > > > > > > >    | |               .       | +->| Bulk data block |
> > > > > > > >     
> > > > > > > >     .                .       |    +-----------------+
> > > > > > > >     .               |d|-+    |
> > > > > > > >     .               +-+ |    |    +-----------------+
> > > > > > > >     
> > > > > > > >    | |                  |    +--->| Bulk data block |
> > > > > > > >    
> > > > > > > >    +-+                  |         +-----------------+
> > > > > > > >    
> > > > > > > >    | |                  |                 .
> > > > > > > >    
> > > > > > > >    +-+                  |                 .
> > > > > > > >    
> > > > > > > >                         |                 .
> > > > > > > >                         |         
> > > > > > > >                         |         +-----------------+
> > > > > > > >                         
> > > > > > > >                         +-------->| Bulk data block |
> > > > > > > >                         
> > > > > > > >                                   +-----------------+
> > > > > > > > 
> > > > > > > > Legend:
> > > > > > > > D: pre-allocated descriptor
> > > > > > > > d: just in time allocated descriptor
> > > > > > > > -->: memory pointer (DMA)
> > > > > > > > 
> > > > > > > > The bulk data blocks are allocated by the respective device
> > > > > > > > driver
> > > > > > > > above
> > > > > > > > virtio subsystem level (guest side).
> > > > > > > > 
> > > > > > > > There are exactly as many descriptors pre-allocated (D) as the
> > > > > > > > size of
> > > > > > > > a
> > > > > > > > ring buffer.
> > > > > > > > 
> > > > > > > > A "descriptor" is more or less just a chainable DMA memory
> > > > > > > > pointer;
> > > > > > > > defined
> > > > > > > > as:
> > > > > > > > 
> > > > > > > > /* Virtio ring descriptors: 16 bytes.  These can chain
> > > > > > > > together
> > > > > > > > via
> > > > > > > > "next". */ struct vring_desc {
> > > > > > > > 
> > > > > > > > 	/* Address (guest-physical). */
> > > > > > > > 	__virtio64 addr;
> > > > > > > > 	/* Length. */
> > > > > > > > 	__virtio32 len;
> > > > > > > > 	/* The flags as indicated above. */
> > > > > > > > 	__virtio16 flags;
> > > > > > > > 	/* We chain unused descriptors via this, too */
> > > > > > > > 	__virtio16 next;
> > > > > > > > 
> > > > > > > > };
> > > > > > > > 
> > > > > > > > There are 2 ring buffers; the "available" ring buffer is for
> > > > > > > > sending a
> > > > > > > > message guest->host (which will transmit DMA addresses of
> > > > > > > > guest
> > > > > > > > allocated
> > > > > > > > bulk data blocks that are used for data sent to device, and
> > > > > > > > separate
> > > > > > > > guest allocated bulk data blocks that will be used by host
> > > > > > > > side to
> > > > > > > > place
> > > > > > > > its response bulk data), and the "used" ring buffer is for
> > > > > > > > sending
> > > > > > > > host->guest to let guest know about host's response and that
> > > > > > > > it
> > > > > > > > could
> > > > > > > > now
> > > > > > > > safely consume and then deallocate the bulk data blocks
> > > > > > > > subsequently.
> > > > > > > > 
> > > > > > > > ---------- [recap-end] ----------
> > > > > > > > 
> > > > > > > > So the "queue size" actually defines the ringbuffer size. It
> > > > > > > > does
> > > > > > > > not
> > > > > > > > define the maximum amount of descriptors. The "queue size"
> > > > > > > > rather
> > > > > > > > defines
> > > > > > > > how many pending messages can be pushed into either one
> > > > > > > > ringbuffer
> > > > > > > > before
> > > > > > > > the other side would need to wait until the counter side would
> > > > > > > > step up
> > > > > > > > (i.e. ring buffer full).
> > > > > > > > 
> > > > > > > > The maximum amount of descriptors (what VIRTQUEUE_MAX_SIZE
> > > > > > > > actually
> > > > > > > > is)
> > > > > > > > OTOH defines the max. bulk data size that could be transmitted
> > > > > > > > with
> > > > > > > > each
> > > > > > > > virtio round trip message.
> > > > > > > > 
> > > > > > > > And in fact, 9p currently handles the virtio "queue size" as
> > > > > > > > directly
> > > > > > > > associative with its maximum amount of active 9p requests the
> > > > > > > > server
> > > > > > > > could
> > > > > > > > 
> > > > > > > > handle simultaniously:
> > > > > > > >   hw/9pfs/9p.h:#define MAX_REQ         128
> > > > > > > >   hw/9pfs/9p.h:    V9fsPDU pdus[MAX_REQ];
> > > > > > > >   hw/9pfs/virtio-9p-device.c:    v->vq =
> > > > > > > >   virtio_add_queue(vdev,
> > > > > > > >   MAX_REQ,
> > > > > > > >   
> > > > > > > >                                  handle_9p_output);
> > > > > > > > 
> > > > > > > > So if I would change it like this, just for the purpose to
> > > > > > > > increase
> > > > > > > > the
> > > > > > > > max. virtio transmission size:
> > > > > > > > 
> > > > > > > > --- a/hw/9pfs/virtio-9p-device.c
> > > > > > > > +++ b/hw/9pfs/virtio-9p-device.c
> > > > > > > > @@ -218,7 +218,7 @@ static void
> > > > > > > > virtio_9p_device_realize(DeviceState
> > > > > > > > *dev,
> > > > > > > > Error **errp)>
> > > > > > > > 
> > > > > > > >      v->config_size = sizeof(struct virtio_9p_config) +
> > > > > > > >      strlen(s->fsconf.tag);
> > > > > > > >      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P,
> > > > > > > >      v->config_size,
> > > > > > > >      
> > > > > > > >                  VIRTQUEUE_MAX_SIZE);
> > > > > > > > 
> > > > > > > > -    v->vq = virtio_add_queue(vdev, MAX_REQ,
> > > > > > > > handle_9p_output);
> > > > > > > > +    v->vq = virtio_add_queue(vdev, 32*1024,
> > > > > > > > handle_9p_output);
> > > > > > > > 
> > > > > > > >  }
> > > > > > > > 
> > > > > > > > Then it would require additional synchronization code on both
> > > > > > > > ends
> > > > > > > > and
> > > > > > > > therefore unnecessary complexity, because it would now be
> > > > > > > > possible
> > > > > > > > that
> > > > > > > > more requests are pushed into the ringbuffer than server could
> > > > > > > > handle.
> > > > > > > > 
> > > > > > > > There is one potential issue though that probably did justify
> > > > > > > > the
> > > > > > > > "don't
> > > > > > > > exceed the queue size" rule:
> > > > > > > > 
> > > > > > > > ATM the descriptor table is allocated (just in time) as *one*
> > > > > > > > continuous
> > > > > > > > buffer via kmalloc_array():
> > > > > > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53
> > > > > > > > 798c
> > > > > > > > a086
> > > > > > > > f7c7
> > > > > > > > d33a4/drivers/virtio/virtio_ring.c#L440
> > > > > > > > 
> > > > > > > > So assuming transmission size of 2 * 128 MB that
> > > > > > > > kmalloc_array()
> > > > > > > > call
> > > > > > > > would
> > > > > > > > yield in kmalloc(1M) and the latter might fail if guest had
> > > > > > > > highly
> > > > > > > > fragmented physical memory. For such kind of error case there
> > > > > > > > is
> > > > > > > > currently a fallback path in virtqueue_add_split() that would
> > > > > > > > then
> > > > > > > > use
> > > > > > > > the required amount of pre-allocated descriptors instead:
> > > > > > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53
> > > > > > > > 798c
> > > > > > > > a086
> > > > > > > > f7c7
> > > > > > > > d33a4/drivers/virtio/virtio_ring.c#L525
> > > > > > > > 
> > > > > > > > That fallback recovery path would no longer be viable if the
> > > > > > > > queue
> > > > > > > > size
> > > > > > > > was
> > > > > > > > exceeded. There would be alternatives though, e.g. by allowing
> > > > > > > > to
> > > > > > > > chain
> > > > > > > > indirect descriptor tables (currently prohibited by the virtio
> > > > > > > > specs).
> > > > > > > 
> > > > > > > Making the maximum number of descriptors independent of the
> > > > > > > queue
> > > > > > > size
> > > > > > > requires a change to the VIRTIO spec since the two values are
> > > > > > > currently
> > > > > > > explicitly tied together by the spec.
> > > > > > 
> > > > > > Yes, that's what the virtio specs say. But they don't say why, nor
> > > > > > did
> > > > > > I
> > > > > > hear a reason in this dicussion.
> > > > > > 
> > > > > > That's why I invested time reviewing current virtio implementation
> > > > > > and
> > > > > > specs, as well as actually testing exceeding that limit. And as I
> > > > > > outlined in detail in my previous email, I only found one
> > > > > > theoretical
> > > > > > issue that could be addressed though.
> > > > > 
> > > > > I agree that there is a limitation in the VIRTIO spec, but violating
> > > > > the
> > > > > spec isn't an acceptable solution:
> > > > > 
> > > > > 1. QEMU and Linux aren't the only components that implement VIRTIO.
> > > > > You
> > > > > 
> > > > >    cannot make assumptions about their implementations because it
> > > > >    may
> > > > >    break spec-compliant implementations that you haven't looked at.
> > > > >    
> > > > >    Your patches weren't able to increase Queue Size because some
> > > > >    device
> > > > >    implementations break when descriptor chains are too long. This
> > > > >    shows
> > > > >    there is a practical issue even in QEMU.
> > > > > 
> > > > > 2. The specific spec violation that we discussed creates the problem
> > > > > 
> > > > >    that drivers can no longer determine the maximum description
> > > > >    chain
> > > > >    length. This in turn will lead to more implementation-specific
> > > > >    assumptions being baked into drivers and cause problems with
> > > > >    interoperability and future changes.
> > > > > 
> > > > > The spec needs to be extended instead. I included an idea for how to
> > > > > do
> > > > > that below.
> > > > 
> > > > Sure, I just wanted to see if there was a non-neglectable "hard" show
> > > > stopper per se that I probably haven't seen yet. I have not questioned
> > > > aiming a clean solution.
> > > > 
> > > > Thanks for the clarification!
> > > > 
> > > > > > > Before doing that, are there benchmark results showing that 1 MB
> > > > > > > vs
> > > > > > > 128
> > > > > > > MB produces a performance improvement? I'm asking because if
> > > > > > > performance
> > > > > > > with 1 MB is good then you can probably do that without having
> > > > > > > to
> > > > > > > change
> > > > > > > VIRTIO and also because it's counter-intuitive that 9p needs 128
> > > > > > > MB
> > > > > > > for
> > > > > > > good performance when it's ultimately implemented on top of disk
> > > > > > > and
> > > > > > > network I/O that have lower size limits.
> > > > > > 
> > > > > > First some numbers, linear reading a 12 GB file:
> > > > > > 
> > > > > > msize    average      notes
> > > > > > 
> > > > > > 8 kB     52.0 MB/s    default msize of Linux kernel <v5.15
> > > > > > 128 kB   624.8 MB/s   default msize of Linux kernel >=v5.15
> > > > > > 512 kB   1961 MB/s    current max. msize with any Linux kernel
> > > > > > <=v5.15
> > > > > > 1 MB     2551 MB/s    this msize would already violate virtio
> > > > > > specs
> > > > > > 2 MB     2521 MB/s    this msize would already violate virtio
> > > > > > specs
> > > > > > 4 MB     2628 MB/s    planned max. msize of my current kernel
> > > > > > patches
> > > > > > [1]
> > > > > 
> > > > > How many descriptors are used? 4 MB can be covered by a single
> > > > > descriptor if the data is physically contiguous in memory, so this
> > > > > data
> > > > > doesn't demonstrate a need for more descriptors.
> > > > 
> > > > No, in the last couple years there was apparently no kernel version
> > > > that
> > > > used just one descriptor, nor did my benchmarked version. Even though
> > > > the
> > > > Linux 9p client uses (yet) simple linear buffers (contiguous physical
> > > > memory) on 9p client level, these are however split into PAGE_SIZE
> > > > chunks
> > > > by function pack_sg_list() [1] before being fed to virtio level:
> > > > 
> > > > static unsigned int rest_of_page(void *data)
> > > > {
> > > > 
> > > > 	return PAGE_SIZE - offset_in_page(data);
> > > > 
> > > > }
> > > > ...
> > > > static int pack_sg_list(struct scatterlist *sg, int start,
> > > > 
> > > > 			int limit, char *data, int count)
> > > > 
> > > > {
> > > > 
> > > > 	int s;
> > > > 	int index = start;
> > > > 	
> > > > 	while (count) {
> > > > 	
> > > > 		s = rest_of_page(data);
> > > > 		...
> > > > 		sg_set_buf(&sg[index++], data, s);
> > > > 		count -= s;
> > > > 		data += s;
> > > > 	
> > > > 	}
> > > > 	...
> > > > 
> > > > }
> > > > 
> > > > [1]
> > > > https://github.com/torvalds/linux/blob/19901165d90fdca1e57c9baa0d5b4c6
> > > > 3d1
> > > > 5c476a/net/9p/trans_virtio.c#L171
> > > > 
> > > > So when sending 4MB over virtio wire, it would yield in 1k descriptors
> > > > ATM.
> > > > 
> > > > I have wondered about this before, but did not question it, because
> > > > due to
> > > > the cross-platform nature I couldn't say for certain whether that's
> > > > probably needed somewhere. I mean for the case virtio-PCI I know for
> > > > sure
> > > > that one descriptor (i.e. >PAGE_SIZE) would be fine, but I don't know
> > > > if
> > > > that applies to all buses and architectures.
> > > 
> > > VIRTIO does not limit descriptor the descriptor len field to PAGE_SIZE,
> > > so I don't think there is a limit at the VIRTIO level.
> > 
> > So you are viewing this purely from virtio specs PoV: in the sense, if it
> > is not prohibited by the virtio specs, then it should work. Maybe.
> 
> Limitations must be specified either in the 9P protocol or the VIRTIO
> specification. Drivers and devices will not be able to operate correctly
> if there are limitations that aren't covered by the specs.
> 
> Do you have something in mind that isn't covered by the specs?

Not sure whether that's something that should be specified by the virtio 
specs, probably not. I simply do not know if there was any bus or architecture 
that would have a limitation for max. size for a memory block passed per one 
DMA address.

> > > If this function coalesces adjacent pages then the descriptor chain
> > > length issues could be reduced.
> > > 
> > > > > > But again, this is not just about performance. My conclusion as
> > > > > > described
> > > > > > in my previous email is that virtio currently squeezes
> > > > > > 
> > > > > > 	"max. simultanious amount of bulk messages"
> > > > > > 
> > > > > > vs.
> > > > > > 
> > > > > > 	"max. bulk data transmission size per bulk messaage"
> > > > > > 
> > > > > > into the same configuration parameter, which is IMO inappropriate
> > > > > > and
> > > > > > hence
> > > > > > splitting them into 2 separate parameters when creating a queue
> > > > > > makes
> > > > > > sense, independent of the performance benchmarks.
> > > > > > 
> > > > > > [1]
> > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crud
> > > > > > ebyt
> > > > > > e.c
> > > > > > om/
> > > > > 
> > > > > Some devices effectively already have this because the device
> > > > > advertises
> > > > > a maximum number of descriptors via device-specific mechanisms like
> > > > > the
> > > > > struct virtio_blk_config seg_max field. But today these fields can
> > > > > only
> > > > > reduce the maximum descriptor chain length because the spec still
> > > > > limits
> > > > > the length to Queue Size.
> > > > > 
> > > > > We can build on this approach to raise the length above Queue Size.
> > > > > This
> > > > > approach has the advantage that the maximum number of segments isn't
> > > > > per
> > > > > device or per virtqueue, it's fine-grained. If the device supports
> > > > > two
> > > > > requests types then different max descriptor chain limits could be
> > > > > given
> > > > > for them by introducing two separate configuration space fields.
> > > > > 
> > > > > Here are the corresponding spec changes:
> > > > > 
> > > > > 1. A new feature bit called VIRTIO_RING_F_LARGE_INDIRECT_DESC is
> > > > > added
> > > > > 
> > > > >    to indicate that indirect descriptor table size and maximum
> > > > >    descriptor chain length are not limited by Queue Size value.
> > > > >    (Maybe
> > > > >    there still needs to be a limit like 2^15?)
> > > > 
> > > > Sounds good to me!
> > > > 
> > > > AFAIK it is effectively limited to 2^16 because of vring_desc->next:
> > > > 
> > > > /* Virtio ring descriptors: 16 bytes.  These can chain together via
> > > > "next". */ struct vring_desc {
> > > > 
> > > >         /* Address (guest-physical). */
> > > >         __virtio64 addr;
> > > >         /* Length. */
> > > >         __virtio32 len;
> > > >         /* The flags as indicated above. */
> > > >         __virtio16 flags;
> > > >         /* We chain unused descriptors via this, too */
> > > >         __virtio16 next;
> > > > 
> > > > };
> > > 
> > > Yes, Split Virtqueues have a fundamental limit on indirect table size
> > > due to the "next" field. Packed Virtqueue descriptors don't have a
> > > "next" field so descriptor chains could be longer in theory (currently
> > > forbidden by the spec).
> > > 
> > > > > One thing that's messy is that we've been discussing the maximum
> > > > > descriptor chain length but 9p has the "msize" concept, which isn't
> > > > > aware of contiguous memory. It may be necessary to extend the 9p
> > > > > driver
> > > > > code to size requests not just according to their length in bytes
> > > > > but
> > > > > also according to the descriptor chain length. That's how the Linux
> > > > > block layer deals with queue limits (struct queue_limits
> > > > > max_segments vs
> > > > > max_hw_sectors).
> > > > 
> > > > Hmm, can't follow on that one. For what should that be needed in case
> > > > of
> > > > 9p? My plan was to limit msize by 9p client simply at session start to
> > > > whatever is the max. amount virtio descriptors supported by host and
> > > > using PAGE_SIZE as size per descriptor, because that's what 9p client
> > > > actually does ATM (see above). So you think that should be changed to
> > > > e.g. just one descriptor for 4MB, right?
> > > 
> > > Limiting msize to the 9p transport device's maximum number of
> > > descriptors is conservative (i.e. 128 descriptors = 512 KB msize)
> > > because it doesn't take advantage of contiguous memory. I suggest
> > > leaving msize alone, adding a separate limit at which requests are split
> > > according to the maximum descriptor chain length, and tweaking
> > > pack_sg_list() to coalesce adjacent pages.
> > > 
> > > That way msize can be large without necessarily using lots of
> > > descriptors (depending on the memory layout).
> > 
> > That was actually a tempting solution. Because it would neither require
> > changes to the virtio specs (at least for a while) and it would also work
> > with older QEMU versions. And for that pack_sg_list() portion of the code
> > it would work well and easy as the buffer passed to pack_sg_list() is
> > contiguous already.
> > 
> > However I just realized for the zero-copy version of the code that would
> > be
> > more tricky. The ZC version already uses individual pages (struct page,
> > hence PAGE_SIZE each) which are pinned, i.e. it uses pack_sg_list_p() [1]
> > in combination with p9_get_mapped_pages() [2]
> > 
> > [1]
> > https://github.com/torvalds/linux/blob/7ddb58cb0ecae8e8b6181d736a87667cc9
> > ab8389/net/9p/trans_virtio.c#L218 [2]
> > https://github.com/torvalds/linux/blob/7ddb58cb0ecae8e8b6181d736a87667cc9
> > ab8389/net/9p/trans_virtio.c#L309
> > 
> > So that would require much more work and code trying to sort and coalesce
> > individual pages to contiguous physical memory for the sake of reducing
> > virtio descriptors. And there is no guarantee that this is even possible.
> > The kernel may simply return a non-contiguous set of pages which would
> > eventually end up exceeding the virtio descriptor limit again.
> 
> Order must be preserved so pages cannot be sorted by physical address.
> How about simply coalescing when pages are adjacent?

It would help, but not solve the issue we are talking about here: if 99% of 
the cases could successfully merge descriptors to stay below the descriptor 
count limit, but in 1% of the cases it could not, then this still construes a 
severe runtime issue that could trigger at any time.

> > So looks like it was probably still easier and realistic to just add
> > virtio
> > capabilities for now for allowing to exceed current descriptor limit.
> 
> I'm still not sure why virtio-net, virtio-blk, virtio-fs, etc perform
> fine under today's limits while virtio-9p needs a much higher limit to
> achieve good performance. Maybe there is an issue in a layer above the
> vring that's causing the virtio-9p performance you've observed?

Are you referring to (somewhat) recent benchmarks when saying those would all 
still perform fine today?

Vivek was running detailed benchmarks for virtiofs vs. 9p:
https://lists.gnu.org/archive/html/qemu-devel/2020-12/msg02704.html

For the virtio aspect discussed here, only the benchmark configurations 
without cache are relevant (9p-none, vtfs-none) and under this aspect the 
situation seems to be quite similar between 9p and virtio-fs. You'll also note 
once DAX is enabled (vtfs-none-dax) that apparently boosts virtio-fs 
performance significantly, which however seems to corelate to numbers when I 
am running 9p with msize > 300k. Note: Vivek was presumably running 9p 
effecively with msize=300k, as this was the kernel limitation at that time.

To bring things into relation: there are known performance aspects in 9p that 
can be improved, yes, both on Linux kernel side and on 9p server side in QEMU. 
For instance 9p server uses coroutines [1] and currently dispatches between 
worker thread(s) and main thread too often per request (partly addressed 
already [2], but still WIP), which accumulates to overall latency. But Vivek 
was actually using a 9p patch here which disabled coroutines entirely, which 
suggests that the virtio transmission size limit still represents a 
bottleneck.

[1] https://wiki.qemu.org/Documentation/9p#Coroutines
[2] https://wiki.qemu.org/Documentation/9p#Implementation_Plans

Best regards,
Christian Schoenebeck
Stefan Hajnoczi Nov. 10, 2021, 10:05 a.m. UTC | #20
On Tue, Nov 09, 2021 at 02:09:59PM +0100, Christian Schoenebeck wrote:
> On Dienstag, 9. November 2021 11:56:35 CET Stefan Hajnoczi wrote:
> > On Thu, Nov 04, 2021 at 03:41:23PM +0100, Christian Schoenebeck wrote:
> > > On Mittwoch, 3. November 2021 12:33:33 CET Stefan Hajnoczi wrote:
> > > > On Mon, Nov 01, 2021 at 09:29:26PM +0100, Christian Schoenebeck wrote:
> > > > > On Donnerstag, 28. Oktober 2021 11:00:48 CET Stefan Hajnoczi wrote:
> > > > > > On Mon, Oct 25, 2021 at 05:03:25PM +0200, Christian Schoenebeck 
> wrote:
> > > > > > > On Montag, 25. Oktober 2021 12:30:41 CEST Stefan Hajnoczi wrote:
> > > > > > > > On Thu, Oct 21, 2021 at 05:39:28PM +0200, Christian Schoenebeck 
> wrote:
> > > > > > > > > On Freitag, 8. Oktober 2021 18:08:48 CEST Christian 
> Schoenebeck wrote:
> > > > > > > > > > On Freitag, 8. Oktober 2021 16:24:42 CEST Christian 
> Schoenebeck wrote:
> > > > > > > > > > > On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > > > > > > > > > > > On Thu, 7 Oct 2021 16:42:49 +0100
> > > > > > > > > > > > 
> > > > > > > > > > > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > > > > > > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian 
> Schoenebeck wrote:
> > > > > > > > > > > > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan 
> Hajnoczi wrote:
> > > > > > > > > > > > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200,
> > > > > > > > > > > > > > > Christian
> > > > > > > > > > > > > > > Schoenebeck
> > > > > > > > > > 
> > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > At the moment the maximum transfer size with
> > > > > > > > > > > > > > > > virtio
> > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > limited
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > 4M
> > > > > > > > > > > > > > > > (1024 * PAGE_SIZE). This series raises this
> > > > > > > > > > > > > > > > limit to
> > > > > > > > > > > > > > > > its
> > > > > > > > > > > > > > > > maximum
> > > > > > > > > > > > > > > > theoretical possible transfer size of 128M (32k
> > > > > > > > > > > > > > > > pages)
> > > > > > > > > > > > > > > > according
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > virtio specs:
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/c
> > > > > > > > > > > > > > > > s01/
> > > > > > > > > > > > > > > > virt
> > > > > > > > > > > > > > > > io-v
> > > > > > > > > > > > > > > > 1.1-
> > > > > > > > > > > > > > > > cs
> > > > > > > > > > > > > > > > 01
> > > > > > > > > > > > > > > > .html#
> > > > > > > > > > > > > > > > x1-240006
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Hi Christian,
> > > > > > > > > > > > 
> > > > > > > > > > > > > > > I took a quick look at the code:
> > > > > > > > > > > > Hi,
> > > > > > > > > > > > 
> > > > > > > > > > > > Thanks Stefan for sharing virtio expertise and helping
> > > > > > > > > > > > Christian
> > > > > > > > > > > > !
> > > > > > > > > > > > 
> > > > > > > > > > > > > > > - The Linux 9p driver restricts descriptor chains
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > 128
> > > > > > > > > > > > > > > elements
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Yes, that's the limitation that I am about to remove
> > > > > > > > > > > > > > (WIP);
> > > > > > > > > > > > > > current
> > > > > > > > > > > > > > kernel
> > > > > > > > > > > > > > patches:
> > > > > > > > > > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.
> > > > > > > > > > > > > > linu
> > > > > > > > > > > > > > x_os
> > > > > > > > > > > > > > s@cr
> > > > > > > > > > > > > > udeb
> > > > > > > > > > > > > > yt
> > > > > > > > > > > > > > e.
> > > > > > > > > > > > > > com/>
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I haven't read the patches yet but I'm concerned that
> > > > > > > > > > > > > today
> > > > > > > > > > > > > the
> > > > > > > > > > > > > driver
> > > > > > > > > > > > > is pretty well-behaved and this new patch series
> > > > > > > > > > > > > introduces a
> > > > > > > > > > > > > spec
> > > > > > > > > > > > > violation. Not fixing existing spec violations is
> > > > > > > > > > > > > okay,
> > > > > > > > > > > > > but
> > > > > > > > > > > > > adding
> > > > > > > > > > > > > new
> > > > > > > > > > > > > ones is a red flag. I think we need to figure out a
> > > > > > > > > > > > > clean
> > > > > > > > > > > > > solution.
> > > > > > > > > > > 
> > > > > > > > > > > Nobody has reviewed the kernel patches yet. My main
> > > > > > > > > > > concern
> > > > > > > > > > > therefore
> > > > > > > > > > > actually is that the kernel patches are already too
> > > > > > > > > > > complex,
> > > > > > > > > > > because
> > > > > > > > > > > the
> > > > > > > > > > > current situation is that only Dominique is handling 9p
> > > > > > > > > > > patches on
> > > > > > > > > > > kernel
> > > > > > > > > > > side, and he barely has time for 9p anymore.
> > > > > > > > > > > 
> > > > > > > > > > > Another reason for me to catch up on reading current
> > > > > > > > > > > kernel
> > > > > > > > > > > code
> > > > > > > > > > > and
> > > > > > > > > > > stepping in as reviewer of 9p on kernel side ASAP,
> > > > > > > > > > > independent
> > > > > > > > > > > of
> > > > > > > > > > > this
> > > > > > > > > > > issue.
> > > > > > > > > > > 
> > > > > > > > > > > As for current kernel patches' complexity: I can certainly
> > > > > > > > > > > drop
> > > > > > > > > > > patch
> > > > > > > > > > > 7
> > > > > > > > > > > entirely as it is probably just overkill. Patch 4 is then
> > > > > > > > > > > the
> > > > > > > > > > > biggest
> > > > > > > > > > > chunk, I have to see if I can simplify it, and whether it
> > > > > > > > > > > would
> > > > > > > > > > > make
> > > > > > > > > > > sense to squash with patch 3.
> > > > > > > > > > > 
> > > > > > > > > > > > > > > - The QEMU 9pfs code passes iovecs directly to
> > > > > > > > > > > > > > > preadv(2)
> > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > will
> > > > > > > > > > > > > > > fail
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >   with EINVAL when called with more than IOV_MAX
> > > > > > > > > > > > > > >   iovecs
> > > > > > > > > > > > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Hmm, which makes me wonder why I never encountered
> > > > > > > > > > > > > > this
> > > > > > > > > > > > > > error
> > > > > > > > > > > > > > during
> > > > > > > > > > > > > > testing.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Most people will use the 9p qemu 'local' fs driver
> > > > > > > > > > > > > > backend
> > > > > > > > > > > > > > in
> > > > > > > > > > > > > > practice,
> > > > > > > > > > > > > > so
> > > > > > > > > > > > > > that v9fs_read() call would translate for most
> > > > > > > > > > > > > > people to
> > > > > > > > > > > > > > this
> > > > > > > > > > > > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > static ssize_t local_preadv(FsContext *ctx,
> > > > > > > > > > > > > > V9fsFidOpenState
> > > > > > > > > > > > > > *fs,
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > >                             const struct iovec *iov,
> > > > > > > > > > > > > >                             int iovcnt, off_t
> > > > > > > > > > > > > >                             offset)
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > {
> > > > > > > > > > > > > > #ifdef CONFIG_PREADV
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > >     return preadv(fs->fd, iov, iovcnt, offset);
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > #else
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > >     int err = lseek(fs->fd, offset, SEEK_SET);
> > > > > > > > > > > > > >     if (err == -1) {
> > > > > > > > > > > > > >     
> > > > > > > > > > > > > >         return err;
> > > > > > > > > > > > > >     
> > > > > > > > > > > > > >     } else {
> > > > > > > > > > > > > >     
> > > > > > > > > > > > > >         return readv(fs->fd, iov, iovcnt);
> > > > > > > > > > > > > >     
> > > > > > > > > > > > > >     }
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > #endif
> > > > > > > > > > > > > > }
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Unless I misunderstood the code, neither side can
> > > > > > > > > > > > > > > take
> > > > > > > > > > > > > > > advantage
> > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > new 32k descriptor chain limit?
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > Stefan
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > I need to check that when I have some more time. One
> > > > > > > > > > > > > > possible
> > > > > > > > > > > > > > explanation
> > > > > > > > > > > > > > might be that preadv() already has this wrapped into
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > loop
> > > > > > > > > > > > > > in
> > > > > > > > > > > > > > its
> > > > > > > > > > > > > > implementation to circumvent a limit like IOV_MAX.
> > > > > > > > > > > > > > It
> > > > > > > > > > > > > > might
> > > > > > > > > > > > > > be
> > > > > > > > > > > > > > another
> > > > > > > > > > > > > > "it
> > > > > > > > > > > > > > works, but not portable" issue, but not sure.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > There are still a bunch of other issues I have to
> > > > > > > > > > > > > > resolve.
> > > > > > > > > > > > > > If
> > > > > > > > > > > > > > you
> > > > > > > > > > > > > > look
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > > net/9p/client.c on kernel side, you'll notice that
> > > > > > > > > > > > > > it
> > > > > > > > > > > > > > basically
> > > > > > > > > > > > > > does
> > > > > > > > > > > > > > this ATM> >
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > >     kmalloc(msize);
> > > > > > > > > > > > 
> > > > > > > > > > > > Note that this is done twice : once for the T message
> > > > > > > > > > > > (client
> > > > > > > > > > > > request)
> > > > > > > > > > > > and
> > > > > > > > > > > > once for the R message (server answer). The 9p driver
> > > > > > > > > > > > could
> > > > > > > > > > > > adjust
> > > > > > > > > > > > the
> > > > > > > > > > > > size
> > > > > > > > > > > > of the T message to what's really needed instead of
> > > > > > > > > > > > allocating
> > > > > > > > > > > > the
> > > > > > > > > > > > full
> > > > > > > > > > > > msize. R message size is not known though.
> > > > > > > > > > > 
> > > > > > > > > > > Would it make sense adding a second virtio ring, dedicated
> > > > > > > > > > > to
> > > > > > > > > > > server
> > > > > > > > > > > responses to solve this? IIRC 9p server already calculates
> > > > > > > > > > > appropriate
> > > > > > > > > > > exact sizes for each response type. So server could just
> > > > > > > > > > > push
> > > > > > > > > > > space
> > > > > > > > > > > that's
> > > > > > > > > > > really needed for its responses.
> > > > > > > > > > > 
> > > > > > > > > > > > > > for every 9p request. So not only does it allocate
> > > > > > > > > > > > > > much
> > > > > > > > > > > > > > more
> > > > > > > > > > > > > > memory
> > > > > > > > > > > > > > for
> > > > > > > > > > > > > > every request than actually required (i.e. say 9pfs
> > > > > > > > > > > > > > was
> > > > > > > > > > > > > > mounted
> > > > > > > > > > > > > > with
> > > > > > > > > > > > > > msize=8M, then a 9p request that actually would just
> > > > > > > > > > > > > > need 1k
> > > > > > > > > > > > > > would
> > > > > > > > > > > > > > nevertheless allocate 8M), but also it allocates >
> > > > > > > > > > > > > > PAGE_SIZE,
> > > > > > > > > > > > > > which
> > > > > > > > > > > > > > obviously may fail at any time.>
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The PAGE_SIZE limitation sounds like a kmalloc() vs
> > > > > > > > > > > > > vmalloc()
> > > > > > > > > > > > > situation.
> > > > > > > > > > > 
> > > > > > > > > > > Hu, I didn't even consider vmalloc(). I just tried the
> > > > > > > > > > > kvmalloc()
> > > > > > > > > > > wrapper
> > > > > > > > > > > as a quick & dirty test, but it crashed in the same way as
> > > > > > > > > > > kmalloc()
> > > > > > > > > > > with
> > > > > > > > > > > large msize values immediately on mounting:
> > > > > > > > > > > 
> > > > > > > > > > > diff --git a/net/9p/client.c b/net/9p/client.c
> > > > > > > > > > > index a75034fa249b..cfe300a4b6ca 100644
> > > > > > > > > > > --- a/net/9p/client.c
> > > > > > > > > > > +++ b/net/9p/client.c
> > > > > > > > > > > @@ -227,15 +227,18 @@ static int parse_opts(char *opts,
> > > > > > > > > > > struct
> > > > > > > > > > > p9_client
> > > > > > > > > > > *clnt)
> > > > > > > > > > > 
> > > > > > > > > > >  static int p9_fcall_init(struct p9_client *c, struct
> > > > > > > > > > >  p9_fcall
> > > > > > > > > > >  *fc,
> > > > > > > > > > >  
> > > > > > > > > > >                          int alloc_msize)
> > > > > > > > > > >  
> > > > > > > > > > >  {
> > > > > > > > > > > 
> > > > > > > > > > > -       if (likely(c->fcall_cache) && alloc_msize ==
> > > > > > > > > > > c->msize)
> > > > > > > > > > > {
> > > > > > > > > > > +       //if (likely(c->fcall_cache) && alloc_msize ==
> > > > > > > > > > > c->msize) {
> > > > > > > > > > > +       if (false) {
> > > > > > > > > > > 
> > > > > > > > > > >                 fc->sdata =
> > > > > > > > > > >                 kmem_cache_alloc(c->fcall_cache,
> > > > > > > > > > >                 GFP_NOFS);
> > > > > > > > > > >                 fc->cache = c->fcall_cache;
> > > > > > > > > > >         
> > > > > > > > > > >         } else {
> > > > > > > > > > > 
> > > > > > > > > > > -               fc->sdata = kmalloc(alloc_msize,
> > > > > > > > > > > GFP_NOFS);
> > > > > > > > > > > +               fc->sdata = kvmalloc(alloc_msize,
> > > > > > > > > > > GFP_NOFS);
> > > > > > > > > > 
> > > > > > > > > > Ok, GFP_NOFS -> GFP_KERNEL did the trick.
> > > > > > > > > > 
> > > > > > > > > > Now I get:
> > > > > > > > > >    virtio: bogus descriptor or out of resources
> > > > > > > > > > 
> > > > > > > > > > So, still some work ahead on both ends.
> > > > > > > > > 
> > > > > > > > > Few hacks later (only changes on 9p client side) I got this
> > > > > > > > > running
> > > > > > > > > stable
> > > > > > > > > now. The reason for the virtio error above was that kvmalloc()
> > > > > > > > > returns
> > > > > > > > > a
> > > > > > > > > non-logical kernel address for any kvmalloc(>4M), i.e. an
> > > > > > > > > address
> > > > > > > > > that
> > > > > > > > > is
> > > > > > > > > inaccessible from host side, hence that "bogus descriptor"
> > > > > > > > > message
> > > > > > > > > by
> > > > > > > > > QEMU.
> > > > > > > > > So I had to split those linear 9p client buffers into sparse
> > > > > > > > > ones
> > > > > > > > > (set
> > > > > > > > > of
> > > > > > > > > individual pages).
> > > > > > > > > 
> > > > > > > > > I tested this for some days with various virtio transmission
> > > > > > > > > sizes
> > > > > > > > > and
> > > > > > > > > it
> > > > > > > > > works as expected up to 128 MB (more precisely: 128 MB read
> > > > > > > > > space
> > > > > > > > > +
> > > > > > > > > 128 MB
> > > > > > > > > write space per virtio round trip message).
> > > > > > > > > 
> > > > > > > > > I did not encounter a show stopper for large virtio
> > > > > > > > > transmission
> > > > > > > > > sizes
> > > > > > > > > (4 MB ... 128 MB) on virtio level, neither as a result of
> > > > > > > > > testing,
> > > > > > > > > nor
> > > > > > > > > after reviewing the existing code.
> > > > > > > > > 
> > > > > > > > > About IOV_MAX: that's apparently not an issue on virtio level.
> > > > > > > > > Most of
> > > > > > > > > the
> > > > > > > > > iovec code, both on Linux kernel side and on QEMU side do not
> > > > > > > > > have
> > > > > > > > > this
> > > > > > > > > limitation. It is apparently however indeed a limitation for
> > > > > > > > > userland
> > > > > > > > > apps
> > > > > > > > > calling the Linux kernel's syscalls yet.
> > > > > > > > > 
> > > > > > > > > Stefan, as it stands now, I am even more convinced that the
> > > > > > > > > upper
> > > > > > > > > virtio
> > > > > > > > > transmission size limit should not be squeezed into the queue
> > > > > > > > > size
> > > > > > > > > argument of virtio_add_queue(). Not because of the previous
> > > > > > > > > argument
> > > > > > > > > that
> > > > > > > > > it would waste space (~1MB), but rather because they are two
> > > > > > > > > different
> > > > > > > > > things. To outline this, just a quick recap of what happens
> > > > > > > > > exactly
> > > > > > > > > when
> > > > > > > > > a bulk message is pushed over the virtio wire (assuming virtio
> > > > > > > > > "split"
> > > > > > > > > layout here):
> > > > > > > > > 
> > > > > > > > > ---------- [recap-start] ----------
> > > > > > > > > 
> > > > > > > > > For each bulk message sent guest <-> host, exactly *one* of
> > > > > > > > > the
> > > > > > > > > pre-allocated descriptors is taken and placed (subsequently)
> > > > > > > > > into
> > > > > > > > > exactly
> > > > > > > > > *one* position of the two available/used ring buffers. The
> > > > > > > > > actual
> > > > > > > > > descriptor table though, containing all the DMA addresses of
> > > > > > > > > the
> > > > > > > > > message
> > > > > > > > > bulk data, is allocated just in time for each round trip
> > > > > > > > > message.
> > > > > > > > > Say,
> > > > > > > > > it
> > > > > > > > > is the first message sent, it yields in the following
> > > > > > > > > structure:
> > > > > > > > > 
> > > > > > > > > Ring Buffer   Descriptor Table      Bulk Data Pages
> > > > > > > > > 
> > > > > > > > >    +-+              +-+           +-----------------+
> > > > > > > > >    
> > > > > > > > >    |D|------------->|d|---------->| Bulk data block |
> > > > > > > > >    
> > > > > > > > >    +-+              |d|--------+  +-----------------+
> > > > > > > > >    
> > > > > > > > >    | |              |d|------+ |
> > > > > > > > >    
> > > > > > > > >    +-+               .       | |  +-----------------+
> > > > > > > > >    
> > > > > > > > >    | |               .       | +->| Bulk data block |
> > > > > > > > >     
> > > > > > > > >     .                .       |    +-----------------+
> > > > > > > > >     .               |d|-+    |
> > > > > > > > >     .               +-+ |    |    +-----------------+
> > > > > > > > >     
> > > > > > > > >    | |                  |    +--->| Bulk data block |
> > > > > > > > >    
> > > > > > > > >    +-+                  |         +-----------------+
> > > > > > > > >    
> > > > > > > > >    | |                  |                 .
> > > > > > > > >    
> > > > > > > > >    +-+                  |                 .
> > > > > > > > >    
> > > > > > > > >                         |                 .
> > > > > > > > >                         |         
> > > > > > > > >                         |         +-----------------+
> > > > > > > > >                         
> > > > > > > > >                         +-------->| Bulk data block |
> > > > > > > > >                         
> > > > > > > > >                                   +-----------------+
> > > > > > > > > 
> > > > > > > > > Legend:
> > > > > > > > > D: pre-allocated descriptor
> > > > > > > > > d: just in time allocated descriptor
> > > > > > > > > -->: memory pointer (DMA)
> > > > > > > > > 
> > > > > > > > > The bulk data blocks are allocated by the respective device
> > > > > > > > > driver
> > > > > > > > > above
> > > > > > > > > virtio subsystem level (guest side).
> > > > > > > > > 
> > > > > > > > > There are exactly as many descriptors pre-allocated (D) as the
> > > > > > > > > size of
> > > > > > > > > a
> > > > > > > > > ring buffer.
> > > > > > > > > 
> > > > > > > > > A "descriptor" is more or less just a chainable DMA memory
> > > > > > > > > pointer;
> > > > > > > > > defined
> > > > > > > > > as:
> > > > > > > > > 
> > > > > > > > > /* Virtio ring descriptors: 16 bytes.  These can chain
> > > > > > > > > together
> > > > > > > > > via
> > > > > > > > > "next". */ struct vring_desc {
> > > > > > > > > 
> > > > > > > > > 	/* Address (guest-physical). */
> > > > > > > > > 	__virtio64 addr;
> > > > > > > > > 	/* Length. */
> > > > > > > > > 	__virtio32 len;
> > > > > > > > > 	/* The flags as indicated above. */
> > > > > > > > > 	__virtio16 flags;
> > > > > > > > > 	/* We chain unused descriptors via this, too */
> > > > > > > > > 	__virtio16 next;
> > > > > > > > > 
> > > > > > > > > };
> > > > > > > > > 
> > > > > > > > > There are 2 ring buffers; the "available" ring buffer is for
> > > > > > > > > sending a
> > > > > > > > > message guest->host (which will transmit DMA addresses of
> > > > > > > > > guest
> > > > > > > > > allocated
> > > > > > > > > bulk data blocks that are used for data sent to device, and
> > > > > > > > > separate
> > > > > > > > > guest allocated bulk data blocks that will be used by host
> > > > > > > > > side to
> > > > > > > > > place
> > > > > > > > > its response bulk data), and the "used" ring buffer is for
> > > > > > > > > sending
> > > > > > > > > host->guest to let guest know about host's response and that
> > > > > > > > > it
> > > > > > > > > could
> > > > > > > > > now
> > > > > > > > > safely consume and then deallocate the bulk data blocks
> > > > > > > > > subsequently.
> > > > > > > > > 
> > > > > > > > > ---------- [recap-end] ----------
> > > > > > > > > 
> > > > > > > > > So the "queue size" actually defines the ringbuffer size. It
> > > > > > > > > does
> > > > > > > > > not
> > > > > > > > > define the maximum amount of descriptors. The "queue size"
> > > > > > > > > rather
> > > > > > > > > defines
> > > > > > > > > how many pending messages can be pushed into either one
> > > > > > > > > ringbuffer
> > > > > > > > > before
> > > > > > > > > the other side would need to wait until the counter side would
> > > > > > > > > step up
> > > > > > > > > (i.e. ring buffer full).
> > > > > > > > > 
> > > > > > > > > The maximum amount of descriptors (what VIRTQUEUE_MAX_SIZE
> > > > > > > > > actually
> > > > > > > > > is)
> > > > > > > > > OTOH defines the max. bulk data size that could be transmitted
> > > > > > > > > with
> > > > > > > > > each
> > > > > > > > > virtio round trip message.
> > > > > > > > > 
> > > > > > > > > And in fact, 9p currently handles the virtio "queue size" as
> > > > > > > > > directly
> > > > > > > > > associative with its maximum amount of active 9p requests the
> > > > > > > > > server
> > > > > > > > > could
> > > > > > > > > 
> > > > > > > > > handle simultaniously:
> > > > > > > > >   hw/9pfs/9p.h:#define MAX_REQ         128
> > > > > > > > >   hw/9pfs/9p.h:    V9fsPDU pdus[MAX_REQ];
> > > > > > > > >   hw/9pfs/virtio-9p-device.c:    v->vq =
> > > > > > > > >   virtio_add_queue(vdev,
> > > > > > > > >   MAX_REQ,
> > > > > > > > >   
> > > > > > > > >                                  handle_9p_output);
> > > > > > > > > 
> > > > > > > > > So if I would change it like this, just for the purpose to
> > > > > > > > > increase
> > > > > > > > > the
> > > > > > > > > max. virtio transmission size:
> > > > > > > > > 
> > > > > > > > > --- a/hw/9pfs/virtio-9p-device.c
> > > > > > > > > +++ b/hw/9pfs/virtio-9p-device.c
> > > > > > > > > @@ -218,7 +218,7 @@ static void
> > > > > > > > > virtio_9p_device_realize(DeviceState
> > > > > > > > > *dev,
> > > > > > > > > Error **errp)>
> > > > > > > > > 
> > > > > > > > >      v->config_size = sizeof(struct virtio_9p_config) +
> > > > > > > > >      strlen(s->fsconf.tag);
> > > > > > > > >      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P,
> > > > > > > > >      v->config_size,
> > > > > > > > >      
> > > > > > > > >                  VIRTQUEUE_MAX_SIZE);
> > > > > > > > > 
> > > > > > > > > -    v->vq = virtio_add_queue(vdev, MAX_REQ,
> > > > > > > > > handle_9p_output);
> > > > > > > > > +    v->vq = virtio_add_queue(vdev, 32*1024,
> > > > > > > > > handle_9p_output);
> > > > > > > > > 
> > > > > > > > >  }
> > > > > > > > > 
> > > > > > > > > Then it would require additional synchronization code on both
> > > > > > > > > ends
> > > > > > > > > and
> > > > > > > > > therefore unnecessary complexity, because it would now be
> > > > > > > > > possible
> > > > > > > > > that
> > > > > > > > > more requests are pushed into the ringbuffer than server could
> > > > > > > > > handle.
> > > > > > > > > 
> > > > > > > > > There is one potential issue though that probably did justify
> > > > > > > > > the
> > > > > > > > > "don't
> > > > > > > > > exceed the queue size" rule:
> > > > > > > > > 
> > > > > > > > > ATM the descriptor table is allocated (just in time) as *one*
> > > > > > > > > continuous
> > > > > > > > > buffer via kmalloc_array():
> > > > > > > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53
> > > > > > > > > 798c
> > > > > > > > > a086
> > > > > > > > > f7c7
> > > > > > > > > d33a4/drivers/virtio/virtio_ring.c#L440
> > > > > > > > > 
> > > > > > > > > So assuming transmission size of 2 * 128 MB that
> > > > > > > > > kmalloc_array()
> > > > > > > > > call
> > > > > > > > > would
> > > > > > > > > yield in kmalloc(1M) and the latter might fail if guest had
> > > > > > > > > highly
> > > > > > > > > fragmented physical memory. For such kind of error case there
> > > > > > > > > is
> > > > > > > > > currently a fallback path in virtqueue_add_split() that would
> > > > > > > > > then
> > > > > > > > > use
> > > > > > > > > the required amount of pre-allocated descriptors instead:
> > > > > > > > > https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53
> > > > > > > > > 798c
> > > > > > > > > a086
> > > > > > > > > f7c7
> > > > > > > > > d33a4/drivers/virtio/virtio_ring.c#L525
> > > > > > > > > 
> > > > > > > > > That fallback recovery path would no longer be viable if the
> > > > > > > > > queue
> > > > > > > > > size
> > > > > > > > > was
> > > > > > > > > exceeded. There would be alternatives though, e.g. by allowing
> > > > > > > > > to
> > > > > > > > > chain
> > > > > > > > > indirect descriptor tables (currently prohibited by the virtio
> > > > > > > > > specs).
> > > > > > > > 
> > > > > > > > Making the maximum number of descriptors independent of the
> > > > > > > > queue
> > > > > > > > size
> > > > > > > > requires a change to the VIRTIO spec since the two values are
> > > > > > > > currently
> > > > > > > > explicitly tied together by the spec.
> > > > > > > 
> > > > > > > Yes, that's what the virtio specs say. But they don't say why, nor
> > > > > > > did
> > > > > > > I
> > > > > > > hear a reason in this dicussion.
> > > > > > > 
> > > > > > > That's why I invested time reviewing current virtio implementation
> > > > > > > and
> > > > > > > specs, as well as actually testing exceeding that limit. And as I
> > > > > > > outlined in detail in my previous email, I only found one
> > > > > > > theoretical
> > > > > > > issue that could be addressed though.
> > > > > > 
> > > > > > I agree that there is a limitation in the VIRTIO spec, but violating
> > > > > > the
> > > > > > spec isn't an acceptable solution:
> > > > > > 
> > > > > > 1. QEMU and Linux aren't the only components that implement VIRTIO.
> > > > > > You
> > > > > > 
> > > > > >    cannot make assumptions about their implementations because it
> > > > > >    may
> > > > > >    break spec-compliant implementations that you haven't looked at.
> > > > > >    
> > > > > >    Your patches weren't able to increase Queue Size because some
> > > > > >    device
> > > > > >    implementations break when descriptor chains are too long. This
> > > > > >    shows
> > > > > >    there is a practical issue even in QEMU.
> > > > > > 
> > > > > > 2. The specific spec violation that we discussed creates the problem
> > > > > > 
> > > > > >    that drivers can no longer determine the maximum description
> > > > > >    chain
> > > > > >    length. This in turn will lead to more implementation-specific
> > > > > >    assumptions being baked into drivers and cause problems with
> > > > > >    interoperability and future changes.
> > > > > > 
> > > > > > The spec needs to be extended instead. I included an idea for how to
> > > > > > do
> > > > > > that below.
> > > > > 
> > > > > Sure, I just wanted to see if there was a non-neglectable "hard" show
> > > > > stopper per se that I probably haven't seen yet. I have not questioned
> > > > > aiming a clean solution.
> > > > > 
> > > > > Thanks for the clarification!
> > > > > 
> > > > > > > > Before doing that, are there benchmark results showing that 1 MB
> > > > > > > > vs
> > > > > > > > 128
> > > > > > > > MB produces a performance improvement? I'm asking because if
> > > > > > > > performance
> > > > > > > > with 1 MB is good then you can probably do that without having
> > > > > > > > to
> > > > > > > > change
> > > > > > > > VIRTIO and also because it's counter-intuitive that 9p needs 128
> > > > > > > > MB
> > > > > > > > for
> > > > > > > > good performance when it's ultimately implemented on top of disk
> > > > > > > > and
> > > > > > > > network I/O that have lower size limits.
> > > > > > > 
> > > > > > > First some numbers, linear reading a 12 GB file:
> > > > > > > 
> > > > > > > msize    average      notes
> > > > > > > 
> > > > > > > 8 kB     52.0 MB/s    default msize of Linux kernel <v5.15
> > > > > > > 128 kB   624.8 MB/s   default msize of Linux kernel >=v5.15
> > > > > > > 512 kB   1961 MB/s    current max. msize with any Linux kernel
> > > > > > > <=v5.15
> > > > > > > 1 MB     2551 MB/s    this msize would already violate virtio
> > > > > > > specs
> > > > > > > 2 MB     2521 MB/s    this msize would already violate virtio
> > > > > > > specs
> > > > > > > 4 MB     2628 MB/s    planned max. msize of my current kernel
> > > > > > > patches
> > > > > > > [1]
> > > > > > 
> > > > > > How many descriptors are used? 4 MB can be covered by a single
> > > > > > descriptor if the data is physically contiguous in memory, so this
> > > > > > data
> > > > > > doesn't demonstrate a need for more descriptors.
> > > > > 
> > > > > No, in the last couple years there was apparently no kernel version
> > > > > that
> > > > > used just one descriptor, nor did my benchmarked version. Even though
> > > > > the
> > > > > Linux 9p client uses (yet) simple linear buffers (contiguous physical
> > > > > memory) on 9p client level, these are however split into PAGE_SIZE
> > > > > chunks
> > > > > by function pack_sg_list() [1] before being fed to virtio level:
> > > > > 
> > > > > static unsigned int rest_of_page(void *data)
> > > > > {
> > > > > 
> > > > > 	return PAGE_SIZE - offset_in_page(data);
> > > > > 
> > > > > }
> > > > > ...
> > > > > static int pack_sg_list(struct scatterlist *sg, int start,
> > > > > 
> > > > > 			int limit, char *data, int count)
> > > > > 
> > > > > {
> > > > > 
> > > > > 	int s;
> > > > > 	int index = start;
> > > > > 	
> > > > > 	while (count) {
> > > > > 	
> > > > > 		s = rest_of_page(data);
> > > > > 		...
> > > > > 		sg_set_buf(&sg[index++], data, s);
> > > > > 		count -= s;
> > > > > 		data += s;
> > > > > 	
> > > > > 	}
> > > > > 	...
> > > > > 
> > > > > }
> > > > > 
> > > > > [1]
> > > > > https://github.com/torvalds/linux/blob/19901165d90fdca1e57c9baa0d5b4c6
> > > > > 3d1
> > > > > 5c476a/net/9p/trans_virtio.c#L171
> > > > > 
> > > > > So when sending 4MB over virtio wire, it would yield in 1k descriptors
> > > > > ATM.
> > > > > 
> > > > > I have wondered about this before, but did not question it, because
> > > > > due to
> > > > > the cross-platform nature I couldn't say for certain whether that's
> > > > > probably needed somewhere. I mean for the case virtio-PCI I know for
> > > > > sure
> > > > > that one descriptor (i.e. >PAGE_SIZE) would be fine, but I don't know
> > > > > if
> > > > > that applies to all buses and architectures.
> > > > 
> > > > VIRTIO does not limit descriptor the descriptor len field to PAGE_SIZE,
> > > > so I don't think there is a limit at the VIRTIO level.
> > > 
> > > So you are viewing this purely from virtio specs PoV: in the sense, if it
> > > is not prohibited by the virtio specs, then it should work. Maybe.
> > 
> > Limitations must be specified either in the 9P protocol or the VIRTIO
> > specification. Drivers and devices will not be able to operate correctly
> > if there are limitations that aren't covered by the specs.
> > 
> > Do you have something in mind that isn't covered by the specs?
> 
> Not sure whether that's something that should be specified by the virtio 
> specs, probably not. I simply do not know if there was any bus or architecture 
> that would have a limitation for max. size for a memory block passed per one 
> DMA address.

Host-side limitations like that can exist. For example when a physical
storage device on the host has limits that the VIRTIO device does not
have. In this case both virtio-scsi and virtio-blk report those limits
to the guest so that the guest won't submit requests that the physical
device would reject. I guess networking MTU is kind of similar too. What
they have in common is that the limit needs to be reported to the guest,
typically using a VIRTIO Configuration Space field. It is an explicit
limit that is part of the host<->guest interface (VIRTIO spec, SCSI,
etc).

> > > > If this function coalesces adjacent pages then the descriptor chain
> > > > length issues could be reduced.
> > > > 
> > > > > > > But again, this is not just about performance. My conclusion as
> > > > > > > described
> > > > > > > in my previous email is that virtio currently squeezes
> > > > > > > 
> > > > > > > 	"max. simultanious amount of bulk messages"
> > > > > > > 
> > > > > > > vs.
> > > > > > > 
> > > > > > > 	"max. bulk data transmission size per bulk messaage"
> > > > > > > 
> > > > > > > into the same configuration parameter, which is IMO inappropriate
> > > > > > > and
> > > > > > > hence
> > > > > > > splitting them into 2 separate parameters when creating a queue
> > > > > > > makes
> > > > > > > sense, independent of the performance benchmarks.
> > > > > > > 
> > > > > > > [1]
> > > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crud
> > > > > > > ebyt
> > > > > > > e.c
> > > > > > > om/
> > > > > > 
> > > > > > Some devices effectively already have this because the device
> > > > > > advertises
> > > > > > a maximum number of descriptors via device-specific mechanisms like
> > > > > > the
> > > > > > struct virtio_blk_config seg_max field. But today these fields can
> > > > > > only
> > > > > > reduce the maximum descriptor chain length because the spec still
> > > > > > limits
> > > > > > the length to Queue Size.
> > > > > > 
> > > > > > We can build on this approach to raise the length above Queue Size.
> > > > > > This
> > > > > > approach has the advantage that the maximum number of segments isn't
> > > > > > per
> > > > > > device or per virtqueue, it's fine-grained. If the device supports
> > > > > > two
> > > > > > requests types then different max descriptor chain limits could be
> > > > > > given
> > > > > > for them by introducing two separate configuration space fields.
> > > > > > 
> > > > > > Here are the corresponding spec changes:
> > > > > > 
> > > > > > 1. A new feature bit called VIRTIO_RING_F_LARGE_INDIRECT_DESC is
> > > > > > added
> > > > > > 
> > > > > >    to indicate that indirect descriptor table size and maximum
> > > > > >    descriptor chain length are not limited by Queue Size value.
> > > > > >    (Maybe
> > > > > >    there still needs to be a limit like 2^15?)
> > > > > 
> > > > > Sounds good to me!
> > > > > 
> > > > > AFAIK it is effectively limited to 2^16 because of vring_desc->next:
> > > > > 
> > > > > /* Virtio ring descriptors: 16 bytes.  These can chain together via
> > > > > "next". */ struct vring_desc {
> > > > > 
> > > > >         /* Address (guest-physical). */
> > > > >         __virtio64 addr;
> > > > >         /* Length. */
> > > > >         __virtio32 len;
> > > > >         /* The flags as indicated above. */
> > > > >         __virtio16 flags;
> > > > >         /* We chain unused descriptors via this, too */
> > > > >         __virtio16 next;
> > > > > 
> > > > > };
> > > > 
> > > > Yes, Split Virtqueues have a fundamental limit on indirect table size
> > > > due to the "next" field. Packed Virtqueue descriptors don't have a
> > > > "next" field so descriptor chains could be longer in theory (currently
> > > > forbidden by the spec).
> > > > 
> > > > > > One thing that's messy is that we've been discussing the maximum
> > > > > > descriptor chain length but 9p has the "msize" concept, which isn't
> > > > > > aware of contiguous memory. It may be necessary to extend the 9p
> > > > > > driver
> > > > > > code to size requests not just according to their length in bytes
> > > > > > but
> > > > > > also according to the descriptor chain length. That's how the Linux
> > > > > > block layer deals with queue limits (struct queue_limits
> > > > > > max_segments vs
> > > > > > max_hw_sectors).
> > > > > 
> > > > > Hmm, can't follow on that one. For what should that be needed in case
> > > > > of
> > > > > 9p? My plan was to limit msize by 9p client simply at session start to
> > > > > whatever is the max. amount virtio descriptors supported by host and
> > > > > using PAGE_SIZE as size per descriptor, because that's what 9p client
> > > > > actually does ATM (see above). So you think that should be changed to
> > > > > e.g. just one descriptor for 4MB, right?
> > > > 
> > > > Limiting msize to the 9p transport device's maximum number of
> > > > descriptors is conservative (i.e. 128 descriptors = 512 KB msize)
> > > > because it doesn't take advantage of contiguous memory. I suggest
> > > > leaving msize alone, adding a separate limit at which requests are split
> > > > according to the maximum descriptor chain length, and tweaking
> > > > pack_sg_list() to coalesce adjacent pages.
> > > > 
> > > > That way msize can be large without necessarily using lots of
> > > > descriptors (depending on the memory layout).
> > > 
> > > That was actually a tempting solution. Because it would neither require
> > > changes to the virtio specs (at least for a while) and it would also work
> > > with older QEMU versions. And for that pack_sg_list() portion of the code
> > > it would work well and easy as the buffer passed to pack_sg_list() is
> > > contiguous already.
> > > 
> > > However I just realized for the zero-copy version of the code that would
> > > be
> > > more tricky. The ZC version already uses individual pages (struct page,
> > > hence PAGE_SIZE each) which are pinned, i.e. it uses pack_sg_list_p() [1]
> > > in combination with p9_get_mapped_pages() [2]
> > > 
> > > [1]
> > > https://github.com/torvalds/linux/blob/7ddb58cb0ecae8e8b6181d736a87667cc9
> > > ab8389/net/9p/trans_virtio.c#L218 [2]
> > > https://github.com/torvalds/linux/blob/7ddb58cb0ecae8e8b6181d736a87667cc9
> > > ab8389/net/9p/trans_virtio.c#L309
> > > 
> > > So that would require much more work and code trying to sort and coalesce
> > > individual pages to contiguous physical memory for the sake of reducing
> > > virtio descriptors. And there is no guarantee that this is even possible.
> > > The kernel may simply return a non-contiguous set of pages which would
> > > eventually end up exceeding the virtio descriptor limit again.
> > 
> > Order must be preserved so pages cannot be sorted by physical address.
> > How about simply coalescing when pages are adjacent?
> 
> It would help, but not solve the issue we are talking about here: if 99% of 
> the cases could successfully merge descriptors to stay below the descriptor 
> count limit, but in 1% of the cases it could not, then this still construes a 
> severe runtime issue that could trigger at any time.
> 
> > > So looks like it was probably still easier and realistic to just add
> > > virtio
> > > capabilities for now for allowing to exceed current descriptor limit.
> > 
> > I'm still not sure why virtio-net, virtio-blk, virtio-fs, etc perform
> > fine under today's limits while virtio-9p needs a much higher limit to
> > achieve good performance. Maybe there is an issue in a layer above the
> > vring that's causing the virtio-9p performance you've observed?
> 
> Are you referring to (somewhat) recent benchmarks when saying those would all 
> still perform fine today?

I'm not referring to specific benchmark results. Just that none of those
devices needed to raise the descriptor chain length, so I'm surprised
that virtio-9p needs it because it's conceptually similar to these
devices.

> Vivek was running detailed benchmarks for virtiofs vs. 9p:
> https://lists.gnu.org/archive/html/qemu-devel/2020-12/msg02704.html
> 
> For the virtio aspect discussed here, only the benchmark configurations 
> without cache are relevant (9p-none, vtfs-none) and under this aspect the 
> situation seems to be quite similar between 9p and virtio-fs. You'll also note 
> once DAX is enabled (vtfs-none-dax) that apparently boosts virtio-fs 
> performance significantly, which however seems to corelate to numbers when I 
> am running 9p with msize > 300k. Note: Vivek was presumably running 9p 
> effecively with msize=300k, as this was the kernel limitation at that time.

Agreed, virtio-9p and virtiofs are similar without caching.

I think we shouldn't consider DAX here since it bypasses the virtqueue.

> To bring things into relation: there are known performance aspects in 9p that 
> can be improved, yes, both on Linux kernel side and on 9p server side in QEMU. 
> For instance 9p server uses coroutines [1] and currently dispatches between 
> worker thread(s) and main thread too often per request (partly addressed 
> already [2], but still WIP), which accumulates to overall latency. But Vivek 
> was actually using a 9p patch here which disabled coroutines entirely, which 
> suggests that the virtio transmission size limit still represents a 
> bottleneck.

These results were collected with 4k block size. Neither msize nor the
descriptor chain length limits will be stressed, so I don't think these
results are relevant here.

Maybe a more relevant comparison would be virtio-9p, virtiofs, and
virtio-blk when block size is large (e.g. 1M). The Linux block layer in
the guest will split virtio-blk requests when they exceed the block
queue limits.

Stefan

> 
> [1] https://wiki.qemu.org/Documentation/9p#Coroutines
> [2] https://wiki.qemu.org/Documentation/9p#Implementation_Plans
> 
> Best regards,
> Christian Schoenebeck
> 
>
Christian Schoenebeck Nov. 10, 2021, 1:14 p.m. UTC | #21
On Mittwoch, 10. November 2021 11:05:50 CET Stefan Hajnoczi wrote:
> > > > So looks like it was probably still easier and realistic to just add
> > > > virtio
> > > > capabilities for now for allowing to exceed current descriptor limit.
> > > 
> > > I'm still not sure why virtio-net, virtio-blk, virtio-fs, etc perform
> > > fine under today's limits while virtio-9p needs a much higher limit to
> > > achieve good performance. Maybe there is an issue in a layer above the
> > > vring that's causing the virtio-9p performance you've observed?
> > 
> > Are you referring to (somewhat) recent benchmarks when saying those would
> > all still perform fine today?
> 
> I'm not referring to specific benchmark results. Just that none of those
> devices needed to raise the descriptor chain length, so I'm surprised
> that virtio-9p needs it because it's conceptually similar to these
> devices.

I would not say virtio-net and virtio-blk were comparable with virtio-9p and 
virtio-fs. virtio-9p and virtio-fs are fully fledged file servers which must 
perform various controller tasks before handling the actually requested I/O 
task, which inevitably adds latency to each request, whereas virtio-net and 
virtio-blk are just very thin layers that do not have that controller task 
overhead per request. And a network device only needs to handle very small 
messages in the first place.

> > Vivek was running detailed benchmarks for virtiofs vs. 9p:
> > https://lists.gnu.org/archive/html/qemu-devel/2020-12/msg02704.html
> > 
> > For the virtio aspect discussed here, only the benchmark configurations
> > without cache are relevant (9p-none, vtfs-none) and under this aspect the
> > situation seems to be quite similar between 9p and virtio-fs. You'll also
> > note once DAX is enabled (vtfs-none-dax) that apparently boosts virtio-fs
> > performance significantly, which however seems to corelate to numbers
> > when I am running 9p with msize > 300k. Note: Vivek was presumably
> > running 9p effecively with msize=300k, as this was the kernel limitation
> > at that time.
> Agreed, virtio-9p and virtiofs are similar without caching.
> 
> I think we shouldn't consider DAX here since it bypasses the virtqueue.

DAX bypasses virtio, sure, but the performance boost you get with DAX actually 
shows the limiting factor with virtio pretty well.

> > To bring things into relation: there are known performance aspects in 9p
> > that can be improved, yes, both on Linux kernel side and on 9p server
> > side in QEMU. For instance 9p server uses coroutines [1] and currently
> > dispatches between worker thread(s) and main thread too often per request
> > (partly addressed already [2], but still WIP), which accumulates to
> > overall latency. But Vivek was actually using a 9p patch here which
> > disabled coroutines entirely, which suggests that the virtio transmission
> > size limit still represents a bottleneck.
> 
> These results were collected with 4k block size. Neither msize nor the
> descriptor chain length limits will be stressed, so I don't think these
> results are relevant here.
> 
> Maybe a more relevant comparison would be virtio-9p, virtiofs, and
> virtio-blk when block size is large (e.g. 1M). The Linux block layer in
> the guest will split virtio-blk requests when they exceed the block
> queue limits.

I am sorry, I cannot spend time for more benchmarks like that. For really 
making fair comparisons I would need to review all their code on both ends, 
adjust configuration/sources, etc.

I do think that I performed enough benchmarks and tests to show that 
increasing the transmission size can significantly improve performance with 
9p, and that allowing to exceed the queue size does make sense even for small 
transmission sizes (e.g. max. active requests on 9p server side vs. max. 
transmission size per request).

The reason for the performance gain is the minimum latency involved per 
request, and like I said, that can be improved to a certain extent with 9p, 
but that will take a long time and it could not be eliminated entirely.

As you are apparently reluctant for changing the virtio specs, what about 
introducing those discussed virtio capabalities either as experimental ones 
without specs changes, or even just as 9p specific device capabilities for 
now. I mean those could be revoked on both sides at any time anyway.

Best regards,
Christian Schoenebeck
Stefan Hajnoczi Nov. 10, 2021, 3:14 p.m. UTC | #22
On Wed, Nov 10, 2021 at 02:14:43PM +0100, Christian Schoenebeck wrote:
> On Mittwoch, 10. November 2021 11:05:50 CET Stefan Hajnoczi wrote:
> As you are apparently reluctant for changing the virtio specs, what about 
> introducing those discussed virtio capabalities either as experimental ones 
> without specs changes, or even just as 9p specific device capabilities for 
> now. I mean those could be revoked on both sides at any time anyway.

I would like to understand the root cause before making changes.

"It's faster when I do X" is useful information but it doesn't
necessarily mean doing X is the solution. The "it's faster when I do X
because Y" part is missing in my mind. Once there is evidence that shows
Y then it will be clearer if X is a good solution, if there's a more
general solution, or if it was just a side-effect.

I'm sorry for frustrating your efforts here. We have discussed a lot of
different ideas and maybe our perspectives are not that far apart
anymore.

Keep going with what you think is best. If I am still skeptical we can
ask someone else to review the patches instead of me so you have a
second opinion.

Stefan
Christian Schoenebeck Nov. 10, 2021, 3:53 p.m. UTC | #23
On Mittwoch, 10. November 2021 16:14:19 CET Stefan Hajnoczi wrote:
> On Wed, Nov 10, 2021 at 02:14:43PM +0100, Christian Schoenebeck wrote:
> > On Mittwoch, 10. November 2021 11:05:50 CET Stefan Hajnoczi wrote:
> > As you are apparently reluctant for changing the virtio specs, what about
> > introducing those discussed virtio capabalities either as experimental
> > ones
> > without specs changes, or even just as 9p specific device capabilities for
> > now. I mean those could be revoked on both sides at any time anyway.
> 
> I would like to understand the root cause before making changes.
> 
> "It's faster when I do X" is useful information but it doesn't
> necessarily mean doing X is the solution. The "it's faster when I do X
> because Y" part is missing in my mind. Once there is evidence that shows
> Y then it will be clearer if X is a good solution, if there's a more
> general solution, or if it was just a side-effect.

I think I made it clear that the root cause of the observed performance gain 
with rising transmission size is latency (and also that performance is not the 
only reason for addressing this queue size issue).

Each request roundtrip has a certain minimum latency, the virtio ring alone 
has its latency, plus latency of the controller portion of the file server 
(e.g. permissions, sandbox checks, file IDs) that is executed with *every* 
request, plus latency of dispatching the request handling between threads 
several times back and forth (also for each request).

Therefore when you split large payloads (e.g. reading a large file) into 
smaller n amount of chunks, then that individual latency per request 
accumulates to n times the individual latency, eventually leading to degraded 
transmission speed as those requests are serialized.

> I'm sorry for frustrating your efforts here. We have discussed a lot of
> different ideas and maybe our perspectives are not that far apart
> anymore.
> 
> Keep going with what you think is best. If I am still skeptical we can
> ask someone else to review the patches instead of me so you have a
> second opinion.
> 
> Stefan

Thanks Stefan!

In the meantime I try to address your objections as far as I can. If there is 
more I can do (with reasonable effort) to resolve your doubts, just let me 
know.

Best regards,
Christian Schoenebeck
Stefan Hajnoczi Nov. 11, 2021, 4:31 p.m. UTC | #24
On Wed, Nov 10, 2021 at 04:53:33PM +0100, Christian Schoenebeck wrote:
> On Mittwoch, 10. November 2021 16:14:19 CET Stefan Hajnoczi wrote:
> > On Wed, Nov 10, 2021 at 02:14:43PM +0100, Christian Schoenebeck wrote:
> > > On Mittwoch, 10. November 2021 11:05:50 CET Stefan Hajnoczi wrote:
> > > As you are apparently reluctant for changing the virtio specs, what about
> > > introducing those discussed virtio capabalities either as experimental
> > > ones
> > > without specs changes, or even just as 9p specific device capabilities for
> > > now. I mean those could be revoked on both sides at any time anyway.
> > 
> > I would like to understand the root cause before making changes.
> > 
> > "It's faster when I do X" is useful information but it doesn't
> > necessarily mean doing X is the solution. The "it's faster when I do X
> > because Y" part is missing in my mind. Once there is evidence that shows
> > Y then it will be clearer if X is a good solution, if there's a more
> > general solution, or if it was just a side-effect.
> 
> I think I made it clear that the root cause of the observed performance gain 
> with rising transmission size is latency (and also that performance is not the 
> only reason for addressing this queue size issue).
> 
> Each request roundtrip has a certain minimum latency, the virtio ring alone 
> has its latency, plus latency of the controller portion of the file server 
> (e.g. permissions, sandbox checks, file IDs) that is executed with *every* 
> request, plus latency of dispatching the request handling between threads 
> several times back and forth (also for each request).
> 
> Therefore when you split large payloads (e.g. reading a large file) into 
> smaller n amount of chunks, then that individual latency per request 
> accumulates to n times the individual latency, eventually leading to degraded 
> transmission speed as those requests are serialized.

It's easy to increase the blocksize in benchmarks, but real applications
offer less control over the I/O pattern. If latency in the device
implementation (QEMU) is the root cause then reduce the latency to speed
up all applications, even those that cannot send huge requests.

One idea is request merging on the QEMU side. If the application sends
10 sequential read or write requests, coalesce them together before the
main part of request processing begins in the device. Process a single
large request to spread the cost of the file server over the 10
requests. (virtio-blk has request merging to help with the cost of lots
of small qcow2 I/O requests.) The cool thing about this is that the
guest does not need to change its I/O pattern to benefit from the
optimization.

Stefan
Christian Schoenebeck Nov. 11, 2021, 5:54 p.m. UTC | #25
On Donnerstag, 11. November 2021 17:31:52 CET Stefan Hajnoczi wrote:
> On Wed, Nov 10, 2021 at 04:53:33PM +0100, Christian Schoenebeck wrote:
> > On Mittwoch, 10. November 2021 16:14:19 CET Stefan Hajnoczi wrote:
> > > On Wed, Nov 10, 2021 at 02:14:43PM +0100, Christian Schoenebeck wrote:
> > > > On Mittwoch, 10. November 2021 11:05:50 CET Stefan Hajnoczi wrote:
> > > > As you are apparently reluctant for changing the virtio specs, what
> > > > about
> > > > introducing those discussed virtio capabalities either as experimental
> > > > ones
> > > > without specs changes, or even just as 9p specific device capabilities
> > > > for
> > > > now. I mean those could be revoked on both sides at any time anyway.
> > > 
> > > I would like to understand the root cause before making changes.
> > > 
> > > "It's faster when I do X" is useful information but it doesn't
> > > necessarily mean doing X is the solution. The "it's faster when I do X
> > > because Y" part is missing in my mind. Once there is evidence that shows
> > > Y then it will be clearer if X is a good solution, if there's a more
> > > general solution, or if it was just a side-effect.
> > 
> > I think I made it clear that the root cause of the observed performance
> > gain with rising transmission size is latency (and also that performance
> > is not the only reason for addressing this queue size issue).
> > 
> > Each request roundtrip has a certain minimum latency, the virtio ring
> > alone
> > has its latency, plus latency of the controller portion of the file server
> > (e.g. permissions, sandbox checks, file IDs) that is executed with *every*
> > request, plus latency of dispatching the request handling between threads
> > several times back and forth (also for each request).
> > 
> > Therefore when you split large payloads (e.g. reading a large file) into
> > smaller n amount of chunks, then that individual latency per request
> > accumulates to n times the individual latency, eventually leading to
> > degraded transmission speed as those requests are serialized.
> 
> It's easy to increase the blocksize in benchmarks, but real applications
> offer less control over the I/O pattern. If latency in the device
> implementation (QEMU) is the root cause then reduce the latency to speed
> up all applications, even those that cannot send huge requests.

Which I did, still do, and also mentioned before, e.g.:

8d6cb100731c4d28535adbf2a3c2d1f29be3fef4 9pfs: reduce latency of Twalk
0c4356ba7dafc8ecb5877a42fc0d68d45ccf5951 9pfs: T_readdir latency optimization

Reducing overall latency is a process that is ongoing and will still take a 
very long development time. Not because of me, but because of lack of 
reviewers. And even then, it does not make the effort to support higher 
transmission sizes obsolete.

> One idea is request merging on the QEMU side. If the application sends
> 10 sequential read or write requests, coalesce them together before the
> main part of request processing begins in the device. Process a single
> large request to spread the cost of the file server over the 10
> requests. (virtio-blk has request merging to help with the cost of lots
> of small qcow2 I/O requests.) The cool thing about this is that the
> guest does not need to change its I/O pattern to benefit from the
> optimization.
> 
> Stefan

Ok, don't get me wrong: I appreciate that you are suggesting approaches that 
could improve things. But I could already hand you over a huge list of mine. 
The limiting factor here is not the lack of ideas of what could be improved, 
but rather the lack of people helping out actively on 9p side:
https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg06452.html

The situation on kernel side is the same. I already have a huge list of what 
could & should be improved. But there is basically no reviewer for 9p patches 
on Linux kernel side either.

The much I appreciate suggestions of what could be improved, I would 
appreciate much more if there was *anybody* actively assisting as well. In the 
time being I have to work the list down in small patch chunks, priority based.

Best regards,
Christian Schoenebeck
Stefan Hajnoczi Nov. 15, 2021, 11:54 a.m. UTC | #26
On Thu, Nov 11, 2021 at 06:54:03PM +0100, Christian Schoenebeck wrote:
> On Donnerstag, 11. November 2021 17:31:52 CET Stefan Hajnoczi wrote:
> > On Wed, Nov 10, 2021 at 04:53:33PM +0100, Christian Schoenebeck wrote:
> > > On Mittwoch, 10. November 2021 16:14:19 CET Stefan Hajnoczi wrote:
> > > > On Wed, Nov 10, 2021 at 02:14:43PM +0100, Christian Schoenebeck wrote:
> > > > > On Mittwoch, 10. November 2021 11:05:50 CET Stefan Hajnoczi wrote:
> > > > > As you are apparently reluctant for changing the virtio specs, what
> > > > > about
> > > > > introducing those discussed virtio capabalities either as experimental
> > > > > ones
> > > > > without specs changes, or even just as 9p specific device capabilities
> > > > > for
> > > > > now. I mean those could be revoked on both sides at any time anyway.
> > > > 
> > > > I would like to understand the root cause before making changes.
> > > > 
> > > > "It's faster when I do X" is useful information but it doesn't
> > > > necessarily mean doing X is the solution. The "it's faster when I do X
> > > > because Y" part is missing in my mind. Once there is evidence that shows
> > > > Y then it will be clearer if X is a good solution, if there's a more
> > > > general solution, or if it was just a side-effect.
> > > 
> > > I think I made it clear that the root cause of the observed performance
> > > gain with rising transmission size is latency (and also that performance
> > > is not the only reason for addressing this queue size issue).
> > > 
> > > Each request roundtrip has a certain minimum latency, the virtio ring
> > > alone
> > > has its latency, plus latency of the controller portion of the file server
> > > (e.g. permissions, sandbox checks, file IDs) that is executed with *every*
> > > request, plus latency of dispatching the request handling between threads
> > > several times back and forth (also for each request).
> > > 
> > > Therefore when you split large payloads (e.g. reading a large file) into
> > > smaller n amount of chunks, then that individual latency per request
> > > accumulates to n times the individual latency, eventually leading to
> > > degraded transmission speed as those requests are serialized.
> > 
> > It's easy to increase the blocksize in benchmarks, but real applications
> > offer less control over the I/O pattern. If latency in the device
> > implementation (QEMU) is the root cause then reduce the latency to speed
> > up all applications, even those that cannot send huge requests.
> 
> Which I did, still do, and also mentioned before, e.g.:
> 
> 8d6cb100731c4d28535adbf2a3c2d1f29be3fef4 9pfs: reduce latency of Twalk
> 0c4356ba7dafc8ecb5877a42fc0d68d45ccf5951 9pfs: T_readdir latency optimization
> 
> Reducing overall latency is a process that is ongoing and will still take a 
> very long development time. Not because of me, but because of lack of 
> reviewers. And even then, it does not make the effort to support higher 
> transmission sizes obsolete.
> 
> > One idea is request merging on the QEMU side. If the application sends
> > 10 sequential read or write requests, coalesce them together before the
> > main part of request processing begins in the device. Process a single
> > large request to spread the cost of the file server over the 10
> > requests. (virtio-blk has request merging to help with the cost of lots
> > of small qcow2 I/O requests.) The cool thing about this is that the
> > guest does not need to change its I/O pattern to benefit from the
> > optimization.
> > 
> > Stefan
> 
> Ok, don't get me wrong: I appreciate that you are suggesting approaches that 
> could improve things. But I could already hand you over a huge list of mine. 
> The limiting factor here is not the lack of ideas of what could be improved, 
> but rather the lack of people helping out actively on 9p side:
> https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg06452.html
> 
> The situation on kernel side is the same. I already have a huge list of what 
> could & should be improved. But there is basically no reviewer for 9p patches 
> on Linux kernel side either.
> 
> The much I appreciate suggestions of what could be improved, I would 
> appreciate much more if there was *anybody* actively assisting as well. In the 
> time being I have to work the list down in small patch chunks, priority based.

I see request merging as an alternative to this patch series, not as an
additional idea.

My thoughts behind this is that request merging is less work than this
patch series and more broadly applicable. It would be easy to merge (no
idea how easy it is to implement though) in QEMU's virtio-9p device
implementation, does not require changes across the stack, and benefits
applications that can't change their I/O pattern to take advantage of
huge requests.

There is a risk that request merging won't pan out, it could have worse
performance than submitting huge requests.

Stefan
Christian Schoenebeck Nov. 15, 2021, 2:32 p.m. UTC | #27
On Montag, 15. November 2021 12:54:32 CET Stefan Hajnoczi wrote:
> On Thu, Nov 11, 2021 at 06:54:03PM +0100, Christian Schoenebeck wrote:
> > On Donnerstag, 11. November 2021 17:31:52 CET Stefan Hajnoczi wrote:
> > > On Wed, Nov 10, 2021 at 04:53:33PM +0100, Christian Schoenebeck wrote:
> > > > On Mittwoch, 10. November 2021 16:14:19 CET Stefan Hajnoczi wrote:
> > > > > On Wed, Nov 10, 2021 at 02:14:43PM +0100, Christian Schoenebeck 
wrote:
> > > > > > On Mittwoch, 10. November 2021 11:05:50 CET Stefan Hajnoczi wrote:
> > > > > > As you are apparently reluctant for changing the virtio specs,
> > > > > > what
> > > > > > about
> > > > > > introducing those discussed virtio capabalities either as
> > > > > > experimental
> > > > > > ones
> > > > > > without specs changes, or even just as 9p specific device
> > > > > > capabilities
> > > > > > for
> > > > > > now. I mean those could be revoked on both sides at any time
> > > > > > anyway.
> > > > > 
> > > > > I would like to understand the root cause before making changes.
> > > > > 
> > > > > "It's faster when I do X" is useful information but it doesn't
> > > > > necessarily mean doing X is the solution. The "it's faster when I do
> > > > > X
> > > > > because Y" part is missing in my mind. Once there is evidence that
> > > > > shows
> > > > > Y then it will be clearer if X is a good solution, if there's a more
> > > > > general solution, or if it was just a side-effect.
> > > > 
> > > > I think I made it clear that the root cause of the observed
> > > > performance
> > > > gain with rising transmission size is latency (and also that
> > > > performance
> > > > is not the only reason for addressing this queue size issue).
> > > > 
> > > > Each request roundtrip has a certain minimum latency, the virtio ring
> > > > alone
> > > > has its latency, plus latency of the controller portion of the file
> > > > server
> > > > (e.g. permissions, sandbox checks, file IDs) that is executed with
> > > > *every*
> > > > request, plus latency of dispatching the request handling between
> > > > threads
> > > > several times back and forth (also for each request).
> > > > 
> > > > Therefore when you split large payloads (e.g. reading a large file)
> > > > into
> > > > smaller n amount of chunks, then that individual latency per request
> > > > accumulates to n times the individual latency, eventually leading to
> > > > degraded transmission speed as those requests are serialized.
> > > 
> > > It's easy to increase the blocksize in benchmarks, but real applications
> > > offer less control over the I/O pattern. If latency in the device
> > > implementation (QEMU) is the root cause then reduce the latency to speed
> > > up all applications, even those that cannot send huge requests.
> > 
> > Which I did, still do, and also mentioned before, e.g.:
> > 
> > 8d6cb100731c4d28535adbf2a3c2d1f29be3fef4 9pfs: reduce latency of Twalk
> > 0c4356ba7dafc8ecb5877a42fc0d68d45ccf5951 9pfs: T_readdir latency
> > optimization
> > 
> > Reducing overall latency is a process that is ongoing and will still take
> > a
> > very long development time. Not because of me, but because of lack of
> > reviewers. And even then, it does not make the effort to support higher
> > transmission sizes obsolete.
> > 
> > > One idea is request merging on the QEMU side. If the application sends
> > > 10 sequential read or write requests, coalesce them together before the
> > > main part of request processing begins in the device. Process a single
> > > large request to spread the cost of the file server over the 10
> > > requests. (virtio-blk has request merging to help with the cost of lots
> > > of small qcow2 I/O requests.) The cool thing about this is that the
> > > guest does not need to change its I/O pattern to benefit from the
> > > optimization.
> > > 
> > > Stefan
> > 
> > Ok, don't get me wrong: I appreciate that you are suggesting approaches
> > that could improve things. But I could already hand you over a huge list
> > of mine. The limiting factor here is not the lack of ideas of what could
> > be improved, but rather the lack of people helping out actively on 9p
> > side:
> > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg06452.html
> > 
> > The situation on kernel side is the same. I already have a huge list of
> > what could & should be improved. But there is basically no reviewer for
> > 9p patches on Linux kernel side either.
> > 
> > The much I appreciate suggestions of what could be improved, I would
> > appreciate much more if there was *anybody* actively assisting as well. In
> > the time being I have to work the list down in small patch chunks,
> > priority based.
> I see request merging as an alternative to this patch series, not as an
> additional idea.

It is not an alternative. Like I said before, even if it would solve the 
sequential I/O performance issue (by not simply moving the problem somewhere 
else), which I doubt, your suggestion would still not solve the semantical 
conflict of virtio's "maximum queue size" terminology: i.e. max. active/
pending messages vs. max. transmissions size. Denying that simply means 
attempting to postpone addressing this virtio issue appropriately.

The legitimate concerns you came up with can easily be addressed by two virtio 
capabilities to make things clean and officially supported by both ends, which 
could also be revoked at any time without breaking things if there were any 
real-life issues actually coming up on virtio level in future. The rest is 
already prepared.

> My thoughts behind this is that request merging is less work than this
> patch series and more broadly applicable.

It is definitely not less work. All I still have to do is adding two virtio 
capabilities as suggested by you, either as official virtio ones, or as 9p 
device specific ones. The other outstanding tasks on my side are independent 
of this overall issue.

> It would be easy to merge (no
> idea how easy it is to implement though) in QEMU's virtio-9p device
> implementation, does not require changes across the stack, and benefits
> applications that can't change their I/O pattern to take advantage of
> huge requests.

And waiting on every single request whether there might be more sequential 
requests coming in somewhere in future, i.e. even increasing latency and 
worsening the situation on random I/O side, increasing the probability of a 
full queue and client having to wait too often, piling up more complex code, 
and what not.

Your idea of merging sequential requests on QEMU side already fails at the 
intial point: sequential read is typically initiated by sequential calls to 
read() by a guest application thread. However that read() function call must 
return some data before guest app thread would be able to call read() again 
for subsequent chunks.

> There is a risk that request merging won't pan out, it could have worse
> performance than submitting huge requests.

That's not a risk, it is not even feasible. Plus my plan is to improve 
performance for various use case patterns, especially for both sequential and 
random I/O patterns, not just only one of them.

A more appropriate solution from what you suggested, is e.g. to extend the 
already existing Linux 9p client's optional fs-cache feature by an optional 
read-ahead feature. Again, one of many things on my TODO list, but also still 
a bunch of things to do on fs-cache alone before being able to start that.

We have discussed this issue for almost 2 months now. I think it is time to 
move on. If you are still convinced about your ideas, please send your patches 
and benchmark results.

I would appreciate if you'd let me know whether I should suggest the discussed 
two virtio capabilities as official virtio ones, or whether I should directly 
go for 9p device specific ones instead.

Thanks!

Best regards,
Christian Schoenebeck
Stefan Hajnoczi Nov. 16, 2021, 11:13 a.m. UTC | #28
On Mon, Nov 15, 2021 at 03:32:46PM +0100, Christian Schoenebeck wrote:
> I would appreciate if you'd let me know whether I should suggest the discussed 
> two virtio capabilities as official virtio ones, or whether I should directly 
> go for 9p device specific ones instead.

Please propose changes for increasing the maximum descriptor chain
length in the common virtqueue section of the spec.

Stefan